pullword: Unsupervised Word Discovery

Introduction

With the growing availability of digitized text data, there is a great need for effective computational tools to automatically extract kownledge from texts.

The Chinese language differs most significantly from alphabet-based languages in not specifying word boundaries, most existing Chinese text-mining methods require a prespecified vocabulary and/or a large relevant training corpus, which may not be available in some applications.

Pullword is developed for word discovering from small/large volumes of unstructured Chinese texts, and propose ways to order discovered words and conduct higher-level context analyses and it is particularly useful for mining online and domain-specific texts where the underlying vocabulary is unknown or the texts of interest differ significantly from available training corpora.

Implementations

The implementations mainly follow this post.

golang version
javascript version

Online demo

javascript demo

NLP NLU