pullword: Unsupervised Word Discovery

Go to Project Site

Introduction

With the growing availability of digitized text data, there is a great need for effective computational tools to automatically extract kownledge from texts.

The Chinese language differs most significantly from alphabet-based languages in not specifying word boundaries, most existing Chinese text-mining methods require a prespecified vocabulary and/or a large relevant training corpus, which may not be available in some applications.

Pullword is developed for word discovering from small/large volumes of unstructured Chinese texts, and propose ways to order discovered words and conduct higher-level context analyses and it is particularly useful for mining online and domain-specific texts where the underlying vocabulary is unknown or the texts of interest differ significantly from available training corpora.

Implementations

The implementations mainly follow this post.

Online demo