Bitbucket is a code hosting site with unlimited public and private repositories. We're also free for small teams!

Close

Colada: (Word) Classes with Online LDA

Grzegorz Chrupała grzegorz.chrupala@gmail.com

Colada implements online and minibatch word class class induction using Latent Dirichlet Allocation (LDA) with an Online Gibbs sampler.

Chrupala (2011) describes how to use LDA to induce soft word classes from text. Song et al (2005) and Canini et al (2009) describe a simple modification of the Gibbs sampler for LDA to make it run online. Colada brings these two ideas together. Slides from CLIN 2012 comparing Colada with another word class induction algorithm (Chrupala and Alishahi 2010): https://bitbucket.org/gchrupala/delta-h/downloads/slides.pdf

We have used Colada in Chrupała 2012 (in batch mode) and Alishahi and Chrupała 2012 (in online mode).

Installation

The simplest way to start using colada is to install the Haskell platform. Then run the following commands in the console:

cabal update
cabal install colada --prefix=$HOME

This will install the latest version of colada in $HOME/bin. Run

colada help

for usage help.

Quick start

The input format is one word per line, sentences separated by a blank line. If the file INPUT contains your input sentences, you can induce a 20 word class model, using minibatches of 100 sentences with 10 passes over each batch, while printing to OUTPUT the word class distributions for each token according to the evolving model:

colada learn --topic-num=20 --batch-size=100 --passes=10 \
  --progressive --lambda=1.0 MODEL < INPUT > OUTPUT

The model will be saved in the file MODEL. To display a human-readable summary of the induced classes, execute:

colada summary MODEL

References

  • Grzegorz Chrupała. 2012. Hierarchical clustering of word class distributions. NAACL-HLT 2012 Workshop on the Induction of Linguistic Structure.
  • Afra Alishahi and Grzegorz Chrupała. 2012. Concurrent Acquisition of Word Meaning and Lexical Categories. EMNLP-CoNLL 2012.
  • Grzegorz Chrupala. Efficient induction of probabilistic word classes with LDA. IJCNLP 2011.
  • Grzegorz Chrupała and Afra Alishahi. Online Entropy-based Model of Lexical Category Acquisition. In CoNLL 2010
  • Xiaodan Song, Ching-Yung Lin Belle L. Tseng Ming-Ting Sun. Modeling and Predicting Personal Information Dissemination Behavior. KDD 2005.
  • Kevin R. Canini, Lei Shi, Thomas L. Griffiths. Online Inference of Topics with Latent Dirichlet Allocation. AISTATS 2009.

Recent activity

Tip: Filter by directory path e.g. /media app.js to search for public/media/app.js.
Tip: Use camelCasing e.g. ProjME to search for ProjectModifiedEvent.java.
Tip: Filter by extension type e.g. /repo .js to search for all .js files in the /repo directory.
Tip: Separate your search with spaces e.g. /ssh pom.xml to search for src/ssh/pom.xml.
Tip: Use ↑ and ↓ arrow keys to navigate and return to view the file.
Tip: You can also navigate files with Ctrl+j (next) and Ctrl+k (previous) and view the file with Ctrl+o.
Tip: You can also navigate files with Alt+j (next) and Alt+k (previous) and view the file with Alt+o.