Colada: (Word) Classes with Online LDA ====================================== Grzegorz Chrupała <> Colada implements online and minibatch word class class induction using Latent Dirichlet Allocation (LDA) with an Online Gibbs sampler. Chrupala (2011) describes how to use LDA to induce soft word classes from text. Song et al (2005) and Canini et al (2009) describe a simple modification of the Gibbs sampler for LDA to make it run online. Colada brings these two ideas together. Slides from CLIN 2012 comparing Colada with another word class induction algorithm (Chrupala and Alishahi 2010): We have used Colada in Chrupała 2012 (in batch mode) and Alishahi and Chrupała 2012 (in online mode). Installation ------------ The simplest way to start using colada is to install the [Haskell platform]( Then run the following commands in the console: cabal update cabal install colada --bindir=$HOME/bin This will install the latest version of colada in $HOME/bin. Run colada help for usage help. Quick start ----------- The input format is one word per line, sentences separated by a blank line. If the file INPUT contains your input sentences, you can induce a 20 word class model, using minibatches of 100 sentences with 10 passes over each batch, while printing to OUTPUT the word class distributions for each token according to the evolving model: colada learn --topic-num=20 --batch-size=100 --passes=10 \ --progressive --lambda=1.0 MODEL < INPUT > OUTPUT The model will be saved in the file MODEL. To display a human-readable summary of the induced classes, execute: colada summary MODEL To output the unnormalized class distribution for each word-type, execute: colada word-type-classes MODEL References ---------- - Grzegorz Chrupała. 2012. Hierarchical clustering of word class distributions. NAACL-HLT 2012 Workshop on the Induction of Linguistic Structure. - Afra Alishahi and Grzegorz Chrupała. 2012. Concurrent Acquisition of Word Meaning and Lexical Categories. EMNLP-CoNLL 2012. - Grzegorz Chrupala. Efficient induction of probabilistic word classes with LDA. IJCNLP 2011. - Grzegorz Chrupała and Afra Alishahi. Online Entropy-based Model of Lexical Category Acquisition. In CoNLL 2010 - Xiaodan Song, Ching-Yung Lin Belle L. Tseng Ming-Ting Sun. Modeling and Predicting Personal Information Dissemination Behavior. KDD 2005. - Kevin R. Canini, Lei Shi, Thomas L. Griffiths. Online Inference of Topics with Latent Dirichlet Allocation. AISTATS 2009.