LDA word class induction
Grzegorz Chrupała <grzegorz.chrupala@gmail.com>

This software implements the LDA word class induction method described
in Grzegorz Chrupała, 2011, Efficient induction of probabilistic word
classes with LDA, IJCNLP.


Java 1.6, Unix-like shell environment.


You can test the software on the example text included. In order to
extract features with at least 3 occurrences and induce 50 and 100

./bin/run.sh extract-features example.txt 3 example-3
./bin/run.sh induce-word-classes example-3 50  example-3-50
./bin/run.sh induce-word-classes example-3 100 example-3-100

In the output directories you will find the following files:

wordtype-class-probs - Word class probabilities given word type
      P(z|w_0). Zero probabilities are omitted.

feature-class-counts - Co-occurrence counts of word-classes and
      features, i.e. unnormalized P(z|w_{-1}) and
      P(z|w_{+1}). Previous word features are suffixed with '^L' and
      next word features with '^R'. 

wordtype-counts - Word type counts.


From inside the package directory you can run the top-level
script in bin/run.sh. It supports two actions, extracting features,
and inducing word classes.

./bin/run.sh extract-features INPUT-TEXT MIN-OCCUR OUTPUT-DIR
  INPUT-TEXT is the file containing tokenized space-separated text,
  one sentence per line.  
  MIN-OCCUR specifies the minimum number of
  occurrences of a feature to be extracted. Features occurring less
  frequently will be ignored.
  OUTPUT-DIR is the directory where the files with extracted features 
  will be stored.

./bin/run.sh induce-word-classes INPUT-DIR NUM-CLASSES OUTPUT-DIR
  INPUT_DIR is a directory created by the extract-features command
  NUM-CLASSES is the number of classes to induce
  OUTPUT-DIR is where the results will be written


This package relies on the Mallet toolkit, which is included in the
lib directory.

The script sets the maximum Java heap to 2G. For large datasets you
will need a larger heap. This can be changed in bin/run.sh.