1. Grzegorz Chrupała
  2. hiera

Overview

hiera

Hiera implements the algorithm for hierarchical clustering of word-class probability distributions described in [Chrupala-2012]: it is an agglomerative clustering algorithm where the distance between clusters is defined as the Jensen- Shannon divergence between the probability distributions over classes associated with each word-type.

Installation

The easiest way to start using hiera is to install the Haskell Platform. Then execute:

cabal install --bindir=$HOME/bin

This will install the executable hiera in the directory $HOME/bin. If this directory is in your path, hiera is ready to use.

Usage

To build a model (a tree) from a word class distributions:

cat INPUT | hiera build MIN_COUNT MODEL +RTS -N8 -RTS

The format of the input it one record per line, columns separated by spaces. The first column should contain the word, the second column the absolute word frequency, and the subsequent columns the values of the probabilities of word classes for this word. For example:

dog 2000 0.3 0.2 0.2 0.2 0.1 0.0

The word is "dog", it appeared 2000 times, the class 1 has probability 0.3, etc.

MIN_COUNT is the threshold on word frequency for the words to be included in the model. Set it to 1 to use all the words. Bear in mind that the hiera runtime grows fast with size of the input. Building models with more than 2000 words or so may be impractically slow.

The option -N8 specifies how many threads to use.

The tree can be visually displayed with the following command:

hiera display MODEL

Once the model is built, you can assign cluster ID to word distributions. These IDs are paths in the tree to the best matching node. Note that the word distributions to be clustered can be both the ones that were used to build the model as well as additional words:

cat INPUT | hiera label MAX_SIZE MODEL > OUTPUT

MAX_SIZE is the maximum size of the path (set to a large number to use the full path). The format of the INPUT is the same as specified above. The format of the output is one line per word, with a string of 0s and 1s representing the path.

[Chrupala-2012]Grzegorz Chrupała. 2012. Hierarchical clustering of word class distributions. NAACL-HLT 2012 Workshop on the Induction of Linguistic Structure.