Clone wiki

platform / Hierarchical_Word_Clusters

Hierarchical Clustering

This module clusters the words in a corpus using hierarchical clusters, and uses the cluster information rather than the words themselves when creating N-grams.

Background info on Brown Clustering: https://d396qusza40orc.cloudfront.net/nlangp/brown.pdf

Once the clustering algorithm is run, each word in the vocabulary is assigned a bitstring (e.g. "word" = 100010010110). To make clusters, it assigns a word to a cluster based on the first n bits (e.g. "word" would be assigned to cluster 10001 if n = 5).

NOTE: The implementation used can be found at https://github.com/mheilman/tan-clustering. With this implementation, using point wise mutual information to create the clusters is much much faster than using brown clustering, so as of right now this module uses point wise mutual information

NOTE: I'm not really sure how well this works. It seems like it clusters a lot of the words into one big cluster, which probably isn't useful.

Code

Assuming that a word to cluster dictionary exists (e.g. in a pickled file somewhere), cluster_ngrams_from_dir goes through all the text files in a directory, parses each sentence in the file, and creates N-grams while replacing words with their cluster designations.

The arguments are:

  • indir: path to directory of txt files
  • n: int. order of ngram
  • clusters: dict of words to bitstrings
  • bits: int. number of bits to consider, creates at most 2^bits clusters
  • stem: {'snowball','porter','lemma',None} what kind of stemmer to use. Defaults to None.
  • stop_words: Boolean. Whether to include stopwords. Defaults to True
  • punctuation: Boolean. Whether to include punctuation. Defaults to True
  • freq: Boolean. Whether to return frequencies. Defaults to False
  • outdir: directory to write to. Defaults to directory of fpath

Command Line

For command line usage type python clusters.py -h or python clusters.py --help. There are a few different options here:

If clustering hasn't been done and a corpus file doesn't exist

In this case, the clustering algorithm needs to run, so the proper thing to write in the command line would include the flags

  • --write_corpus: this will write a corpus file (each document its own line, each token separated by whitespace
  • --save_clusters: this will save the cluster information. You can put a directory after this flag, but the default directory is a sub-directory of the directory containing the corpus called brown_files

If clustering hasn't been done and a corpus file already exists

In this case the clustering algorithm needs to be run, but a corpus file doesn't need to be created, so include the flags

  • --corpus_file: after this flag write the path to the corpus file
  • --save_clusters: this will save the cluster information. You can put a directory after this flag, but the default directory is a sub-directory of the directory containing the corpus called brown_files

If clustering has already been done

In this case there should be a .pkl file which contains a dictionary of the cluster information for each word (as created when the --save_clusters flag is included), so include the flag

  • --cluster_file : after this flag write the path to the .pkl file which has the word to cluster dictionary

TODO:

  • Include thresholding

  • Find a way to figure out right number of clusters to get good results? (What's right? What are good results?)

Updated