Clone wiki

platform / N-grams

This is a General Purpose tool for extracting N-gram frequencies (frequency distributions over phrases of length N) from documents.

  • Input: a directory of pre-processed text files.

The Process so far is as follows:

  1. (Optional)Get POS Tags and combine them with the words
  2. Create N-Grams of length N
  3. (Optional) Stem the words
  4. Count the N-Grams and create ID dictionaries
  5. Combine counts to make document frequencies
  6. (Optional) Delete N-Grams which do not show up in more than a certain threshold number of documents
  7. Save to file

The method main in ngrams.py creates a streaming corpus iterator over the corpus and saves and outputs
a sparse matrix where the rows are the documents and the columns are the ngrams and a Dictionary object which keeps
track of what grams the columns correspond to with the id2token dict.
The arguments are

  • fpath: path of file
  • n: order of n gram
  • tfif: Boolean. Whether to do the tfidf transformation
  • stem: {'snowball','porter','lemma',None} what kind of stemmer to use. Defaults to None.
  • stop_words: Boolean. Whether to include stopwords. Defaults to True
  • tag: {'ap','nltk','stanford',None}. What POS tagger to use. Defaults to None
  • tag_pattern: list of tag patterns to allow in simplified form. Defaults to None. 'default' gives the default set of tag patterns. See below for more on the simplified form and the default tag patterns.
  • punctuation: Boolean. Whether to include punctuation. Defaults to True
  • threshold: minimum number of documents a gram has to be present in. Defaults to 5.
  • split_clauses: Boolean. Split on clauses
  • outdir: directory to write to. Defaults to indir/ngram_results

Simplified tag pattern form:
An example of a mapping from universal to simplified tags is VBN -> V or NN -> N. An example tag pattern is "AN". From the command line you can specify these tag patterns like the following example: --tp AN VN NVV. --tp default gives the default set of tag patterns which is 'AN','NN','AAN','ANN','NAN','NPN', 'VN','VAN','VNN','VPN','ANV','NVV','VDN'. (from a paper? Which one?)

Given indir, if it is a directory then outdir is indir/ngram_results, if it is a zip archive
and zipdir is the directory containing the zip archive, then outdir is zipdir/ngram_results

  • outdir/mat.pkl: the sparse feature matrix
  • outdir/dictionary.pkl: the dictionary

Command Line

For command line usage help enter python ngrams.py -h or python ngrams.py --help.

Updated