Clone wiki

platform / Phraser

Phraser

Transforms documents by linking together phrases using n-gram frequencies.

The method main in ``phrases.py``` creates a streaming corpus iterator over the corpus and saves and outputs a sparse matrix where the rows are the documents and the columns are the unigrams and phrases, and a Dictionary object which keeps track of what grams the columns correspond to with the id2token dict. The arguments are:

  • indir: path to directory of txt files
  • tfidf: Boolean. do tfidf
  • stem: {'snowball','porter','lemma',None} stemmer to use Defaults to None.
  • stop_words: Boolean. include stopwords. Defaults to True
  • tag: {'ap','nltk','stanford'}. POS tagger to use. Defaults to 'ap'
  • allowed_tags: tags for unigrams that are allowed. Defaults to Nouns
  • punctuation: Boolean. include punctuation. Defaults to True
  • threshold: minimum number of documents a gram has to be present in. Defaults to 5.
  • split_clauses: Boolean. Split on clauses
  • outdir: directory to write to. Defaults to indir/results

Command Line
For command line usage help enter python phrases.py -h or python phrases.py --help

Updated