Clone wiki

platform / Dependencies and Syntactic N-grams

This set of scripts produces dependency relations and syntactic n-grams.

Background on English dependency relations:

Dependency Parsing

Dependency Class has five attributes:

  • relation: the relation between the head word and the dependent word
  • head: the head word
  • dependent: the dependent word
  • headPos: the position of the head word
  • depPos: the position of the dependent word

The method deps_from_file in gets the Dependency objects from the Stanford Dependency Parser


The Process so far is as follows:

  1. Get Dependency Parse from Stanford Dependency Parser using
  2. Create SN-Grams up to length N
  3. (Optional) Stem the words
  4. Count the SN-Grams and create ID dictionaries
  5. Save to file

The method main in creates a streaming corpus iterator over the corpus and saves and outputs a sparse matrix where the rows are the documents and the columns are the sngrams and a Dictionary object which keeps track of what grams the columns correspond to with the id2token dict. The arguments are:

  • indir: file path for input directory (only looks at .txt files)
  • n: Max number of words in sn-gram
  • tfif: Boolean. Whether to do the tfidf transformation
  • relations: relations to exclude, defaults to [None]
  • stem: {'snowball','porter','lemma',None} what kind of stemmer to use. Defaults to None.
  • stop_words: Boolean. Whether to include stopwords. Defaults to True.
  • rel: Boolean. Whether to include relations in the sngrams. Defaults to False.
  • split_clauses: Boolean. Split on clauses
  • outdir: file path for output directory, defaults to indir/resultsif outdir = None

Given indir, if it is a directory then outdir is indir/sngram_results, if it is a zip archive and zipdir is the directory containing the zip archive, then outdir is zipdir/sngram_results

  • outdir/mat.pkl: the sparse feature matrix
  • outdir/dictionary.pkl: the dictionary

Command Line

For command line usage help enter python -h or python --help.

From the command line, main is run with the given arguments

TODO: Include extra options such as:


  • Will including relations cause the sn-grams to be so sparse that we never get any that occur more than a little bit?