Clone wiki

platform / Word2Vec Word Vectors

Word2Vec

This feature extractor constructs word vectors using Google's Word2Vec algorithm, implemented in gensim.models.Word2Vec.

Code

The functions load_train and load_train_save in embedding.py are the functions which run the rest. load_train_save loads the corpus in the proper form, trains Word2Vec, and saves the trained Word2Vec object, a matrix where each row is the average word vector over all the words in a document, and a list which keeps track of which rows correspond to which documents. The arguments for load_train_save are:

  • indir: path to directory of txt files
  • stem: {'snowball','porter','lemma',None} what kind of stemmer to use. Defaults to None.
  • stop_words: Boolean. Whether to include stopwords. Defaults to True
  • punctuation: Boolean. Whether to include punctuation. Defaults to True
  • split_clauses: Boolean. Split on clauses
  • size,sinow,min_count,workers: see gensim Word2Vec documentation
  • finalize: whether to finalize the model and do no more training (Saves memory). Defaults to True
  • mem_efficient: use memory efficient corpus loading (in case the corpus is larger than what can fit on memory). Defaults to False
  • outfile: file name to write to (without extension). Defaults to indir/results/word2vec

Updated