Clone wiki

platform / Latent_Dirichlet_Allocation

Latent Dirichlet Allocation

This set of scripts uses the output of the feature extraction scripts to create a corpus for training a topic model using LDA. Here's a good non-technical explanation of LDA.

Code

These scripts require output from one of the feature extract scripts. The functions load_train, train_save, and load_train_save are the important functions which run the rest. load_train loads the information output from a feature extraction script into a gensim friendly structure, trains a topic model, and saves the output. The other two functions do either just loading and training, or just training and saving. Here's an explanation of the gensim friendly format. The arguments for load_train_save are:

  • dict_path: a pickled tuple t with the following structure:
    • t[0] = document frequencies over corpus
    • t[1] = dict with grams as keys and ids as values
    • t[0] dict with ids as keys and grams as values
      When running a feature extraction script, this is the output named `doc_freqs followed by the gram number
  • indir: a path to a directory with pickle files containing tuples with the following structure:
    • t[0] = term frequencies over corpus
    • t[1] = dict with grams as keys and ids as values
    • t[0] dict with ids as keys and grams as values When running a feature extraction script, these pickle files are the pickle output files
  • num_topics: number of topics
  • outfile: file name to write to (without extension). Defaults to indir

Updated