Clone wiki

platform / Feature_Extraction_Overview

This is an overview of the feature extraction step.

Code

Currently, to run the feature extraction code, use run_all.py, which is located in the feature extraction directory. run_all.py takes in a directory and a configuration file. The directory is the directory of the corpus and the configuration file has all of the options. There is a sample configuration file config.txt and a file explaining each option config_explain.txt in the feature extraction directory. The script creates sub-directories of the corpus directory for each type of feature extraction (eg. ngrams_results) and saves the relevant data there. The output from all of the feature extraction text is saved in a sparse matrix (currently in CSR form) which combines all of the features. This sparse matrix can be used for the analysis and is saved as final_results/feature_mat.pkl. The names of the features (e.g. the gram for an ngram feature or the topic number for a topic model feature) is saved as a list as final_results/feature_names.pkl. The index of the feature names list corresponds with its column in the feature matrix.

Input

The input is the pre-processed text from the pre-processing step. This is either a set of text files and associated metadata CSV file, or a Postgres database, with a DocText table and a Metadata table.

Output

The output is the set of extracted features. The main thing is to establish a dictionary of features, so taht each feature can be referred to by an integer identifier. Then a particular feature of a particular document can be indexed by [docid,gramid].

We will experiment with four formats.

The simplest version of this is just a CSV file for each document, with the name of the text file being [docid].txt. Each row is [feature_id, frequency], where feature_id is an integer identifier that links to a separate dictionary CSV file, with row [feature_id, feature], where "feature" corresponds to a gram, phrase, dependency relation, etc.

Second, each document could be saved as a SciPy sparse matrix format (see here: http://www.philippsinger.info/?p=464). We could save using HDF5 format: https://github.com/telegraphic/hickle.

Third, each document could consist of a row in a postgres table, with columns [docid, gramids], where gramids is an array of integers.

Fourth, each (docid, gramid) pair (if the gram shows up in the document) could be represented as a row in a postgres table, with columns [docid,gramid,freq].

Updated