This is an overview of the feature extraction step.
Currently, to run the feature extraction code, use
run_all.py, which is located in the feature extraction directory.
run_all.py takes in a directory and a configuration file. The directory is the directory of the corpus and the configuration file has all of the options. There is a sample configuration file
config.txt and a file explaining each option
config_explain.txt in the feature extraction directory. The script creates sub-directories of the corpus directory for each type of feature extraction (eg.
ngrams_results) and saves the relevant data there. The output from all of the feature extraction text is saved in a sparse matrix (currently in CSR form) which combines all of the features. This sparse matrix can be used for the analysis and is saved as
final_results/feature_mat.pkl. The names of the features (e.g. the gram for an ngram feature or the topic number for a topic model feature) is saved as a list as
final_results/feature_names.pkl. The index of the feature names list corresponds with its column in the feature matrix.
The input is the pre-processed text from the pre-processing step. This is either a set of text files and associated metadata CSV file, or a Postgres database, with a DocText table and a Metadata table.
The output is the set of extracted features. The main thing is to establish a dictionary of features, so taht each feature can be referred to by an integer identifier. Then a particular feature of a particular document can be indexed by [docid,gramid].
We will experiment with four formats.
The simplest version of this is just a CSV file for each document, with the name of the text file being [docid].txt. Each row is [feature_id, frequency], where feature_id is an integer identifier that links to a separate dictionary CSV file, with row [feature_id, feature], where "feature" corresponds to a gram, phrase, dependency relation, etc.
Third, each document could consist of a row in a postgres table, with columns [docid, gramids], where gramids is an array of integers.
Fourth, each (docid, gramid) pair (if the gram shows up in the document) could be represented as a row in a postgres table, with columns [docid,gramid,freq].