Wiki
Clone wikiQuestimate / Feature Editing
Overview
The features files that are created by the tool are in Avro format, using two different schemas:
- One for corpus without n-best lists, but perhaps with post-edited sentences and references
- One for corpus with n-best lists, that is, for the case where there are multiple hypotheses for each sentence
There are two ways in which these feature files can be edited:
- Through a shell script that allow a few operations on the files
- Through a GUI
The operations that can be performed are of the following kinds:
- Dataset operations: On the whole dataset or the file
- Record operations: On specific records (sentence-hypothesis pairs)
Dataset Operations
- Merging files horizontally: Merging features for the same corpus
- Merging files vertically: Merging features for different corpora (concatenating features for different corpora)
- Splitting files horizontally: To create subsest of features (Not yet implemented)
- Splitting files vertically: To divide the dataset into smaller datasets
- Removing features to get a subset of features
- Selective horizontal merging: Taking features from one set and adding it to another to get a third feature set (for the same corpus)
- Making a particular feature as the 'class' or the 'label' for prediction through Questimate
Record Operations
- Removing any NaN values
- Removing empty hypotheses
- Adding ratio features, if they are not already there
Updated