Wiki

Clone wiki

Questimate / Feature Editing

Overview

The features files that are created by the tool are in Avro format, using two different schemas:

  • One for corpus without n-best lists, but perhaps with post-edited sentences and references
  • One for corpus with n-best lists, that is, for the case where there are multiple hypotheses for each sentence

There are two ways in which these feature files can be edited:

  • Through a shell script that allow a few operations on the files
  • Through a GUI

The operations that can be performed are of the following kinds:

  • Dataset operations: On the whole dataset or the file
  • Record operations: On specific records (sentence-hypothesis pairs)

Dataset Operations

  • Merging files horizontally: Merging features for the same corpus
  • Merging files vertically: Merging features for different corpora (concatenating features for different corpora)
  • Splitting files horizontally: To create subsest of features (Not yet implemented)
  • Splitting files vertically: To divide the dataset into smaller datasets
  • Removing features to get a subset of features
  • Selective horizontal merging: Taking features from one set and adding it to another to get a third feature set (for the same corpus)
  • Making a particular feature as the 'class' or the 'label' for prediction through Questimate

Record Operations

  • Removing any NaN values
  • Removing empty hypotheses
  • Adding ratio features, if they are not already there

Updated