Wiki
Clone wikiQuestimate / Feature Extraction
The following types for features can be directly extracted using the command line:
- Surface features
- Word translation or association features
- N-gram language model features
- N-gram counts features
- Soul LM scores
- POS counts
- Model scores from n-best lists
- N-gram posterior probabity scores
There is also a special category of features, called Label features. These are the classes or labels to be predicted, based on the other features.
For this purpose, the tool can load several kinds of resources such as n-best lists, n-gram language models, lattices, IBM1 scores etc., and it also uses some external tools like POS taggers and language model creators. Some of these tools are in Java,so they can be directly called from the API, others are used via shell scripts.
All the features currently supported are global or sentence level 'dense' (as opposed to 'sparse') features.
For many (in fact, most) cases, each feature has the following variants:
- Value on the source side
- Value on the target side
- Normalized values for the source and the target side
- Ratio features (source to target)
The Java API is flexible enough to make it easy to add code for extracting many other kinds of features.
Updated