Wiki

Clone wiki

Questimate / N-gram Counts Features

These features are based n-gram counts, that is, frequencies, not on probabilities. Currently only frequency quartile features are implemented:

  • OOVUnigrams: Number of unigrams in the sentence not found in the n-gram counts model
  • UnigramsFreqQuartile1: Number of unigrams in the sentence whose frequencies are in frequency quartile 1
  • UnigramsFreqQuartile2: Number of unigrams in the sentence whose frequencies are in frequency quartile 2
  • UnigramsFreqQuartile3: Number of unigrams in the sentence whose frequencies are in frequency quartile 3
  • UnigramsFreqQuartile4: Number of unigrams in the sentence whose frequencies are in frequency quartile 4

It may be that only quartile 1 and quartile 4 features are more relevant (apart from OOV), but the other two have been included just in case they prove to be relevant. For example, these features could behave differently for the three cases mentioned above (text, POS tagged and POS tagged nolex).

Similar features are calculated for bigram and trigrams.

Since these features are calculated for three different models, on 1, 2 and 3-grams, on source as well target side, and have normalzed and ration versions, these constitute the largest category in terms of numbers of features. Many of these features may not turn out to be relevant for specific cases and may have to be eliminated during the feature selection stage.

Updated