1. Matt Chaput
  2. whoosh


whoosh / docs / source / keywords.rst

Query expansion and Key word extraction


Whoosh provides methods for computing the "key terms" of a set of documents. For these methods, "key terms" basically means terms that are frequent in the given documents, but relatively infrequent in the indexed collection as a whole.

Because this is a purely statistical operation, not a natural language processing or AI function, the quality of the results will vary based on the content, the size of the document collection, and the number of documents for which you extract keywords.

These methods can be useful for providing the following features to users:

  • Search term expansion. You can extract key terms for the top N results from a query and suggest them to the user as additional/alternate query terms to try.
  • Tag suggestion. Extracting the key terms for a single document may yield useful suggestions for tagging the document.
  • "More like this". You can extract key terms for the top ten or so results from a query (and removing the original query terms), and use those key words as the basis for another query that may find more documents using terms the user didn't think of.


Expansion models

The ExpansionModel subclasses in the :mod:`whoosh.classify` module implement different weighting functions for key words. These models are translated into Python from original Java implementations in Terrier.