Files changed (21)
-#. Break the token stream into "fragments" (there are several different styles of fragmentation available).
-Calling ``index.create_in`` on a directory with an existing index will clear the current contents of the index.
-(Alternatively you can simply delete the index's files from the directory, e.g. if you only have one index in the directory, use ``shutil.rmtree`` to remove the directory and then recreate it.)
-Once you've created an Index object, you can add documents to the index with an ``IndexWriter`` object. The easiest way to get the ``IndexWriter`` is to call ``Index.writer()``::
-The IndexWriter's ``add_document(**kwargs)`` method accepts keyword arguments where the field name is mapped to a value::
-You don't have to fill in a value for every field. Whoosh doesn't care if you leave out a field from a document.
-Indexed fields must be passed a unicode value. Fields that are stored but not indexed (i.e. the STORED field type) can be passed any pickle-able object.
-Whoosh will happily allow you to add documents with identical values, which can be useful or annoying depending on what you're using the library for::
-This adds two documents to the index with identical path and title fields. See "updating documents" below for information on the update_document method, which uses "unique" fields to replace old documents instead of appending.
-If you have a field that is both indexed and stored, you can index a unicode value but store a different object if necessary (it's usually not, but sometimes this is really useful) using a "special" keyword argument _stored_<fieldname>. The normal value will be analyzed and indexed, but the "stored" value will show up in the results::
-An ``IndexWriter`` object is kind of like a database transaction. You specify a bunch of changes to the index, and then "commit" them all at once.
-If you want to close the writer without committing the changes, call ``cancel()`` instead of ``commit()``::
-Keep in mind that while you have a writer open (including a writer you opened and is still in scope), no other thread or process can get a writer or modify the index. A writer also keeps several open files. So you should always remember to call either commit() or cancel() when you're done with a writer object.
-A Whoosh index is really a container for one or more "sub-indexes" called segments. When you add documents to an index, instead of integrating the new documents with the existing documents (which could potentially be very expensive, since it involves resorting all the indexed terms on disk), Whoosh creates a new segment next to the existing segment. Then when you search the index, Whoosh searches both segments individually and merges the results so the segments appear to be one unified index. (This smart design is copied from Lucene.)
-So, having a few segments is more efficient than rewriting the entire index every time you add some documents. But searching multiple segments does slow down searching somewhat, and the more segments you have, the slower it gets. So Whoosh has an algorithm that runs when you call commit() that looks for small segments it can merge together to make fewer, bigger segments.
- The default: uses a heuristic (taken from KinoSearch?) based on the Fibonacci sequence to merge "small" segments together.
- Do not merge segments, even if it means creating lots of small segments. This may be useful if you don't want to pay any speed penalty for merging when you're doing lots of small adds to the index. You'll want to somehow schedule and "optimization" (see below) at some point to merge the segments.
-The Index object also has an ``optimize()`` method that lets you optimize the index (merge all the segments together). It simply creates a writer and calls ``commit(OPTIMIZE)`` on it.
-(NO_MERGE, MERGE_SMALL, and OPTIMIZE are actually callables that implement the different "policies". It is possible for an expert user to implement a different algorithm for merging segments.)
-You can delete documents using identical methods on either the Index object or the IndexWriter object. In both cases, you need to call ``commit()`` on the object to write the deletions to disk.
- Deletes any documents where the given (indexed) field contains the given term. This is mostly useful for ID or KEYWORD fields.
-Note that "deleting" a document simply adds the document number to a list of deleted documents stored with the index. When you search the index, it knows not to return deleted documents in the results. However, the document's contents are still stored in the index, and certain statistics (such as term document frequencies) are not updated, until you merge the segments containing deleted documents (see merging above). (This is because removing the information immediately from the index would essentially involving rewriting the entire index on disk, which would be very inefficient.)
-If you want to "replace" (re-index) a document, you can delete the old document using one of the ``delete_*`` methods on ``Index`` or ``IndexWriter``, then use ``IndexWriter.add_document`` to add the new version. Or, you can use ``IndexWriter.update_document`` to do this in one step.
-For ``update_document`` to work, you must have marked at least one of the fields in the schema as "unique". Whoosh will then use the contents of the "unique" field(s) to search for documents to delete::
-If no existing document matches the unique fields of the document you're updating, update_document acts just like add_document.
-"Unique" fields and update_document are simply convenient shortcuts for deleting and adding. Whoosh has no inherent concept of a unique identifier, and in no way enforces uniqueness when you use add_document.
-When you're indexing a collection of documents, you'll often want two code paths: one to index all the documents from scratch, and one to only update the documents that have changed (leaving aside web applications where you need to add/update documents according to user actions).
-Now, for a small collection of documents, indexing from scratch every time might actually be fast enough. But for large collections, you'll want to have the script only re-index the documents that have changed.
-To start we'll need to store each document's last-modified time, so we can check if the file has changed. In this example, we'll just use the mtime for simplicity::
-Whoosh provides methods for computing the "key terms" of a set of documents. For these methods, "key terms" basically means terms that are frequent in the given documents, but relatively infrequent in the indexed collection as a whole.
-Because this is a purely statistical operation, not a natural language processing or AI function, the quality of the results will vary based on the content, the size of the document collection, and the number of documents for which you extract keywords.
-* Search term expansion. You can extract key terms for the top N results from a query and suggest them to the user as additional/alternate query terms to try.
-* Tag suggestion. Extracting the key terms for a single document may yield useful suggestions for tagging the document.
-* "More like this". You can extract key terms for the top ten or so results from a query (and removing the original query terms), and use those key words as the basis for another query that may find more documents using terms the user didn't think of.
- Use the :meth:`~whoosh.searching.Searcher.document_number` or :meth:`~whoosh.searching.Searcher.document_number` methods of the :class:`whoosh.searching.Searcher` object to get the document numbers for the document(s) you want to extract keywords from.
- Use the :meth:`~whoosh.searching.Searcher.key_terms` method of :class:`whoosh.searching.Searcher` to extract the keywords, given the list of document numbers.
- For example, let's say you have an index of emails. To extract key terms from the ``content`` field of emails whose ``emailto`` field contains ``email@example.com``::
- Use the :meth:`~whoosh.searching.Results.key_terms` method of the :class:`whoosh.searching.Results` object to extract keywords from the top N documents of the result set.
- For example, to extract *five* key terms from the ``content`` field of the top *ten* documents of a results object::
-The ``ExpansionModel`` subclasses in the :mod:`whoosh.classify` module implement different weighting functions for key words. These models are translated into Python from original Java implementations in Terrier.
-The job of a query parser is to convert a *query string* submitted by a user into *query objects* (objects from the :mod:`whoosh.query` module) which
-Whoosh includes a few pre-made parsers for user queries in the :mod:`whoosh.qparser` module. The default parser is based on `pyparsing <http://pyparsing.wikispaces.com/>` and implements a query language similar to the one shipped with Lucene. The parser is quite powerful and how it builds query trees is fairly customizable.
-To create a :class:`whoosh.qparser.QueryParser` object, pass it the name of the *default field* to search and the schema of the index you'll be searching.
- You can instantiate a QueryParser object without specifying a schema, however the parser will not process the text of the user query (see :ref:`querying and indexing <index-query>` below). This is really only useful for debugging, when you want to see how QueryParser will build a query, but don't want to make up a schema just for testing.
-Once you have a QueryParser object, you can call ``parse()`` on it to parse a query string into a query object::
-See the :doc:`query language reference <querylang>` for the features and syntax of the default parser's query language.
-The QueryParser object takes terms without explicit fields and assigns them to the default field you specified when you created the object, so for example if you created the object with::
-However, you might want to let the user search *multiple* fields by default. For example, you might want "unfielded" terms to search both the ``title`` and ``content`` fields.
-In that case, you can use a :class:`whoosh.qparser.MultifieldParser`. This is just like the normal QueryParser, but instead of a default field name string, it takes a *sequence* of field names::
- The query class to use to join sub-queries when the user doesn't explicitly specify a boolean operator, such as ``AND`` or ``OR``.
- This must be a :class:`whoosh.query.Query` subclass (*not* an instantiated object) that accepts a list of subqueries in its ``__init__`` method. The default is :class:`whoosh.query.And`.
- This is useful if you want to change the default operator to ``OR``, or if you've written a custom operator you want the parser to use instead of the ones shipped with Whoosh.
- This must be a :class:`whoosh.query.Query` subclass (*not* an instantiated object) that accepts a fieldname string and term text unicode string in its ``__init__`` method. The default is :class:`whoosh.query.Term`.
- This is useful if you want to chnage the default term class to :class:`whoosh.query.Variations`, or if you've written a custom term class you want the parser to use instead of the ones shipped with Whoosh.
-The ``QueryParser`` class is designed to allow a certain amount of customization by subclassing. The methods invoked on the abstract syntax tree produced by pyparsing in turn call methods starting with ``make_``, such as ``make_term``, ``make_prefix``, etc. The methods are passed the parsed information (such as the fieldname and term text for ``make_term``) and return a ``Query`` object. You can subclass and replace these methods to do additional processing or return difference Query types. See the source code of the ``PyparsingBasedParser`` and ``QueryParser`` classes in the ``qparser`` module.
-To implement a different query syntax, or for complete control over query parsing, you can write your own parser.
-A parser is simply a class or function that takes input from the user and generates :class:`whoosh.query.Query` objects from it. For example, you could write a function that parses queries specified in XML:
-Some fields can be indexed, and some fields can be stored with the document so the contents of the field so the field value is available in search results. Some fields will be both indexed and stored.
-The schema is the set of all possible fields in a document. Each individual document might only use a subset of the available fields in the schema.
-For example, a simple schema for indexing emails might have fields like ``from_addr``, ``to_addr``, ``subject``, ``body``, and ``attachments``, where the ``attachments`` field lists the names of attachments to the email. For emails without attachments, you would omit the attachments field.
- This type is for body text. It indexes (and optionally stores) the text and stores term positions to allow phrase searching.
- TEXT fields use StandardAnalyzer? by default. To specify a different analyzer, use the analyzer keyword argument to the constructor, e.g. TEXT(analyzer=analysis.StemmingAnalyzer()). See TextAnalysis?.
- By default, TEXT fields store position information for each indexed term, to allow you to search for phrases. If you don't need to be able to search for phrases in a text field, you can turn off storing term positions to save space. Use TEXT(phrase=False).
- By default, TEXT fields are not stored. Usually you will not want to store the body text in the search index. Usually you have the indexed documents themselves available to read or link to based on the search results, so you don't need to store their text in the search index. However, in some circumstances it can be useful (see HighlightingResults?). Use TEXT(stored=True) to specify that the text should be stored in the index.
- This field type is designed for space- or comma-separated keywords. This type is indexed and searchable (and optionally stored). To save space, it does not support phrase searching.
- To store the value of the field in the index, use stored=True in the constructor. To automatically lowercase the keywords before indexing them, use lowercase=True.
- By default, the keywords are space separated. To separate the keywords by commas instead (to allow keywords containing spaces), use commas=True.
- The ID field type simply indexes (and optionally stores) the entire value of the field as a single unit (that is, it doesn't break it up into individual terms). This type of field does not store frequency information, so it's quite compact, but not very useful for scoring.
- Use ID for fields like url or path (the URL or file path of a document), date, category -- fields where the value must be treated as a whole, and each document only has one value for the field.
- By default, ID fields are not stored. Use ID(stored=True) to specify that the value of the field should be stored with the document for use in the search results. For example, you would want to store the value of a url field so you could provide links to the original in your search results.
- This field is stored with the document, but not indexed and not searchable. This is useful for document information you want to display to the user in the search results, but don't need to be able to search for.
-If you aren't specifying any constructor keyword arguments to one of the predefined fields, you can leave off the brackets (e.g. fieldname=TEXT instead of fieldname=TEXT()). Whoosh will instantiate the class for you.
-You can specify a field boost for a field. This is a multiplier applied to the score of any term found in the field. For example, to make terms found in the title field score twice as high as terms in the body field::
-The predefined field types listed above are subclasses of ``fields.FieldType``. ``FieldType`` is a pretty simple class. Its attributes contain information that define the behavior of a field.
-The constructors for most of the predefined field types have parameters that let you customize these parts. For example:
-* The ``TEXT()`` constructor takes an ``analyzer`` keyword argument that is passed on to the format object.
-A ``Format`` object defines what kind of information a field records about each term, and how the information is stored on disk.
-The indexing code passes the unicode string for a field to the field's Format object. The Format object calls its analyzer (see text analysis) to break the string into tokens, then encodes information about each token.
-The STORED field type uses the Stored format (which does nothing, so STORED fields are not indexed). The ID type uses the Existence format. The KEYWORD type uses the Frequency format. The TEXT type uses the Positions format if it is instantiated with phrase=True (the default), or Frequency if phrase=False.
-In addition, the following formats are implemented for the possible convenience of expert users, but are not currently used in Whoosh:
-The main index is an inverted index. It maps terms to the documents they appear in. It is also sometimes useful to store a forward index, also known as a term vector, that maps documents to the terms that appear in them.
-If you set FieldType.vector to a Format object, the indexing code will use the Format object to store information about the terms in each document. Currently by default Whoosh does not make use of term vectors at all, but they are available to expert users who want to implement their own field types.
-The query.Phrase query object can use positions in postings (``FieldType.format=Positions``) or in vectors (``FieldType.vector=Positions``), but storing positions in the postings gives faster phrase searches.
-Field names are mapped to numbers inside the Schema, and the numbers are used internally. This means you can add fields to an existing index, and you can rename fields (although there is no API for doing so), but you can't delete fields from an existing index. If you want to make drastic changes to the schema, you should reindex your documents from scratch with the new schema.
-Instead of sorting the matched documents by a score, you can sort them by the contents of one or more indexed field(s). These should be fields for which each document stores one term (i.e. an ID field type), for example "path", "id", "date", etc.
-If you require more complex sorting you can implement a custom :class:`whoosh.scoring.Sorter` object and pass it to the `sortedby` keyword argument::
-A sorting object is an object with an :meth:`~whoosh.scoring.Sorter.order` method, which takes a searcher and an unsorted list of document numbers, and returns a sorted list of document numbers.
-The :meth:`~whoosh.searching.Searcher.document` and :meth:`~whoosh.searching.Searcher.documents` methods on the Searcher object let you retrieve the stored fields of documents matching terms you pass in keyword arguments.
-* The entire value of each keyword argument is considered a single term; you can't search for multiple terms in the same field.
-It is sometimes useful to use the results of another query to influence the order of a :class:`whoosh.searching.Results` object.
-For example, you might have a "best bet" field. This field contains hand-picked keywords for documents. When the user searches for those keywords, you want those documents to be placed at the top of the results list. You could try to do this by boosting the "bestbet" field tremendously, but that can have unpredictable effects on scoring. It's much easier to simply run the query twice and combine the results::
- Any result documents that also appear in 'results' are moved to the top of the list of result documents.
- Any result documents that also appear in 'results' are moved to the top of the list of result documents. Then any other documents in 'results' are added on to the list of result documents.
-Whoosh includes pure-Python spell-checking library functions that use the Whoosh search engine for back-end storage.
-If you have a Whoosh ``Index`` object and you want to open the spelling dictionary in the same directory as the index, you can re-use the ``Index`` object's ``Storage``::
-Whoosh lets you keep multiple indexes in the same directory by assigning the indexes different names. The default name for a regular index is ``_MAIN``. The default name for the index created by the SpellChecker object is ``SPELL`` (so you can keep your main index and a spelling index in the same directory by default). You can pass an ``indexname`` argument to the SpellChecker constructor to choose a different index name (for example, if you want to keep multiple spelling dictionaries in the same directory)::
-You need to populate the spell-checking dictionary with (properly spelled) words to check against. There are a few strategies for doing this:
- For example, if you've created an index for a collection of documents with the contents indexed in a field named ``content``, you can automatically add all the words from that field::
- The advantage of using the contents of an index field is that when you are spell checking queries on that index, the suggestions are tailored to the contents of the index. The disadvantage is that if the indexed documents contain spelling errors, then the spelling suggestions will also be erroneous.
- There are plenty of word lists available on the internet you can use to populate the spelling dictionary. ::
-* Use a combination of word lists and index field contents. For example, you could add words from a field, but only if they appear in the word list::
-Note that adding words to the dictionary should be done all at once. Each call to ``add_field()``, ``add_words()``, or ``add_scored_words()`` (see below) creates a writer, adds to the underlying index, and the closes the writer, just like adding documents to a regular Whoosh index. **DO NOT** do anything like this::
-**Be careful** not to add the same word to the spelling dictionary more than once. The ``SpellChecker`` code *does not* currently guard against this automatically.
-Once you have words in the spelling dictionary, you can use the ``suggest()`` method to check words::
-The ``number`` keyword argument sets the maximum number of suggestions to return (default is 3). ::
-Each word in the dictionary can have a "score" associated with it. When two or more suggestions have the same "edit distance" (number of differences) from the checked word, the score is used to order them in the suggestion list.
-By default the list of suggestions is only ordered by the number of differences between the suggestion and the original word. To make the ``suggest()`` method use word scores, use the ``usescores=True`` keyword argument. ::
-The main use for this is to use the word's frequency in the index as its score, so common words are suggested before obscure words. **Note** The ``add_field()`` method does this by default.
-If you want to add a list of words with scores manually, you can use the ``add_scored_words()`` method::
-For example, if you wanted to reverse the default behavior of ``add_field()`` so that *obscure* words would be suggested before common words, you could do this::
-If you want to spell check a user query, first parse the user's query into a ``whoosh.query.Query`` object tree, using the default parser or your own custom parser. For example::
-Then you can use the ``all_terms()`` or ``existing_terms()`` methods of the ``Query`` object to extract the set of terms used in the query. The two methods work in a slightly unusual way: instead of returning a list, you pass them a set, and they populate the set with the query terms::
-The ``all_terms()`` method simply adds all the terms found in the query. The ``existing_terms()`` method takes an IndexReader object and only adds terms from the query *that exist* in the reader's underlying index. ::
-Of course, it's more useful to spell check the terms that are *missing* from the index, not the ones that exist. The ``reverse=True`` keyword argument to ``existing_terms()`` lets us find the missing terms
-So now you have a set of ``("fieldname", "termtext")`` tuples. Now you can check them against the spelling dictionary::
-The spell checker is mainly intended to be "write-once, read-many". You can continually add words to the dictionary, but it is not possible to remove words or dynamically update the dictionary.
-Currently the best strategy available for keeping a spelling dictionary up-to-date with changing content is simply to **delete and re-create** the spelling dictionary periodically.