Commits

Matt Chaput committed cac9c3a

Fixed Expander API. Doc updates and fixes.

Comments (0)

Files changed (21)

docs/source/api/qparser.rst

 ==============
 
 .. autoclass:: QueryParser
-    :inherited-members:
     :members:
     
 .. autoclass:: MultifieldParser
 
 .. autoclass:: SimpleNgramParser
 
+
+Exceptions
+==========
+
+.. autoexception:: QueryParserError
+

docs/source/api/query.rst

 
 .. autoclass:: TermRange
 
+.. autoclass:: Every
+
 
 Binary operations
 =================
 
-These binary operators are not generally created by the query parser in :mod:`whoosh.qparser`.
-Unless you specifically need these operations, you should use the normal query classes instead.
-
 .. autoclass:: Require
 
 .. autoclass:: AndMaybe

docs/source/api/scoring.rst

 
 .. autoclass:: BM25F
 
-.. autoclass:: Cosine
-
-.. autoclass:: DFree
-
-.. autoclass:: DLH13
-
-.. autoclass:: Hiemstra_LM
-
-.. autoclass:: InL2
-
 .. autoclass:: TF_IDF
 
 .. autoclass:: Frequency
 
 
+Scoring utility classes
+=======================
+
+.. autoclass:: MultiWeighting
+
+.. autoclass:: ReverseWeighting
+
+
 Sorting classes
 ===============
 

docs/source/api/searching.rst

 .. autoclass:: ResultsPage
 	:members:
 
+
+Facets
+======
+
+.. autoclass:: Facets
+

docs/source/api/writing.rst

     DOCLENGTH_TYPE.
 
 
-Writers
+Writer
 ======
 
 .. autoclass:: IndexWriter
     :members:
-    
+
+
+Utility writers
+===============
+
 .. autoclass:: AsyncWriter
     :members:
+    
+.. autoclass:: BatchWriter
+    :members:
+    
+    
+Posting writer
+==============
+
+.. autoclass:: PostingWriter
+
 
 Exceptions
 ==========

docs/source/highlight.rst

 Overview
 ========
 
-The highlight module requires that you have the text of the indexed 
-document available. You can keep the text in a stored field, or if the 
-original text is available in a file, database column, etc, just reload 
-it on the fly. Note that you might need to process the text to remove 
-e.g. HTML tags, wiki markup, etc.
+The highlight module requires that you have the text of the indexed document
+available. You can keep the text in a stored field, or if the original text is
+available in a file, database column, etc, just reload it on the fly. Note that
+you might need to process the text to remove e.g. HTML tags, wiki markup, etc.
 
 The highlight module works on a pipeline:
 
 #. Run the text through an analyzer to turn it into a token stream [#f1]_.
 
-#. Break the token stream into "fragments" (there are several different styles of fragmentation  available).
+#. Break the token stream into "fragments" (there are several different styles
+   of fragmentation  available).
 
-#. Score each fragment based on how many matched query terms the fragment contains.
+#. Score each fragment based on how many matched query terms the fragment
+   contains.
 
 #. Format the highest scoring fragments for display.
 
 .. rubric:: Footnotes
 
 .. [#f1]
-    Some search systems, such as Lucene, can use term vectors to highlight text 
+    Some search systems, such as Lucene, can use term vectors to highlight text
     without retokenizing it. In my tests I found that using a Position/Character
     term vector didn't give any speed improvement in Whoosh over retokenizing
     the text. This probably needs further investigation.
     The original text of the document.
 
 terms
-    An iterable containing the query words to match, e.g.
-    ("render", "shader").
+    An iterable containing the query words to match, e.g. ("render", "shader").
 
 analyzer
-    The analyzer to use to break the document text into tokens for
-    matching against the query terms. This is usually the analyzer
-    for the field the query terms are in.
+    The analyzer to use to break the document text into tokens for matching
+    against the query terms. This is usually the analyzer for the field the
+    query terms are in.
 
 fragmenter
     A fragmeter callable, see below.
     is BasicFragmentScorer, the default.
 
 minscore
-    The minimum score a fragment must have to be considered for
-    inclusion.
+    The minimum score a fragment must have to be considered for inclusion.
 
 order
-    An ordering function that determines the order of the "top"
-    fragments in the output text. This will usually be either
-    SCORE (highest scoring fragments first) or FIRST (highest
-    scoring fragments in their original order). (Whoosh also
-    includes LONGER (longer fragments first) and SHORTER (shorter
-    fragments first) as examples of scoring functions, but they
-    probably aren't as generally useful.)
-
-Example
--------
-
-
+    An ordering function that determines the order of the "top" fragments in the
+    output text. This will usually be either SCORE (highest scoring fragments
+    first) or FIRST (highest scoring fragments in their original order). (Whoosh
+    also includes LONGER (longer fragments first) and SHORTER (shorter fragments
+    first) as examples of scoring functions, but they probably aren't as
+    generally useful.)
 
 
 How it works
 Fragmenters
 -----------
 
-A fragmenter controls the policy of how to extract excerpts from the 
-original text. It is a callable that takes the original text, the set of 
-terms to match, and the token stream, and returns a sequence of Fragment 
-objects.
+A fragmenter controls the policy of how to extract excerpts from the original
+text. It is a callable that takes the original text, the set of terms to match,
+and the token stream, and returns a sequence of Fragment objects.
 
 The available fragmenters are:
 
 NullFragmenter
-    Returns the entire text as one "fragment". This can be useful if you
-    are highlighting a short bit of text and don't need to fragment it.
+    Returns the entire text as one "fragment". This can be useful if you are
+    highlighting a short bit of text and don't need to fragment it.
 
 SimpleFragmenter
-    Or maybe "DumbFragmenter", this just breaks the token stream into
-    equal sized chunks.
+    Or maybe "DumbFragmenter", this just breaks the token stream into equal
+    sized chunks.
 
 SentenceFragmenter
-    Tries to break the text into fragments based on sentence punctuation
-    (".", "!", and "?"). This object works by looking in the original
-    text for a sentence end as the next character after each token's
-    'endchar'. Can be fooled by e.g. source code, decimals, etc.
+    Tries to break the text into fragments based on sentence punctuation (".",
+    "!", and "?"). This object works by looking in the original text for a
+    sentence end as the next character after each token's 'endchar'. Can be
+    fooled by e.g. source code, decimals, etc.
 
 ContextFragmenter
-    This is a "smart" fragmenter that finds matched terms and then pulls
-    in surround text to form fragments. This fragmenter only yields
-    fragments that contain matched terms.
+    This is a "smart" fragmenter that finds matched terms and then pulls in
+    surround text to form fragments. This fragmenter only yields fragments that
+    contain matched terms.
 
 (See the docstrings for how to instantiate these)
 
 Formatters
 ----------
 
-A formatter contols how the highest scoring fragments are turned into a 
-formatted bit of text for display to the user. It can return anything 
-(e.g. plain text, HTML, a Genshi event stream, a SAX event generater, 
-anything useful to the calling system).
+A formatter contols how the highest scoring fragments are turned into a
+formatted bit of text for display to the user. It can return anything (e.g.
+plain text, HTML, a Genshi event stream, a SAX event generater, anything useful
+to the calling system).
 
-Whoosh currently includes only two formatters, because I wrote this 
-module for myself and that's all I needed at the time. Unless you happen 
-to be using Genshi also, you'll probably need to implement your own 
-formatter. I'll try to add more useful formatters in the future.
+(Whoosh currently includes only a few formatters, because I wrote this module
+for myself and that's all I needed at the time.)
 
 UppercaseFormatter
     Converts the matched terms to UPPERCASE.
 
 HtmlFormatter
-	Outputs a string containing HTML tags (with a class attribute)
-	around the the matched terms.
+	Outputs a string containing HTML tags (with a class attribute) around the the
+	matched terms.
 
 GenshiFormatter
     Outputs a Genshi event stream, with the matched terms wrapped in a
 Writing your own formatter
 --------------------------
 
-A formatter must be a callable (a function or an object with a __call__ 
-method). It is called with the following arguments::
+A formatter must be a callable (a function or an object with a __call__ method).
+It is called with the following arguments::
 
     formatter(text, fragments)
 
     An iterable of Fragment objects representing the top scoring
     fragments.
 
-The Fragment object is a simple object that has attributes containing 
-basic information about the fragment:
+The Fragment object is a simple object that has attributes containing basic
+information about the fragment:
 
 Fragment.startchar
     The index of the first character of the fragment.
     terms within the fragment.
 
 Fragments.matched_terms
-    For convenience: A frozenset of the text of the matched terms within
-    the fragment -- i.e. frozenset(t.text for t in self.matches).
+    For convenience: A frozenset of the text of the matched terms within the
+    fragment -- i.e. frozenset(t.text for t in self.matches).
 
 The basic work you need to do in the formatter is:
 
 
 * For each Token object in Fragment.matches, highlight the bits of the
    excerpt between Token.startchar and Token.endchar. (Remember that the
-   character indices refer to the original text, so you need to adjust
-   them for the excerpt.)
+   character indices refer to the original text, so you need to adjust them for
+   the excerpt.)
 
-The tricky part is that if you're adding text (e.g. inserting HTML tags 
-into the output), you have to be careful about keeping the character 
-indices straight.
+The tricky part is that if you're adding text (e.g. inserting HTML tags into the
+output), you have to be careful about keeping the character indices straight.

docs/source/indexing.rst

 
 The schema you created the index with is pickled and stored with the index.
 
-You can keep multiple indexes in the same directory using the indexname keyword argument::
+You can keep multiple indexes in the same directory using the indexname keyword
+argument::
 
 	# Using the convenience functions
     ix = index.create_in("indexdir", schema=schema, indexname="usages")
 Clearing the index
 ==================
 
-Calling ``index.create_in`` on a directory with an existing index will clear the current contents of the index.
+Calling ``index.create_in`` on a directory with an existing index will clear the
+current contents of the index.
 
-To test whether a directory currently contains a valid index, use ``index.exists_in``::
+To test whether a directory currently contains a valid index, use
+``index.exists_in``::
 
     exists = index.exists_in("indexdir")
     usages_exists = index.exists_in("indexdir", indexname="usages")
 
-(Alternatively you can simply delete the index's files from the directory, e.g. if you only have one index in the directory, use ``shutil.rmtree`` to remove the directory and then recreate it.)
+(Alternatively you can simply delete the index's files from the directory, e.g.
+if you only have one index in the directory, use ``shutil.rmtree`` to remove the
+directory and then recreate it.)
 
 
 Indexing documents
 ==================
 
-Once you've created an Index object, you can add documents to the index with an ``IndexWriter`` object. The easiest way to get the ``IndexWriter`` is to call ``Index.writer()``::
+Once you've created an Index object, you can add documents to the index with an
+``IndexWriter`` object. The easiest way to get the ``IndexWriter`` is to call
+``Index.writer()``::
 
     ix = index.open_dir("index")
     writer = ix.writer()
 
-Creating a writer locks the index, so only one thread/process at once can have a writer open.
+Creating a writer locks the index, so only one thread/process at once can have a
+writer open.
 
-The IndexWriter's ``add_document(**kwargs)`` method accepts keyword arguments where the field name is mapped to a value::
+The IndexWriter's ``add_document(**kwargs)`` method accepts keyword arguments
+where the field name is mapped to a value::
 
     writer = ix.writer()
     writer.add_document(title=u"My document", content=u"This is my document!",
                         path=u"/c", tags=u"short", icon=u"/icons/book.png")
     writer.commit()
 
-You don't have to fill in a value for every field. Whoosh doesn't care if you leave out a field from a document.
+You don't have to fill in a value for every field. Whoosh doesn't care if you
+leave out a field from a document.
 
-Indexed fields must be passed a unicode value. Fields that are stored but not indexed (i.e. the STORED field type) can be passed any pickle-able object.
+Indexed fields must be passed a unicode value. Fields that are stored but not
+indexed (i.e. the STORED field type) can be passed any pickle-able object.
 
-Whoosh will happily allow you to add documents with identical values, which can be useful or annoying depending on what you're using the library for::
+Whoosh will happily allow you to add documents with identical values, which can
+be useful or annoying depending on what you're using the library for::
 
     writer.add_document(path=u"/a", title=u"A", content=u"Hello there")
     writer.add_document(path=u"/a", title=u"A", content=u"Deja vu!")
 
-This adds two documents to the index with identical path and title fields. See "updating documents" below for information on the update_document method, which uses "unique" fields to replace old documents instead of appending.
+This adds two documents to the index with identical path and title fields. See
+"updating documents" below for information on the update_document method, which
+uses "unique" fields to replace old documents instead of appending.
 
 
 Indexing and storing different values for the same field
 --------------------------------------------------------
 
-If you have a field that is both indexed and stored, you can index a unicode value but store a different object if necessary (it's usually not, but sometimes this is really useful) using a "special" keyword argument _stored_<fieldname>. The normal value will be analyzed and indexed, but the "stored" value will show up in the results::
+If you have a field that is both indexed and stored, you can index a unicode
+value but store a different object if necessary (it's usually not, but sometimes
+this is really useful) using a "special" keyword argument _stored_<fieldname>.
+The normal value will be analyzed and indexed, but the "stored" value will show
+up in the results::
 
     writer.add_document(title=u"Title to be indexed", _stored_title=u"Stored title")
 
 Finishing adding documents
 --------------------------
 
-An ``IndexWriter`` object is kind of like a database transaction. You specify a bunch of changes to the index, and then "commit" them all at once.
+An ``IndexWriter`` object is kind of like a database transaction. You specify a
+bunch of changes to the index, and then "commit" them all at once.
 
-Calling ``commit()`` on the ``IndexWriter`` saves the added documents to the index::
+Calling ``commit()`` on the ``IndexWriter`` saves the added documents to the
+index::
 
     writer.commit()
 
 Once your documents are in the index, you can search for them.
 
-If you want to close the writer without committing the changes, call ``cancel()`` instead of ``commit()``::
+If you want to close the writer without committing the changes, call
+``cancel()`` instead of ``commit()``::
 
     writer.cancel()
 
-Keep in mind that while you have a writer open (including a writer you opened and is still in scope), no other thread or process can get a writer or modify the index. A writer also keeps several open files. So you should always remember to call either commit() or cancel() when you're done with a writer object.
+Keep in mind that while you have a writer open (including a writer you opened
+and is still in scope), no other thread or process can get a writer or modify
+the index. A writer also keeps several open files. So you should always remember
+to call either commit() or cancel() when you're done with a writer object.
 
 
 Merging segments
 ================
 
-A Whoosh index is really a container for one or more "sub-indexes" called segments. When you add documents to an index, instead of integrating the new documents with the existing documents (which could potentially be very expensive, since it involves resorting all the indexed terms on disk), Whoosh creates a new segment next to the existing segment. Then when you search the index, Whoosh searches both segments individually and merges the results so the segments appear to be one unified index. (This smart design is copied from Lucene.)
+A Whoosh ``filedb`` index is really a container for one or more "sub-indexes"
+called segments. When you add documents to an index, instead of integrating the
+new documents with the existing documents (which could potentially be very
+expensive, since it involves resorting all the indexed terms on disk), Whoosh
+creates a new segment next to the existing segment. Then when you search the
+index, Whoosh searches both segments individually and merges the results so the
+segments appear to be one unified index. (This smart design is copied from
+Lucene.)
 
-So, having a few segments is more efficient than rewriting the entire index every time you add some documents. But searching multiple segments does slow down searching somewhat, and the more segments you have, the slower it gets. So Whoosh has an algorithm that runs when you call commit() that looks for small segments it can merge together to make fewer, bigger segments.
+So, having a few segments is more efficient than rewriting the entire index
+every time you add some documents. But searching multiple segments does slow
+down searching somewhat, and the more segments you have, the slower it gets. So
+Whoosh has an algorithm that runs when you call commit() that looks for small
+segments it can merge together to make fewer, bigger segments.
 
-The ``commit()`` method takes an argument that lets you control this "merge policy" explicitly::
+To prevent Whoosh from merging segments during a commit, use the ``merge``
+keyword argument::
 
-    from whoosh.writing import NO_MERGE, MERGE_SMALL, OPTIMIZE
-    writer.commit(MERGE_SMALL)
+    writer.commit(merge=False)
+    
+To merge all segments together, optimizing the index into a single segment,
+use the ``optimize`` keyword argument::
 
-:meth:`whoosh.writing.MERGE_SMALL`
+    writer.commit(optimize=True)
 
-    The default: uses a heuristic (taken from KinoSearch?) based on the Fibonacci sequence to merge "small" segments together.
+(The Index object also has an ``optimize()`` method that lets you optimize the
+index (merge all the segments together). It simply creates a writer and calls
+``commit(optimize=True)`` on it.)
 
-:meth:`whoosh.writing.NO_MERGE`
-
-    Do not merge segments, even if it means creating lots of small segments. This may be useful if you don't want to pay any speed penalty for merging when you're doing lots of small adds to the index. You'll want to somehow schedule and "optimization" (see below) at some point to merge the segments.
-
-:meth:`whoosh.writing.OPTIMIZE`
-
-    Merge all segments together to finish with only one segment in the index.
-
-The Index object also has an ``optimize()`` method that lets you optimize the index (merge all the segments together). It simply creates a writer and calls ``commit(OPTIMIZE)`` on it.
-
-(NO_MERGE, MERGE_SMALL, and OPTIMIZE are actually callables that implement the different "policies". It is possible for an expert user to implement a different algorithm for merging segments.)
+For more control over segment merging, you can write your own merge policy
+function and use it as an argument to the ``commit()`` method. See the
+implementation of the ``NO_MERGE``, ``MERGE_SMALL``, and ``OPTIMIZE`` functions
+in the ``whoosh.filedb.filewriting`` module.
 
 
 Deleting documents
 ==================
 
-You can delete documents using identical methods on either the Index object or the IndexWriter object. In both cases, you need to call ``commit()`` on the object to write the deletions to disk.
+You can delete documents using the following methods on an ``IndexWriter``
+object. You then need to call ``commit()`` on the writer to save the deletions
+to disk.
 
 ``delete_document(docnum)``
 
 
 ``is_deleted(docnum)``
 
-    Low-level method, returns True if the document with the given internal number is deleted.
+    Low-level method, returns True if the document with the given internal
+    number is deleted.
 
 ``delete_by_term(fieldname, termtext)``
 
-    Deletes any documents where the given (indexed) field contains the given term. This is mostly useful for ID or KEYWORD fields.
+    Deletes any documents where the given (indexed) field contains the given
+    term. This is mostly useful for ID or KEYWORD fields.
 
 ``delete_by_query(query)``
 
         # Save the deletion to disk
         ix.commit()
 
-Note that "deleting" a document simply adds the document number to a list of deleted documents stored with the index. When you search the index, it knows not to return deleted documents in the results. However, the document's contents are still stored in the index, and certain statistics (such as term document frequencies) are not updated, until you merge the segments containing deleted documents (see merging above). (This is because removing the information immediately from the index would essentially involving rewriting the entire index on disk, which would be very inefficient.)
+In the ``filedb`` backend, "deleting" a document simply adds the document number
+to a list of deleted documents stored with the index. When you search the index,
+it knows not to return deleted documents in the results. However, the document's
+contents are still stored in the index, and certain statistics (such as term
+document frequencies) are not updated, until you merge the segments containing
+deleted documents (see merging above). (This is because removing the information
+immediately from the index would essentially involving rewriting the entire
+index on disk, which would be very inefficient.)
 
 
 Updating documents
 ==================
 
-If you want to "replace" (re-index) a document, you can delete the old document using one of the ``delete_*`` methods on ``Index`` or ``IndexWriter``, then use ``IndexWriter.add_document`` to add the new version. Or, you can use ``IndexWriter.update_document`` to do this in one step.
+If you want to "replace" (re-index) a document, you can delete the old document
+using one of the ``delete_*`` methods on ``Index`` or ``IndexWriter``, then use
+``IndexWriter.add_document`` to add the new version. Or, you can use
+``IndexWriter.update_document`` to do this in one step.
 
-For ``update_document`` to work, you must have marked at least one of the fields in the schema as "unique". Whoosh will then use the contents of the "unique" field(s) to search for documents to delete::
+For ``update_document`` to work, you must have marked at least one of the fields
+in the schema as "unique". Whoosh will then use the contents of the "unique"
+field(s) to search for documents to delete::
 
     from whoosh.fields import Schema, ID, TEXT
 
 
 The "unique" field(s) must be indexed.
 
-If no existing document matches the unique fields of the document you're updating, update_document acts just like add_document.
+If no existing document matches the unique fields of the document you're
+updating, update_document acts just like add_document.
 
-"Unique" fields and update_document are simply convenient shortcuts for deleting and adding. Whoosh has no inherent concept of a unique identifier, and in no way enforces uniqueness when you use add_document.
+"Unique" fields and update_document are simply convenient shortcuts for deleting
+and adding. Whoosh has no inherent concept of a unique identifier, and in no way
+enforces uniqueness when you use add_document.
 
 
 Incremental indexing
 ====================
 
-When you're indexing a collection of documents, you'll often want two code paths: one to index all the documents from scratch, and one to only update the documents that have changed (leaving aside web applications where you need to add/update documents according to user actions).
+When you're indexing a collection of documents, you'll often want two code
+paths: one to index all the documents from scratch, and one to only update the
+documents that have changed (leaving aside web applications where you need to
+add/update documents according to user actions).
 
 Indexing everything from scratch is pretty easy. Here's a simple example::
 
       fileobj.close()
       writer.add_document(path=path, content=content)
 
-Now, for a small collection of documents, indexing from scratch every time might actually be fast enough. But for large collections, you'll want to have the script only re-index the documents that have changed.
+Now, for a small collection of documents, indexing from scratch every time might
+actually be fast enough. But for large collections, you'll want to have the
+script only re-index the documents that have changed.
 
-To start we'll need to store each document's last-modified time, so we can check if the file has changed. In this example, we'll just use the mtime for simplicity::
+To start we'll need to store each document's last-modified time, so we can check
+if the file has changed. In this example, we'll just use the mtime for
+simplicity::
 
     def get_schema()
       return Schema(path=ID(unique=True, stored=True), time=STORED, content=TEXT)
       modtime = os.path.getmtime(path)
       writer.add_document(path=path, content=content, time=modtime)
 
-Now we can modify the script to allow either "clean" (from scratch) or incremental indexing::
+Now we can modify the script to allow either "clean" (from scratch) or
+incremental indexing::
 
     def index_my_docs(dirname, clean=False):
       if clean:
         # The set of all paths we need to re-index
         to_index = set()
 
+        writer = ix.writer()
+
         # Loop over the stored fields in the index
-        for fields in searcher.doc_reader:
+        for fields in searcher.all_stored_fields():
           indexed_path = fields['path']
           indexed_paths.add(indexed_path)
 
           if not os.path.exists(indexed_path):
             # This file was deleted since it was indexed
-            ix.delete_by_term('path', indexed_path)
+            writer.delete_by_term('path', indexed_path)
 
           else:
             # Check if this file was changed since it
             indexed_time = fields['time']
             mtime = os.path.getmtime(indexed_path)
             if mtime > indexed_time:
-              # The file has changed, add it to the list of
-              # filese
+              # The file has changed, delete it and add it to the list of
+              # files to reindex
+              writer.delete_by_term('path', indexed_path)
               to_index.add(indexed_path)
 
-        writer = ix.writer()
-
         # Loop over the files in the filesystem
         # Assume we have a function that gathers the filenames of the
         # documents to be indexed

docs/source/intro.rst

 documents based on simple or complex search criteria.
 
 
-
-
 Getting help with Whoosh
 ------------------------
 

docs/source/keywords.rst

 Overview
 ========
 
-Whoosh provides methods for computing the "key terms" of a set of documents. For these methods, "key terms" basically means terms that are frequent in the given documents, but relatively infrequent in the indexed collection as a whole.
+Whoosh provides methods for computing the "key terms" of a set of documents. For
+these methods, "key terms" basically means terms that are frequent in the given
+documents, but relatively infrequent in the indexed collection as a whole.
 
-Because this is a purely statistical operation, not a natural language processing or AI function, the quality of the results will vary based on the content, the size of the document collection, and the number of documents for which you extract keywords.
+Because this is a purely statistical operation, not a natural language
+processing or AI function, the quality of the results will vary based on the
+content, the size of the document collection, and the number of documents for
+which you extract keywords.
 
 These methods can be useful for providing the following features to users:
 
-* Search term expansion. You can extract key terms for the top N results from a query and suggest them to the user as additional/alternate query terms to try.
+* Search term expansion. You can extract key terms for the top N results from a
+  query and suggest them to the user as additional/alternate query terms to try.
 
-* Tag suggestion. Extracting the key terms for a single document may yield useful suggestions for tagging the document.
+* Tag suggestion. Extracting the key terms for a single document may yield
+  useful suggestions for tagging the document.
 
-* "More like this". You can extract key terms for the top ten or so results from a query (and removing the original query terms), and use those key words as the basis for another query that may find more documents using terms the user didn't think of.
-
+* "More like this". You can extract key terms for the top ten or so results from
+  a query (and removing the original query terms), and use those key words as
+  the basis for another query that may find more documents using terms the user
+  didn't think of.
 
 Usage
 =====
 
-* Extract keywords for an arbitrary set of documents.
+* Extract keywords for the top N documents in a
+  :class:`whoosh.searching.Results` object. *This requires that the field is
+  either vectored or stored*.
 
-  Use the :meth:`~whoosh.searching.Searcher.document_number` or :meth:`~whoosh.searching.Searcher.document_number` methods of the :class:`whoosh.searching.Searcher` object to get the document numbers for the document(s) you want to extract keywords from.
+  Use the :meth:`~whoosh.searching.Results.key_terms` method of the
+  :class:`whoosh.searching.Results` object to extract keywords from the top N
+  documents of the result set.
     
-  Use the :meth:`~whoosh.searching.Searcher.key_terms` method of :class:`whoosh.searching.Searcher` to extract the keywords, given the list of document numbers.
+  For example, to extract *five* key terms from the ``content`` field of the top
+  *ten* documents of a results object::
     
-  For example, let's say you have an index of emails. To extract key terms from the ``content`` field of emails whose ``emailto`` field contains ``matt@whoosh.ca``::
+        keywords = list(results.key_terms("content", docs=10, numterms=5))
+        
+* Extract keywords for an arbitrary set of documents. *This requires that the
+  field is either vectored or stored*.
+
+  Use the :meth:`~whoosh.searching.Searcher.document_number` or
+  :meth:`~whoosh.searching.Searcher.document_number` methods of the
+  :class:`whoosh.searching.Searcher` object to get the document numbers for the
+  document(s) you want to extract keywords from.
+    
+  Use the :meth:`~whoosh.searching.Searcher.key_terms` method of a
+  :class:`whoosh.searching.Searcher` to extract the keywords, given the list of
+  document numbers.
+    
+  For example, let's say you have an index of emails. To extract key terms from
+  the ``content`` field of emails whose ``emailto`` field contains
+  ``matt@whoosh.ca``::
     
         searcher = email_index.searcher()
         docnums = searcher.document_numbers(emailto=u"matt@whoosh.ca")
-        keywords = list(searcher.key_terms(docnums, "content"))
+        keywords = list(searcher.key_terms(docnums, "body"))
 
-* Extract keywords for the top N documents in a :class:`whoosh.searching.Results` object.
+* Extract keywords from arbitrary text not in the index.
 
-  Use the :meth:`~whoosh.searching.Results.key_terms` method of the :class:`whoosh.searching.Results` object to extract keywords from the top N documents of the result set.
-    
-  For example, to extract *five* key terms from the ``content`` field of the top *ten* documents of a results object::
-    
-        keywords = list(results.key_terms("content", docs=10, numterms=5))
-        
+  Use the :meth:`~whoosh.searching.Searcher.key_terms_from_text` method of a
+  :class:`whoosh.searching.Searcher` to extract the keywords, given the text::
+  
+        searcher = email_index.searcher()
+        keywords = list(searcher.key_terms_from_text("body", mytext))
+
 
 Expansion models
 ================
 
-The ``ExpansionModel`` subclasses in the :mod:`whoosh.classify` module implement different weighting functions for key words. These models are translated into Python from original Java implementations in Terrier.
+The ``ExpansionModel`` subclasses in the :mod:`whoosh.classify` module implement
+different weighting functions for key words. These models are translated into
+Python from original Java implementations in Terrier.
     
 

docs/source/parsing.rst

 Overview
 ========
 
-The job of a query parser is to convert a *query string* submitted by a user into *query objects* (objects from the :mod:`whoosh.query` module) which 
+The job of a query parser is to convert a *query string* submitted by a user
+into *query objects* (objects from the :mod:`whoosh.query` module) which
 
 For example, the user query::
 
 
     And([Term("content", u"rendering"), Term("content", u"shading")])
 
-Whoosh includes a few pre-made parsers for user queries in the :mod:`whoosh.qparser` module. The default parser is based on `pyparsing <http://pyparsing.wikispaces.com/>` and implements a query language similar to the one shipped with Lucene. The parser is quite powerful and how it builds query trees is fairly customizable. 
+Whoosh includes a few pre-made parsers for user queries in the
+:mod:`whoosh.qparser` module. The default parser is based on `pyparsing
+<http://pyparsing.wikispaces.com/>` and implements a query language similar to
+the one shipped with Lucene. The parser is quite powerful and how it builds
+query trees is fairly customizable.
 
 
 Using the default parser
 ========================
 
-To create a :class:`whoosh.qparser.QueryParser` object, pass it the name of the *default field* to search and the schema of the index you'll be searching.
+To create a :class:`whoosh.qparser.QueryParser` object, pass it the name of the
+*default field* to search and the schema of the index you'll be searching.
 
     from whoosh.qparser import QueryParser
 
     
 .. tip::
 
-    You can instantiate a QueryParser object without specifying a schema, however the parser will not process the text of the user query (see :ref:`querying and indexing <index-query>` below). This is really only useful for debugging, when you want to see how QueryParser will build a query, but don't want to make up a schema just for testing.
+    You can instantiate a QueryParser object without specifying a schema,
+    however the parser will not process the text of the user query (see
+    :ref:`querying and indexing <index-query>` below). This is really only
+    useful for debugging, when you want to see how QueryParser will build a
+    query, but don't want to make up a schema just for testing.
 
-Once you have a QueryParser object, you can call ``parse()`` on it to parse a query string into a query object::
+Once you have a QueryParser object, you can call ``parse()`` on it to parse a
+query string into a query object::
 
     >>> parser.parse(u"alpha OR beta gamma")
     Or([Term("content", u"alpha"), Term("content", "beta")])
 
-See the :doc:`query language reference <querylang>` for the features and syntax of the default parser's query language.
+See the :doc:`query language reference <querylang>` for the features and syntax
+of the default parser's query language.
 
 
 Letting the user search multiple fields
 =======================================
 
-The QueryParser object takes terms without explicit fields and assigns them to the default field you specified when you created the object, so for example if you created the object with::
+The QueryParser object takes terms without explicit fields and assigns them to
+the default field you specified when you created the object, so for example if
+you created the object with::
 
     parser = QueryParser("content", schema=myschema)
     
 
     content:three content:blind content:mice
     
-However, you might want to let the user search *multiple* fields by default. For example, you might want "unfielded" terms to search both the ``title`` and ``content`` fields.
+However, you might want to let the user search *multiple* fields by default. For
+example, you might want "unfielded" terms to search both the ``title`` and
+``content`` fields.
 
-In that case, you can use a :class:`whoosh.qparser.MultifieldParser`. This is just like the normal QueryParser, but instead of a default field name string, it takes a *sequence* of field names::
+In that case, you can use a :class:`whoosh.qparser.MultifieldParser`. This is
+just like the normal QueryParser, but instead of a default field name string, it
+takes a *sequence* of field names::
 
     from whoosh.qparser import MultifieldParser
 
 QueryParser supports two extra keyword arguments:
 
 conjunction
-    The query class to use to join sub-queries when the user doesn't explicitly specify a boolean operator, such as ``AND`` or ``OR``.
+    The query class to use to join sub-queries when the user doesn't explicitly
+    specify a boolean operator, such as ``AND`` or ``OR``.
     
-    This must be a :class:`whoosh.query.Query` subclass (*not* an instantiated object) that accepts a list of subqueries in its ``__init__`` method. The default is :class:`whoosh.query.And`.
+    This must be a :class:`whoosh.query.Query` subclass (*not* an instantiated
+    object) that accepts a list of subqueries in its ``__init__`` method. The
+    default is :class:`whoosh.query.And`.
     
-    This is useful if you want to change the default operator to ``OR``, or if you've written a custom operator you want the parser to use instead of the ones shipped with Whoosh.
+    This is useful if you want to change the default operator to ``OR``, or if
+    you've written a custom operator you want the parser to use instead of the
+    ones shipped with Whoosh.
 
 termclass
     The query class to use to wrap single terms.
     
-    This must be a :class:`whoosh.query.Query` subclass (*not* an instantiated object) that accepts a fieldname string and term text unicode string in its ``__init__`` method. The default is :class:`whoosh.query.Term`.
+    This must be a :class:`whoosh.query.Query` subclass (*not* an instantiated
+    object) that accepts a fieldname string and term text unicode string in its
+    ``__init__`` method. The default is :class:`whoosh.query.Term`.
 
-    This is useful if you want to chnage the default term class to :class:`whoosh.query.Variations`, or if you've written a custom term class you want the parser to use instead of the ones shipped with Whoosh.
+    This is useful if you want to chnage the default term class to
+    :class:`whoosh.query.Variations`, or if you've written a custom term class
+    you want the parser to use instead of the ones shipped with Whoosh.
 
 >>> orparser = QueryParser("content", schema=myschema, conjunction=query.Or)
 
-Subclassing QueryParser
------------------------
-
-The ``QueryParser`` class is designed to allow a certain amount of customization by subclassing. The methods invoked on the abstract syntax tree produced by pyparsing in turn call methods starting with ``make_``, such as ``make_term``, ``make_prefix``, etc. The methods are passed the parsed information (such as the fieldname and term text for ``make_term``) and return a ``Query`` object. You can subclass and replace these methods to do additional processing or return difference Query types. See the source code of the ``PyparsingBasedParser`` and ``QueryParser`` classes in the ``qparser`` module.
 
 Writing your own parser
 -----------------------
 
-To implement a different query syntax, or for complete control over query parsing, you can write your own parser.
+To implement a different query syntax, or for complete control over query
+parsing, you can write your own parser.
 
-A parser is simply a class or function that takes input from the user and generates :class:`whoosh.query.Query` objects from it. For example, you could write a function that parses queries specified in XML:
+A parser is simply a class or function that takes input from the user and
+generates :class:`whoosh.query.Query` objects from it. For example, you could
+write a function that parses queries specified in XML:
 
 .. code-block:: xml
 

docs/source/quickstart.rst

 A quick introduction
 ====================
 
-The following code should give you some of the flavor of Whoosh. It uses 
-
 >>> from whoosh.index import create_in
 >>> from whoosh.fields import *
 >>> schema = Schema(title=TEXT(stored=True), path=ID(stored=True), content=TEXT)
 {"title": u"First document", "path": u"/a"}
 
 
-Creating an index
-=================
+The ``Index`` and ``Schema`` objects
+====================================
 
-At a high level, to begin using Whoosh you need an *index object*. The first
-time you create an index, you must define the index's *schema*. For example,
-this schema has two fields, "title" and "content"::
+To begin using Whoosh, you need an *index object*. The first time you create
+an index, you must define the index's *schema*. The schema lists the *fields*
+in the index. A field is a piece of information for each document in the index,
+such as its title or text content. A field can be *indexed* (meaning it can
+be searched) and/or *stored* (meaning the value that gets indexed is returned
+with the results; this is useful for fields such as the title).
+
+This schema has two fields, "title" and "content":
 
 	from whoosh.fields import Schema, TEXT
 	
 	schema = Schema(title=TEXT, content=TEXT)
 
-A Schema object defines the fields that are indexed for each document, and
-how/whether the content of the fields is indexed. You only need to do create
-the schema once, when you create the index. The schema is pickled and stored with
-the index.
+You only need to do create the schema once, when you create the index. The
+schema is pickled and stored with the index.
 
 When you create the Schema object, you use keyword arguments to map field names
-to field types. The list of fields and their types defines what you are indexing and
-what's searchable. Whoosh comes with some very useful predefined field types, and you
-can easily create your own.
+to field types. The list of fields and their types defines what you are indexing
+and what's searchable. Whoosh comes with some very useful predefined field
+types, and you can easily create your own.
 
 :class:`whoosh.fields.ID`
-    This type simply indexes (and optionally stores) the entire value of the field as a
-    single unit (that is, it doesn't break it up into individual words). This is useful
-    for fields such as a file path, URL, date, category, etc.
+    This type simply indexes (and optionally stores) the entire value of the
+    field as a single unit (that is, it doesn't break it up into individual
+    words). This is useful for fields such as a file path, URL, date, category,
+    etc.
     
 :class:`whoosh.fields.STORED`
-    This field is stored with the document, but not indexed. This field type is not
-    indexed and not searchable. This is useful for document information you want to
-    display to the user in the search results.
+    This field is stored with the document, but not indexed. This field type is
+    not indexed and not searchable. This is useful for document information you
+    want to display to the user in the search results.
     
 :class:`whoosh.fields.KEYWORD`
-    This type is designed for space- or comma-separated keywords. This type is indexed
-    and searchable (and optionally stored). To save space, it does not support phrase
-    searching.
+    This type is designed for space- or comma-separated keywords. This type is
+    indexed and searchable (and optionally stored). To save space, it does not
+    support phrase searching.
     
 :class:`whoosh.fields.TEXT`
-    This type is for body text. It indexes (and optionally stores) the text and stores
-    term positions to allow phrase searching.
+    This type is for body text. It indexes (and optionally stores) the text and
+    stores term positions to allow phrase searching.
+
+:class:`whoosh.fields.NUMERIC`
+    This type is for numbers. You can store integers or floating point numbers.
+    
+:class:`whoosh.fields.BOOLEAN`
+    This type is for boolean (true/false) values.
+
+:class:`whoosh.fields.DATETIME`
+    This type is for ``datetime`` objects.
 
 :class:`whoosh.fields.NGRAM`
     TODO
 
-(As a shortcut, if you don't need to pass any arguments to the field type, you can just
-give the class name and Whoosh will instantiate the object for you.) ::
+(As a shortcut, if you don't need to pass any arguments to the field type, you
+can just give the class name and Whoosh will instantiate the object for you.) ::
 
     from whoosh.fields import Schema, STORED, ID, KEYWORD, TEXT
 
 
 See :doc:`schema` for more information.
 
-Once you have the schema, you can create an index using the ``create_index_in``
+Once you have the schema, you can create an index using the ``create_in``
 function::
 
 	import os.path
 	
 	if not os.path.exists("index"):
         os.mkdir("index")
-	index = create_index_in("index", schema)
+	ix = create_in("index", schema)
 
-At a low level, this involves creating a *storage* object to contain the index.
-A Storage object represents that medium in which the index will be stored. Usually this
-will be ``FileStorage``, which stores the index as a set of files in a directory.
-Whoosh includes a few other experimental storage backends. Future versions may include
-additional options, such as a SQL backend.
+(At a low level, this creates a *Storage* object to contain the index. A
+``Storage`` object represents that medium in which the index will be stored.
+Usually this will be ``FileStorage``, which stores the index as a set of files
+in a directory.)
 
-Here's how you would create the index using a storage object directly instead of
-the ``create_index_in`` convenience function::
-
-    import os, os.path
-    from whoosh.filedb.filestore import FileStorage
-
-    if not os.path.exists("index"):
-        os.mkdir("index")
-
-    storage = FileStorage("index")
-    index = storage.create_index(schema)
-
-
-Opening an index
-================
-
-After you've created an index, you can open it using the ``open_dir`` convenience
-function::
+After you've created an index, you can open it using the ``open_dir``
+convenience function::
 
 	from whoosh.index import open_dir
 	
-	index = open_dir("index")
+	ix = open_dir("index")
 	
-Or, using a storage object::
 
-	from whoosh.filedb.filestore import FileStorage
-	
-	storage = FileStorage("index")
-	index = storage.open_index()
+The ``IndexWriter`` object
+==========================
 
-
-Indexing documents
-==================
-
-OK, so we've got an Index object, now we can start adding documents. The writer() method
-of the Index object returns an ``IndexWriter`` object that lets you add documents to
-the index. The IndexWriter's ``add_document(**kwargs)`` method accepts keyword arguments
-where the field name is mapped to a value::
+OK, so we've got an Index object, now we can start adding documents. The
+writer() method of the Index object returns an ``IndexWriter`` object that lets
+you add documents to the index. The IndexWriter's ``add_document(**kwargs)``
+method accepts keyword arguments where the field name is mapped to a value::
 
     writer = ix.writer()
     writer.add_document(title=u"My document", content=u"This is my document!",
 
 Two important notes:
 
-* You don't have to fill in a value for every field. Whoosh doesn't care if you leave
-  out a field from a document.
+* You don't have to fill in a value for every field. Whoosh doesn't care if you
+  leave out a field from a document.
 
-* Indexed fields must be passed a unicode value. Fields that are stored but not
-  indexed (STORED field type) can be passed any pickle-able object.
+* Indexed text fields must be passed a unicode value. Fields that are stored
+  but not indexed (STORED field type) can be passed any marshal-able object.
 
-If you have a field that is both indexed and stored, you can even index a unicode
-value but store a different object if necessary (it's usually not, but sometimes
-this is really useful) using this trick::
+If you have a text field that is both indexed and stored, you can index a
+unicode value but store a different object if necessary (it's usually not, but
+sometimes this is really useful) using this trick::
 
     writer.add_document(title=u"Title to be indexed", _stored_title=u"Stored title")
 
 
 See :doc:`indexing` for more information.
 
-Once your documents are in the index, you can search for them.
+Once your documents are commited to the index, you can search for them.
 
 
-Searching
-=========
-
-So, let's say a user has typed a search into a search box and you want to run that search on
-you index.
+The ``Searcher`` object
+=======================
 
 To begin searching the index, we'll need a Searcher object::
 
     searcher = ix.searcher()
 
-You can use the high-level ``find()`` method to run queries on the index.
-The first argument is the default field to search (for terms in the query string that
-aren't explicitly qualified with a field), and the second is the query string. The
-method returns a Results object.
+The Searcher's ``search()`` method takes a *Query object*. You can construct
+query objects directly or use a query parser to parse a query string.
 
-The Results object acts like a list of dictionaries, where each dictionary
-contains the stored fields of the document. The first document in the list is the most
-relevant based on the scoring algorithm::
+For example, this query would match documents that contain both "apple" and
+"bear" in the "content" field::
 
-	>>> results = searcher.find("content", u"second")
+    # Construct query objects directly
+    
+    from whoosh.query import *
+    myquery = And([Term("content", u"apple"), Term("content", "bear")])
+
+To parse a query string, you can use the default query parser in the ``qparser``
+module. The first argument to the ``QueryParser`` constructor is the default
+field to search. This is usually the "body text" field. The second optional
+argument is a schema to use to understand how to parse the fields::
+
+    # Parse a query string
+    
+    from whoosh.qparser import QueryParser
+    parser = QueryParser("content", schema = ix.schema)
+    myquery = parser.parse(querystring)
+    
+Once you have a ``Searcher`` and a query object, you can use the ``Searcher``'s
+``search()`` method to run the query and get a ``Results`` object::
+
+    >>> results = searcher.find("content", u"second")
     >>> print(len(results))
     1
     >>> print(results[0])
     {"title": "Second try", "path": "/b", "icon": "/icons/sheep.png"}
 
-At a lower level, the Searcher's ``search()`` method takes Query objects instead of
-a query string. You can construct query objects directly or use a query parser to
-parse a query string into Query objects.
-
-For example, this query would match documents that contain both "apple" and "bear"
-in the "content" field::
-
-	from whoosh.query import *
-
-	myquery = And([Term("content", u"apple"), Term("content", "bear")])
-	
-To parse a query string into Query objects, you can use the default query parser
-in the ``qparser`` module::
-
-    from whoosh.qparser import QueryParser
-    
-    parser = QueryParser("content", schema = ix.schema)
-
-The first argument, ``"content"``, specifies the default field to use when the user
-doesn't specify a field for a word/phrase/clause. This is usually the "body text"
-field. Specifying the schema lets the parser know which analyzers to use for which
-fields. If you don't have a schema (usually when you're testing the parser), you can
-omit the schema. In that case, the parser won't filter the query terms (for example,
-it won't lower-case them).
-
-The default ``QueryParser`` implements a query language very similar to Lucene's.
-It lets you connect terms with AND or OR, eleminate terms with NOT, group terms
-together into clauses with parentheses, do range, prefix, and wilcard queries,
-and specify different fields to search. By default it joins clauses together with
-AND (so by default, all terms you specify must be in the document for the document
-to match)::
+The default ``QueryParser`` implements a query language very similar to
+Lucene's. It lets you connect terms with ``AND`` or ``OR``, eleminate terms with
+``NOT``, group terms together into clauses with parentheses, do range, prefix,
+and wilcard queries, and specify different fields to search. By default it joins
+clauses together with ``AND`` (so by default, all terms you specify must be in
+the document for the document to match)::
 
     >>> print(parser.parse(u"render shade animate"))
     And([Term("content", "render"), Term("content", "shade"), Term("content", "animate")])
 
     >>> print(parser.parse(u"rend*"))
     Prefix("content", "rend")
-    
-We'll create a query object we can use to find a document in the index we created above::
-
-    query = parser.parse(u"second")
-
-Now you can use the searcher to find documents that match the query::
-
-    results = searcher.search(query)
 
 Whoosh includes extra features for dealing with search results, such as
 
 
 See :doc:`searching` for more information.
 
+
+
+

docs/source/schema.rst

 
 Each document can have multiple fields, such as title, content, url, date, etc.
 
-Some fields can be indexed, and some fields can be stored with the document so the contents of the field so the field value is available in search results. Some fields will be both indexed and stored.
+Some fields can be indexed, and some fields can be stored with the document so
+the contents of the field so the field value is available in search results.
+Some fields will be both indexed and stored.
 
-The schema is the set of all possible fields in a document. Each individual document might only use a subset of the available fields in the schema.
+The schema is the set of all possible fields in a document. Each individual
+document might only use a subset of the available fields in the schema.
 
-For example, a simple schema for indexing emails might have fields like ``from_addr``, ``to_addr``, ``subject``, ``body``, and ``attachments``, where the ``attachments`` field lists the names of attachments to the email. For emails without attachments, you would omit the attachments field.
+For example, a simple schema for indexing emails might have fields like
+``from_addr``, ``to_addr``, ``subject``, ``body``, and ``attachments``, where
+the ``attachments`` field lists the names of attachments to the email. For
+emails without attachments, you would omit the attachments field.
 
 
 Built-in field types
 Whoosh provides some useful predefined field types:
 
 :class:`whoosh.fields.TEXT`
-    This type is for body text. It indexes (and optionally stores) the text and stores term positions to allow phrase searching.
+    This type is for body text. It indexes (and optionally stores) the text and
+    stores term positions to allow phrase searching.
 
-    TEXT fields use StandardAnalyzer? by default. To specify a different analyzer, use the analyzer keyword argument to the constructor, e.g. TEXT(analyzer=analysis.StemmingAnalyzer()). See TextAnalysis?.
+    TEXT fields use StandardAnalyzer? by default. To specify a different
+    analyzer, use the analyzer keyword argument to the constructor, e.g.
+    TEXT(analyzer=analysis.StemmingAnalyzer()). See TextAnalysis?.
 
-    By default, TEXT fields store position information for each indexed term, to allow you to search for phrases. If you don't need to be able to search for phrases in a text field, you can turn off storing term positions to save space. Use TEXT(phrase=False).
+    By default, TEXT fields store position information for each indexed term, to
+    allow you to search for phrases. If you don't need to be able to search for
+    phrases in a text field, you can turn off storing term positions to save
+    space. Use TEXT(phrase=False).
 
-    By default, TEXT fields are not stored. Usually you will not want to store the body text in the search index. Usually you have the indexed documents themselves available to read or link to based on the search results, so you don't need to store their text in the search index. However, in some circumstances it can be useful (see HighlightingResults?). Use TEXT(stored=True) to specify that the text should be stored in the index.
+    By default, TEXT fields are not stored. Usually you will not want to store
+    the body text in the search index. Usually you have the indexed documents
+    themselves available to read or link to based on the search results, so you
+    don't need to store their text in the search index. However, in some
+    circumstances it can be useful (see HighlightingResults?). Use
+    TEXT(stored=True) to specify that the text should be stored in the index.
 
 :class:`whoosh.fields.KEYWORD`
-    This field type is designed for space- or comma-separated keywords. This type is indexed and searchable (and optionally stored). To save space, it does not support phrase searching.
+    This field type is designed for space- or comma-separated keywords. This
+    type is indexed and searchable (and optionally stored). To save space, it
+    does not support phrase searching.
 
-    To store the value of the field in the index, use stored=True in the constructor. To automatically lowercase the keywords before indexing them, use lowercase=True.
+    To store the value of the field in the index, use stored=True in the
+    constructor. To automatically lowercase the keywords before indexing them,
+    use lowercase=True.
 
-    By default, the keywords are space separated. To separate the keywords by commas instead (to allow keywords containing spaces), use commas=True.
+    By default, the keywords are space separated. To separate the keywords by
+    commas instead (to allow keywords containing spaces), use commas=True.
 
     If you users will use the keyword field for searching, use scorable=True.
 
 :class:`whoosh.fields.ID`
-    The ID field type simply indexes (and optionally stores) the entire value of the field as a single unit (that is, it doesn't break it up into individual terms). This type of field does not store frequency information, so it's quite compact, but not very useful for scoring.
+    The ID field type simply indexes (and optionally stores) the entire value of
+    the field as a single unit (that is, it doesn't break it up into individual
+    terms). This type of field does not store frequency information, so it's
+    quite compact, but not very useful for scoring.
 
-    Use ID for fields like url or path (the URL or file path of a document), date, category -- fields where the value must be treated as a whole, and each document only has one value for the field.
+    Use ID for fields like url or path (the URL or file path of a document),
+    date, category -- fields where the value must be treated as a whole, and
+    each document only has one value for the field.
 
-    By default, ID fields are not stored. Use ID(stored=True) to specify that the value of the field should be stored with the document for use in the search results. For example, you would want to store the value of a url field so you could provide links to the original in your search results.
+    By default, ID fields are not stored. Use ID(stored=True) to specify that
+    the value of the field should be stored with the document for use in the
+    search results. For example, you would want to store the value of a url
+    field so you could provide links to the original in your search results.
 
 :class:`whoosh.fields.STORED`
-    This field is stored with the document, but not indexed and not searchable. This is useful for document information you want to display to the user in the search results, but don't need to be able to search for.
+    This field is stored with the document, but not indexed and not searchable.
+    This is useful for document information you want to display to the user in
+    the search results, but don't need to be able to search for.
 
 :class:`whoosh.fields.NGRAM`
     TBD.
                     body=TEXT(analyzer=StemmingAnalyzer()),
                     tags=KEYWORD)
 
-If you aren't specifying any constructor keyword arguments to one of the predefined fields, you can leave off the brackets (e.g. fieldname=TEXT instead of fieldname=TEXT()). Whoosh will instantiate the class for you.
+If you aren't specifying any constructor keyword arguments to one of the
+predefined fields, you can leave off the brackets (e.g. fieldname=TEXT instead
+of fieldname=TEXT()). Whoosh will instantiate the class for you.
 
 
 Advanced schema setup
 Field boosts
 ------------
 
-You can specify a field boost for a field. This is a multiplier applied to the score of any term found in the field. For example, to make terms found in the title field score twice as high as terms in the body field::
+You can specify a field boost for a field. This is a multiplier applied to the
+score of any term found in the field. For example, to make terms found in the
+title field score twice as high as terms in the body field::
 
     schema = Schema(title=TEXT(field_boost=2.0), body=TEXT)
 
 Field types
 -----------
 
-The predefined field types listed above are subclasses of ``fields.FieldType``. ``FieldType`` is a pretty simple class. Its attributes contain information that define the behavior of a field.
+The predefined field types listed above are subclasses of ``fields.FieldType``.
+``FieldType`` is a pretty simple class. Its attributes contain information that
+define the behavior of a field.
 
 ============ =============== ======================================================
 Attribute     Type             Description
                              on an ``IndexWriter``.
 ============ =============== ======================================================
 
-The constructors for most of the predefined field types have parameters that let you customize these parts. For example:
+The constructors for most of the predefined field types have parameters that let
+you customize these parts. For example:
 
-* Most of the predefined field types take a stored keyword argument that sets FieldType.stored.
+* Most of the predefined field types take a stored keyword argument that sets
+FieldType.stored.
 
-* The ``TEXT()`` constructor takes an ``analyzer`` keyword argument that is passed on to the format object.
+* The ``TEXT()`` constructor takes an ``analyzer`` keyword argument that is
+passed on to the format object.
 
 Formats
 -------
 
-A ``Format`` object defines what kind of information a field records about each term, and how the information is stored on disk.
+A ``Format`` object defines what kind of information a field records about each
+term, and how the information is stored on disk.
 
 For example, the Existence format would store postings like this:
 
 30    ``[7,12]``
 ===== =============
 
-The indexing code passes the unicode string for a field to the field's Format object. The Format object calls its analyzer (see text analysis) to break the string into tokens, then encodes information about each token.
+The indexing code passes the unicode string for a field to the field's Format
+object. The Format object calls its analyzer (see text analysis) to break the
+string into tokens, then encodes information about each token.
 
 Whoosh ships with the following pre-defined formats.
 
                 and at what positions.
 =============== ================================================================
 
-The STORED field type uses the Stored format (which does nothing, so STORED fields are not indexed). The ID type uses the Existence format. The KEYWORD type uses the Frequency format. The TEXT type uses the Positions format if it is instantiated with phrase=True (the default), or Frequency if phrase=False.
+The STORED field type uses the Stored format (which does nothing, so STORED
+fields are not indexed). The ID type uses the Existence format. The KEYWORD type
+uses the Frequency format. The TEXT type uses the Positions format if it is
+instantiated with phrase=True (the default), or Frequency if phrase=False.
 
-In addition, the following formats are implemented for the possible convenience of expert users, but are not currently used in Whoosh:
+In addition, the following formats are implemented for the possible convenience
+of expert users, but are not currently used in Whoosh:
 
 ================= ================================================================
 Class name        Description
 Vectors
 -------
 
-The main index is an inverted index. It maps terms to the documents they appear in. It is also sometimes useful to store a forward index, also known as a term vector, that maps documents to the terms that appear in them.
+The main index is an inverted index. It maps terms to the documents they appear
+in. It is also sometimes useful to store a forward index, also known as a term
+vector, that maps documents to the terms that appear in them.
 
 For example, imagine an inverted index like this for a field:
 
 3          ``[(text=apple, freq=1)]``
 ========== ======================================================
 
-If you set FieldType.vector to a Format object, the indexing code will use the Format object to store information about the terms in each document. Currently by default Whoosh does not make use of term vectors at all, but they are available to expert users who want to implement their own field types.
+If you set FieldType.vector to a Format object, the indexing code will use the
+Format object to store information about the terms in each document. Currently
+by default Whoosh does not make use of term vectors at all, but they are
+available to expert users who want to implement their own field types.
 
 Implementation notes
 --------------------
 
-The query.Phrase query object can use positions in postings (``FieldType.format=Positions``) or in vectors (``FieldType.vector=Positions``), but storing positions in the postings gives faster phrase searches.
+The query.Phrase query object can use positions in postings
+(``FieldType.format=Positions``) or in vectors (``FieldType.vector=Positions``),
+but storing positions in the postings gives faster phrase searches.
 
-Field names are mapped to numbers inside the Schema, and the numbers are used internally. This means you can add fields to an existing index, and you can rename fields (although there is no API for doing so), but you can't delete fields from an existing index. If you want to make drastic changes to the schema, you should reindex your documents from scratch with the new schema.
+Field names are mapped to numbers inside the Schema, and the numbers are used
+internally. This means you can add fields to an existing index, and you can
+rename fields (although there is no API for doing so), but you can't delete
+fields from an existing index. If you want to make drastic changes to the
+schema, you should reindex your documents from scratch with the new schema.
 

docs/source/searching.rst

 How to search
 =============
 
-Once you've created an index and added documents to it, you can search for those documents.
+Once you've created an index and added documents to it, you can search for those
+documents.
 
 The Searcher object
 ===================
 
     searcher = myindex.searcher()
 
-The Searcher object is the main high-level interface for reading the index. It has
-lots of useful methods for getting information about the index, such as
+The Searcher object is the main high-level interface for reading the index. It
+has lots of useful methods for getting information about the index, such as
 ``most_frequent_terms()``.
 
 >>> list(searcher.most_frequent_terms("content", 3))
 [(u"whoosh", 32), (u"index", 24), (u"document", 18)]
 
 However, the most important method on the Searcher object is
-:meth:`~whoosh.searching.Searcher.search`, which takes a :class:`whoosh.query.Query`
-object and returns a :class:`~whoosh.searching.Results` object::
+:meth:`~whoosh.searching.Searcher.search`, which takes a
+:class:`whoosh.query.Query` object and returns a
+:class:`~whoosh.searching.Results` object::
 
     from whoosh.qparser import QueryParser
     
     s = myindex.searcher()
     results = s.search(q)
 
-If you know you only need the top "N" documents (for example, you're creating an HTML
-page showing the top 10 results), you can specify that you only want that many documents
-to be scored and sorted::
+If you know you only need the top "N" documents (for example, you're creating an
+HTML page showing the top 10 results), you can specify that you only want that
+many documents to be scored and sorted::
 
     results = s.search(q, limit=10)
     
-You should set the limit whenever possible, because it's much more efficient than scoring
-and sorting every matching document.
+You should set the limit whenever possible, because it's much more efficient
+than scoring and sorting every matching document.
 
-Since display a page of results at a time is a common pattern, the ``search_page``
-method lets you conveniently retrieve only the results on a given page::
+Since display a page of results at a time is a common pattern, the
+``search_page`` method lets you conveniently retrieve only the results on a
+given page::
 
 	results = s.search_page(q, 1)
 
-The default page length is 10 hits. You can use the ``pagelen`` keyword argument to
-set a different page length::
+The default page length is 10 hits. You can use the ``pagelen`` keyword argument
+to set a different page length::
 
 	results = s.search_page(q, 5, pagelen=20)
 
 Results object
 ==============
 
-The :class:`~whoosh.searching.Results` object acts like a list of the matched documents.
-You can use it to access the stored fields of each hit document, to display to the user.
+The :class:`~whoosh.searching.Results` object acts like a list of the matched
+documents. You can use it to access the stored fields of each hit document, to
+display to the user.
 
 >>> # How many documents matched?
 >>> len(results)
 Scoring
 -------
 
-Normally the list of result documents is sorted by *score*. The :mod:`whoosh.scoring` module
-contains implementations of various scoring algorithms. The default is
-:class:`~whoosh.scoring.BM25F`.
+Normally the list of result documents is sorted by *score*. The
+:mod:`whoosh.scoring` module contains implementations of various scoring
+algorithms. The default is :class:`~whoosh.scoring.BM25F`.
 
-You can set the scoring object to use when you create the searcher using the ``weighting``
-keyword argument::
+You can set the scoring object to use when you create the searcher using the
+``weighting`` keyword argument::
 
     s = myindex.searcher(weighting=whoosh.scoring.Cosine())
 
-A scoring object is an object with a :meth:`~whoosh.scoring.Weighting.score` method that
-takes information about the term to score and returns a score as a floating point number.
+A scoring object is an object with a :meth:`~whoosh.scoring.Weighting.score`
+method that takes information about the term to score and returns a score as a
+floating point number.
 
 Sorting
 -------
 
-Instead of sorting the matched documents by a score, you can sort them by the contents of one or more indexed field(s). These should be fields for which each document stores one term (i.e. an ID field type), for example "path", "id", "date", etc.
+Instead of sorting the matched documents by a score, you can sort them by the
+contents of one or more indexed field(s). These should be fields for which each
+document stores one term (i.e. an ID field type), for example "path", "id",
+"date", etc.
 
 To sort by the contents of the "path" field::
 
 Custom sorters
 --------------
 
-If you require more complex sorting you can implement a custom :class:`whoosh.scoring.Sorter` object and pass it to the `sortedby` keyword argument::
+If you require more complex sorting you can implement a custom
+:class:`whoosh.scoring.Sorter` object and pass it to the `sortedby` keyword
+argument::
 
     results = s.search(myquery, sortedby=mysorter())
     
-A sorting object is an object with an :meth:`~whoosh.scoring.Sorter.order` method, which takes a searcher and an unsorted list of document numbers, and returns a sorted list of document numbers.
+A sorting object is an object with an :meth:`~whoosh.scoring.Sorter.order`
+method, which takes a searcher and an unsorted list of document numbers, and
+returns a sorted list of document numbers.
 
 
 Convenience functions
 =====================
 
-The :meth:`~whoosh.searching.Searcher.document` and :meth:`~whoosh.searching.Searcher.documents` methods on the Searcher object let you retrieve the stored fields of documents matching terms you pass in keyword arguments.
+The :meth:`~whoosh.searching.Searcher.document` and
+:meth:`~whoosh.searching.Searcher.documents` methods on the Searcher object let
+you retrieve the stored fields of documents matching terms you pass in keyword
+arguments.
 
-This is especially useful for fields such as dates/times, identifiers, paths, and so on.
+This is especially useful for fields such as dates/times, identifiers, paths,
+and so on.
 
 >>> list(searcher.documents(indexeddate=u"20051225"))
 [{"title": u"Christmas presents"}, {"title": u"Turkey dinner report"}]
 
 * The results are not scored.
 * Multiple keywords are always AND-ed together.
-* The entire value of each keyword argument is considered a single term; you can't search for multiple terms in the same field.
+* The entire value of each keyword argument is considered a single term; you
+  can't search for multiple terms in the same field.
 
 
 Combining Results objects
 =========================
 
-It is sometimes useful to use the results of another query to influence the order of a :class:`whoosh.searching.Results` object.
+It is sometimes useful to use the results of another query to influence the
+order of a :class:`whoosh.searching.Results` object.
 
-For example, you might have a "best bet" field. This field contains hand-picked keywords for documents. When the user searches for those keywords, you want those documents to be placed at the top of the results list. You could try to do this by boosting the "bestbet" field tremendously, but that can have unpredictable effects on scoring. It's much easier to simply run the query twice and combine the results::
+For example, you might have a "best bet" field. This field contains hand-picked
+keywords for documents. When the user searches for those keywords, you want
+those documents to be placed at the top of the results list. You could try to do
+this by boosting the "bestbet" field tremendously, but that can have
+unpredictable effects on scoring. It's much easier to simply run the query twice
+and combine the results::
 
     # Parse the user query
     userquery = queryparser.parse(querystring)
 The Results object supports the following methods:
 
 ``Results.extend(results)``
-    Adds the documents in 'results' on to the end of the list of result documents.
+    Adds the documents in 'results' on to the end of the list of result
+    documents.
     
 ``Results.filter(results)``
     Removes the documents in 'results' from the list of result documents.
     
 ``Results.upgrade(results)``
-    Any result documents that also appear in 'results' are moved to the top of the list of result documents.
+    Any result documents that also appear in 'results' are moved to the top of
+    the list of result documents.
     
 ``Results.upgrade_and_extend(results)``
-    Any result documents that also appear in 'results' are moved to the top of the list of result documents. Then any other documents in 'results' are added on to the list of result documents.
+    Any result documents that also appear in 'results' are moved to the top of
+    the list of result documents. Then any other documents in 'results' are
+    added on to the list of result documents.
 
 
 

docs/source/spelling.rst

 Overview
 --------
 
-Whoosh includes pure-Python spell-checking library functions that use the Whoosh search engine for back-end storage.
+Whoosh includes pure-Python spell-checking library functions that use the Whoosh
+search engine for back-end storage.
 
 To create a :class:`~whoosh.spelling.SpellChecker` object::
 
     # SpellChecker object needs a Storage object in which to put its index.
     speller = SpellChecker(st)
 
-If you have a Whoosh ``Index`` object and you want to open the spelling dictionary in the same directory as the index, you can re-use the ``Index`` object's ``Storage``::
+If you have a Whoosh ``Index`` object and you want to open the spelling
+dictionary in the same directory as the index, you can re-use the ``Index``
+object's ``Storage``::
 
     from whoosh import index
     
     # Start/open a spelling dictionary in the same directory
     speller = SpellChecer(ix.storage)
 
-Whoosh lets you keep multiple indexes in the same directory by assigning the indexes different names. The default name for a regular index is ``_MAIN``. The default name for the index created by the SpellChecker object is ``SPELL`` (so you can keep your main index and a spelling index in the same directory by default). You can pass an ``indexname`` argument to the SpellChecker constructor to choose a different index name (for example, if you want to keep multiple spelling dictionaries in the same directory)::
+Whoosh lets you keep multiple indexes in the same directory by assigning the
+indexes different names. The default name for a regular index is ``_MAIN``. The
+default name for the index created by the SpellChecker object is ``SPELL`` (so
+you can keep your main index and a spelling index in the same directory by
+default). You can pass an ``indexname`` argument to the SpellChecker constructor
+to choose a different index name (for example, if you want to keep multiple
+spelling dictionaries in the same directory)::
 
     speller = SpellChecker(st, indexname="COMMON_WORDS")
 
 Creating the spelling dictionary
 --------------------------------
 
-You need to populate the spell-checking dictionary with (properly spelled) words to check against. There are a few strategies for doing this:
+You need to populate the spell-checking dictionary with (properly spelled) words
+to check against. There are a few strategies for doing this:
 
 *   Add all the words that appear in a certain field in a Whoosh index.
  
-    For example, if you've created an index for a collection of documents with the contents indexed in a field named ``content``, you can automatically add all the words from that field::
+    For example, if you've created an index for a collection of documents with
+    the contents indexed in a field named ``content``, you can automatically add
+    all the words from that field::
     
         from whoosh import index
     
         # main index's 'content' field.
         speller.add_field(ix, "content")
         
-    The advantage of using the contents of an index field is that when you are spell checking queries on that index, the suggestions are tailored to the contents of the index. The disadvantage is that if the indexed documents contain spelling errors, then the spelling suggestions will also be erroneous.
+    The advantage of using the contents of an index field is that when you are
+    spell checking queries on that index, the suggestions are tailored to the
+    contents of the index. The disadvantage is that if the indexed documents
+    contain spelling errors, then the spelling suggestions will also be
+    erroneous.
  
 *   Use a preset list of words. The ``add_words`` method lets you add words from any iterable.
  
-    There are plenty of word lists available on the internet you can use to populate the spelling dictionary. ::
+    There are plenty of word lists available on the internet you can use to
+    populate the spelling dictionary. ::
     
         speller.add_words(["custom", "word", "list"])
     
         # directly
         speller.add_words(wordfile)
         
-*   Use a combination of word lists and index field contents. For example, you could add words from a field, but only if they appear in the word list::
+*   Use a combination of word lists and index field contents. For example, you
+    could add words from a field, but only if they appear in the word list::
  
         # Open the list of words (one on each line) and load it into a set
         wordfile = open("words.txt")
         speller.add_words(word for word in reader.lexicon("content")
                           if word in wordset)
 
-Note that adding words to the dictionary should be done all at once. Each call to ``add_field()``, ``add_words()``, or ``add_scored_words()`` (see below) creates a writer, adds to the underlying index, and the closes the writer, just like adding documents to a regular Whoosh index. **DO NOT** do anything like this::
+Note that adding words to the dictionary should be done all at once. Each call
+to ``add_field()``, ``add_words()``, or ``add_scored_words()`` (see below)
+creates a writer, adds to the underlying index, and the closes the writer, just
+like adding documents to a regular Whoosh index. **DO NOT** do anything like
+this::
 
     # This would be very slow
     for word in my_list_of_words:
         speller.add_words([word])
         
-**Be careful** not to add the same word to the spelling dictionary more than once. The ``SpellChecker`` code *does not* currently guard against this automatically.
+**Be careful** not to add the same word to the spelling dictionary more than
+once. The ``SpellChecker`` code *does not* currently guard against this
+automatically.
 
 Gettings suggestions
 --------------------
 
-Once you have words in the spelling dictionary, you can use the ``suggest()`` method to check words::
+Once you have words in the spelling dictionary, you can use the ``suggest()``
+method to check words::
 
     >>> st = store.FileStorage("spelldict")
     >>> speller = SpellChecker(st)
     >>> speller.suggest("woosh")
     ["whoosh"]
     
-The ``number`` keyword argument sets the maximum number of suggestions to return (default is 3). ::
+The ``number`` keyword argument sets the maximum number of suggestions to return
+(default is 3). ::
 
     >>> # Get the top 5 suggested replacements for this word
     >>> speller.suggest("rundering", number=5)
 Word scores
 -----------
 
-Each word in the dictionary can have a "score" associated with it. When two or more suggestions have the same "edit distance" (number of differences) from the checked word, the score is used to order them in the suggestion list.
+Each word in the dictionary can have a "score" associated with it. When two or
+more suggestions have the same "edit distance" (number of differences) from the
+checked word, the score is used to order them in the suggestion list.
 
-By default the list of suggestions is only ordered by the number of differences between the suggestion and the original word. To make the ``suggest()`` method use word scores, use the ``usescores=True`` keyword argument. ::
+By default the list of suggestions is only ordered by the number of differences
+between the suggestion and the original word. To make the ``suggest()`` method
+use word scores, use the ``usescores=True`` keyword argument. ::
 
     speller.suggest("woosh", usescores=True)
 
-The main use for this is to use the word's frequency in the index as its score, so common words are suggested before obscure words. **Note** The ``add_field()`` method does this by default.
+The main use for this is to use the word's frequency in the index as its score,
+so common words are suggested before obscure words. **Note** The ``add_field()``
+method does this by default.
 
-If you want to add a list of words with scores manually, you can use the ``add_scored_words()`` method::
+If you want to add a list of words with scores manually, you can use the
+``add_scored_words()`` method::
 
     # Takes an iterable of ("word", score) tuples
     speller.add_scored_words([("whoosh", 2.0), ("search", 1.0), ("find", 0.5)])
 
-For example, if you wanted to reverse the default behavior of ``add_field()`` so that *obscure* words would be suggested before common words, you could do this::
+For example, if you wanted to reverse the default behavior of ``add_field()`` so
+that *obscure* words would be suggested before common words, you could do this::
 
     # Open the main index
     ix = index.open_dir("index")
 Spell checking Whoosh queries
 -----------------------------
 
-If you want to spell check a user query, first parse the user's query into a ``whoosh.query.Query`` object tree, using the default parser or your own custom parser. For example::
+If you want to spell check a user query, first parse the user's query into a
+``whoosh.query.Query`` object tree, using the default parser or your own custom
+parser. For example::
 
     from whoosh.qparser import QueryParser
     parser = QueryParser("content", schema=my_schema)
     user_query = parser.parse(user_query_string)
     
-Then you can use the ``all_terms()`` or ``existing_terms()`` methods of the ``Query`` object to extract the set of terms used in the query. The two methods work in a slightly unusual way: instead of returning a list, you pass them a set, and they populate the set with the query terms::
+Then you can use the ``all_terms()`` or ``existing_terms()`` methods of the
+``Query`` object to extract the set of terms used in the query. The two methods
+work in a slightly unusual way: instead of returning a list, you pass them a
+set, and they populate the set with the query terms::
 
     termset = set()
     user_query.all_terms(termset)
     
-The ``all_terms()`` method simply adds all the terms found in the query. The ``existing_terms()`` method takes an IndexReader object and only adds terms from the query *that exist* in the reader's underlying index. ::
+The ``all_terms()`` method simply adds all the terms found in the query. The
+``existing_terms()`` method takes an IndexReader object and only adds terms from
+the query *that exist* in the reader's underlying index. ::
 
     reader = my_index.reader()
     termset = set()
     user_query.existing_terms(reader, termset)
     
-Of course, it's more useful to spell check the terms that are *missing* from the index, not the ones that exist. The ``reverse=True`` keyword argument to ``existing_terms()`` lets us find the missing terms
+Of course, it's more useful to spell check the terms that are *missing* from the
+index, not the ones that exist. The ``reverse=True`` keyword argument to
+``existing_terms()`` lets us find the missing terms
 
     missing = set()
     user_query.existing_terms(reader, missing, reverse=True)
     
-So now you have a set of ``("fieldname", "termtext")`` tuples. Now you can check them against the spelling dictionary::
+So now you have a set of ``("fieldname", "termtext")`` tuples. Now you can check
+them against the spelling dictionary::
 
     # Load the main index
     ix = index.open_dir("index")
 Updating the spelling dictionary
 --------------------------------
 
-The spell checker is mainly intended to be "write-once, read-many". You can continually add words to the dictionary, but it is not possible to remove words or dynamically update the dictionary.
+The spell checker is mainly intended to be "write-once, read-many". You can
+continually add words to the dictionary, but it is not possible to remove words
+or dynamically update the dictionary.
 
-Currently the best strategy available for keeping a spelling dictionary up-to-date with changing content is simply to **delete and re-create** the spelling dictionary periodically.
+Currently the best strategy available for keeping a spelling dictionary
+up-to-date with changing content is simply to **delete and re-create** the
+spelling dictionary periodically.
 
-Note, to clear the spelling dictionary so you can start re-adding words, do this::
+Note, to clear the spelling dictionary so you can start re-adding words, do
+this::
 
     speller = SpellChecker(storage_object)
     speller.index(create=True)

src/whoosh/classify.py

 # Expansion models
 
 class ExpansionModel(object):
-    def __init__(self, ixreader, fieldname):
-        self.N = ixreader.doc_count_all()
-        self.collection_total = ixreader.field_length(fieldname)
+    def __init__(self, doc_count, field_length):
+        self.N = doc_count
+        self.collection_total = field_length
         self.mean_length = self.collection_total / self.N
     
     def normalizer(self, maxweight, top_total):
             scoring.Bo1Model by default.
         """
         
+        self.ixreader = ixreader
         self.fieldname = fieldname
         
         if type(model) is type:
-            model = model(ixreader, fieldname)
+            model = model(self.ixreader.doc_count_all(),
+                          self.ixreader.field_length(fieldname))
         self.model = model
         
         # Cache the collection frequency of every term in this field. This
-        # turns out to be much faster than reading each individual weight from
-        # the term index as we add words.
+        # turns out to be much faster than reading each individual weight
+        # from the term index as we add words.
         self.collection_freq = dict((word, freq) for word, _, freq
-                                      in ixreader.iter_field(fieldname))
+                                      in self.ixreader.iter_field(self.fieldname))
         
         # Maps words to their weight in the top N documents.
         self.topN_weight = defaultdict(float)
         
         # Total weight of all terms in the top N documents.
         self.top_total = 0
-        
+    
     def add(self, vector):
         """Adds forward-index information about one of the "top N" documents.
         
             
         self.top_total += total_weight
     
+    def add_document(self, docnum):
+        if self.ixreader.has_vector(docnum, self.fieldname):
+            self.add(self.ixreader.vector_as("weight", docnum, self.fieldname))
+        elif self.ixreader.field(self.fieldname).stored:
+            self.add_text(self.ixreader.stored_fields(docnum).get(self.fieldname))
+        else:
+            raise Exception("Field %r in document %s is not vectored or stored" % (self.fieldname, docnum))
+    
+    def add_text(self, string):
+        field = self.ixreader.field(self.fieldname)
+        self.add((text, weight) for text, freq, weight, value
+                 in field.index(string))
+    
     def expanded_terms(self, number, normalize=True):
         """Returns the N most important terms in the vectors added so far.
         

src/whoosh/fields.py

             self.vector.clean()
             
     def index(self, value, **kwargs):
-        """Returns an iterator of (termtext, frequency, encoded_value) tuples.
+        """Returns an iterator of (termtext, frequency, weight, encoded_value)
+        tuples.
         """
         
         if not self.format:
         return self.format.word_values(value, mode="index", **kwargs)
     
     def process_text(self, qstring, mode='', **kwargs):
+        """Returns an iterator of token strings corresponding to the given
+        string.
+        """
+        
         if not self.format:
             raise Exception("%s field has no format" % self)
         return (t.text for t

src/whoosh/filedb/fileindex.py

 from whoosh import __version__
 from whoosh.fields import Schema
 from whoosh.index import Index
-from whoosh.index import EmptyIndexError, OutOfDateError, IndexVersionError
+from whoosh.index import EmptyIndexError, IndexVersionError
 from whoosh.index import _DEF_INDEX_NAME
 from whoosh.store import Storage, LockError
 from whoosh.system import _INT_SIZE, _FLOAT_SIZE
             w.cancel()
 
     def doc_count_all(self):
-        info = self._read_toc()
-        return info.segments.doc_count_all()
+        return self._segments().doc_count_all()
 
     def doc_count(self):
-        info = self._read_toc()
-        return info.segments.doc_count()
+        return self._segments().doc_count()
+
+    def field_length(self, fieldname):
+        return self._segments().field_length(fieldname)
 
     # searcher
     
         """
         return sum(s.doc_count() for s in self.segments)
 
+    def field_length(self, fieldname):
+        return sum(s.field_length(fieldname) for s in self.segments)
 
     def has_deletions(self):
         """

src/whoosh/index.py

     """Represents an indexed collection of documents.
     """
     
-    def __init__(self, storage, schema=None, indexname=_DEF_INDEX_NAME):
-        """
-        :param storage: The :class:`whoosh.store.Storage` object in which this
-            index resides. See the store module for more details.
-        :param schema: A :class:`whoosh.fields.Schema` object defining the
-            fields of this index.
-        :param indexname: An optional name to use for the index. Use this if
-            you need to keep multiple indexes in the same storage object.
-        """
-        
-        self.storage = storage
-        self.indexname = indexname
-        
-        if schema is not None and not isinstance(schema, fields.Schema):
-            raise ValueError("%r is not a Schema object" % schema)
-        
-        self.schema = schema
-    
     def close(self):
         """Closes any open resources held by the Index object itself. This may
         not close all resources being used everywhere, for example by a
         """
         raise NotImplementedError
     
+    def field_length(self, fieldname):
+        """Returns the total length of the given field across all documents.
+        """
+        
+        raise NotImplementedError
+    
     def searcher(self, **kwargs):
         """Returns a Searcher object for this index. Keyword arguments are
         passed to the Searcher object's constructor.
         w = self.writer()
         w.delete_by_query(q, searcher=searcher)
         w.commit()
+        
     
 
 # Debugging functions

src/whoosh/matching.py

 
 You do not need to deal with the classes in this module unless you need to
 write your own Matcher implementation to provide some new functionality. These
-classes are not instantiated by the user.
+classes are not instantiated by the user. They are usually created by a
+:class:`~whoosh.query.Query` object's ``matcher()`` method, which returns the
+appropriate matcher to implement the query (for example, the ``Or`` query's
+``matcher()`` method returns a ``UnionMatcher`` object).
 
-Certain backends 
+Certain backends support "quality" optimizations. These backends have the
+ability to skip ahead if it knows the current block of postings can't
+contribute to the top N documents. If the matcher tree and backend support
+these optimizations, the matcher's ``supports_quality()`` method will return
+``True``.
 """
 
 

src/whoosh/reading.py

         postreaders = []
         docoffsets = []
         for i, r in enumerate(self.readers):
-            format = r.schema[fieldname].format
+            format = r.field(fieldname).format
             if (fieldname, text) in r:
                 pr = r.postings(fieldname, text, scorefns=scorefns,
                                 exclude_docs=exclude_docs)

src/whoosh/searching.py

 
         # Copy attributes/methods from wrapped reader
         for name in ("stored_fields", "vector", "vector_as", "scorable",
-                     "lexicon", "frequency", "doc_field_length",
+                     "lexicon", "frequency", "field_length", "doc_field_length",
                      "max_field_length"):
             setattr(self, name, getattr(self.ixreader, name))