Commits

Matt Chaput committed 8dbd954

Finished first iteration of new spelling system (finally!!!).

  • Participants
  • Parent commits 0a8f4ba
  • Branches dawg

Comments (0)

Files changed (11)

File docs/source/spelling.rst

 =====================================================
 
 .. note::
-
     In Whoosh 1.9 the old spelling system based on a separate N-gram index was
     replaced with this significantly more convenient and powerful
     implementation.
 Overview
 ========
 
-Whoosh can quickly suggest replacements for mis-typed words by returning a
-list of words from the index (or a dictionary) that are close to the mis-typed
-word::
+Whoosh can quickly suggest replacements for mis-typed words by returning a list
+of words from the index (or a dictionary) that are close to the mis-typed word::
 
     with ix.searcher() as s:
+        corrector = s.corrector("text")
         for mistyped_word in mistyped_words:
-            print s.suggest("text", mistyped_word, limit=3)
+            print corrector.suggest(mistyped_word, limit=3)
+
+See the :meth:`whoosh.spelling.Corrector.suggest` method documentation for
+information on the arguments.
 
 Currently the suggestion engine is more like a "typo corrector" than a real
-"spell checker" since it doesn't do the kind of sophisticated phonetic
-matching or semantic/contextual analysis a good spell checker would. However,
-it is still very useful.
+"spell checker" since it doesn't do the kind of sophisticated phonetic matching
+or semantic/contextual analysis a good spell checker might. However, it is
+still very useful.
 
-There are a two main strategies for where to get the correct words:
+There are a two main strategies for correcting words:
 
 *   Use the terms from an index field.
 
     schema = Schema(text=TEXT(spelling=True))
 
 (If you have an existing index you want to enable spelling for, you can alter
-the schema in-place and use the :func:`whoosh.filedb.filewriting.add_spelling`
+the schema in-place using the :func:`whoosh.filedb.filewriting.add_spelling`
 function to create the missing word graph files.)
 
-The advantage of using the contents of an index field is that when you are
-spell checking queries on that index, the suggestions are tailored to the
-contents of the index. The disadvantage is that if the indexed documents
-contain spelling errors, then the spelling suggestions will also be
-erroneous.
+.. tip::
+    You can get suggestions for fields without the ``spelling`` attribute, but
+    calculating the suggestions will be slower.
 
-Note that if you're stemming the content field, the spelling suggestions will
-be stemmed and so may appear strange (for example, "rend" instead of
-"render"). One solution is to create a second spelling field with the same
-content as the main field with an unstemmed analyzer::
+You can then use the :meth:`whoosh.searching.Searcher.corrector` method to get a
+corrector for a field::
+
+    corrector = searcher.corrector("content")
+
+The advantage of using the contents of an index field is that when you are spell
+checking queries on that index, the suggestions are tailored to the contents of
+the index. The disadvantage is that if the indexed documents contain spelling
+errors, then the spelling suggestions will also be erroneous.
+
+Note that if you're stemming the content field, the spelling suggestions will be
+stemmed and so may appear strange (for example, "rend" instead of "render").
+One solution is to create a second spelling field with the same content as the
+main field with an unstemmed analyzer::
 
     # Stemming analyzer for the main field
     s_ana = RegexTokenizer() | LowercaseFilter() | StemmingFilter()
                     unstemmed=TEXT(analyzer=u_ana, spelling=True))
 
 Then you can offer spelling suggestions based on the unstemmed field. You may
-even find it useful to let users search the unstemmed field when they know
-they want a specific form of a word.
+even find it useful to let users search the unstemmed field when they know they
+want a specific form of a word.
 
 
-Pulling suggestions from a word file
+Pulling suggestions from a word list
 ====================================
 
-There are plenty of word lists available on the internet you can use to
-populate the spelling dictionary.
+There are plenty of word lists available on the internet you can use to populate
+the spelling dictionary.
 
+(In the following examples, ``word_list`` can be a list of unicode strings, or a
+file object with one word on each line.)
 
+To create a :class:`whoosh.spelling.Corrector` object from a word list::
 
+    from whoosh.spelling import GraphCorrector
+    
+    corrector = GraphCorrector.from_word_list(word_list)
+    
+Creating a corrector directly from a word list can be slow for large word lists,
+so you can save a corrector's graph to a more efficient on-disk form like this::
 
+    graphfile = myindex.storage.create_file("words.graph")
+    # to_file() automatically closes the file when it's finished
+    corrector.to_file(graphfile)
 
+To open the graph file again very quickly::
 
+    graphfile = myindex.storage.open_file("words.graph")
+    corrector = GraphCorrector.from_graph_file(graphfile)
 
 
-Creating the spelling dictionary
-================================
+Merging two or more correctors
+==============================
 
+You can combine suggestions from two sources (for example, the contents of an
+index field and a word list) using a :class:`whoosh.spelling.MultiCorrector`::
 
-        
+    c1 = searcher.corrector("content")
+    c2 = GraphCorrector.from_graph_file(wordfile)
+    corrector = MultiCorrector([c1, c2])
+
+
+Correcting user queries
+=======================
+
+You can spell-check a user query using the
+:meth:`whoosh.searching.Searcher.correct_query` method::
+
+    from whoosh import qparser
+
+    # Parse the user query string
+    qp = qparser.QueryParser("content", myindex.schema)
+    q = qp.parse(qstring)
     
- 
-*   Use a preset list of words. The ``add_words`` method lets you add words from any iterable.
- 
-     ::
+    # Try correcting the query
+    with myindex.searcher() as s:
+        corrected = s.correct_query(q, qstring)
+        if corrected.query != q:
+            print("Did you mean:", corrected.string)
+
+The ``correct_query`` method returns an object with the following attributes:
+
+``query``
+    A corrected :class:`whoosh.query.Query` tree. You can test whether this
+    is equal (``==``) to the original parsed query to check if the corrector
+    actually changed anything.
+
+``string``
+    A corrected version of the user's query string.
+
+``tokens``
+    A list of corrected token objects representing the corrected terms. You
+    can use this to reformat the user query (see below).
+
+
+You can use a :class:`whoosh.highlight.Formatter` object to format the corrected
+query string. For example, use the :class:`~whoosh.highlight.HtmlFormatter` to
+format the corrected string as HTML::
+
+    from whoosh import highlight
     
-        speller.add_words(["custom", "word", "list"])
-    
-        # Assume this is file contains a list of words, one on each line
-        wordfile = open("words.txt")
-        
-        # add_words() takes an iterable, so we can pass it the file object
-        # directly
-        speller.add_words(wordfile)
-        
-*   Use a combination of word lists and index field contents. For example, you
-    could add words from a field, but only if they appear in the word list::
- 
-        # Open the list of words (one on each line) and load it into a set
-        wordfile = open("words.txt")
-        wordset = set(wordfile)
-        
-        # Open the main index
-        ix = index.open_dir("index")
-        reader = ix.reader()
-        
-        # Add words from the main index's 'content' field only if they
-        # appear in the word list
-        speller.add_words(word for word in reader.lexicon("content")
-                          if word in wordset)
+    hf = highlight.HtmlFormatter()
+    corrected = s.correct_query(q, qstring, formatter=hf)
+     
+See the documentation for :meth:`whoosh.searching.Searcher.correct_query` for
+information on the defaults and arguments.
 
-Note that adding words to the dictionary should be done all at once. Each call
-to ``add_field()``, ``add_words()``, or ``add_scored_words()`` (see below)
-creates a writer, adds to the underlying index, and the closes the writer, just
-like adding documents to a regular Whoosh index. **DO NOT** do anything like
-this::
 
-    # This would be very slow
-    for word in my_list_of_words:
-        speller.add_words([word])
-        
-**Be careful** not to add the same word to the spelling dictionary more than
-once. The ``SpellChecker`` code *does not* currently guard against this
-automatically.
 
 
-Gettings suggestions
-====================
 
-Once you have words in the spelling dictionary, you can use the ``suggest()``
-method to check words::
 
-    >>> st = store.FileStorage("spelldict")
-    >>> speller = SpellChecker(st)
-    >>> speller.suggest("woosh")
-    ["whoosh"]
-    
-The ``number`` keyword argument sets the maximum number of suggestions to return
-(default is 3). ::
 
-    >>> # Get the top 5 suggested replacements for this word
-    >>> speller.suggest("rundering", number=5)
-    
-    >>> # Get only the top suggested replacement for this word
-    >>> speller.suggest("woosh", number=1)
-
-
-Word scores
-===========
-
-Each word in the dictionary can have a "score" associated with it. When two or
-more suggestions have the same "edit distance" (number of differences) from the
-checked word, the score is used to order them in the suggestion list.
-
-By default the list of suggestions is only ordered by the number of differences
-between the suggestion and the original word. To make the ``suggest()`` method
-use word scores, use the ``usescores=True`` keyword argument. ::
-
-    speller.suggest("woosh", usescores=True)
-
-The main use for this is to use the word's frequency in the index as its score,
-so common words are suggested before obscure words. **Note** The ``add_field()``
-method does this by default.
-
-If you want to add a list of words with scores manually, you can use the
-``add_scored_words()`` method::
-
-    # Takes an iterable of ("word", score) tuples
-    speller.add_scored_words([("whoosh", 2.0), ("search", 1.0), ("find", 0.5)])
-
-For example, if you wanted to reverse the default behavior of ``add_field()`` so
-that *obscure* words would be suggested before common words, you could do this::
-
-    # Open the main index
-    ix = index.open_dir("index")
-    reader = ix.reader()
-    
-    # IndexReader.iter_field() yields (term_text, doc_freq, index_freq) tuples
-    # for each term in the given field.
-    
-    # We pull out the term text and the index frequency of each term, and
-    # then invert the frequency so terms with lower frequencies get higher
-    # scores in the spelling dictionary
-    speller.add_scored_words((termtext, 1 / index_freq)
-                             for termtext, doc_freq, index_freq
-                             in reader.iter_field("content"))
-
-
-Spell checking Whoosh queries
-=============================
-
-If you want to spell check a user query, first parse the user's query into a
-``whoosh.query.Query`` object tree, using the default parser or your own custom
-parser. For example::
-
-    from whoosh.qparser import QueryParser
-    parser = QueryParser("content", schema=my_schema)
-    user_query = parser.parse(user_query_string)
-    
-Then you can use the ``all_terms()`` or ``existing_terms()`` methods of the
-``Query`` object to extract the set of terms used in the query. The two methods
-work in a slightly unusual way: instead of returning a list, you pass them a
-set, and they populate the set with the query terms::
-
-    termset = set()
-    user_query.all_terms(termset)
-    
-The ``all_terms()`` method simply adds all the terms found in the query. The
-``existing_terms()`` method takes an IndexReader object and only adds terms from
-the query *that exist* in the reader's underlying index. ::
-
-    reader = my_index.reader()
-    termset = set()
-    user_query.existing_terms(reader, termset)
-    
-Of course, it's more useful to spell check the terms that are *missing* from the
-index, not the ones that exist. The ``reverse=True`` keyword argument to
-``existing_terms()`` lets us find the missing terms
-
-    missing = set()
-    user_query.existing_terms(reader, missing, reverse=True)
-    
-So now you have a set of ``("fieldname", "termtext")`` tuples. Now you can check
-them against the spelling dictionary::
-
-    # Load the main index
-    ix = index.open_dir("index")
-    reader = ix.reader()
-    
-    # Load a spelling dictionary stored in the same directory
-    # as the main index
-    speller = SpellChecker(ix.storage)
-
-    # Extract missing terms from the user query
-    missing = set()
-    user_query.existing_terms(reader, missing, reverse=True)
-    
-    # Print a list of suggestions for each missing word
-    for fieldname, termtext in missing:
-        # Only spell check terms in the "content" field
-        if fieldname == "content":
-            suggestions = speller.suggest(termtext)
-            if suggestions:
-                print "%s not found. Might I suggest %r?" % (termtext, suggestions)
-
-
-Updating the spelling dictionary
-================================
-
-The spell checker is mainly intended to be "write-once, read-many". You can
-continually add words to the dictionary, but it is not possible to remove words
-or dynamically update the dictionary.
-
-Currently the best strategy available for keeping a spelling dictionary
-up-to-date with changing content is simply to **delete and re-create** the
-spelling dictionary periodically.
-
-Note, to clear the spelling dictionary so you can start re-adding words, do
-this::
-
-    speller = SpellChecker(storage_object)
-    speller.index(create=True)
-

File src/whoosh/highlight.py

     
     between = "..."
     
+    def _text(self, text):
+        return text
+    
     def format_token(self, text, token, replace=False):
         """Returns a formatted version of the given "token" object, which
         should have at least ``startchar`` and ``endchar`` attributes, and
         
         for t in fragment.matches:
             if t.startchar > index:
-                output.append(text[index:t.startchar])
+                output.append(self._text(text[index:t.startchar]))
             output.append(self.format_token(text, t, replace))
             index = t.endchar
         
-        output.append(text[index:fragment.endchar])
+        output.append(self._text(text[index:fragment.endchar]))
         return "".join(output)
     
     def format(self, fragments, replace=False):
         return self.between.join(formatted)
     
     def __call__(self, text, fragments):
+        # For backwards compatibility
         return self.format(fragments)
 
 
+class NullFormatter(Formatter):
+    """Formatter that does not modify the string.
+    """
+    
+    def format_token(self, text, token, replace=False):
+        return get_text(text, token, replace)
+
+
 class UppercaseFormatter(Formatter):
     """Returns a string in which the matched terms are in UPPERCASE.
     """
         self.seen = {}
         self.htmlclass = " ".join((self.classname, self.termclass))
     
+    def _text(self, text):
+        return htmlescape(text)
+    
     def format_token(self, text, token, replace=False):
         seen = self.seen
-        ttext = htmlescape(get_text(text, token, replace))
+        ttext = self._text(get_text(text, token, replace))
         if ttext in seen:
             termnum = seen[ttext]
         else:

File src/whoosh/qparser/plugins.py

     wordexpr = rcompile(r'\S+')
     
     class PhraseNode(syntax.TextNode):
-        def __init__(self, text, slop=1):
+        def __init__(self, text, textstartchar, slop=1):
             syntax.TextNode.__init__(self, text)
+            self.textstartchar = textstartchar
             self.slop = slop
         
         def r(self):
             # We want to process the text of the phrase into "words" (tokens),
             # and also record the startchar and endchar of each word
             
+            sc = self.textstartchar
             if parser.schema and fieldname in parser.schema:
                 field = parser.schema[fieldname]
                 if field.format:
                     char_ranges = []
                     for t in tokens:
                         words.append(t.text)
-                        char_ranges.append((t.startchar, t.endchar))
+                        char_ranges.append((sc + t.startchar, sc + t.endchar))
                 else:
                     # We have a field but it doesn't have a format object,
                     # for some reason (it's self-parsing?), so use process_text
                 char_ranges = []
                 for match in PhrasePlugin.wordexpr.finditer(text):
                     words.append(match.group(0))
-                    char_ranges.append((match.start(), match.end()))
+                    char_ranges.append((sc + match.start(), sc + match.end()))
             
             qclass = parser.phraseclass
-            q = qclass(fieldname, words, slop=self.slop, boost=self.boost)
-            q.char_ranges = char_ranges
+            q = qclass(fieldname, words, slop=self.slop, boost=self.boost,
+                       char_ranges=char_ranges)
             return attach(q, self)
     
     class PhraseTagger(RegexTagger):
-        def create(self, parser, matcher):
-            return PhrasePlugin.PhraseNode(matcher.group("text"))
+        def create(self, parser, match):
+            return PhrasePlugin.PhraseNode(match.group("text"),
+                                           match.start("text"))
     
     def __init__(self, expr='"(?P<text>.*?)"'):
         self.expr = expr

File src/whoosh/qparser/syntax.py

     """
     
     merging = False
+    has_boost = False
     
     def query(self, parser):
         assert len(self.nodes) == 2
         q = self.qclass(self.nodes[0].query(parser),
-                        self.nodes[1].query(parser),
-                                   boost=self.boost)
+                        self.nodes[1].query(parser))
         return attach(q, self)
 
 

File src/whoosh/query.py

 import re
 from array import array
 
+from whoosh.analysis import Token
 from whoosh.compat import u, xrange, text_type
 from whoosh.lang.morph_en import variations
 from whoosh.matching import (AndMaybeMatcher, DisjunctionMaxMatcher,
     return q
 
 
-def query_lists(q):
-    """Returns the leaves of the query tree, with the query hierarchy
-    represented as nested lists.
-    """
-    
-    if q.is_leaf():
-        return q
-    else:
-        return [query_lists(qq) for qq in q.children()]
-
-
-def term_lists(q, phrases=True):
+def token_lists(q, phrases=True):
     """Returns the terms in the query tree, with the query hierarchy
     represented as nested lists.
     """
     
     if q.is_leaf():
         if phrases or not isinstance(q, Phrase):
-            return list(q.terms())
+            return list(q.tokens())
     else:
         ls = []
         for qq in q.children():
-            t = term_lists(qq, phrases=phrases)
+            t = token_lists(qq, phrases=phrases)
             if len(t) == 1:
                 t = t[0]
             if t:
         
         return fn_wrapper(self)
 
-    def replace(self, oldtext, newtext):
+    def replace(self, fieldname, oldtext, newtext):
         """Returns a copy of this query with oldtext replaced by newtext (if
         oldtext was anywhere in this query).
         
         # The default implementation uses the apply method to "pass down" the
         # replace() method call
         if self.is_leaf():
-            return copy(self)
+            return copy.copy(self)
         else:
-            return self.apply(methodcaller("replace", oldtext, newtext))
+            return self.apply(methodcaller("replace", fieldname, oldtext, newtext))
 
     def copy(self):
         """Deprecated, just use ``copy.deepcopy``.
                 for t in q.terms():
                     yield t
 
-    def all_term_queries(self):
-        """Returns an iterator of :class:`Term` query objects corresponding to
-        all terms in this query tree.
-        
-        Note that this doesn't just return the actual :class:`Term` queries in
-        the query tree... it also yields the terms inside phrases, variations,
-        etc.
+    def all_tokens(self, boost=1.0):
+        """Returns an iterator of :class:`analysis.Token` objects corresponding
+        to all terms in this query tree. The Token objects will have the
+        ``fieldname``, ``text``, and ``boost`` attributes set. If the query
+        was built by the query parser, they Token objects will also have
+        ``startchar`` and ``endchar`` attributes indexing into the original
+        user query.
         """
         
-        for q in self.leaves():
-            if q.has_terms():
-                for t in q.term_queries():
-                    yield t
-
+        if self.is_leaf():
+            for token in self.tokens(boost):
+                yield token
+        else:
+            boost *= self.boost if hasattr(self, "boost") else 1.0
+            for child in self.children():
+                for token in child.all_tokens(boost):
+                    yield token
+        
     def terms(self):
         """Yields zero or more ("fieldname", "text") pairs searched for by this
         query object. You can check whether a query object targets specific
         To get all terms in a query tree, use :meth:`Query.iter_all_terms`.
         """
         
-        return []
+        for token in self.tokens():
+            yield (token.fieldname, token.text)
     
-    def term_queries(self):
-        """Yields zero or more :class:`Term` query objects corresponding to the
-        terms searched for by this query object. You can check whether a query
-        object targets specific terms before you call this method using
+    def tokens(self, boost=1.0):
+        """Yields zero or more :class:`analysis.Token` objects corresponding to
+        the terms searched for by this query object. You can check whether a
+        query object targets specific terms before you call this method using
         :meth:`Query.has_terms`.
         
-        The startchar and endchar indices will only be meaningful for queries
-        which were built by the query parser from a query string.
+        The Token objects will have the ``fieldname``, ``text``, and ``boost``
+        attributes set. If the query was built by the query parser, they Token
+        objects will also have ``startchar`` and ``endchar`` attributes
+        indexing into the original user query.
         
-        To get all tokens for a query tree, use
-        :meth:`Query.all_terms_queries`.
+        To get all tokens for a query tree, use :meth:`Query.all_tokens`.
         """
         
-        for fname, text in self.terms():
-            q = Term(fname, text, boost=self.boost)
-            q.startchar = self.startchar
-            q.endchar = self.endchar
-            yield q
-    
+        return []
+        
     def requires(self):
         """Returns a set of queries that are *known* to be required to match
         for the entire query to match. Note that other queries might also turn
         
         return self.fieldname
 
+    def with_boost(self, boost):
+        """Returns a COPY of this query with the boost set to the given value.
+        
+        If a query type does not accept a boost itself, it will try to pass the
+        boost on to its children, if any.
+        """
+        
+        q = self.copy()
+        q.boost = boost
+        return q
+
     def estimate_size(self, ixreader):
         """Returns an estimate of how many documents this query could
         potentially match (for example, the estimated size of a simple term
     def __hash__(self):
         return hash(self.__class__.__name__) ^ hash(self.child)
     
+    def _rewrap(self, child):
+        return self.__class__(child)
+    
     def is_leaf(self):
         return False
     
         yield self.child
     
     def apply(self, fn):
-        return self.__class__(fn(self.child))
+        return self._rewrap(fn(self.child))
     
     def requires(self):
         return self.child.requires()
     def field(self):
         return self.child.field()
     
+    def with_boost(self, boost):
+        return self._rewrap(self.child.with_boost(boost))
+    
     def estimate_size(self, ixreader):
         return self.child.estimate_size(ixreader)
     
     
     def matcher(self, searcher):
         return self.child.matcher(searcher)
+    
 
 
 class CompoundQuery(Query):
 
     def __repr__(self):
         r = "%s(%r" % (self.__class__.__name__, self.subqueries)
-        if self.boost != 1:
+        if hasattr(self, "boost") and self.boost != 1:
             r += ", boost=%s" % self.boost
         r += ")"
         return r
         for s in self.subqueries:
             s = s.normalize()
             if isinstance(s, self.__class__):
-                subqueries += [ss.normalize() for ss in s.subqueries]
+                subqueries += [ss.with_boost(ss.boost * s.boost) for ss in s]
             else:
                 subqueries.append(s)
         
+        # If every subquery is Null, this query is Null
         if all(q is NullQuery for q in subqueries):
             return NullQuery
 
+        # If there's an unfielded Every inside, then this query is Every
         if any((isinstance(q, Every) and q.fieldname is None) for q in subqueries):
             return Every()
 
         if len(subqs) == 1:
             sub = subqs[0]
             if not (self.boost == 1.0 and sub.boost == 1.0):
-                sub = copy.deepcopy(sub)
-                sub.boost *= self.boost
+                sub = sub.with_boost(sub.boost * self.boost)
             return sub
 
         return self.__class__(subqs, boost=self.boost)
     def has_terms(self):
         return True
 
-    def terms(self):
-        yield (self.fieldname, self.text)
+    def tokens(self, boost=1.0):
+        yield Token(fieldname=self.fieldname, text=self.text,
+                    boost=boost * self.boost, startchar=self.startchar,
+                    endchar=self.endchar, chars=True)
 
-    def term_queries(self):
-        yield self
-
-    def replace(self, oldtext, newtext):
+    def replace(self, fieldname, oldtext, newtext):
         q = copy.copy(self)
-        if q.text == oldtext:
+        if q.fieldname == fieldname and q.text == oldtext:
             q.text = newtext
         return q
 
 
 
 class ExpandingTerm(MultiTerm):
-    """Middleware class for queries such as FuzzyTerm and Variations that
-    expand into multiple queries, but come from a single term.
+    """Intermediate base class for queries such as FuzzyTerm and Variations
+    that expand into multiple queries, but come from a single term.
     """
     
     def has_terms(self):
         return True
     
-    def terms(self):
-        yield (self.fieldname, self.text)
+    def tokens(self, boost=1.0):
+        yield Token(fieldname=self.fieldname, text=self.text,
+                    boost=boost * self.boost, startchar=self.startchar,
+                    endchar=self.endchar, chars=True)
     
 
 class FuzzyTerm(ExpandingTerm):
 
     __str__ = __unicode__
 
-    def replace(self, oldtext, newtext):
+    def replace(self, fieldname, oldtext, newtext):
         q = copy.copy(self)
-        if q.text == oldtext:
+        if q.fieldname == fieldname and q.text == oldtext:
             q.text = newtext
         return q
 
                              self.startexcl, self.endexcl,
                              boost=self.boost)
 
-    def replace(self, oldtext, newtext):
-        q = self.copy()
-        if q.start == oldtext:
-            q.start = newtext
-        if q.end == oldtext:
-            q.end = newtext
-        return q
+    #def replace(self, fieldname, oldtext, newtext):
+    #    q = self.copy()
+    #    if q.fieldname == fieldname:
+    #        if q.start == oldtext:
+    #            q.start = newtext
+    #        if q.end == oldtext:
+    #            q.end = newtext
+    #    return q
     
     def _words(self, ixreader):
         fieldname = self.fieldname
 class Phrase(Query):
     """Matches documents containing a given phrase."""
 
-    # If a Phrase object is created by the query parser, it will set this
-    # attribute to a list of (startchar, endchar) pairs corresponding to the
-    # words
-    char_ranges = None
-
-    def __init__(self, fieldname, words, slop=1, boost=1.0):
+    def __init__(self, fieldname, words, slop=1, boost=1.0, char_ranges=None):
         """
         :param fieldname: the field to search.
         :param words: a list of words (unicode strings) in the phrase.
             phrase; the default of 1 means the phrase must match exactly.
         :param boost: a boost factor that to apply to the raw score of
             documents matched by this query.
+        :param char_ranges: if a Phrase object is created by the query parser,
+            it will set this attribute to a list of (startchar, endchar) pairs
+            corresponding to the words in the phrase
         """
 
         self.fieldname = fieldname
         self.words = words
         self.slop = slop
         self.boost = boost
+        self.char_ranges = char_ranges
 
     def __eq__(self, other):
         return (other and self.__class__ is other.__class__ and
     def has_terms(self):
         return True
 
-    def terms(self):
-        return ((self.fieldname, word) for word in self.words)
-
-    def term_queries(self):
+    def tokens(self, boost=1.0):
         char_ranges = self.char_ranges
         startchar = endchar = None
         for i, word in enumerate(self.words):
             if char_ranges:
                 startchar, endchar = char_ranges[i]
-            q = Term(self.fieldname, word, boost=self.boost)
-            q.startchar = startchar
-            q.endchar = endchar
-            yield q
+                
+            yield Token(fieldname=self.fieldname, text=word,
+                        boost=boost * self.boost, startchar=startchar,
+                        endchar=endchar, chars=True)
 
     def normalize(self):
         if not self.words:
 
         words = [w for w in self.words if w is not None]
         return self.__class__(self.fieldname, words, slop=self.slop,
-                              boost=self.boost)
+                              boost=self.boost, char_ranges=self.char_ranges)
 
-    def replace(self, oldtext, newtext):
-        q = self.copy()
-        for i in xrange(len(q.words)):
-            if q.words[i] == oldtext:
-                q.words[i] = newtext
+    def replace(self, fieldname, oldtext, newtext):
+        q = copy.copy(self)
+        if q.fieldname == fieldname:
+            for i, word in enumerate(q.words):
+                if word == oldtext:
+                    q.words[i] = newtext
         return q
 
     def _and_query(self):
     def __hash__(self):
         return hash(self.child) ^ hash(self.score)
     
-    def apply(self, fn):
-        return self.__class__(fn(self.child), self.score)
+    def _rewrap(self, child):
+        return self.__class__(child, self.score)
     
     def matcher(self, searcher):
         m = self.child.matcher(searcher)
     ``estimate_size()``, and/or ``estimate_min_size()``.
     """
     
-    def __init__(self, a, b, boost=1.0):
+    boost = 1.0
+    
+    def __init__(self, a, b):
         self.a = a
         self.b = b
         self.subqueries = (a, b)
-        self.boost = boost
 
     def __eq__(self, other):
         return (other and self.__class__ is other.__class__
-                and self.a == other.a and self.b == other.b
-                and self.boost == other.boost)
+                and self.a == other.a and self.b == other.b)
     
     def __hash__(self):
-        return (hash(self.__class__.__name__) ^ hash(self.a) ^ hash(self.b)
-                ^ hash(self.boost))
+        return (hash(self.__class__.__name__) ^ hash(self.a) ^ hash(self.b))
     
     def apply(self, fn):
-        return self.__class__(fn(self.a), fn(self.b), boost=self.boost)
+        return self.__class__(fn(self.a), fn(self.b))
     
     def field(self):
         f = self.a.field()
         if self.b.field() == f:
             return f
     
+    def with_boost(self, boost):
+        return self.__class__(self.a.with_boost(boost), self.b.with_boost(boost))
+    
     def normalize(self):
         a = self.a.normalize()
         b = self.b.normalize()
         elif b is NullQuery:
             return a
     
-        return self.__class__(a, b, boost=self.boost)
+        return self.__class__(a, b)
     
     def matcher(self, searcher):
         return self.matcherclass(self.a.matcher(searcher),
     def estimate_min_size(self, ixreader):
         return self.b.estimate_min_size(ixreader)
 
+    def with_boost(self, boost):
+        return self.__class__(self.a.with_boost(boost), self.b)
+
     def normalize(self):
         a = self.a.normalize()
         b = self.b.normalize()
         if a is NullQuery or b is NullQuery:
             return NullQuery
-        return self.__class__(a, b, boost=self.boost)
+        return self.__class__(a, b)
     
     def docs(self, searcher):
         return And(self.subqueries).docs(searcher)
             return NullQuery
         if b is NullQuery:
             return a
-        return self.__class__(a, b, boost=self.boost)
+        return self.__class__(a, b)
 
     def requires(self):
         return self.a.requires()
     JOINT = " ANDNOT "
     matcherclass = AndNotMatcher
 
+    def with_boost(self, boost):
+        return self.__class__(self.a.with_boost(boost), self.b)
+
     def normalize(self):
         a = self.a.normalize()
         b = self.b.normalize()
         elif b is NullQuery:
             return a
 
-        return self.__class__(a, b, boost=self.boost)
+        return self.__class__(a, b)
 
     def requires(self):
         return self.a.requires()

File src/whoosh/searching.py

         collector = Collector(limit=limit, usequality=optimize,
                           groupedby=groupedby, reverse=reverse)
         return collector.search(self, q, allow=filter, restrict=mask)
+    
+    def correct_query(self, q, qstring, correctors=None, allfields=False,
+                      terms=None, prefix=0, maxdist=2):
+        """Returns a corrected version of the given user query using a default
+        :class:`
+        
+        The default 
+        
+        * Corrects any words that don't appear in the index.
+        
+        * Takes suggestions from the words in the index. To make certain fields
+          use custom correctors, use the ``correctors`` argument to pass a
+          dictionary mapping field names to :class:`whoosh.spelling.Corrector`
+          objects.
+        
+        * ONLY CORRECTS FIELDS THAT HAVE THE ``spelling`` ATTRIBUTE in the
+          schema (or for which you pass a custom corrector). To automatically
+          check all fields, use ``allfields=True``. Spell checking fields
+          without ``spelling`` is slower.
+
+        Expert users who want more sophisticated correction behavior can create
+        a custom :class:`whoosh.spelling.QueryCorrector` and use that instead
+        of this method.
+        
+        Returns a :class:`whoosh.spelling.Correction` object with a ``query``
+        attribute containing the corrected :class:`whoosh.query.Query` object
+        and a ``string`` attributes containing the corrected query string.
+        
+        >>> from whoosh import qparser, highlight
+        >>> qtext = 'mary "litle lamb"'
+        >>> q = qparser.QueryParser("text", myindex.schema)
+        >>> mysearcher = myindex.searcher()
+        >>> correction = mysearcher().correct_query(q, qtext)
+        >>> correction.query
+        <query.And ...>
+        >>> correction.string
+        'mary "little lamb"'
+        
+        You can use the ``Correction`` object's ``format_string`` method to
+        format the corrected query string using a
+        :class:`whoosh.highlight.Formatter` object. For example, you can format
+        the corrected string as HTML, emphasizing the changed words.
+        
+        >>> hf = highlight.HtmlFormatter(classname="change")
+        >>> correction.format_string(hf)
+        'mary "<strong class="change term0">little</strong> lamb"'
+        
+        :param q: the :class:`whoosh.query.Query` object to correct.
+        :param qstring: the original user query from which the query object was
+            created. You can pass None instead of a string, in which the
+            second item in the returned tuple will also be None.
+        :param correctors: an optional dictionary mapping fieldnames to
+            :class:`whoosh.spelling.Corrector` objects. By default, this method
+            uses the contents of the index to spell check the terms in the
+            query. You can use this argument to "override" some fields with a
+            different correct, for example a
+            :class:`whoosh.spelling.GraphCorrector`.
+        :param allfields: if True, automatically spell check all fields, not
+            just fields with the ``spelling`` attribute.
+        :param terms: a sequence of ``("fieldname", "text")`` tuples to correct
+            in the query. By default, this method corrects terms that don't
+            appear in the index. You can use this argument to override that
+            behavior and explicitly specify the terms that should be corrected.
+        :param prefix: suggested replacement words must share this number of
+            initial characters with the original word. Increasing this even to
+            just ``1`` can dramatically speed up suggestions, and may be
+            justifiable since spellling mistakes rarely involve the first
+            letter of a word.
+        :param maxdist: the maximum number of "edits" (insertions, deletions,
+            subsitutions, or transpositions of letters) allowed between the
+            original word and any suggestion. Values higher than ``2`` may be
+            slow.
+        :rtype: :class:`whoosh.spelling.Correction`
+        """
+        
+        if correctors is None:
+            correctors = {}
+        
+        if allfields:
+            fieldnames = self.schema.names()
+        else:
+            fieldnames = [name for name, field in self.schema.items()
+                          if field.spelling]
+        for fieldname in fieldnames:
+            if fieldname not in correctors:
+                correctors[fieldname] = self.corrector(fieldname)
+        
+        if terms is None:
+            terms = []
+            for token in q.all_tokens():
+                if token.fieldname in correctors:
+                    terms.append((token.fieldname, token.text))
+        
+        from whoosh import spelling
+        
+        sqc = spelling.SimpleQueryCorrector(correctors, terms)
+        return sqc.correct_query(q, qstring)
         
 
 class Collector(object):
 
         self.docset = docs | otherdocs
         self.top_n = arein + notin + other
+        
+    def contains_term(self, fieldname, text):
+        """Returns True if the given term exists in at least one of the
+        documents in this results set.
+        """
+        
+        docset = self.docs()
+        minid = min(docset)
+        maxid = max(docset)
+        
+        field = self.searcher.schema[fieldname]
+        text = field.to_text(text)
+        postings = self.searcher.postings(fieldname, text)
+        postings.skip_to(minid)
+        for id in postings.all_ids():
+            if id in docset:
+                return True
+            if id >= maxid:
+                break
+        return False
 
 
 class Hit(object):

File src/whoosh/spelling.py

 from collections import defaultdict
 from heapq import heappush, heapreplace
 
+from whoosh import analysis, fields, highlight, query, scoring
 from whoosh.compat import xrange, string_type
-import whoosh.support.dawg as dawg
-from whoosh import analysis, fields, query, scoring
+from whoosh.support import dawg
 from whoosh.support.levenshtein import distance
 
 
+
 # Suggestion scorers
 
 def simple_scorer(word, cost):
 
 class Corrector(object):
     """Base class for spelling correction objects. Concrete sub-classes should
-    implement the ``suggestions`` method.
+    implement the ``_suggestions`` method.
     """
     
     def suggest(self, text, limit=5, maxdist=2, prefix=0):
             list of words.
         """
         
-        suggestions = self.suggestions
+        _suggestions = self._suggestions
         
         heap = []
         seen = set()
         for k in xrange(1, maxdist+1):
-            for item in suggestions(text, k, prefix, seen):
+            for item in _suggestions(text, k, prefix, seen):
                 if len(heap) < limit:
                     heappush(heap, item)
                 elif item < heap[0]:
         
         return [sug for _, sug in sorted(heap)]
         
-    def suggestions(self, text, maxdist, prefix, seen):
+    def _suggestions(self, text, maxdist, prefix, seen):
         """Low-level method that yields a series of (score, "suggestion")
         tuples.
         
         self.reader = reader
         self.fieldname = fieldname
     
-    def suggestions(self, text, maxdist, prefix, seen):
+    def _suggestions(self, text, maxdist, prefix, seen):
         fieldname = self.fieldname
         freq = self.reader.frequency
-        for sug in self.reader.terms_within(self.fieldname, text, maxdist,
+        for sug in self.reader.terms_within(fieldname, text, maxdist,
                                             prefix=prefix, seen=seen):
             yield ((maxdist, 0 - freq(fieldname, sug)), sug)
 
         self.word_graph = word_graph
         self.ranking = ranking or simple_scorer
     
-    def suggestions(self, text, maxdist, prefix, seen):
+    def _suggestions(self, text, maxdist, prefix, seen):
         ranking = self.ranking
         for sug in dawg.within(self.word_graph, text, maxdist, prefix=prefix,
                                seen=seen):
     def __init__(self, correctors):
         self.correctors = correctors
         
-    def suggestions(self, text, maxdist, prefix, seen):
+    def _suggestions(self, text, maxdist, prefix, seen):
         for corr in self.correctors:
-            for item in corr.suggestions(text, maxdist, prefix, seen):
+            for item in corr._suggestions(text, maxdist, prefix, seen):
                 yield item
 
 
     g.to_file(dbfile)
 
 
+# Query correction
+
+class Correction(object):
+    """Represents the corrected version of a user query string. Has the
+    following attributes:
+    
+    ``query``
+        The corrected :class:`whoosh.query.Query` object.
+    ``string``
+        The corrected user query string.
+    ``original_query``
+        The original :class:`whoosh.query.Query` object that was corrected.
+    ``original_string``
+        The original user query string.
+    ``tokens``
+        A list of token objects representing the corrected words.
+    
+    You can also use the :meth:`Correction.format_string` to reformat the
+    corrected query string using a :class:`whoosh.highlight.Formatter` class.
+    For example, to display the corrected query string as HTML with the
+    changed words emphasized::
+    
+        from whoosh import highlight
+        
+        correction = mysearcher.correct_query(q, qstring)
+        
+        hf = highlight.HtmlFormatter(classname="change")
+        html = correction.format_string(hf)
+    """
+    
+    def __init__(self, q, qstring, corr_q, tokens):
+        self.original_query = q
+        self.query = corr_q
+        self.original_string = qstring
+        self.tokens = tokens
+        
+        if self.original_string and self.tokens:
+            self.string = self.format_string(highlight.NullFormatter())
+        else:
+            self.string = None
+    
+    def __repr__(self):
+        return "%s(%r, %r)" % (self.__class__.__name__, self.query, self.string)
+    
+    def format_string(self, formatter):
+        if not (self.original_string and self.tokens):
+            raise Exception("The original query isn't available") 
+        if isinstance(formatter, type):
+            formatter = formatter()
+        
+        fragment = highlight.Fragment(self.original_string, self.tokens)
+        return formatter.format_fragment(fragment, replace=True)
+
+
+# QueryCorrector objects
+
+class QueryCorrector(object):
+    """Base class for objects that correct words in a user query.
+    """
+    
+    def correct_query(self, q, qstring):
+        """Returns a :class:`Correction` object representing the corrected
+        form of the given query.
+        
+        :param q: the original :class:`whoosh.query.Query` tree to be
+            corrected.
+        :param qstring: the original user query. This may be None if the
+        original query string is not available, in which case the
+        ``Correction.string`` attribute will also be None.
+        :rtype: :class:`Correction`
+        """
+        
+        raise NotImplementedError
+
+
+class SimpleQueryCorrector(QueryCorrector):
+    """A simple query corrector based on a mapping of field names to
+    :class:`Corrector` objects, and a list of ``("fieldname", "text")`` tuples
+    to correct. And terms in the query that appear in list of term tuples are
+    corrected using the appropriate corrector.
+    """
+    
+    def __init__(self, correctors, terms, prefix=0, maxdist=2):
+        """
+        :param correctors: a dictionary mapping field names to
+            :class:`Corrector` objects.
+        :param terms: a sequence of ``("fieldname", "text")`` tuples
+            representing terms to be corrected.
+        :param prefix: suggested replacement words must share this number of
+            initial characters with the original word. Increasing this even to
+            just ``1`` can dramatically speed up suggestions, and may be
+            justifiable since spellling mistakes rarely involve the first
+            letter of a word.
+        :param maxdist: the maximum number of "edits" (insertions, deletions,
+            subsitutions, or transpositions of letters) allowed between the
+            original word and any suggestion. Values higher than ``2`` may be
+            slow.
+        """
+        
+        self.correctors = correctors
+        self.termset = frozenset(terms)
+        self.prefix = prefix
+        self.maxdist = maxdist
+    
+    def correct_query(self, q, qstring):
+        correctors = self.correctors
+        termset = self.termset
+        prefix = self.prefix
+        maxdist = self.maxdist
+        
+        corrected_tokens = []
+        corrected_q = q
+        for token in q.all_tokens():
+            fname = token.fieldname
+            if (fname, token.text) in termset:
+                sugs = correctors[fname].suggest(token.text, prefix=prefix,
+                                                 maxdist=maxdist)
+                if sugs:
+                    sug = sugs[0]
+                    corrected_q = corrected_q.replace(token.fieldname,
+                                                      token.text, sug)
+                    token.text = sug
+                    corrected_tokens.append(token)
+
+        return Correction(q, qstring, corrected_q, corrected_tokens)
+
 #
 #
 #

File tests/test_highlighting.py

     htext = highlight.highlight(_doc, terms, sa, cf, hf)
     assert_equal(htext, 'alfa <strong class="match term0">bravo</strong> charlie...hotel <strong class="match term1">india</strong> juliet')
 
+def test_html_escape():
+    terms = frozenset(["bravo"])
+    sa = analysis.StandardAnalyzer()
+    wf = highlight.WholeFragmenter()
+    hf = highlight.HtmlFormatter()
+    htext = highlight.highlight(u('alfa <bravo "charlie"> delta'), terms, sa, wf, hf)
+    assert_equal(htext, 'alfa &lt;<strong class="match term0">bravo</strong> "charlie"&gt; delta')
+
 def test_maxclasses():
     terms = frozenset(("alfa", "bravo", "charlie", "delta", "echo"))
     sa = analysis.StandardAnalyzer()

File tests/test_queries.py

 
 def test_replace():
     q = And([Or([Term("a", "b"), Term("b", "c")], boost=1.2), Variations("a", "b", boost=2.0)])
-    q = q.replace("b", "BB")
+    q = q.replace("a", "b", "BB")
     assert_equal(q, And([Or([Term("a", "BB"), Term("b", "c")], boost=1.2),
                          Variations("a", "BB", boost=2.0)]))
 

File tests/test_results.py

             assert_equal(docnums[:-1], last)
             last = docnums
 
-
-
+def test_contains():
+    schema = fields.Schema(text=fields.TEXT)
+    ix = RamStorage().create_index(schema)
+    w = ix.writer()
+    w.add_document(text=u("alfa sierra tango"))
+    w.add_document(text=u("bravo charlie delta"))
+    w.add_document(text=u("charlie delta echo"))
+    w.add_document(text=u("delta echo foxtrot"))
+    w.commit()
+    
+    q = query.Or([query.Term("text", "bravo"), query.Term("text", "charlie")])
+    r = ix.searcher().search(q)
+    assert not r.contains_term("text", "alfa")
+    assert r.contains_term("text", "bravo")
+    assert r.contains_term("text", "charlie")
+    assert r.contains_term("text", "delta")
+    assert r.contains_term("text", "echo")
+    assert not r.contains_term("text", "foxtrot")
 
 
 

File tests/test_spelling.py

 from nose.tools import assert_equal, assert_not_equal  #@UnresolvedImport
 
 from whoosh import fields, highlight, query, spelling
+from whoosh.analysis import Token
 from whoosh.compat import u, text_type
 from whoosh.filedb.filestore import RamStorage
 from whoosh.qparser import QueryParser
         
         assert_equal(list(dawg.flatten(dw.root.edge("test"))), ["special", "specials"])
     
-
 def test_multisegment():
     schema = fields.Schema(text=fields.TEXT(spelling=True))
     ix = RamStorage().create_index(schema)
     gf.close()
 
 def test_query_highlight():
-    text = "alfa bravo charlie delta"
     qp = QueryParser("a", None)
-    q = qp.parse(text)
-    tqs = [tq for tq in q.all_term_queries() if tq.text == "bravo"]
-    fragment = highlight.Fragment(text, tqs)
-    hl = highlight.HtmlFormatter().format_fragment(fragment)
-    assert_equal(hl, 'alfa <strong class="match term0">bravo</strong> charlie delta')
+    hf = highlight.HtmlFormatter()
     
+    def do(text, terms):
+        q = qp.parse(text)
+        tks = [tk for tk in q.all_tokens() if tk.text in terms]
+        for tk in tks:
+            if tk.startchar is None or tk.endchar is None:
+                assert False, tk
+        fragment = highlight.Fragment(text, tks)
+        return hf.format_fragment(fragment)
+    
+    assert_equal(do("a b c d", ["b"]),
+                 'a <strong class="match term0">b</strong> c d')
+    assert_equal(do('a (x:b OR y:"c d") e', ("b", "c")),
+                 'a (x:<strong class="match term0">b</strong> OR y:"<strong class="match term1">c</strong> d") e')
 
 def test_query_terms():
     qp = QueryParser("a", None)
     q = qp.parse("alfa b:(bravo OR c:charlie) delta")
     assert_equal(sorted(q.iter_all_terms()), [("a", "alfa"), ("a", "delta"),
                                               ("b", "bravo"), ("c", "charlie")])
-    assert_equal(query.term_lists(q), [("a", "alfa"),
-                                       [("b", "bravo"), ("c", "charlie")],
-                                       ("a", "delta")])
     
     q = qp.parse("alfa brav*")
     assert_equal(sorted(q.iter_all_terms()), [("a", "alfa")])
-    assert_equal(query.term_lists(q), [("a", "alfa")])
     
-    q = qp.parse('alfa "bravo charlie" delta')
-    assert_equal(query.term_lists(q), [("a", "alfa"),
-                                       [("a", "bravo"), ("a", "charlie")],
-                                       ("a", "delta")])
+    q = qp.parse('a b:("b c" d)^2 e')
+    tokens = [(t.fieldname, t.text, t.boost) for t in q.all_tokens()]
+    assert_equal(tokens, [('a', 'a', 1.0), ('b', 'b', 2.0), ('b', 'c', 2.0),
+                          ('b', 'd', 2.0), ('a', 'e', 1.0)])
 
-    q = qp.parse('a b:("b c" d)^2 e')
-    assert_equal(list(q.all_term_queries()), [query.Term('a', 'a'),
-                                              query.Term('b', 'b', boost=2.0),
-                                              query.Term('b', 'c', boost=2.0),
-                                              query. Term('b', 'd', boost=2.0),
-                                              query.Term('a', 'e')])
+def test_correct_query():
+    schema = fields.Schema(a=fields.TEXT(spelling=True), b=fields.TEXT)
+    ix = RamStorage().create_index(schema)
+    w = ix.writer()
+    w.add_document(a=u("alfa bravo charlie delta"))
+    w.add_document(a=u("delta echo foxtrot golf"))
+    w.add_document(a=u("golf hotel india juliet"))
+    w.add_document(a=u("juliet kilo lima mike"))
+    w.commit()
+    
+    s = ix.searcher()
+    qp = QueryParser("a", ix.schema)
+    qtext = u('alpha ("brovo november" OR b:dolta) detail')
+    q = qp.parse(qtext, ix.schema)
+    
+    c = s.correct_query(q, qtext)
+    assert_equal(c.query.__unicode__(), '(a:alfa AND (a:"bravo november" OR b:dolta) AND a:detail)')
+    assert_equal(c.string, 'alfa ("bravo november" OR b:dolta) detail')
 
+    qtext = u('alpha b:("brovo november" a:delta) detail')
+    q = qp.parse(qtext, ix.schema)
+    c = s.correct_query(q, qtext)
+    assert_equal(c.query.__unicode__(), '(a:alfa AND b:"brovo november" AND a:delta AND a:detail)')
+    assert_equal(c.string, 'alfa b:("brovo november" a:delta) detail')
+    
+    hf = highlight.HtmlFormatter(classname="c")
+    assert_equal(c.format_string(hf), '<strong class="c term0">alfa</strong> b:("brovo november" a:<strong class="c term1">delta</strong>) detail')
+    
 
 
-
-
-
-