Commits

Matt Chaput committed a910e00

Refactored searching.Collector into collectors module. Added hit collapsing.
Fixes issue #259.

Comments (0)

Files changed (10)

docs/source/api/collectors.rst

+=====================
+``collectors`` module
+=====================
+
+.. automodule:: whoosh.collectors
+
+
+Base classes
+============
+
+.. autoclass:: Collector
+    :members:
+
+.. autoclass:: ScoredCollector
+    :members:
+
+.. autoclass:: WrappingCollector
+    :members:
+
+
+Basic collectors
+================
+
+.. autoclass:: TopCollector
+
+.. autoclass:: UnlimitedCollector
+
+.. autoclass:: SortingCollector
+
+
+Wrappers
+========
+
+.. autoclass:: FilterCollector
+
+.. autoclass:: FacetCollector
+
+.. autoclass:: CollapseCollector
+
+.. autoclass:: TimeLimitCollector
+
+.. autoclass:: TermsCollector
+
+
+
+
+

docs/source/api/searching.rst

 .. autoclass:: Searcher
     :members:
 
-.. autoclass:: Collector
-    :members:
-
 
 Results classes
 ===============
 .. autoclass:: ResultsPage
     :members:
 
+
+Exceptions
+==========
+
+.. autoexception:: NoTermsException
+
+.. autoexception:: TimeLimit
+

docs/source/searching.rst

 
     results = s.search_page(q, 1)
 
-The default page length is 10 hits. You can use the ``pagelen`` keyword argument
-to set a different page length::
+The default page length is 10 hits. You can use the ``pagelen`` keyword
+argument to set a different page length::
 
     results = s.search_page(q, 5, pagelen=20)
 
 See :doc:`highlight` and :doc:`keywords` for information on these topics.
 
 
-Convenience functions
+Filtering results
+=================
+
+You can use the ``filter`` keyword argument to ``search()`` to specify a set of
+documents to permit in the results. The argument can be a
+:class:`whoosh.query.Query` object, a :class:`whoosh.searching.Results` object,
+or a set-like object containing document numbers. The searcher caches filters
+so if for example you use the same query filter with a searcher multiple times,
+the additional searches will be faster because the searcher will cache the
+results of running the filter query
+
+You can also specify a ``mask`` keyword argument to specify a set of documents
+that are not permitted in the results.
+
+::
+    with myindex.searcher() as s:
+        qp = qparser.QueryParser("content", myindex.schema)
+        user_q = qp.parse(query_string)
+   
+        # Only show documents in the "rendering" chapter
+        allow_q = query.Term("chapter", "rendering")
+        # Don't show any documents where the "tag" field contains "todo"
+        restrict_q = query.Term("tag", "todo")
+      
+        results = s.search(user_q, filter=allow_q, mask=restrict_q)
+
+(If you specify both a ``filter`` and a ``mask``, and a matching document
+appears in both, the ``mask`` "wins" and the document is not permitted.)
+
+To find out how many results were filtered out of the results, use
+``results.filtered_count`` (or ``resultspage.results.filtered_count``)::
+
+    with myindex.searcher() as s:
+        qp = qparser.QueryParser("content", myindex.schema)
+        user_q = qp.parse(query_string)
+      
+        # Filter documents older than 7 days
+        old_q = query.DateRange("created", None, datetime.now() - timedelta(days=7))
+        results = s.search(user_q, mask=old_q)
+      
+        print("Filtered out %d older documents" % results.filtered_count)
+
+
+Which terms from my query matched?
+==================================
+
+You can use the ``terms=True`` keyword argument to ``search()`` to have the
+search record which terms in the query matched which documents::
+
+    with myindex.searcher() as s:
+        results = s.seach(myquery, terms=True)
+      
+You can then get information about which terms matched from the
+:class:`whoosh.searching.Results` and :class:`whoosh.searching.Hit` objects::
+
+    # Was this results object created with terms=True?
+    if results.has_matched_terms():
+        # What terms matched in the results?
+        print(results.matched_terms())
+        
+        # What terms matched in each hit?
+        for hit in results:
+            print(hit.matched_terms())
+
+
+.. _collapsing:
+
+Collapsing results
+==================
+
+Whoosh lets you eliminate all but the top N documents with the same facet key
+from the results. This can be useful in a few situations:
+
+* Eliminating duplicates at search time.
+
+* Restricting the number of matches per source. For example, in a web search
+  application, you might want to show at most three matches from any website.
+
+Whether a document should be collapsed is determined by the value of a
+"collapse facet". If a document has an empty collapse key, it will never be
+collapsed, but otherwise only the top N documents with the same collapse key
+will appear in the results.
+
+See :doc:`/facets` for information on facets.
+
+::
+    with myindex.searcher() as s:
+        # Set the facet to collapse on and the maximum number of documents per
+        # facet value (default is 1)
+        results = s.collector(collapse="hostname", collapse_limit=3)
+        
+        # Dictionary mapping collapse keys to the number of documents that
+        # were filtered out by collapsing on that key
+        print(results.collapsed_counts)
+
+Collapsing works with both scored and sorted results. You can use any of the
+facet types available in the :mod:`whoosh.sorting` module.
+
+By default, Whoosh uses the results order (score or sort key) to determine
+the documents to collapse. For example, in scored results, the best scoring
+documents would be kept. You can optionally specify a ``collapse_order`` facet
+to control which documents to keep when collapsing.
+
+For example, in a product search you could display results sorted by decreasing
+price, and eliminate all but the highest rated item of each product type::
+
+    from whoosh import sorting
+
+    with myindex.searcher() as s:
+        price_facet = sorting.FieldFacet("price", reverse=True)
+        type_facet = sorting.FieldFacet("type")
+        rating_facet = sorting.FieldFacet("rating", reverse=True)
+    
+        results = s.collector(sortedby=price_facet,  # Sort by reverse price
+                              collapse=type_facet,  # Collapse on product type
+                              collapse_order=rating_facet  # Collapse to highest rated
+                              )
+
+The collapsing happens during the search, so it is usually more efficient than
+finding everything and post-processing the results. However, if the collapsing
+eliminates a large number of documents, collapsed search can take longer
+because the search has to consider more documents and remove many
+already-collected documents.
+
+Since this collector must sometimes go back and remove already-collected
+documents, if you use it in combination with
+:class:`~whoosh.collectors.TermsCollector` and/or
+:class:`~whoosh.collectors.FacetCollector`, those collectors may contain
+information about documents that were filtered out of the final results by
+collapsing.
+
+
+Time limited searches
 =====================
 
+::
+    from whoosh.collectors import TimeLimitCollector, TimeLimit
+    
+    with myindex.searcher() as s:
+        # Get a collector object
+        c = s.collector(limit=None, sortedby="title_exact")
+        # Wrap it in a TimeLimitedCollector and set the time limit to 10 seconds
+        tlc = TimeLimitedCollector(c, timelimit=10.0)
+      
+        # Try searching
+        try:
+            s.search_with_collector(myquery, tlc)
+        except TimeLimit:
+            print("Search took too long, aborting!")
+
+        # You can still get partial results from the collector
+        results = tlc.results()
+
+
+Convenience methods
+===================
+
 The :meth:`~whoosh.searching.Searcher.document` and
 :meth:`~whoosh.searching.Searcher.documents` methods on the Searcher object let
 you retrieve the stored fields of documents matching terms you pass in keyword
 >>> print searcher.document(path=u"/a/b/c")
 {"title": "Document C"}
 
-These convenience functions have some limitations:
+These methods have some limitations:
 
 * The results are not scored.
 * Multiple keywords are always AND-ed together.

src/whoosh/collectors.py

+# Copyright 2012 Matt Chaput. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+#    1. Redistributions of source code must retain the above copyright notice,
+#       this list of conditions and the following disclaimer.
+#
+#    2. Redistributions in binary form must reproduce the above copyright
+#       notice, this list of conditions and the following disclaimer in the
+#       documentation and/or other materials provided with the distribution.
+#
+# THIS SOFTWARE IS PROVIDED BY MATT CHAPUT ``AS IS'' AND ANY EXPRESS OR
+# IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
+# MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO
+# EVENT SHALL MATT CHAPUT OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
+# INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+# LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA,
+# OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
+# LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
+# NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE,
+# EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#
+# The views and conclusions contained in the software and documentation are
+# those of the authors and should not be interpreted as representing official
+# policies, either expressed or implied, of Matt Chaput.
+
+"""
+This module contains "collector" objects. Collectors provide a way to gather
+"raw" results from a :class:`whoosh.matching.Matcher` object, implement
+sorting, filtering, collation, etc., and produce a
+:class:`whoosh.searching.Results` object.
+
+The basic collectors are:
+
+TopCollector
+    Returns the top N matching results sorted by score, using block-quality
+    optimizations to skip blocks of documents that can't contribute to the top
+    N. The :meth:`whoosh.searching.Searcher.search` method uses this type of
+    collector by default or when you specify a ``limit``.
+
+UnlimitedCollector
+    Returns all matching results sorted by score. The
+    :meth:`whoosh.searching.Searcher.search` method uses this type of collector
+    when you specify ``limit=None`` or you specify a limit equal to or greater
+    than the number of documents in the searcher.
+
+SortingCollector
+    Returns all matching results sorted by a :class:`whoosh.sorting.Facet`
+    object. The :meth:`whoosh.searching.Searcher.search` method uses this type
+    of collector when you use the ``sortedby`` parameter.
+
+Here's an example of a simple collector that instead of remembering the matched
+documents just counts up the number of matches::
+
+    class CountingCollector(Collector):
+        def prepare(self, top_searcher, q, requires_matcher=False):
+            # Always call super method in prepare
+            Collector.prepare(self, top_searcher, q, requires_matcher)
+            
+            self.count = 0
+        
+        def collect(self, sub_docnum):
+            self.count += 1
+    
+    c = CountingCollector()
+    mysearcher.search_with_collector(myquery, c)
+    print(c.count)
+
+There are also several wrapping collectors that extend or modify the
+functionality of other collectors. The meth:`whoosh.searching.Searcher.search`
+method uses many of these when you specify various parameters.
+
+NOTE: collectors are not designed to be reentrant or thread-safe. It is
+generally a good idea to create a new collector for each search.
+"""
+
+import threading
+from array import array
+from bisect import insort
+from collections import defaultdict
+from heapq import heapify, heappush, heapreplace
+
+from whoosh import sorting
+from whoosh.compat import abstractmethod, iteritems, itervalues, xrange
+from whoosh.searching import Hit, Results, TimeLimit
+from whoosh.util import now
+
+
+# Base class
+
+class Collector(object):
+    """Base class for collectors.
+    """
+
+    def prepare(self, top_searcher, q, requires_matcher=False):
+        """This method is called before a search.
+        
+        Subclasses can override this to perform set-up work, but
+        they should still call the superclass's method because it sets several
+        necessary attributes on the collector object:
+        
+        self.top_searcher
+            The top-level searcher.
+        self.q
+            The query object
+        self.requires_matcher
+            Whether a wrapping collector requires that this collector's matcher
+            be in a valid state at every call to ``collect()``. If this is
+            ``False``, the collector is free to use faster methods that don't
+            necessarily keep the matcher updated, such as
+            ``matcher.all_ids()``.
+        
+        :param top_searcher: the top-level :class:`whoosh.searching.Searcher`
+            object.
+        :param q: the :class:`whoosh.query.Query` object being searched for.
+        :param requires_matcher: whether a wrapping collector needs this
+            collector to step through the matches individually.
+        """
+
+        self.top_searcher = top_searcher
+        self.q = q
+        self.requires_matcher = requires_matcher
+
+        self.runtime = None
+        self.starttime = now()
+        self.docset = set()
+
+    def set_subsearcher(self, subsearcher, offset):
+        """This method is called each time the collector starts on a new
+        sub-searcher.
+        
+        Subclasses can override this to perform set-up work, but
+        they should still call the superclass's method because it sets several
+        necessary attributes on the collector object:
+        
+        self.subsearcher
+            The current sub-searcher. If the top-level searcher is atomic, this
+            is the same as the top-level searcher.
+        self.offset
+            The document number offset of the current searcher. You must add
+            this number to the document number passed to
+            :meth:`Collector.collect` to get the top-level document number
+            for use in results.
+        self.matcher
+            A :class:`whoosh.matching.Matcher` object representing the matches
+            for the query in the current sub-searcher.
+        """
+
+        self.subsearcher = subsearcher
+        self.offset = offset
+        self.matcher = self.q.matcher(subsearcher)
+
+    def collect_matches(self):
+        """This method calls :meth:`Collector.matches` and then for each
+        matched document calls :meth:`Collector.collect`. Sub-classes that
+        want to intervene between finding matches and adding them to the
+        collection (for example, to filter out certain documents) can override
+        this method.
+        """
+
+        collect = self.collect
+        for sub_docnum in self.matches():
+            collect(sub_docnum)
+
+    @abstractmethod
+    def collect(self, sub_docnum):
+        """This method is called for every matched document. It should do the
+        work of adding a matched document to the results, and it should return
+        an object to use as a "sorting key" for the given document (such as the
+        document's score, a key generated by a facet, or just None). Subclasses
+        must implement this method.
+        
+        If you want the score for the current document, use
+        ``self.matcher.score()``.
+        
+        Overriding methods should add the current document offset
+        (``self.offset``) to the ``sub_docnum`` to get the top-level document
+        number for the matching document to add to results.
+        
+        :param sub_docnum: the document number of the current match within the
+            current sub-searcher. You must add ``self.offset`` to this number
+            to get the document's top-level document number.
+        """
+
+        raise NotImplementedError
+
+    @abstractmethod
+    def sort_key(self, sub_docnum):
+        """Returns a sorting key for the current match. This should return the
+        same value returned by :meth:`Collector.collect`, but without the side
+        effect of adding the current document to the results.
+
+        If the collector has been prepared with ``requires_matcher=True``,
+        this method can use ``self.matcher`` to get information, for example
+        the score. Otherwise, it should only use the provided ``sub_docnum``,
+        since the matcher may be in an inconsistent state.
+        
+        Subclasses must implement this method.
+        """
+
+        raise NotImplementedError
+
+    def remove(self, global_docnum):
+        """Removes a document from the collector. Not that this method uses the
+        global document number as opposed to :meth:`Collector.collect` which
+        takes a segment-relative docnum.
+        """
+
+        items = self.items
+        for i in xrange(len(items)):
+            if items[i][1] == global_docnum:
+                items.pop(i)
+                return
+        raise KeyError(global_docnum)
+
+    def _step_through_matches(self):
+        matcher = self.matcher
+        while matcher.is_active():
+            yield matcher.id()
+            matcher.next()
+
+    def matches(self):
+        """Yields a series of relative document numbers for matches
+        in the current subsearcher.
+        """
+
+        # We jump through a lot of hoops to avoid stepping through the matcher
+        # "manually" if we can because all_ids() is MUCH faster
+        if self.requires_matcher:
+            return self._step_through_matches()
+        else:
+            return self.matcher.all_ids()
+
+    def finish(self):
+        """This method is called after a search.
+        
+        Subclasses can override this to perform set-up work, but
+        they should still call the superclass's method because it sets several
+        necessary attributes on the collector object:
+        
+        self.runtime
+            The time (in seconds) the search took.
+        """
+
+        self.runtime = now() - self.starttime
+
+    def _results(self, items, **kwargs):
+        # Fills in a Results object with the invariant information and the
+        # given "items" (a list of (score, docnum) tuples)
+        return Results(self.top_searcher, self.q, items, runtime=self.runtime,
+                       **kwargs)
+
+    @abstractmethod
+    def results(self):
+        """Returns a :class:`~whoosh.searching.Results` object containing the
+        results of the search. Subclasses must implement this method
+        """
+
+        raise NotImplementedError
+
+
+# Scored collectors
+
+class ScoredCollector(Collector):
+    """Base class for collectors that sort the results based on document score.
+    """
+
+    def __init__(self, replace=10, **kwargs):
+        """
+        :param replace: Number of matches between attempts to replace the
+            matcher with a more efficient version.
+        """
+
+        Collector.__init__(self, **kwargs)
+        self.replace = replace
+
+    def prepare(self, top_searcher, q, requires_matcher=False):
+        # This collector requires a valid matcher at each step
+        Collector.prepare(self, top_searcher, q, True)
+
+        if top_searcher.weighting.use_final:
+            self.final_fn = top_searcher.weighting.final
+        else:
+            self.final_fn = None
+
+        # Heap containing top N (score, 0-docnum) pairs
+        self.items = []
+        # Minimum score a document must have to make it into the top N. This is
+        # used by the block-quality optimizations
+        self.minscore = 0
+        # Number of times the matcher was replaced (for debugging)
+        self.replaced_times = 0
+        # Number of blocks skipped by quality optimizations (for debugging)
+        self.skipped_times = 0
+
+    def sort_key(self, sub_docnum):
+        return 0 - self.matcher.score()
+
+    def collect(self, sub_docnum):
+        # Do common work to calculate score and top-level document number
+        global_docnum = self.offset + sub_docnum
+
+        score = self.matcher.score()
+        if self.final_fn:
+            score = self.final_fn(self.top_searcher, global_docnum, score)
+
+        # Call specialized method on subclass
+        return self._collect(global_docnum, score)
+
+    def matches(self):
+        minscore = self.minscore
+        matcher = self.matcher
+        usequality = self._use_block_quality()
+        replace = self.replace
+        replacecounter = 0
+
+        # A flag to indicate whether we should check block quality at the start
+        # of the next loop
+        checkquality = True
+
+        while matcher.is_active():
+            # If the replacement counter has reached 0, try replacing the
+            # matcher with a more efficient version
+            if replace:
+                if replacecounter == 0 or self.minscore != minscore:
+                    self.matcher = matcher = matcher.replace(minscore or 0)
+                    self.replaced_times += 1
+                    if not matcher.is_active():
+                        break
+                    usequality = self._use_block_quality()
+                    replacecounter = self.replace
+                    minscore = self.minscore
+                replacecounter -= 1
+
+            # If we're using block quality optimizations, and the checkquality
+            # flag is true, try to skip ahead to the next block with the
+            # minimum required quality
+            if usequality and checkquality and minscore is not None:
+                self.skipped_times += matcher.skip_to_quality(minscore)
+                # Skipping ahead might have moved the matcher to the end of the
+                # posting list
+                if not matcher.is_active():
+                    break
+
+            yield matcher.id()
+
+            # Move to the next document. This method returns True if the
+            # matcher has entered a new block, so we should check block quality
+            # again.
+            checkquality = matcher.next()
+
+
+class TopCollector(ScoredCollector):
+    """A collector that only returns the top "N" scored results.
+    """
+
+    def __init__(self, limit=10, usequality=True, **kwargs):
+        """
+        :param limit: the maximum number of results to return.
+        :param usequality: whether to use block-quality optimizations. This may
+            be useful for debugging.
+        """
+
+        ScoredCollector.__init__(self, **kwargs)
+        self.limit = limit
+        self.usequality = usequality
+
+    def _use_block_quality(self):
+        return (self.usequality
+                and not self.top_searcher.weighting.use_final
+                and self.matcher.supports_block_quality())
+
+    # ScoredCollector.collect calls this
+    def _collect(self, global_docnum, score):
+        items = self.items
+
+        # Document numbers are negated before putting them in the heap so that
+        # higher document numbers have lower "priority" in the queue. Lower
+        # document numbers should always come before higher document numbers
+        # with the same score to keep the order stable.
+        if len(items) < self.limit:
+            # The heap isn't full, so add this document
+            heappush(items, (score, 0 - global_docnum))
+            # Negate score to act as sort key so higher scores appear first
+            return 0 - score
+        elif score > items[0][0]:
+            # The heap is full, but if this document has a high enough
+            # score to make the top N, add it to the heap
+            heapreplace(items, (score, 0 - global_docnum))
+            self.minscore = items[0][0]
+            # Negate score to act as sort key so higher scores appear first
+            return 0 - score
+        else:
+            return 0
+
+    def remove(self, global_docnum):
+        negated = 0 - global_docnum
+        items = self.items
+
+        # Search through the results for the document and remove it
+        for i in xrange(len(items)):
+            if items[i][1] == negated:
+                items.pop(i)
+                # Restore the heap invariant
+                heapify(items)
+                self.minscore = items[0][0]
+                return
+
+        # The document wasn't on the list... somebody's confused!
+        raise KeyError(global_docnum)
+
+    def results(self):
+        # The items are stored (postive score, negative docnum) so the heap
+        # keeps the highest scores and lowest docnums, in order from lowest to
+        # highest. Since for the results we want the highest scores first,
+        # sort the heap in reverse order
+        items = self.items
+        items.sort(reverse=True)
+        # De-negate the docnums for presentation to the user
+        items = [(score, 0 - docnum) for score, docnum in items]
+        return self._results(items)
+
+
+class UnlimitedCollector(ScoredCollector):
+    """A collector that only returns all scored results.
+    """
+
+    def __init__(self, reverse=False):
+        ScoredCollector.__init__(self)
+        self.reverse = reverse
+
+    def _use_block_quality(self):
+        return False
+
+    # ScoredCollector.collect calls this
+    def _collect(self, global_docnum, score):
+        self.items.append((score, global_docnum))
+        self.docset.add(global_docnum)
+        # Negate score to act as sort key so higher scores appear first
+        return 0 - score
+
+    def results(self):
+        # Sort by negated scores so that higher scores go first, then by
+        # document number to keep the order stable when documents have the
+        # same score
+        self.items.sort(key=lambda x: (0 - x[0], x[1]), reverse=self.reverse)
+        return self._results(self.items, docset=self.docset)
+
+
+# Sorting collector
+
+class SortingCollector(Collector):
+    """A collector that returns results sorted by a given
+    :class:`whoosh.sorting.Facet` object. See :doc:`/facets` for more
+    information.
+    """
+
+    def __init__(self, sortedby, limit=10, reverse=False, **kwargs):
+        """
+        :param sortedby: see :doc:`/facets`.
+        :param reverse: If True, reverse the overall results. Note that you
+            can reverse individual facets in a multi-facet sort key as well.
+        """
+
+        Collector.__init__(self, **kwargs)
+        self.sortfacet = sorting.MultiFacet.from_sortedby(sortedby)
+        self.limit = limit
+        self.reverse = reverse
+
+    def prepare(self, top_searcher, q, requires_matcher=False):
+        self.categorizer = self.sortfacet.categorizer(top_searcher)
+        # If the categorizer requires a valid matcher, then tell the child
+        # collector that we need it
+        Collector.prepare(self, top_searcher, q,
+                          self.categorizer.requires_matcher)
+        # List of (sortkey, docnum) pairs
+        self.items = []
+
+    def set_subsearcher(self, subsearcher, offset):
+        Collector.set_subsearcher(self, subsearcher, offset)
+        self.categorizer.set_searcher(subsearcher, offset)
+
+    def sort_key(self, sub_docnum):
+        return self.categorizer.key_for(self.matcher, sub_docnum)
+
+    def collect(self, sub_docnum):
+        global_docnum = self.offset + sub_docnum
+        sortkey = self.categorizer.key_for(self.matcher, sub_docnum)
+        self.items.append((sortkey, global_docnum))
+        self.docset.add(global_docnum)
+        return sortkey
+
+    def results(self):
+        items = self.items
+        items.sort(reverse=self.reverse)
+        if self.limit:
+            items = items[:self.limit]
+        return self._results(items, docset=self.docset)
+
+
+# Wrapping collectors
+
+class WrappingCollector(Collector):
+    """Base class for collectors that wrap other collectors.
+    """
+
+    def __init__(self, child):
+        self.child = child
+
+    def prepare(self, top_searcher, q, requires_matcher=False):
+        self.child.prepare(top_searcher, q, requires_matcher)
+
+    def set_subsearcher(self, subsearcher, offset):
+        self.child.set_subsearcher(subsearcher, offset)
+
+    def collect_matches(self):
+        for sub_docnum in self.matches():
+            self.collect(sub_docnum)
+
+    def sort_key(self, sub_docnum):
+        return self.child.sort_key(sub_docnum)
+
+    def collect(self, sub_docnum):
+        return self.child.collect(sub_docnum)
+
+    def matches(self):
+        return self.child.matches()
+
+    def finish(self):
+        self.child.finish()
+
+    def results(self):
+        return self.child.results()
+
+
+# Allow and disallow collector
+
+class FilterCollector(WrappingCollector):
+    """A collector that lets you allow and/or restrict certain document numbers
+    in the results::
+    
+        uc = collectors.UnlimitedCollector()
+    
+        ins = query.Term("chapter", "rendering")
+        outs = query.Term("status", "restricted")
+        fc = FilterCollector(uc, allow=ins, restrict=outs)
+        
+        mysearcher.search_with_collector(myquery, fc)
+        print(fc.results())
+
+    This collector discards a document if:
+    
+    * The allowed set is not None and a document number is not in the set, or
+    * The restrict set is not None and a document number is in the set.
+    
+    (So, if the same document number is in both sets, that document will be
+    discarded.)
+    
+    If you have a reference to the collector, you can use
+    ``FilterCollector.filtered_count`` to get the number of matching documents
+    filtered out of the results by the collector.
+    """
+
+    def __init__(self, child, allow=None, restrict=None):
+        """
+        :param child: the collector to wrap.
+        :param allow: a query, Results object, or set-like object containing
+            docnument numbers that are allowed in the results, or None (meaning
+            everything is allowed).
+        :param restrict: a query, Results object, or set-like object containing
+            document numbers to disallow from the results, or None (meaning
+            nothing is disallowed).
+        """
+
+        self.child = child
+        self.allow = allow
+        self.restrict = restrict
+
+    def prepare(self, top_searcher, q, requires_matcher=False):
+        self.child.prepare(top_searcher, q, requires_matcher)
+
+        allow = self.allow
+        restrict = self.restrict
+        ftc = top_searcher._filter_to_comb
+
+        self._allow = ftc(allow) if allow else None
+        self._restrict = ftc(restrict) if restrict else None
+        self.filtered_count = 0
+
+    def collect_matches(self):
+        child = self.child
+        _allow = self._allow
+        _restrict = self._restrict
+        filtered_count = self.filtered_count
+
+        for sub_docnum in child.matches():
+            global_docnum = child.offset + sub_docnum
+            if ((_allow and global_docnum not in _allow)
+                or (_restrict and global_docnum in _restrict)):
+                filtered_count += 1
+                continue
+
+            child.collect(sub_docnum)
+
+        self.filtered_count = filtered_count
+
+    def results(self):
+        r = self.child.results()
+        r.filtered_count = self.filtered_count
+        r.allowed = self.allow
+        r.restricted = self.restrict
+        return r
+
+
+# Facet grouping collector
+
+class FacetCollector(WrappingCollector):
+    """A collector that creates groups of documents based on
+    :class:`whoosh.sorting.Facet` objects. See :doc:`/facets` for more
+    information.
+    
+    This collector is used if you specify a ``groupedby`` parameter in the
+    :meth:`whoosh.searching.Searcher.search` method. You can use the
+    :meth:`whoosh.searching.Results.groups` method to access the facet groups.
+    
+    If you have a reference to the collector can also use
+    ``FacetedCollector.facetmaps`` to access the groups directly::
+    
+        uc = collectors.UnlimitedCollector()
+        fc = FacetedCollector(uc, sorting.FieldFacet("category"))
+        mysearcher.search_with_collector(myquery, fc)
+        print(fc.facetmaps)
+    """
+
+    def __init__(self, child, groupedby, maptype=None):
+        """
+        :param groupedby: see :doc:`/facets`.
+        :param maptype: a :class:`whoosh.sorting.FacetMap` type to use for any
+            facets that don't specify their own.
+        """
+
+        self.child = child
+        self.facets = sorting.Facets.from_groupedby(groupedby)
+        self.maptype = maptype
+
+    def prepare(self, top_searcher, q, requires_matcher=False):
+        facets = self.facets
+
+        # For each facet we're grouping by:
+        # - Create a facetmap (to hold the groups)
+        # - Create a categorizer (to generate document keys)
+        self.facetmaps = {}
+        self.categorizers = {}
+        # True if any of the categorizers require the matcher to work
+        reqm = requires_matcher
+        for facetname, facet in facets.items():
+            self.facetmaps[facetname] = facet.map(self.maptype)
+
+            ctr = facet.categorizer(top_searcher)
+            self.categorizers[facetname] = ctr
+            if ctr.requires_matcher:
+                reqm = True
+
+        # If any of the categorizers require the matcher, tell the child
+        # collector we need it
+        self.child.prepare(top_searcher, q, reqm)
+
+    def set_subsearcher(self, subsearcher, offset):
+        child = self.child
+        child.set_subsearcher(subsearcher, offset)
+
+        # Tell each categorizer about the new subsearcher and offset
+        for categorizer in itervalues(self.categorizers):
+            categorizer.set_searcher(child.subsearcher, child.offset)
+
+    def collect(self, sub_docnum):
+        matcher = self.child.matcher
+        global_docnum = sub_docnum + self.child.offset
+
+        # We want the sort key for the document so we can (by default) sort
+        # the facet groups
+        sortkey = self.child.collect(sub_docnum)
+
+        # For each facet we're grouping by
+        for name, categorizer in iteritems(self.categorizers):
+            add = self.facetmaps[name].add
+
+            # We have to do more work if the facet allows overlapping groups
+            if categorizer.allow_overlap:
+                for key in categorizer.keys_for(matcher, sub_docnum):
+                    add(categorizer.key_to_name(key), global_docnum, sortkey)
+            else:
+                key = categorizer.key_for(matcher, sub_docnum)
+                key = categorizer.key_to_name(key)
+                add(key, global_docnum, sortkey)
+
+        return sortkey
+
+    def results(self):
+        r = self.child.results()
+        r._facetmaps = self.facetmaps
+        return r
+
+
+# Collapsing collector
+
+class CollapseCollector(WrappingCollector):
+    """A collector that collapses results based on a facet. That is, it
+    eliminates all but the top N results that share the same facet key.
+    Documents with an empty key for the facet are never eliminated.
+    
+    The "top" results within each group is determined by the result ordering
+    (e.g. highest score in a scored search) or an optional second "ordering"
+    facet.
+    
+    If you have a reference to the collector can also use
+    ``CollapseCollector.collapsed_counts`` to access the number of documents
+    eliminated based on each key::
+    
+        tc = TopCollector(limit=20)
+        cc = CollapseCollector(tc, "group", limit=3)
+        mysearcher.search_with_collector(myquery, cc)
+        print(cc.collapsed_counts)
+    
+    See :ref:`collapsing` for more information.
+    """
+
+    def __init__(self, child, keyfacet, limit=1, order=None):
+        """
+        :param child: the collector to wrap.
+        :param keyfacet: a :class:`whoosh.sorting.Facet` to use for collapsing.
+            All but the top N documents that share a key will be eliminated
+            from the results.
+        :param limit: the maximum number of documents to keep for each key.
+        :param orderfacet: an optional :class:`whoosh.sorting.Facet` to use
+            to determine the "top" document(s) to keep when collapsing. The
+            default (``orderfaceet=None``) uses the results order (e.g. the
+            highest score in a scored search).
+        """
+
+        self.child = child
+        self.keyfacet = sorting.MultiFacet.from_sortedby(keyfacet)
+
+        self.limit = limit
+        if order:
+            self.orderfacet = sorting.MultiFacet.from_sortedby(order)
+        else:
+            self.orderfacet = None
+
+    def prepare(self, top_searcher, q, requires_matcher=False):
+        # Categorizer for getting the collapse key of a document
+        self.keyer = self.keyfacet.categorizer(top_searcher)
+        # Categorizer for getting the collapse order of a document
+        self.orderer = None
+        if self.orderfacet:
+            self.orderer = self.orderfacet.categorizer(top_searcher)
+
+        # Dictionary mapping keys to lists of (sortkey, global_docnum) pairs
+        # representing the best docs for that key
+        self.lists = defaultdict(list)
+        # Dictionary mapping keys to the number of documents that have been
+        # filtered out with that key
+        self.collapsed_counts = defaultdict(int)
+
+        # If the keyer or orderer require a valid matcher, tell the child
+        # collector we need it
+        reqm = (self.keyer.requires_matcher
+                or (self.orderer and self.orderer.requires_matcher))
+        self.child.prepare(top_searcher, q, reqm)
+
+    def set_subsearcher(self, subsearcher, offset):
+        self.child.set_subsearcher(subsearcher, offset)
+        self.keyer.set_searcher(subsearcher, offset)
+        if self.orderer:
+            self.orderer.set_searcher(subsearcher, offset)
+
+    def collect_matches(self):
+        lists = self.lists
+        limit = self.limit
+        keyer = self.keyer
+        orderer = self.orderer
+        collapsed_counts = self.collapsed_counts
+
+        child = self.child
+        offset = child.offset
+        for sub_docnum in child.matches():
+            # Collapsing category key
+            ckey = keyer.key_to_name(keyer.key_for(child.matcher, sub_docnum))
+            if ckey is None:
+                # If the document isn't in a collapsing category, just add it
+                child.collect(sub_docnum)
+            else:
+                global_docnum = offset + sub_docnum
+
+                if orderer:
+                    # If user specified a collapse order, use it
+                    sortkey = orderer.key_for(child.matcher, sub_docnum)
+                else:
+                    # Otherwise, use the results order
+                    sortkey = child.sort_key(sub_docnum)
+
+                # Current list of best docs for this collapse key
+                best = lists[ckey]
+                add = False
+                if len(best) < limit:
+                    # If the heap is not full yet, just add this document
+                    add = True
+                elif sortkey < best[-1][0]:
+                    # If the heap is full but this document has a lower sort
+                    # key than the highest key currently on the heap, replace
+                    # the "least-best" document
+                    # Tell the child collector to remove the document
+                    child.remove(best.pop()[1])
+                    # Remember that a document was filtered
+                    collapsed_counts[ckey] += 1
+                    add = True
+
+                if add:
+                    insort(best, (sortkey, global_docnum))
+                    child.collect(sub_docnum)
+                else:
+                    # Remember that a document was filtered
+                    collapsed_counts[ckey] += 1
+
+    def results(self):
+        r = self.child.results()
+        r.collapsed_counts = self.collapsed_counts
+        return r
+
+
+# Time limit collector
+
+class TimeLimitCollector(WrappingCollector):
+    """A collector that raises a :class:`TimeLimit` exception if the search
+    does not complete within a certain number of seconds.
+    
+    ::
+        uc = collectors.UnlimitedCollector()
+        tlc = TimeLimitedCollector(uc, timelimit=5.8)
+        try:
+            mysearcher.search_with_collector(myquery, tlc)
+        except collectors.TimeLimit:
+            print("The search ran out of time!")
+        
+        # We can still get partial results from the collector
+        print(tlc.results())
+    """
+
+    def __init__(self, child, timelimit, greedy=False):
+        """
+        :param child: the collector to wrap.
+        :param timelimit: the maximum amount of time (in seconds) to
+            allow for searching. If the search takes longer than this, it will
+            raise a ``TimeLimit`` exception.
+        :param greedy: if ``True``, the collector will finish adding the most
+            recent hit before raising the ``TimeLimit`` exception.
+        """
+        self.child = child
+        self.timelimit = timelimit
+        self.greedy = greedy
+
+    def prepare(self, top_searcher, q, requires_matcher=False):
+        self.child.prepare(top_searcher, q, requires_matcher)
+
+        # Start a timer thread. If the timer fires, it will call this object's
+        # _timestop() method
+        self.timedout = False
+        self.timer = threading.Timer(self.timelimit, self._timestop)
+        self.timer.start()
+
+    def _timestop(self):
+        self.timer = None
+        # Set an attribute that will be noticed in the collect_matches() loop
+        self.timedout = True
+
+    def collect_matches(self):
+        child = self.child
+        greedy = self.greedy
+
+        for sub_docnum in child.matches():
+            # If the timer fired since the last loop and we're not greedy,
+            # raise the exception
+            if self.timedout and not greedy:
+                raise TimeLimit
+
+            child.collect(sub_docnum)
+
+            # If the timer fired since we entered the loop or it fired earlier
+            # but we were greedy, raise now
+            if self.timedout:
+                raise TimeLimit
+
+    def finish(self):
+        if self.timer:
+            self.timer.cancel()
+        self.timer = None
+        self.child.finish()
+
+
+# Matched terms collector
+
+class TermsCollector(WrappingCollector):
+    """A collector that remembers which terms appeared in which terms appeared
+    in each matched document.
+    
+    This collector is used if you specify ``terms=True`` in the
+    :meth:`whoosh.searching.Searcher.search` method.
+    
+    If you have a reference to the collector can also use
+    ``TermsCollector.termslist`` to access the term lists directly::
+    
+        uc = collectors.UnlimitedCollector()
+        tc = TermsCollector(uc)
+        mysearcher.search_with_collector(myquery, tc)
+        # tc.termdocs is a dictionary mapping (fieldname, text) tuples to
+        # sets of document numbers
+        print(tc.termdocs)
+        # tc.docterms is a dictionary mapping docnums to lists of
+        # (fieldname, text) tuples
+        print(tc.docterms)
+    """
+
+    def __init__(self, child, settype=set):
+        self.child = child
+        self.settype = settype
+
+    def prepare(self, top_searcher, q, requires_matcher=False):
+        # This collector requires a valid matcher at each step
+        self.child.prepare(top_searcher, q, True)
+
+        # A dictionary mapping (fieldname, text) pairs to arrays of docnums
+        self.termdocs = defaultdict(lambda: array("I"))
+        # A dictionary mapping docnums to lists of (fieldname, text) pairs
+        self.docterms = defaultdict(list)
+
+    def set_subsearcher(self, subsearcher, offset):
+        child = self.child
+        child.set_subsearcher(subsearcher, offset)
+        # Store a list of all the term matchers in the matcher tree
+        self.termmatchers = list(child.matcher.term_matchers())
+
+    def collect(self, sub_docnum):
+        child = self.child
+        termdocs = self.termdocs
+        docterms = self.docterms
+
+        child.collect(sub_docnum)
+
+        global_docnum = child.offset + sub_docnum
+        # For each term matcher...
+        for tm in self.termmatchers:
+            # If the term matcher is matching the current document...
+            if tm.is_active() and tm.id() == sub_docnum:
+                # Add it to the list of matching documents for the term
+                term = tm.term()
+                termdocs[term].append(global_docnum)
+                docterms[global_docnum].append(term)
+
+    def results(self):
+        r = self.child.results()
+        r.termdocs = dict(self.termdocs)
+        r.docterms = dict(self.docterms)
+        return r
+
+
+

src/whoosh/highlight.py

 
         for text in texts:
             m = results.searcher.postings(fieldname, text)
-            docset = results._termlists[(fieldname, text)]
+            docset = set(results.termdocs[(fieldname, text)])
             for docnum in sorted_ids:
                 if docnum in docset:
                     m.skip_to(docnum)

src/whoosh/searching.py

 from whoosh.util import now, lru_cache
 
 
-class TimeLimit(Exception):
-    pass
-
-
 class NoTermsException(Exception):
     """Exception raised you try to access matched terms on a :class:`Results`
     object was created without them. To record which terms matched in which
     message = "Results were created without recording terms"
 
 
+class TimeLimit(Exception):
+    """Raised by :class:`TimeLimitedCollector` if the time limit is reached
+    before the search finishes. If you have a reference to the collector, you
+    can get partial results by calling :meth:`TimeLimitedCollector.results`.
+    """
+
+
 # Searcher class
 
 class Searcher(object):
             for docnum in method(self):
                 yield docnum
 
-    def search(self, q, limit=10, sortedby=None, reverse=False, groupedby=None,
-               optimize=True, filter=None, mask=None, terms=False,
-               maptype=None):
-        """Runs the query represented by the ``query`` object and returns a
-        Results object.
+    def collector(self, limit=10, sortedby=None, reverse=False, groupedby=None,
+                  collapse=None, collapse_limit=1, collapse_order=None,
+                  optimize=True, filter=None, mask=None, terms=False,
+                  maptype=None):
+        """Low-level method: returns a configured
+        :class:`whoosh.collectors.Collector` object based on the given
+        arguments. You can use this object with
+        :meth:`Searcher.search_with_collector` to search.
+        
+        See the documentation for the :meth:`Searcher.search` method for a
+        description of the parameters.
+        
+        This method may be useful to get a basic collector object and then wrap
+        it with another collector from ``whoosh.collectors`` or with a custom
+        collector of your own::
+        
+            # Equivalent of
+            # results = mysearcher.search(myquery, limit=10)
+            # but with a time limt...
+
+            # Create a TopCollector
+            c = mysearcher.collector(limit=10)
+            
+            # Wrap it with a TimeLimitedCollector with a time limit of
+            # 10.5 seconds
+            from whoosh.collectors import TimeLimitedCollector
+            c = TimeLimitedCollector(c, 10.5)
+            
+            # Search using the custom collector
+            results = mysearcher.search_with_collector(myquery, c)
+        """
+
+        from whoosh import collectors
+
+        if limit is not None and limit < 1:
+            raise ValueError("limit must be >= 1")
+
+        if sortedby:
+            c = collectors.SortingCollector(sortedby, limit=limit,
+                                            reverse=reverse)
+        elif groupedby or reverse or not limit or limit >= self.doc_count():
+            # A collector that gathers every matching document
+            c = collectors.UnlimitedCollector(reverse=reverse)
+        else:
+            # A collector that uses block quality optimizations and a heap
+            # queue to only collect the top N documents
+            c = collectors.TopCollector(limit, usequality=optimize)
+
+        if groupedby:
+            c = collectors.FacetCollector(c, groupedby, maptype=maptype)
+        if terms:
+            c = collectors.TermsCollector(c)
+        if collapse:
+            c = collectors.CollapseCollector(c, collapse, limit=collapse_limit,
+                                             order=collapse_order)
+
+        # Filtering wraps last so it sees the docs first
+        if filter or mask:
+            c = collectors.FilterCollector(c, filter, mask)
+        return c
+
+    def search(self, q, **kwargs):
+        """Runs a :class:`whoosh.query.Query` object on this searcher and
+        returns a :class:`Results` object. See :doc:`/searching` for more
+        information.
+
+        This method takes many keyword arguments (documented below).
 
         See :doc:`/facets` for information on using ``sortedby`` and/or
-        ``groupedby``.
+        ``groupedby``. See :ref:`collapsing` for more information on using
+        ``collapse``, ``collapse_limit``, and ``collapse_order``.
 
-        :param query: a :class:`whoosh.query.Query` object.
+        :param query: a :class:`whoosh.query.Query` object to use to match
+            documents.
         :param limit: the maximum number of documents to score. If you're only
             interested in the top N documents, you can set limit=N to limit the
-            scoring for a faster search.
+            scoring for a faster search. Default is 10.
         :param sortedby: see :doc:`/facets`.
-        :param reverse: Reverses the direction of the sort.
+        :param reverse: Reverses the direction of the sort. Default is False.
         :param groupedby: see :doc:`/facets`.
         :param optimize: use optimizations to get faster results when possible.
+            Default is True.
         :param filter: a query, Results object, or set of docnums. The results
             will only contain documents that are also in the filter object.
         :param mask: a query, Results object, or set of docnums. The results
-            will not contain documents that are also in the mask object.
+            will not contain any documents that are in the mask object.
         :param terms: if True, record which terms were found in each matching
-            document. You can use :meth:`Results.contains_term` or
-            :meth:`Hit.contains_term` to check whether a hit contains a
-            particular term.
+            document. See :doc:`/searching` for more information. Default is
+            False.
         :param maptype: by default, the results of faceting with ``groupedby``
             is a dictionary mapping group names to ordered lists of document
             numbers in the group. You can pass a
             to specify a different (usually faster) method for grouping. For
             example, ``maptype=sorting.Count`` would store only the count of
             documents in each group, instead of the full list of document IDs.
+        :param collapse: a :doc:`facet </facets>` to use to collapse the
+            results. See :ref:`collapsing` for more information.
+        :param collapse_limit: the maximum number of documents to allow with
+            the same collapse key. See :ref:`collapsing` for more information.
+        :param collapse_order: an optional ordering :doc:`facet </facets>`
+            to control which documents are kept when collapsing. The default
+            (``collapse_order=None``) uses the results order (e.g. the highest
+            scoring documents in a scored search).
         :rtype: :class:`Results`
         """
 
-        if limit is not None and limit < 1:
-            raise ValueError("limit must be >= 1")
+        # Call the collector() method to build a collector based on the
+        # parameters passed to this method
+        c = self.collector(**kwargs)
+        # Call the lower-level method to run the collector
+        self.search_with_collector(q, c)
+        # Return the results object from the collector
+        return c.results()
 
-        collector = Collector(limit=limit, usequality=optimize,
-                              groupedby=groupedby, terms=terms,
-                              maptype=maptype)
+    def search_with_collector(self, q, collector):
+        """Low-level method: runs a :class:`whoosh.query.Query` object on this
+        searcher using the given :class:`whoosh.collectors.Collector` object
+        to collect the results::
+        
+            myquery = query.Term("content", "cabbage")
+        
+            uc = collectors.UnlimitedCollector()
+            tc = TermsCollector(uc)
+            
+            mysearcher.search_with_collector(myquery, tc)
+            print(tc.docterms)
+            print(tc.results())
+        
+        Note that this method does not return a :class:`Results` object. You
+        need to access the collector to get a results object or other
+        information the collector might hold after the search.
+        
+        :param query: a :class:`whoosh.query.Query` object to use to match
+            documents.
+        :param collector: a :class:`whoosh.collectors.Collector` object to feed
+            the results into.
+        """
 
-        if sortedby:
-            return collector.sort(self, q, sortedby, reverse=reverse,
-                                  allow=filter, restrict=mask)
+        collector.prepare(self, q)
+
+        # Make a list of subsearchers (if the searcher is atomic, it's a list
+        # of one)
+        if self.is_atomic():
+            subsearchers = [(self, 0)]
         else:
-            return collector.search(self, q, allow=filter, restrict=mask)
+            subsearchers = self.subsearchers
+
+        try:
+            # for each sub-searcher, run the query and collect the matching
+            # docs
+            for subsearcher, offset in subsearchers:
+                collector.set_subsearcher(subsearcher, offset)
+                collector.collect_matches()
+        finally:
+            collector.finish()
 
     def correct_query(self, q, qstring, correctors=None, allfields=False,
                       terms=None, prefix=0, maxdist=2):
         return sqc.correct_query(q, qstring)
 
 
-class Collector(object):
-    """A Collector finds the matching documents, scores them, collects them
-    into a list, and produces a Results object from them.
-
-    Normally you do not need to instantiate an instance of the base
-    Collector class, the :meth:`Searcher.search` method does that for you.
-
-    If you create a custom Collector instance or subclass you can use its
-    ``search()`` method instead of :meth:`Searcher.search`::
-
-        mycollector = MyCollector()
-        results = mycollector.search(mysearcher, myquery)
-
-    **Do not** re-use or share Collector instances between searches. You
-    should create a new Collector instance for each search.
-
-    To limit the amount of time a search can take, pass the number of
-    seconds to the ``timelimit`` keyword argument::
-
-        # Limit the search to 4.5 seconds
-        col = Collector(timelimit=4.5, greedy=False)
-        # If this call takes more than 4.5 seconds, it will raise a
-        # whoosh.searching.TimeLimit exception
-        try:
-            r = searcher.search(myquery, collector=col)
-        except TimeLimit, tl:
-            # You can still retrieve partial results from the collector
-            r = col.results()
-
-    If the ``greedy`` keyword is ``True``, the collector will finish adding
-    the most recent hit before raising the ``TimeLimit`` exception.
-    """
-
-    def __init__(self, limit=10, usequality=True, groupedby=None,
-                 timelimit=None, greedy=False, terms=False, replace=10,
-                 maptype=None):
-        """
-        :param limit: the maximum number of hits to collect. If this is None,
-            collect all hits.
-        :param usequality: whether to use block quality optimizations when
-            available. This is mostly useful for debugging purposes.
-        :param groupedby: see :doc:`/facets` for information.
-        :param timelimit: the maximum amount of time (in possibly fractional
-            seconds) to allow for searching. If the search takes longer than
-            this, it will raise a ``TimeLimit`` exception.
-        :param greedy: if ``True``, the collector will finish adding the most
-            recent hit before raising the ``TimeLimit`` exception.
-        :param terms: if ``True``, record which terms matched in each document.
-        :param maptype: a :class:`whoosh.sorting.FacetMap` type to use for all
-            facets that don't specify their own.
-        """
-
-        self.limit = limit
-        self.usequality = usequality
-        self.replace = replace
-        self.timelimit = timelimit
-        self.greedy = greedy
-        self.maptype = maptype
-        self.termlists = defaultdict(set) if terms else None
-
-        self.facets = None
-        if groupedby:
-            self.facets = sorting.Facets.from_groupedby(groupedby)
-
-        self.replaced_times = 0
-        self.skipped_times = 0
-
-    def should_add_all(self):
-        """Returns True if this collector needs to add all found documents (for
-        example, if ``limit=None``), or False if this collector should only
-        add the top N found documents.
-        """
-
-        limit = self.limit
-        if limit:
-            limit = min(limit, self.searcher.doc_count_all())
-        return not limit
-
-    def use_block_quality(self, searcher, matcher=None):
-        """Returns True if this collector can use block quality optimizations
-        (usequality is True, the matcher supports block quality, the weighting
-        does not use the final() method, etc.).
-        """
-
-        use = (self.usequality
-               and not searcher.weighting.use_final
-               and not self.should_add_all())
-        if matcher:
-            use = use and matcher.supports_block_quality()
-        return use
-
-    def _score_fn(self, searcher):
-        w = searcher.weighting
-        if w.use_final:
-            def scorefn(matcher):
-                score = matcher.score()
-                return w.final(searcher, matcher.id(), score)
-        else:
-            scorefn = None
-        return scorefn
-
-    def _set_categorizers(self, searcher, offset):
-        if self.facets:
-            self.categorizers = {}
-            for name, facet in self.facets.items():
-                catter = facet.categorizer(searcher)
-                catter.set_searcher(searcher, offset)
-                self.categorizers[name] = catter
-
-    def _set_filters(self, allow, restrict):
-        if allow:
-            allow = self.searcher._filter_to_comb(allow)
-        self.allow = allow
-        if restrict:
-            restrict = self.searcher._filter_to_comb(restrict)
-        self.restrict = restrict
-
-    def _set_timer(self):
-        # If this collector is time limited, start the timer thread
-        self.timer = None
-        if self.timelimit:
-            self.timer = threading.Timer(self.timelimit, self._timestop)
-            self.timer.start()
-
-    def _reset(self):
-        self.facetmaps = {}
-        self.items = []
-        self.timedout = False
-        self.runtime = -1
-        self.minscore = None
-        if self.facets:
-            self.facetmaps = dict((facetname, facet.map(self.maptype))
-                                  for facetname, facet in self.facets.items())
-        else:
-            self.facetmaps = {}
-
-    def _timestop(self):
-        # Called by the Timer when the time limit expires. Set an attribute on
-        # the collector to indicate that the timer has expired and the
-        # collector should raise a TimeLimit exception at the next consistent
-        # state.
-        self.timer = None
-        self.timedout = True
-
-    def collect(self, docid, offsetid, sortkey):
-        docset = self.docset
-        if docset is not None:
-            docset.add(offsetid)
-
-        if self.facets is not None:
-            for name, catter in self.categorizers.items():
-                add = self.facetmaps[name].add
-                if catter.allow_overlap:
-                    for key in catter.keys_for_id(docid):
-                        add(catter.key_to_name(key), offsetid, sortkey)
-                else:
-                    key = catter.key_to_name(catter.key_for_id(docid))
-                    add(key, offsetid, sortkey)
-
-    def search(self, searcher, q, allow=None, restrict=None):
-        """Top-level method call which uses the given :class:`Searcher` and
-        :class:`whoosh.query.Query` objects to return a :class:`Results`
-        object.
-
-        >>> # This is the equivalent of calling searcher.search(q)
-        >>> col = Collector()
-        >>> results = col.search(searcher, q)
-
-        This method takes care of calling :meth:`Collector.add_searcher`
-        for each sub-searcher in a collective searcher. You should only call
-        this method on a top-level searcher.
-        """
-
-        self.searcher = searcher
-        self.q = q
-        self._set_filters(allow, restrict)
-        self._reset()
-        self._set_timer()
-
-        # If we're not using block quality, then we can add every document
-        # number to a set as we see it, because we're not skipping low-quality
-        # blocks
-        self.docset = set() if not self.use_block_quality(searcher) else None
-
-        # Perform the search
-        t = now()
-
-        if searcher.is_atomic():
-            searchers = [(searcher, 0)]
-        else:
-            searchers = searcher.subsearchers
-
-        for s, offset in searchers:
-            scorefn = self._score_fn(s)
-            self.subsearcher = s
-            self._set_categorizers(s, offset)
-            self.add_matches(q, offset, scorefn)
-
-        # If we started a time limit timer thread, cancel it
-        if self.timelimit and self.timer:
-            self.timer.cancel()
-
-        self.runtime = now() - t
-        return self.results()
-
-    def add_matches(self, q, offset, scorefn):
-        items = self.items
-        limit = self.limit
-        addall = self.should_add_all()
-
-        for score, offsetid in self.pull_matches(q, offset, scorefn):
-            # Document numbers are negated before putting them in the heap so
-            # that higher document numbers have lower "priority" in the queue.
-            # Lower document numbers should always come before higher document
-            # numbers with the same score to keep the order stable.
-            negated_offsetid = 0 - offsetid
-
-            if addall:
-                # We're just adding all matches
-                items.append((score, negated_offsetid))
-            elif len(items) < limit:
-                # The heap isn't full, so add this document
-                heappush(items, (score, negated_offsetid))
-            else:
-                # The heap is full, but if this document has a high enough
-                # score to make the top N, add it to the heap
-                if score > items[0][0]:
-                    heapreplace(items, (score, negated_offsetid))
-                    self.minscore = items[0][0]
-
-    def pull_matches(self, q, offset, scorefn):
-        """Low-level method yields (docid, score) pairs from the given matcher.
-        Called by :meth:`Collector.add_matches`.
-        """
-
-        allow = self.allow
-        restrict = self.restrict
-        replace = self.replace
-        collect = self.collect
-        minscore = self.minscore
-        replacecounter = 0
-        timelimited = bool(self.timelimit)
-
-        matcher = q.matcher(self.subsearcher)
-        usequality = self.use_block_quality(self.subsearcher, matcher)
-
-        termlists = self.termlists
-        recordterms = termlists is not None
-        if recordterms:
-            termmatchers = list(matcher.term_matchers())
-        else:
-            termmatchers = None
-
-        # A flag to indicate whether we should check block quality at the start
-        # of the next loop
-        checkquality = True
-
-        while matcher.is_active():
-            # If the replacement counter has reached 0, try replacing the
-            # matcher with a more efficient version
-            if replace:
-                if replacecounter == 0 or self.minscore != minscore:
-                    matcher = matcher.replace(minscore or 0)
-                    self.replaced_times += 1
-                    if not matcher.is_active():
-                        break
-                    usequality = self.use_block_quality(self.subsearcher,
-                                                        matcher)
-                    replacecounter = replace
-                    minscore = self.minscore
-                    if recordterms:
-                        termmatchers = list(matcher.term_matchers())
-                replacecounter -= 1
-
-            # Check whether the time limit expired since the last match
-            if timelimited and self.timedout and not self.greedy:
-                raise TimeLimit
-
-            # If we're using block quality optimizations, and the checkquality
-            # flag is true, try to skip ahead to the next block with the
-            # minimum required quality
-            if usequality and checkquality and minscore is not None:
-                self.skipped_times += matcher.skip_to_quality(minscore)
-                # Skipping ahead might have moved the matcher to the end of the
-                # posting list
-                if not matcher.is_active():
-                    break
-
-            # The current document ID
-            docid = matcher.id()
-            offsetid = docid + offset
-
-            # Check whether the document is filtered
-            if ((not allow or offsetid in allow)
-                and (not restrict or offsetid not in restrict)):
-                # Collect and yield this document
-                if scorefn:
-                    score = scorefn(matcher)
-                    collect(docid, offsetid, score)
-                else:
-                    score = matcher.score()
-                    collect(docid, offsetid, 0 - score)
-                yield (score, offsetid)
-
-            # If recording terms, add the document to the termlists
-            if recordterms:
-                for m in termmatchers:
-                    if m.is_active() and m.id() == docid:
-                        termlists[m.term()].add(offsetid)
-
-            # Check whether the time limit expired
-            if timelimited and self.timedout:
-                raise TimeLimit
-
-            # Move to the next document. This method returns True if the
-            # matcher has entered a new block, so we should check block quality
-            # again.
-            checkquality = matcher.next()
-
-    def sort(self, global_searcher, q, sortedby, reverse=False, allow=None,
-             restrict=None):
-        self.searcher = global_searcher
-        self.q = q
-        self.docset = set()
-        self._set_filters(allow, restrict)
-        self._reset()
-        self._set_timer()
-
-        items = self.items
-        limit = self.limit
-        heapfn = nlargest if reverse else nsmallest
-        addall = self.should_add_all()
-
-        facet = sorting.MultiFacet.from_sortedby(sortedby)
-        catter = facet.categorizer(global_searcher)
-        t = now()
-
-        if global_searcher.is_atomic():
-            searchers = [(global_searcher, 0)]
-        else:
-            searchers = global_searcher.subsearchers
-
-        for segment_searcher, offset in searchers:
-            self.subsearcher = segment_searcher
-            self._set_categorizers(segment_searcher, offset)
-            catter.set_searcher(segment_searcher, offset)
-
-            if catter.requires_matcher or self.termlists:
-                ls = list(self.pull_matches(q, offset, catter.key_for_matcher))
-            else:
-                kfi = catter.key_for_id
-                ls = list(self.pull_unscored_matches(q, offset, kfi))
-
-            if addall:
-                items.extend(ls)
-            else:
-                items = heapfn(limit, items + ls)
-
-        self.items = items
-        self.runtime = now() - t
-        return self.results(scores=False, reverse=reverse)
-
-    def pull_unscored_matches(self, q, offset, keyfn):
-        allow = self.allow
-        restrict = self.restrict
-        collect = self.collect
-        timelimited = bool(self.timelimit)
-
-        matcher = q.matcher(self.subsearcher)
-        for docnum in matcher.all_ids():
-            # Check whether the time limit expired since the last match
-            if timelimited and self.timedout and not self.greedy:
-                raise TimeLimit
-
-            # The current document ID
-            offsetid = docnum + offset
-
-            # Check whether the document is filtered
-            if ((not allow or offsetid in allow)
-                and (not restrict or offsetid not in restrict)):
-                # Collect and yield this document
-                key = keyfn(docnum)
-                collect(docnum, offsetid, key)
-                yield (key, offsetid)
-
-            # Check whether the time limit expired
-            if timelimited and self.timedout:
-                raise TimeLimit
-
-    def results(self, scores=True, reverse=False):
-        """Returns the current results from the collector. This is useful for
-        getting the results out of a collector that was stopped by a time
-        limit exception.
-        """
-
-        if scores:
-            # Docnums are stored as negative for reasons too obscure to go into
-            # here, re-negate them before returning
-            items = [(x[0], 0 - x[1]) for x in self.items]
-
-            # Sort by negated scores so that higher scores go first, then by
-            # document number to keep the order stable when documents have the
-            # same score
-            items.sort(key=lambda x: (0 - x[0], x[1]))
-        else:
-            items = sorted(self.items, reverse=reverse)
-
-        return Results(self.searcher, self.q, items, self.docset,
-                       facetmaps=self.facetmaps, runtime=self.runtime,
-                       filter=self.allow, mask=self.restrict,
-                       termlists=self.termlists)
-
-
 class Results(object):
     """This object is returned by a Searcher. This object represents the
     results of a search query. You can mostly use it as if it was a list of
     so keeps all files used by it open.
     """
 
-    def __init__(self, searcher, q, top_n, docset, facetmaps=None, runtime= -1,
-                 filter=None, mask=None, termlists=None, highlighter=None):
+    def __init__(self, searcher, q, top_n, docset=None, facetmaps=None,
+                 runtime=0, highlighter=None):
         """
         :param searcher: the :class:`Searcher` object that produced these
             results.
         self.docset = docset
         self._facetmaps = facetmaps or {}
         self.runtime = runtime
-        self._filter = filter
-        self._mask = mask
-        self._termlists = termlists
-
         self.highlighter = highlighter or highlight.Highlighter()
         self._char_cache = {}
 
         docset = set(self.searcher.docs_for_query(self.q))
 
         # Apply the filter and mask, if any, from the original search
-        if self._filter:
-            docset.intersection_update(self._filter)
-        if self._mask:
-            docset.difference_update(self._mask)
+        if hasattr(self, "allowed"):
+            docset.intersection_update(self.allowed)
+        if hasattr(self, "restricted"):
+            docset.difference_update(self.restricted)
 
         self.docset = docset
 
         return self.docset
 
     def copy(self):
-        """Returns a copy of this results object.
+        """Returns a deep copy of this results object.
         """
 
-        topcopy = list(self.top_n)
-        setcopy = copy.copy(self.docset)
-        return self.__class__(self.searcher, self.q, topcopy, setcopy,
-                              runtime=self.runtime, filter=self._filter,
-                              mask=self._mask)
+        # Shallow copy self to get attributes
+        r = copy.copy(self)
+        # Deep copies of docset and top_n in case they're modified
+        r.docset = copy.deepcopy(self.docset)
+        r.top_n = copy.deepcopy(self.top_n)
+        return r
 
     def score(self, n):
         """Returns the score for the document at the Nth position in the list
         >>>
         """
 
-        return self._termlists is not None
+        return hasattr(self, "docterms") and hasattr(self, "termdocs")
 
     def matched_terms(self):
         """Returns the set of ``("fieldname", "text")`` tuples representing
         set([("content", "bravo")])
         """
 
-        if self._termlists is None:
+        if not self.has_matched_terms():
             raise NoTermsException
-        return set(self._termlists.keys())
+
+        return set(self.termdocs.keys())
 
     def _get_fragmenter(self):
         return self.highlighter.fragmenter
 
     >>> r = searcher.search(query.Term("content", "render"))
     >>> r[0]
-    <Hit {title=u"Rendering the scene"}>
+    < Hit {title = u"Rendering the scene"} >
     >>> r[0].rank
     0
     >>> r[0].docnum == 4592
         """
         :param results: the Results object this hit belongs to.
         :param pos: the position in the results list of this hit, for example
-            pos=0 means this is the first (highest scoring) hit.
+            pos = 0 means this is the first (highest scoring) hit.
         :param docnum: the document number of this hit.
         :param score: the score of this hit.
         """
         ...   print("Doesn't contain:", q.all_terms() - hit.matched_terms())
         """
 
-        termlists = self.results._termlists
-        if termlists is None:
+        if not self.results.has_matched_terms():
             raise NoTermsException
-
-        # termlists maps terms->set of docnums, so we have to check every term
-        # to see if this document is in its list
-        s = set()
-        for term in termlists.keys():
-            if self.docnum in termlists[term]:
-                s.add(term)
-        return s
+        return self.results.docterms[self.docnum]
 
     def highlights(self, fieldname, text=None, top=3):
         """Returns highlighted snippets from the given field::
                                        top=top, numterms=numterms, model=model,
                                        normalize=normalize, filter=filter)
 
-    def contains_term(self, fieldname, text):
-        """Returns True if the given query term exists in this document. This
-        only works for terms that were in the original query.
-        """
-
-        termlists = self.results._termlists
-        if termlists is not None:
-            term = (fieldname, text)
-            if term in termlists:
-                docset = termlists[term]
-                return self.docnum in docset
-
-        return False
-
     def __repr__(self):
         return "<%s %r>" % (self.__class__.__name__, self.fields())
 

src/whoosh/sorting.py

     searches each segment to let the cateogorizer set up whatever segment-
     specific data it needs.
     
-    ``Collector.allow_overlap`` should be True if the caller should use the
-    ``keys_for_id`` method instead of ``key_for_id`` to group documents into
-    potentially overlapping groups.
+    ``Collector.allow_overlap`` should be ``True`` if the caller can use the
+    ``keys_for`` method instead of ``key_for`` to group documents into
+    potentially overlapping groups. The default is ``False``.
+    
+    If a categorizer subclass can categorize the document using only the
+    document number, it should set ``Collector.requires_matcher`` to ``False``
+    (this is the default) and NOT USE the given matcher in the ``key_for`` or
+    ``keys_for`` methods, since in that case ``segment_docnum`` is not
+    guaranteed to be consistent with the given matcher. If a categorizer
+    subclass needs to access information on the matcher, it should set
+    ``requires_matcher`` to ``True``. This will prevent the caller from using
+    optimizations that might leave the matcher in an inconsistent state.
     """
 
     allow_overlap = False
 
         pass
 
-    def key_for_matcher(self, matcher):
-        """Returns a key for the given matcher. The default implementation
-        simply gets the matcher's current document ID and calls ``key_for_id``,
-        but a subclass can override this if it needs information from the
-        matcher to compute the key.
+    def key_for(self, matcher, segment_docnum):
+        """Returns a key for the current match.
+        
+        :param matcher: a :class:`whoosh.matching.Matcher` object. If
+            ``self.requires_matcher`` is ``False``, DO NOT use this object,
+            since it may be inconsistent. Use the given ``segment_docnum``
+            instead.
+        :param segment_docnum: the segment-relative document number of the
+            current match.
         """
 
-        return self.key_for_id(matcher.id())
-
-    def key_for_id(self, docid):
-        """Returns a key for the given **segment-relative** document number.
-        """
+        # Backwards compatibility
+        if hasattr(self, "key_for_id"):
+            return self.key_for_id(segment_docnum)
+        elif hasattr(self, "key_for_matcher"):
+            return self.key_for_matcher(matcher)
 
         raise NotImplementedError(self.__class__)
 
-    def keys_for_id(self, docid):
-        """Yields a series of keys for the given **segment-relative** document
-        number. This method will be called instead of ``key_for_id`` if
-        ``Categorizer.allow_overlap==True``.
+    def keys_for(self, matcher, segment_docnum):
+        """Yields a series of keys for the current match.
+        
+        This method will be called instead of ``key_for`` if
+        ``self.allow_overlap`` is ``True``.
+        
+        :param matcher: a :class:`whoosh.matching.Matcher` object. If
+            ``self.requires_matcher`` is ``False``, DO NOT use this object,
+            since it may be inconsistent. Use the given ``segment_docnum``
+            instead.
+        :param segment_docnum: the segment-relative document number of the
+            current match.
         """
 
+        # Backwards compatibility
+        if hasattr(self, "keys_for_id"):
+            return self.keys_for_id(segment_docnum)
+
         raise NotImplementedError(self.__class__)
 
     def key_to_name(self, key):
             r = segment_searcher.reader()