Commits

Matt Chaput committed a7eee8a

Cleaned up collector logic. Added ability to overlap field facets. Added/revised docs.

Comments (0)

Files changed (6)

docs/source/facets.rst

-=====================================
-How to present faceted search results
-=====================================
+====================
+Sorting and faceting
+====================
 
 .. note::
-    The API for sorting and faceting changed in Whoosh 1.5.
+    The API for sorting and faceting changed in Whoosh 2.0.
 
 Overview
 ========
 
+Sorting and faceting search results in Whoosh is based on **facets**. Each
+facet associates a value with each document in the search results, allowing you
+to sort by the keys or use them group the documents. Whoosh includes a variety
+of **facet types** you can use for sorting and grouping (see below).
+
+
+Sorting
+=======
+
+By default, the results of a search are sorted with the highest-scoring
+documents first. You can use the ``sortedby`` keyword argument to order the
+results by some other criteria instead, such as the value of a field.
+
+
+A note about sorting by fields
+------------------------------
+
+When sorting by fields, the **indexed terms** in the field are used as the
+"value" to sort by, **not** the stored values. For example, take this index::
+
+    schema = fields.Schema(title=fields.TEXT(stored=True))
+    ix = index.create_in("indexdir", schema)
+    w = ix.writer()
+    w.add_document(title=u"Best Bet")
+    w.add_document(title=u"First Action")
+    w.commit()
+    
+If you sort this index by "title", you might expect the results to be
+"Best Bet" followed by "First Action", but in fact it will be the reverse! This
+is because Whoosh is sorting by **terms**, not the original text you indexed.
+For fields with multiple terms, it just picks the (alphabetically) first one,
+so the document containing "action" sorts before the one with "best".
+
+For this reason, you don't want to sort by TEXT fields. Instad, you should set
+up separate, single-term fields that you can sort by. You can duplicate content
+if you want to be able to sort by the original value of a TEXT field::
+
+    schema = fields.Schema(title=fields.TEXT(stored=True),
+                           sort_title=fields.ID)
+    ix = index.create_in("indexdir", schema)
+    w = ix.writer()
+    for title in titles:
+        w.add_document(title=title, sort_title=title)
+    w.commit()
+
+
+The sortedby keyword argument
+-----------------------------
+
+You can use the following objects as ``sortedby`` values:
+
+A ``FacetType`` object
+    Uses this object to sort the documents. See below for the available facet
+    types.
+
+A field name string
+    Converts the field name into a ``FieldFacet`` (see below) and uses it to
+    sort the documents.
+
+A list of ``FacetType`` objects and/or field name strings
+    Bundles the facets together into a ``MultiFacet`` so you can sort by
+    multiple keys. Note that this shortcut does not allow you to reverse
+    the sort direction of individual facets. To do that, you need to construct
+    the ``MultiFacet`` object yourself.
+
+.. note::
+    You can use the ``reverse=True`` keyword argument to the
+    ``Searcher.search()`` method to reverse the overall sort direction. This
+    is more efficient than reversing each individual facet.
+
+
+Examples
+--------
+
+Sort by the value of the size field::
+
+    results = searcher.search(myquery, sortedby="size")
+    
+Sort by the reverse (highest-to-lowest) order of the "price" field::
+
+    facet = sorting.FieldFacet("price", reverse=True)
+    results = searcher.search(myquery, sortedby=facet)
+
+Sort by ascending size and then descending price::
+
+    mf = sorting.MultiFacet()
+    mf.add_field("size")
+    mf.add_field("price", reverse=True)
+    results = searcher.search(myquery, sortedby=mf)
+    
+    # or...
+    sizes = sorting.FieldFacet("size")
+    prices = sorting.FieldFacet("price", reverse=True)
+    results = searcher.search(myquery, sortedby=[sizes, prices])
+    
+Sort by the "category" field, then by the document's score::
+
+    cats = sorting.FieldFacet("category")
+    scores = sorting.ScoreFacet()
+    results = searcher.search(myquery, sortedby=[cats, scores])
+
+
+Grouping
+========
+
 It is often very useful to present "faceted" search results to the user.
-Faceting is dynamic clustering of search results into categories. The
+Faceting is dynamic grouping of search results into categories. The
 categories let users view a slice of the total results based on the categories
 they're interested in.
 
 category they're interested in, similarly to how the Spotlight quick results
 work on Mac OS X.
 
-.. tip::
-    Whoosh currently only supports **non-overlapping** categories. A document
-    cannot belong to multiple facets at the same time. (It is not an error if
-    the facets overlap; each document will simply be sorted into one category
-    arbitrarily.)
 
-Faceting relies on field caches. See :doc:`fieldcaches` for information about
-field caches.
+The groupedby keyword argument
+------------------------------
 
+You can use the following objects as ``groupedby`` values:
 
-Categorizing search results by field
-====================================
+A ``FacetType`` object
+    Uses this object to group the documents. See below for the available facet
+    types. The facet name will automatically be set to ``"facet"``.
 
-When you use the :meth:`Searcher.search` method, add the `groups` keyword
-argument to specify how to categorize the results::
+A field name string
+    Converts the field name into a ``FieldFacet`` (see below) and uses it to
+    sort the documents. The name of the field is used as the facet name.
 
-    # Group by the value of the "tag" field
-    results = searcher.search(my_query, groupedby=["tag"])
+A dictionary mapping facet names to FacetType objects
+    Sets up multiple grouping crieteria.
+
+A ``Facets`` object
+    This object is a lot like using a dictionary, but has some convenience
+    methods to make setting up multiple groupings a little easier.
+
+Examples
+--------
+
+Group by the value of the "category" field::
+
+    results = searcher.search(myquery, groupedby="category")
     
-    # Retrieve the groups from the results object
-    groups = results.groups("tag")
+Group by the value of the "category" field and also by the value of the "tags"
+field and a date range::
 
-The ``groups()`` method simply returns a dictionary mapping category names
-to lists of **document IDs**. The document IDs will be in their relative
-order from the original results.
+    cats = sorting.FieldFacet("category")
+    tags = sorting.FieldFacet("tags", allow_overlap=True)
+    results = searcher.search(myquery, groupedby={"cats": cats, "tags": tags})
+    
+    # ...or, using a Facets object has a little less duplication
+    facets = sorting.Facets()
+    facets.add_field("category")
+    facets.add_field("tags", allow_overlap=True)
+    results = searcher.search(myquery, groupedby=facets)
 
-    {"small": [5, 1, 4, 8, 2], "medium": [3, 0, 6], "large": [9, 7]}
+To group results by the *intersected values of multiple fields*, use a
+``MultiFacet`` object (see below). For example, if you have two fields named
+``tag`` and ``size``, you could group the results by all combinations of the
+``tag`` and ``size`` field, such as ``('tag1', 'small')``,
+``('tag2', 'small')``, ``('tag1', 'medium')``, and so on::
 
-The last thing you need to know is how to translate document numbers into
-something you can display. The ``Searcher`` object's ``stored_fields()``
-method takes a document number and returns the document's stored fields as a
-dictionary::
+    # Generate a grouping from the combination of the "tag" and "size" fields
+    mf = MultiField("tag", "size")
+    results = searcher.search(myquery, groupedby={"tag/size": mf})
+
+
+Getting the faceted groups
+--------------------------
+
+The ``Results.groups("facetname")`` method  returns a dictionary mapping
+category names to lists of **document IDs**.
+
+    {"small": [1, 2, 4, 5, 8], "medium": [0, 3, 6], "large": [7, 9]}
+
+The ``Searcher`` object's ``stored_fields()`` method takes a document number
+and returns the document's stored fields as a dictionary::
 
     for category_name in categories:
         print "Top 5 documents in the %s category" % category_name
         if len(doclist) > 5:
             print "  (%s more)" % (len(doclist) - 5)
 
-You can use the categories dictionary and ``stored_fields()`` to display the
-categories and results any way you want in your application.
+(You can use ``Searcher.stored_fields(docnum)`` to get the stored fields
+associated with a document number.)
 
+If you just want to **count** the number of documents in each group, instead of
+generating a full list of the documents, use the ``groupids=False`` keyword
+argument::
 
-Getting multiple categorizations of results
-===========================================
+    results = searcher.search(myquery, groupedby="size")
+    groups = results.groups("size")
+    # {"small": 5, "medium": 3, "large": 2}
 
 To generate multiple groupings, you can name multiple fields in the list you
 pass to the `groups` keyword::
 
     # Generate separate groupings for the "tag" and "size" fields
-    results = searcher.search(my_query, groupedby=["tag", "size"])
+    results = searcher.search(myquery, groupedby=["tag", "size"])
     
     # Get the groupings by "tag"
     tag_groups = results.groups("tag")
     size_groups = results.groups("size")
 
 
-Categorizing by multiple fields
-===============================
+Facet types
+===========
 
-To group results by the *combined values of multiple fields*, use a tuple of
-field names instead of a field name. For example, if you have two fields named
-``tag`` and ``size``, you could group the results by all combinations of the
-``tag`` and ``size`` field, such as ``('tag1', 'small')``,
-``('tag2', 'small')``, ``('tag1', 'medium')``, and so on::
+FieldFacet
+----------
 
-    # Generate a grouping from the combination of the "tag" and "size" fields
-    results = searcher.search(my_query, groupedby=[("tag", "size")])
+This is the most common facet type. It sorts or groups based on the
+value in a certain field in each document. This generally works best
+(or at all) if each document has only one term in the field (e.g. an ID
+field)::
+
+    # Sort search results by the value of the "path" field
+    facet = sorting.FieldFacet("path")
+    results = searcher.search(myquery, sortedby=facet)
     
-    groups = results.groups(("tag", "size"))
+    # Group search results by the value of the "parent" field
+    facet = sorting.FieldFacet("parent")
+    results = searcher.search(myquery, groupedby=facet)
+    parent_groups = results.groups("parent")
 
+By default, ``FieldFacet`` only supports **non-overlapping** grouping, where a
+document cannot belong to multiple facets at the same time (each document will
+be sorted into one category arbitrarily.) To get overlapping groups with
+multi-valued fields, use the ``allow_overlap=True`` keyword argument::
 
-Categorizing based on custom queries
-====================================
+    facet = sorting.FieldFacet(fieldname, allow_overlap=True)
 
-If you need more complex categories, you can set up categories defined by
-queries. For example, you can create price categories using range queries::
+This supports overlapping group membership where documents have more than one
+term in a field (e.g. KEYWORD fields). If you don't need overlapping, don't
+use ``allow_overlap`` because it's slower and less memory-efficient.
+
+
+QueryFacet
+----------
+
+You can set up categories defined by arbitrary queries. For example, you can
+group names using prefix queries::
 
     # Use queries to define each category
     # (Here I'll assume "price" is a NUMERIC field, so I'll use
     # NumericRange)
-    categories = {}
-    category["$0 - $100"] = query.NumericRange("price", 0, 100)
-    category["$101 - $500"] = query.NumericRange("price", 101, 500)
-    category["$501 - $1000"] = query.NumericRange("price", 501, 1000)
+    qdict = {}
+    qdict["A-D"] = query.TermRange("name", "a", "d")
+    qdict["E-H"] = query.TermRange("name", "e", "h")
+    qdict["I-L"] = query.TermRange("name", "i", "l")
+    # ...
     
-    # Define the facets on the searcher. If save=True, the cached
-    # facets will be saved to disk for future use. Use save=False to
-    # avoid this for one-off queries.
-    searcher.define_facets("pricerange", categories, save=False)
+    qfacet = sorting.QueryFacet(qdict)
+    r = searcher.search(myquery, groupedby={"firstltr": qfacet})
+    
+By default, ``QueryFacet`` only supports **non-overlapping** grouping, where a
+document cannot belong to multiple facets at the same time (each document will
+be sorted into one category arbitrarily.) To get overlapping groups with
+multi-valued fields, use the ``allow_overlap=True`` keyword argument::
 
-Now you can use ``pricerange`` as if it was the name of a field for the
-purposes of grouping and sorting::
+    facet = sorting.QueryFacet(querydict, allow_overlap=True)
 
-    r = searcher.search(my_query, groupedby=["princerange"])
 
+RangeFacet
+----------
 
+The ``RangeFacet`` is for NUMERIC field types. It divides a range of possible
+values into groups. For example, to group documents based on price into
+buckets $100 "wide"::
 
+    pricefacet = sorting.RangeFacet("price", 0, 1000, 100)
+    
+The first argument is the name of the field. The next two arguments are the
+full range to be divided. Value outside this range (in this example, values
+below 0 and above 1000) will be sorted into the "missing" (None) group. The
+fourth argument is the "gap size", the size of the divisions in the range.
 
+The "gap" can be a list instead of a single value. In that case, the values in
+the list will be used to set the size of the initial divisions, with the last
+value in the list being the size for all subsequent divisions. For example::
 
+    pricefacet = sorting.RangeFacet("price", 0, 1000, [5, 10, 35, 50])
+    
+...will set up divisions of 0-5, 5-15, 15-50, 50-100, and then use 50 as the
+size for all subsequent divisions (i.e. 100-150, 150-200, and so on).
 
+The ``hardend`` keyword argument controls whether the last division is clamped
+to the end of the range or allowed to go past the end of the range. For
+example, this::
 
+    facet = sorting.RangeFacet("num", 0, 10, 4, hardend=False)
+    
+...gives divisions 0-4, 4-8, and 8-12, while this::
 
+    facet = sorting.RangeFacet("num", 0, 10, 4, hardend=True)
+    
+...gives divisions 0-4, 4-8, and 8-10. (The default is ``hardend=False``.)
 
+.. note::
+    The ranges/buckets are always **inclusive** at the start and **exclusive**
+    at the end.
 
+
+DateRangeFacet
+--------------
+
+This is like ``RangeFacet`` but for DATETIME fields. The start and end values
+must be ``datetime.datetime`` objects, and the gap(s) is/are
+``datetime.timedelta`` objects.
+
+For example::
+
+    from datetime import datetime, timedelta
+
+    start = datetime(2000, 1, 1)
+    end = datetime.now()
+    gap = timedelta(days=365)
+    bdayfacet = sorting.DateRangeFacet("birthday", start, end, gap)
+
+As with ``RangeFacet``, you can use a list of gaps and the ``hardend`` keyword
+argument.
+
+
+ScoreFacet
+----------
+
+This facet is sometimes useful for sorting.
+
+For example, to sort by the "category" field, then for documents with the same
+category, sort by the document's score::
+
+    cats = sorting.FieldFacet("category")
+    scores = sorting.ScoreFacet()
+    results = searcher.search(myquery, sortedby=[cats, scores])
+
+The ``ScoreFacet`` always sorts higher scores before lower scores.
+
+.. note::
+    While using ``sortedby=ScoreFacet()`` should give the same results as using
+    the default scored ordering (``sortedby=None``), using the facet will be
+    slower because Whoosh automatically turns off many optimizations when
+    sorting.
+
+
+FunctionFacet
+-------------
+
+This facet lets you pass a custom function to compute the sorting/grouping key
+for documents. (Using this facet type may be easier than subclassing FacetType
+and Categorizer to set up some custom behavior.)
+
+The function will be called with the index searcher and index document ID as
+arguments. For example, if you have an index with term vectors::
+
+    schema = fields.Schema(id=fields.STORED,
+                           text=fields.TEXT(stored=True, vector=True))
+    ix = RamStorage().create_index(schema)
+    
+...you could use a function to sort documents higher the closer they are to
+having equal occurances of two terms:: 
+    
+    def fn(searcher, docnum):
+        v = dict(searcher.vector_as("frequency", docnum, "text"))
+        # Sort documents that have equal number of "alfa" and "bravo" first
+        return 0 - (1.0 / (abs(v.get("alfa", 0) - v.get("bravo", 0)) + 1.0))
+
+    facet = sorting.FunctionFacet(fn)
+    results = searcher.search(myquery, sortedby=facet)
+
+
+MultiFacet
+==========
+
+This facet type returns a composite of the keys returned by two or more
+sub-facets, allowing you to sort/group by the intersected values of multiple
+facets.
+
+``MultiFacet`` has methods for adding facets::
+
+    myfacet = sorting.RangeFacet(0, 1000, 10)
+
+    mf = sorting.MultiFacet()
+    mf.add_field("category")
+    mf.add_field("price", reverse=True)
+    mf.add_facet(myfacet)
+    mf.add_score()
+
+You can also pass a list of field names and/or ``FacetType`` objects to the
+initializer::
+
+    prices = sorting.FieldFacet("price", reverse=True)
+    scores = sorting.ScoreFacet()
+    mf = sorting.MultiFacet("category", prices, myfacet, scores)
+
+
+Missing values
+==============
+
+* When sorting, documents without any terms in a given field, or whatever else
+  constitutes "missing" for different facet types, will always sort to the end.
+
+* When grouping, "missing" documents will appear in a group with the
+  key ``None``.
+
+
+Expert: writing your own facet
+==============================
+
+TBD.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+

src/whoosh/fields.py

         faceting in this field.
         """
         
-        return u("")
+        return u('\uFFFF')
     
     def to_text(self, value):
         """Returns a textual representation of the value. Non-textual fields

src/whoosh/filedb/fieldcache.py

     each document with a value through the array.
     """
     
-    def __init__(self, order=None, texts=None, hastexts=True, default=u(""),
-                 typecode="I"):
+    def __init__(self, order=None, texts=None, hastexts=True,
+                 default=u('\uFFFF'), typecode="I"):
         """
         :param order: an array of ints.
         :param texts: a list of text values.
 # Streaming cache file writer
 
 class FieldCacheWriter(object):
-    def __init__(self, dbfile, size=0, hastexts=True, code="I", default=u("")):
+    def __init__(self, dbfile, size=0, hastexts=True, code="I",
+                 default=u('\uFFFF')):
         self.dbfile = dbfile
         self.order = array(self.code, [0] * size)
         self.hastexts = hastexts

src/whoosh/searching.py

             dls = doclists_for_searcher(self)
             self.ixreader.define_facets(name, dls, save=save)
     
-    def categorize_query(self, q, fieldname, counts=False):
-        """This method is deprecated. Instead use the ``groupedby`` argument to
-        the :meth:`Searcher.search` method.
-        """
-        
-        if isinstance(fieldname, (tuple, list)):
-            facet = sorting.MultiFacet(fieldname)
-        else:
-            facet = sorting.FieldFacet(fieldname)
-        
-        results = self.search(q, groupedby={fieldname: facet})        
-        return results.groups(fieldname)
-    
     def docs_for_query(self, q, leafs=True):
         if self.subsearchers and leafs:
             for s, offset in self.subsearchers:
         """Runs the query represented by the ``query`` object and returns a
         Results object.
         
+        See :doc:`/facets` for information on using ``sortedby`` and/or
+        ``groupedby``.
+        
         :param query: a :class:`whoosh.query.Query` object.
         :param limit: the maximum number of documents to score. If you're only
             interested in the top N documents, you can set limit=N to limit the
             scoring for a faster search.
-        :param sortedby: the name of a field to sort by, or a
-            :class:`whoosh.sorting.Sorter` object.
+        :param sortedby: see :doc`/facets`.
         :param reverse: Reverses the direction of the sort.
-        :param groupedby: a list of field names or facet names, or a
-            :class:`whoosh.sorting.Facets` object. If you supply this
-            argument, you can use the :meth:`Results.groups` method on the
-            results object to retrieve a dictionary mapping field/facet values
-            to document numbers.
+        :param groupedby: see :doc`/facets`.
         :param optimize: use optimizations to get faster results when possible.
         :param filter: a query, Results object, or set of docnums. The results
             will only contain documents that are also in the filter object.
             raise ValueError("limit must be >= 1")
 
         collector = Collector(limit=limit, usequality=optimize,
-                              groupedby=groupedby, reverse=reverse)
+                              groupedby=groupedby, reverse=reverse,
+                              groupids=groupids)
         
         if sortedby:
             return collector.sort(self, q, sortedby, allow=filter,
         return scorefn
     
     def _set_categorizers(self, searcher, offset):
+        groups = self.groups
         if self.facets:
             self.categorizers = dict((name, facet.categorizer(searcher))
                                      for name, facet in self.facets.items())
-            for catter in self.categorizers.values():
+            
+            for name, catter in self.categorizers.items():
+                if self.groupids and name not in groups:
+                    groups[name] = defaultdict(list)
+                elif name not in groups:
+                    groups[name] = defaultdict(int)
+                
                 catter.set_searcher(searcher, offset)
     
     def _set_filters(self, allow, restrict):
         self.timer = None
         self.timedout = True
     
+    def _add_to_group(self, name, key, offsetid):
+        if self.groupids:
+            self.groups[name][key].append(offsetid)
+        else:
+            self.groups[name][key] += 1
+    
     def collect(self, id, offsetid):
         docset = self.docset
         if docset is not None:
             docset.add(offsetid)
         
         if self.facets is not None:
-            groups = self.groups
+            add = self._add_to_group
             for name, catter in self.categorizers.items():
-                key = catter.key_to_name(catter.key_for_id(id))
-                if self.groupids:
-                    if name not in groups:
-                        groups[name] = defaultdict(list)
-                    groups[name][key].append(offsetid)
+                if catter.allow_overlap:
+                    for key in catter.keys_for_id(id):
+                        add(name, catter.key_to_name(key), offsetid)
                 else:
-                    if name not in groups:
-                        groups[name] = defaultdict(int)
-                    groups[name][key] += 1
-    
-    def sort(self, searcher, q, sortedby, allow=None, restrict=None):
-        self.searcher = searcher
-        self.q = q
-        self.docset = set()
-        self._set_filters(allow, restrict)
-        self._reset()
-        self._set_timer()
-        
-        facet = sorting.MultiFacet.from_sortedby(sortedby)
-        catter = facet.categorizer(searcher)
-        keyfn = catter.key_for_matcher
-        t = now()
-        if searcher.is_atomic():
-            self._set_categorizers(searcher, 0)
-            catter.set_searcher(searcher, 0)
-            self.add_sorted_matches(searcher, q, 0, keyfn)
-        else:
-            for s, offset in searcher.subsearchers:
-                self._set_categorizers(s, offset)
-                catter.set_searcher(s, offset)
-                self.add_sorted_matches(s, q, offset, keyfn)
-        self.runtime = now() - t
-        return self.results(scores=False)
-    
-    def add_sorted_matches(self, searcher, q, offset, keyfn):
-        items = self.items
-        limit = self.limit
-        reverse = self.reverse
-        heapfn = nlargest if reverse else nsmallest
-        addall = self.should_add_all()
-        matcher = q.matcher(searcher)
-        
-        ls = list(self.pull_matches(searcher, matcher, offset, keyfn, False))
-        if addall:
-            items.extend(ls)
-        else:
-            self.items = heapfn(limit, items + ls)
+                    key = catter.key_to_name(catter.key_for_id(id))
+                    add(name, key, offsetid)
     
     def search(self, searcher, q, allow=None, restrict=None):
         """Top-level method call which uses the given :class:`Searcher` and
         
         # Perform the search
         t = now()
+        
         if searcher.is_atomic():
-            scorefn = self._score_fn(searcher)
-            self._set_categorizers(searcher, 0)
-            self.add_matches(searcher, q, 0, scorefn)
+            searchers = [(searcher, 0)]
         else:
-            for s, offset in searcher.subsearchers:
-                scorefn = self._score_fn(s)
-                self._set_categorizers(s, offset)
-                self.add_matches(s, q, offset, scorefn)
+            searchers = searcher.subsearchers
+            
+        for s, offset in searchers:
+            scorefn = self._score_fn(s)
+            self._set_categorizers(s, offset)
+            self.add_matches(s, q, offset, scorefn)
         
         # If we started a time limit timer thread, cancel it
         if self.timelimit and self.timer:
             # matcher with a more efficient version
             if replace:
                 if replacecounter == 0 or self.minscore != minscore:
-                    #print "minscore=", minscore, "max_qual=", matcher.max_quality()
                     matcher = matcher.replace(minscore or 0)
                     if not matcher.is_active():
                         break
             # minimum required quality
             if usequality and checkquality and minscore is not None:
                 matcher.skip_to_quality(minscore)
-                #print "skipped=", skipped
                 # Skipping ahead might have moved the matcher to the end of the
                 # posting list
                 if not matcher.is_active():
                 yield (score, offsetid)
             
             # Check whether the time limit expired
-            if self.timedout:
+            if timelimited and self.timedout:
                 raise TimeLimit
             
             # Move to the next document. This method returns True if the
             # again.
             checkquality = matcher.next()
     
+    def sort(self, searcher, q, sortedby, allow=None, restrict=None):
+        self.searcher = searcher
+        self.q = q
+        self.docset = set()
+        self._set_filters(allow, restrict)
+        self._reset()
+        self._set_timer()
+        
+        items = self.items
+        limit = self.limit
+        heapfn = nlargest if self.reverse else nsmallest
+        addall = self.should_add_all()
+        
+        facet = sorting.MultiFacet.from_sortedby(sortedby)
+        catter = facet.categorizer(searcher)
+        t = now()
+        
+        if searcher.is_atomic():
+            searchers = [(searcher, 0)]
+        else:
+            searchers = searcher.subsearchers
+            
+        for s, offset in searchers:
+            self._set_categorizers(s, offset)
+            catter.set_searcher(s, offset)
+            matcher = q.matcher(s)
+            
+            if catter.requires_matcher:
+                ls = list(self.pull_matches(s, matcher, offset,
+                                            catter.key_for_matcher, False))
+            else:
+                ls = list(self.pull_unscored_matches(matcher, offset,
+                                                     catter.key_for_id))
+            
+            if addall:
+                items.extend(ls)
+            else:
+                items = heapfn(limit, items + ls)
+        
+        self.items = items
+        self.runtime = now() - t
+        return self.results(scores=False)
+    
+    def pull_unscored_matches(self, matcher, offset, keyfn):
+        allow = self.allow
+        restrict = self.restrict
+        collect = self.collect
+        timelimited = bool(self.timelimit)
+        
+        for id in matcher.all_ids():
+            # Check whether the time limit expired since the last match
+            if timelimited and self.timedout and not self.greedy:
+                raise TimeLimit
+            
+            # The current document ID 
+            offsetid = id + offset
+            
+            # Check whether the document is filtered
+            if ((not allow or offsetid in allow)
+                and (not restrict or offsetid not in restrict)):
+                # Collect and yield this document
+                collect(id, offsetid)
+                yield (keyfn(id), offsetid)
+            
+            # Check whether the time limit expired
+            if timelimited and self.timedout:
+                raise TimeLimit
+    
     def results(self, scores=True):
         """Returns the current results from the collector. This is useful for
         getting the results out of a collector that was stopped by a time

src/whoosh/sorting.py

 
 from array import array
 
-from whoosh.compat import string_type, xrange
+from whoosh.compat import string_type, u, xrange
 from whoosh.fields import DEFAULT_LONG
 from whoosh.support.times import (long_to_datetime, datetime_to_long,
                                   timedelta_to_usecs)
     The collector will call a categorizer's ``set_searcher`` method as it
     searches each segment to let the cateogorizer set up whatever segment-
     specific data it needs.
+    
+    ``Collector.allow_overlap`` should be True if the caller should use the
+    ``keys_for_id`` method instead of ``key_for_id`` to group documents into
+    potentially overlapping groups.
     """
     
+    allow_overlap = False
+    requires_matcher = False
+    
     def set_searcher(self, searcher, docoffset):
         """Called by the collector when the collector moves to a new segment.
         The ``searcher`` will be atomic. The ``docoffset`` is the offset of
         """Returns a key for the given **segment-relative** document number.
         """
         
-        raise NotImplementedError
+        raise NotImplementedError(self.__class__)
+    
+    def keys_for_id(self, docid):
+        """Yields a series of keys for the given **segment-relative** document
+        number. This method will be called instead of ``key_for_id`` if
+        ``Categorizer.allow_overlap==True``.
+        """
+        
+        raise NotImplementedError(self.__class__)
     
     def key_to_name(self, key):
         """Returns a representation of the key to be used as a dictionary key
         return key
     
 
-class ScoreFacet(FacetType):
-    """Uses a document's score as a sorting criterion.
-    
-    For example, to sort by the ``tag`` field, and then within that by relative
-    score::
-    
-        tag_score = MultiFacet(["tag", ScoreFacet()])
-        results = searcher.search(myquery, sortedby=tag_score)
-    """
-    
-    def categorizer(self, searcher):
-        return self.ScoreCategorizer(searcher)
-    
-    class ScoreCategorizer(Categorizer):
-        def __init__(self, searcher):
-            w = searcher.weighting
-            self.use_final = w.use_final
-            if w.use_final:
-                self.final = w.final
-        
-        def set_searcher(self, searcher, offset):
-            self.searcher = searcher
-    
-        def key_for_matcher(self, matcher):
-            score = matcher.score()
-            if self.use_final:
-                score = self.final(self.searcher, matcher.id(), score)
-            # Negate the score so higher values sort first
-            return 0 - score
-
-
-class FunctionFacet(FacetType):
-    """Lets you pass an arbitrary function that will compute the key. This may
-    be easier than subclassing FacetType and Categorizer to set up the desired
-    behavior.
-    
-    The function is called with the arguments ``(searcher, docid)``, where the
-    ``searcher`` may be a composite searcher, and the ``docid`` is an absolute
-    index document number (not segment-relative).
-    
-    For example, to use the number of words in the document's "content" field
-    as the sorting/faceting key::
-    
-        fn = lambda s, docid: s.doc_field_length(docid, "content")
-        lengths = FunctionFacet(fn)
-    """
-    
-    def __init__(self, fn):
-        self.fn = fn
-    
-    def categorizer(self, searcher):
-        return self.FunctionCategorizer(searcher, self.fn)
-    
-    class FunctionCategorizer(Categorizer):
-        def __init__(self, searcher, fn):
-            self.searcher = searcher
-            self.fn = fn
-        
-        def set_searcher(self, searcher, docoffset):
-            self.offset = docoffset
-        
-        def key_for_id(self, docid):
-            return self.fn(self.searcher, docid + self.offset)
-
-
 class FieldFacet(FacetType):
     """Sorts/facest by the contents of a field.
     
     This facet returns different categorizers based on the field type.
     """
     
-    def __init__(self, fieldname, reverse=False):
+    def __init__(self, fieldname, reverse=False, allow_overlap=False):
         """
         :param fieldname: the name of the field to sort/facet on.
         :param reverse: if True, when sorting, reverse the sort order of this
             facet.
+        :param allow_overlap: if True, when grouping, allow documents to appear
+            in multiple groups when they have multiple terms in the field.
         """
         
         self.fieldname = fieldname
         self.reverse = reverse
+        self.allow_overlap = allow_overlap
     
     def categorizer(self, searcher):
         from whoosh.fields import NUMERIC, DATETIME
         # actual key functions will always be called per-segment following a
         # Categorizer.set_searcher method call
         fieldname = self.fieldname
-        schema = searcher.schema
+        field = None
+        if fieldname in searcher.schema:
+            field = searcher.schema[fieldname]
         
-        if fieldname in schema and isinstance(schema[fieldname], DATETIME):
+        if self.allow_overlap:
+            return self.OverlappingFieldCategorizer(fieldname)
+        
+        elif isinstance(field, DATETIME):
             # Return a subclass of NumericFieldCategorizer that formats dates
-            return DateFieldCategorizer(fieldname, self.reverse)
+            return self.DateFieldCategorizer(fieldname, self.reverse)
         
-        elif fieldname in schema and isinstance(schema[fieldname], NUMERIC):
+        elif isinstance(field, NUMERIC):
             # Numeric fields are naturally reversible
-            return NumericFieldCategorizer(fieldname, self.reverse)
+            return self.NumericFieldCategorizer(fieldname, self.reverse)
         
         elif self.reverse:
             # If we need to "reverse" a string field, we need to do more work
-            return RevFieldCategorizer(searcher, fieldname, self.reverse)
+            return self.RevFieldCategorizer(searcher, fieldname, self.reverse)
         
         else:
             # Straightforward: use the field cache to sort/categorize
-            return FieldCategorizer(fieldname)
+            return self.FieldCategorizer(fieldname)
 
+    class FieldCategorizer(Categorizer):
+        """Categorizer for regular, unreversed fields. Just uses the
+        fieldcache to get the keys.
+        """
+        
+        def __init__(self, fieldname):
+            self.fieldname = fieldname
+        
+        def set_searcher(self, searcher, docoffset):
+            self.fieldcache = searcher.reader().fieldcache(self.fieldname)
+        
+        def key_for_id(self, docid):
+            return self.fieldcache.key_for(docid)
+        
+        def key_to_name(self, key):
+            if key == u('\uFFFF'):
+                return None
+            else:
+                return key
+    
+    class NumericFieldCategorizer(Categorizer):
+        """Categorizer for numeric fields, which are naturally reversible.
+        """
+        
+        def __init__(self, fieldname, reverse):
+            self.fieldname = fieldname
+            self.reverse = reverse
+        
+        def set_searcher(self, searcher, docoffset):
+            self.default = searcher.schema[self.fieldname].sortable_default()
+            self.fieldcache = searcher.reader().fieldcache(self.fieldname)
+        
+        def key_for_id(self, docid):
+            value = self.fieldcache.key_for(docid)
+            if self.reverse:
+                return 0 - value
+            else:
+                return value
+        
+        def key_to_name(self, key):
+            if key == self.default:
+                return None
+            else:
+                return key
 
-class FieldCategorizer(Categorizer):
-    def __init__(self, fieldname):
-        self.fieldname = fieldname
+    class DateFieldCategorizer(NumericFieldCategorizer):
+        """Categorizer for date fields. Same as NumericFieldCategorizer, but
+        converts the numeric keys back to dates for better labels.
+        """
+        
+        def key_to_name(self, key):
+            if key == DEFAULT_LONG:
+                return None
+            else:
+                return long_to_datetime(key)
     
-    def set_searcher(self, searcher, docoffset):
-        self.fieldcache = searcher.reader().fieldcache(self.fieldname)
-    
-    def key_for_id(self, docid):
-        return self.fieldcache.key_for(docid)
-
-
-class NumericFieldCategorizer(Categorizer):
-    def __init__(self, fieldname, reverse):
-        self.fieldname = fieldname
-        self.reverse = reverse
-    
-    def set_searcher(self, searcher, docoffset):
-        self.default = searcher.schema[self.fieldname].sortable_default()
-        self.fieldcache = searcher.reader().fieldcache(self.fieldname)
-    
-    def key_for_id(self, docid):
-        value = self.fieldcache.key_for(docid)
-        if self.reverse:
-            return 0 - value
-        else:
-            return value
-    
-    def key_to_name(self, key):
-        if key == self.default:
-            return None
-        else:
-            return key
-
-
-class DateFieldCategorizer(NumericFieldCategorizer):
-    def key_to_name(self, key):
-        if key == DEFAULT_LONG:
-            return None
-        else:
-            return long_to_datetime(key)
-
-
-class RevFieldCategorizer(Categorizer):
-    def __init__(self, reader, fieldname, reverse):
-        # Cache the relative positions of all docs with the given field
-        # across the entire index
-        dc = reader.doc_count_all()
-        arry = array("i", [0] * dc)
-        field = self.searcher.schema[fieldname]
-        for i, (t, _) in enumerate(field.sortable_values(reader, fieldname)):
-            if reverse:
-                i = 0 - i
-            postings = reader.postings(fieldname, t)
-            for docid in postings.all_ids():
-                arry[docid] = i
-        self.array = arry
+    class RevFieldCategorizer(Categorizer):
+        """Categorizer for reversed fields. Since keys for non-numeric fields
+        are arbitrary data, it's not possible to "negate" them to reverse the
+        sort order. So, this object builds an array caching the order of
+        all documents according to the field, then uses the cached order as a
+        numeric key.
+        """
         
-    def set_searcher(self, searcher, docoffset):
-        self.searcher = searcher
-        self.docoffset = docoffset
-    
-    def key_for_id(self, docid):
-        return self.array[docid + self.docoffset]
+        def __init__(self, reader, fieldname, reverse):
+            # Cache the relative positions of all docs with the given field
+            # across the entire index
+            dc = reader.doc_count_all()
+            arry = array("i", [dc + 1] * dc)
+            field = self.searcher.schema[fieldname]
+            for i, (t, _) in enumerate(field.sortable_values(reader, fieldname)):
+                if reverse:
+                    i = dc - i
+                postings = reader.postings(fieldname, t)
+                for docid in postings.all_ids():
+                    arry[docid] = i
+            self.array = arry
+            
+        def set_searcher(self, searcher, docoffset):
+            self.searcher = searcher
+            self.docoffset = docoffset
+        
+        def key_for_id(self, docid):
+            return self.array[docid + self.docoffset]
+        
+    class OverlappingFieldCategorizer(Categorizer):
+        allow_overlap = True
+        
+        def __init__(self, fieldname):
+            self.fieldname = fieldname
+        
+        def set_searcher(self, searcher, docoffset):
+            fieldname = self.fieldname
+            dc = searcher.doc_count_all()
+            field = searcher.schema[fieldname]
+            reader = searcher.reader()
+            
+            self.lists = [[] for _ in xrange(dc)]
+            for t, _ in field.sortable_values(reader, fieldname):
+                postings = reader.postings(fieldname, t)
+                for docid in postings.all_ids():
+                    self.lists[docid].append(t)
+        
+        def keys_for_id(self, docid):
+            return self.lists[docid] or None
+        
+        def key_for_id(self, docid):
+            ls = self.lists[docid]
+            if ls:
+                return ls[0]
+            else:
+                return None
 
 
 class QueryFacet(FacetType):
     """Sorts/facets based on the results of a series of queries.
     """
     
-    def __init__(self, querydict, other=None):
+    def __init__(self, querydict, other=None, allow_overlap=False):
         """
         :param querydict: a dictionary mapping keys to
             :class:`whoosh.query.Query` objects.
         return self.QueryCategorizer(self.querydict, self.other)
     
     class QueryCategorizer(Categorizer):
-        def __init__(self, querydict, other):
+        def __init__(self, querydict, other, allow_overlap=False):
             self.querydict = querydict
             self.other = other
+            self.allow_overlap = allow_overlap
             
         def set_searcher(self, searcher, offset):
             self.docsets = {}
                 if docid in self.docsets[qname]:
                     return qname
             return self.other
+        
+        def keys_for_id(self, docid):
+            found = False
+            for qname in self.docsets:
+                if docid in self.docsets[qname]:
+                    yield qname
+                    found = True
+            if not found:
+                yield None
 
 
 class RangeFacet(QueryFacet):
     
         prices = RangeFacet("price", 0, 1000, 100)
         results = searcher.search(myquery, groupedby=prices)
+        
+    The ranges/buckets are always **inclusive** at the start and **exclusive**
+    at the end.
     """
     
     def __init__(self, fieldname, start, end, gap, hardend=False):
         gap = timedelta(days=365)
         bdays = RangeFacet("birthday", startdate, enddate, gap)
         results = searcher.search(myquery, groupedby=bdays)
+        
+    The ranges/buckets are always **inclusive** at the start and **exclusive**
+    at the end.
     """
     
     def __init__(self, fieldname, startdate, enddate, delta, hardend=False):
     
     def _range_name(self, startval, endval):
         return (long_to_datetime(startval), long_to_datetime(endval))
+
+
+class ScoreFacet(FacetType):
+    """Uses a document's score as a sorting criterion.
+    
+    For example, to sort by the ``tag`` field, and then within that by relative
+    score::
+    
+        tag_score = MultiFacet(["tag", ScoreFacet()])
+        results = searcher.search(myquery, sortedby=tag_score)
+    """
+    
+    def categorizer(self, searcher):
+        return self.ScoreCategorizer(searcher)
+    
+    class ScoreCategorizer(Categorizer):
+        requires_matcher = True
+        
+        def __init__(self, searcher):
+            w = searcher.weighting
+            self.use_final = w.use_final
+            if w.use_final:
+                self.final = w.final
+        
+        def set_searcher(self, searcher, offset):
+            self.searcher = searcher
+    
+        def key_for_matcher(self, matcher):
+            score = matcher.score()
+            if self.use_final:
+                score = self.final(self.searcher, matcher.id(), score)
+            # Negate the score so higher values sort first
+            return 0 - score
+
+
+class FunctionFacet(FacetType):
+    """Lets you pass an arbitrary function that will compute the key. This may
+    be easier than subclassing FacetType and Categorizer to set up the desired
+    behavior.
+    
+    The function is called with the arguments ``(searcher, docid)``, where the
+    ``searcher`` may be a composite searcher, and the ``docid`` is an absolute
+    index document number (not segment-relative).
+    
+    For example, to use the number of words in the document's "content" field
+    as the sorting/faceting key::
+    
+        fn = lambda s, docid: s.doc_field_length(docid, "content")
+        lengths = FunctionFacet(fn)
+    """
+    
+    def __init__(self, fn):
+        self.fn = fn
+    
+    def categorizer(self, searcher):
+        return self.FunctionCategorizer(searcher, self.fn)
+    
+    class FunctionCategorizer(Categorizer):
+        def __init__(self, searcher, fn):
+            self.searcher = searcher
+            self.fn = fn
+        
+        def set_searcher(self, searcher, docoffset):
+            self.offset = docoffset
+        
+        def key_for_id(self, docid):
+            return self.fn(self.searcher, docid + self.offset)
     
 
 class MultiFacet(FacetType):
         self.facets.append(FieldFacet(fieldname, reverse=reverse))
         return self
     
-    def add_query(self, querydict, other="none"):
-        self.facets.append(QueryFacet(querydict, other=other))
+    def add_query(self, querydict, other=None, allow_overlap=False):
+        self.facets.append(QueryFacet(querydict, other=other, allow_overlap=allow_overlap))
+        return self
+    
+    def add_score(self):
+        self.facets.append(ScoreFacet())
         return self
     
     def add_facet(self, facet):
         def __init__(self, catters):
             self.catters = catters
         
+        @property
+        def requires_matcher(self):
+            return any(c.requires_matcher for c in self.catters)
+        
         def set_searcher(self, searcher, docoffset):
             for catter in self.catters:
                 catter.set_searcher(searcher, docoffset)
         
         return self.facets.items()
     
-    def add_field(self, fieldname):
+    def add_field(self, fieldname, allow_overlap=False):
         """Adds a :class:`FieldFacet` for the given field name (the field name is
         automatically used as the facet name).
         """
         self.facets[fieldname] = FieldFacet(fieldname)
         return self
     
-    def add_query(self, name, querydict, other="none"):
+    def add_query(self, name, querydict, other=None, allow_overlap=False):
         """Adds a :class:`QueryFacet` under the given ``name``.
         
         :param name: a name for the facet.
             :class:`whoosh.query.Query` objects.
         """
         
-        self.facets[name] = QueryFacet(querydict, other=other)
+        self.facets[name] = QueryFacet(querydict, other=other, allow_overlap=allow_overlap)
         return self
     
     def add_facet(self, name, facet):

tests/test_sorting.py

 from __future__ import with_statement
+from datetime import datetime, timedelta
 import random
 
 from nose.tools import assert_equal  #@UnresolvedImport
         
         assert not r3.fieldcache_loaded("id")
         r3.close()
+
+@skip_if_unavailable("multiprocessing")
+@skip_if(lambda: True)
+def test_mp_fieldcache():
+    schema = fields.Schema(key=fields.KEYWORD(stored=True))
+    with TempIndex(schema, "mpfieldcache") as ix:
+        domain = list(u("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"))
+        random.shuffle(domain)
+        w = ix.writer()
+        for char in domain:
+            w.add_document(key=char)
+        w.commit()
         
+        tasks = [MPFCTask(ix.storage, ix.indexname) for _ in xrange(4)]
+        for task in tasks:
+            task.start()
+        for task in tasks:
+            task.join()
+
 def test_sortedby():
     try_sort("id", lambda d: d["id"])
     try_sort("id", lambda d: d["id"], limit=5)
     
     with ix.searcher() as s:
         r = s.search(query.Every(), groupedby="tag")
-        assert_equal(r.groups("tag"), {'': [2, 4], 'bravo': [3], 'alfa': [0, 1]})
+        assert_equal(r.groups("tag"), {None: [2, 4], 'bravo': [3], 'alfa': [0, 1]})
 
 def test_missing_numeric_facet():
     schema = fields.Schema(id=fields.STORED, tag=fields.NUMERIC)
         assert_equal(r.groups("tag"), {None: [2, 4], 0: [3], 1: [0, 1]})
 
 def test_date_facet():
-    from datetime import datetime
-    
     schema = fields.Schema(id=fields.STORED, date=fields.DATETIME)
     ix = RamStorage().create_index(schema)
     w = ix.writer()
                                        (9, 12): [9]})
 
 def test_daterange_facet():
-    from datetime import datetime, timedelta
-    
     schema = fields.Schema(id=fields.STORED, date=fields.DATETIME)
     ix = RamStorage().create_index(schema)
     w = ix.writer()
                                         (dt(2001, 1, 11, 0, 0), dt(2001, 1, 16, 0, 0)): [0],
                                         None: [2]})
 
-@skip_if_unavailable("multiprocessing")
-@skip_if(lambda: True)
-def test_mp_fieldcache():
-    schema = fields.Schema(key=fields.KEYWORD(stored=True))
-    with TempIndex(schema, "mpfieldcache") as ix:
-        domain = list(u("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"))
-        random.shuffle(domain)
-        w = ix.writer()
-        for char in domain:
-            w.add_document(key=char)
-        w.commit()
-        
-        tasks = [MPFCTask(ix.storage, ix.indexname) for _ in xrange(4)]
-        for task in tasks:
-            task.start()
-        for task in tasks:
-            task.join()
-
+def test_overlapping_facet():
+    schema = fields.Schema(id=fields.STORED, tags=fields.KEYWORD)
+    ix = RamStorage().create_index(schema)
+    w = ix.writer()
+    w.add_document(id=0, tags=u("alfa bravo charlie"))
+    w.add_document(id=1, tags=u("bravo charlie delta"))
+    w.add_document(id=2, tags=u("charlie delta echo"))
+    w.add_document(id=3, tags=u("delta echo alfa"))
+    w.add_document(id=4, tags=u("echo alfa bravo"))
+    w.commit()
+    
+    with ix.searcher() as s:
+        of = sorting.FieldFacet("tags", allow_overlap=True)
+        r = s.search(query.Every(), groupedby={"tags": of})
+        assert_equal(r.groups("tags"), {'alfa': [0, 3, 4], 'bravo': [0, 1, 4],
+                                        'charlie': [0, 1, 2], 'delta': [1, 2, 3],
+                                        'echo': [2, 3, 4]})
+    
 def test_field_facets():
     def check(method):
         with TempIndex(get_schema()) as ix:
             method(ix)
             with ix.searcher() as s:
-                q = query.Every()
-                groups = s.categorize_query(q, "tag")
+                results = s.search(query.Every(), groupedby="tag")
+                groups = results.groups("tag")
                 assert (sorted(groups.items())
-                        == [(u('one'), [long_type(0), long_type(6)]), (u('three'),
-                            [long_type(1), long_type(3),
-                             long_type(7), long_type(8)]),
-                            (u('two'), [long_type(2), long_type(4), long_type(5)])])
+                        == [(u('one'), [0, 6]),
+                            (u('three'), [1, 3, 7, 8]),
+                            (u('two'), [2, 4, 5])])
     
     check(make_single_index)
     check(make_multi_index)
                 assert_equal(groups, {'a-i': u('abcdefghi'), 'j-r': u('jklmnopqr'),
                                       's-z': u('stuvwxyz')})
             
-            check(s.categorize_query(query.Every(), "range"))
+            check(s.search(query.Every(), groupedby="range").groups("range"))
 
-            r = s.search(query.Every(), groupedby="range")
-            check(r.groups("range"))
-            
         with ix.searcher() as s:
             assert not s.reader().fieldcache_available("range")
 
                    (u('bravo'), u('small')): [3]}
         
         with ix.searcher() as s:
-            cats = s.categorize_query(query.Every(), ("tag", "size"))
-            assert_equal(cats, correct)
-            
-            from whoosh.sorting import MultiFacet
-            
-            facet = MultiFacet(["tag", "size"])
+            facet = sorting.MultiFacet(["tag", "size"])
             r = s.search(query.Every(), groupedby={"tag/size" : facet})
             cats = r.groups(("tag/size"))
             assert_equal(cats, correct)