Matt Chaput avatar Matt Chaput committed 357c255 Merge

Merge with flexisort branch.

Comments (0)

Files changed (21)

docs/source/facets.rst

-=====================================
-How to present faceted search results
-=====================================
+====================
+Sorting and faceting
+====================
 
 .. note::
-    The API for sorting and faceting changed in Whoosh 1.5.
+    The API for sorting and faceting changed in Whoosh 2.0.
 
 Overview
 ========
 
+Sorting and faceting search results in Whoosh is based on **facets**. Each
+facet associates a value with each document in the search results, allowing you
+to sort by the keys or use them group the documents. Whoosh includes a variety
+of **facet types** you can use for sorting and grouping (see below).
+
+
+Sorting
+=======
+
+By default, the results of a search are sorted with the highest-scoring
+documents first. You can use the ``sortedby`` keyword argument to order the
+results by some other criteria instead, such as the value of a field.
+
+
+A note about sorting by fields
+------------------------------
+
+When sorting by fields, the **indexed terms** in the field are used as the
+"value" to sort by, **not** the stored values. For example, take this index::
+
+    schema = fields.Schema(title=fields.TEXT(stored=True))
+    ix = index.create_in("indexdir", schema)
+    w = ix.writer()
+    w.add_document(title=u"Best Bet")
+    w.add_document(title=u"First Action")
+    w.commit()
+    
+If you sort this index by "title", you might expect the results to be
+"Best Bet" followed by "First Action", but in fact it will be the reverse! This
+is because Whoosh is sorting by **terms**, not the original text you indexed.
+For fields with multiple terms, it just picks the (alphabetically) first one,
+so the document containing "action" sorts before the one with "best".
+
+For this reason, you don't want to sort by TEXT fields. Instad, you should set
+up separate, single-term fields that you can sort by. You can duplicate content
+if you want to be able to sort by the original value of a TEXT field::
+
+    schema = fields.Schema(title=fields.TEXT(stored=True),
+                           sort_title=fields.ID)
+    ix = index.create_in("indexdir", schema)
+    w = ix.writer()
+    for title in titles:
+        w.add_document(title=title, sort_title=title)
+    w.commit()
+
+
+The sortedby keyword argument
+-----------------------------
+
+You can use the following objects as ``sortedby`` values:
+
+A ``FacetType`` object
+    Uses this object to sort the documents. See below for the available facet
+    types.
+
+A field name string
+    Converts the field name into a ``FieldFacet`` (see below) and uses it to
+    sort the documents.
+
+A list of ``FacetType`` objects and/or field name strings
+    Bundles the facets together into a ``MultiFacet`` so you can sort by
+    multiple keys. Note that this shortcut does not allow you to reverse
+    the sort direction of individual facets. To do that, you need to construct
+    the ``MultiFacet`` object yourself.
+
+.. note::
+    You can use the ``reverse=True`` keyword argument to the
+    ``Searcher.search()`` method to reverse the overall sort direction. This
+    is more efficient than reversing each individual facet.
+
+
+Examples
+--------
+
+Sort by the value of the size field::
+
+    results = searcher.search(myquery, sortedby="size")
+    
+Sort by the reverse (highest-to-lowest) order of the "price" field::
+
+    facet = sorting.FieldFacet("price", reverse=True)
+    results = searcher.search(myquery, sortedby=facet)
+
+Sort by ascending size and then descending price::
+
+    mf = sorting.MultiFacet()
+    mf.add_field("size")
+    mf.add_field("price", reverse=True)
+    results = searcher.search(myquery, sortedby=mf)
+    
+    # or...
+    sizes = sorting.FieldFacet("size")
+    prices = sorting.FieldFacet("price", reverse=True)
+    results = searcher.search(myquery, sortedby=[sizes, prices])
+    
+Sort by the "category" field, then by the document's score::
+
+    cats = sorting.FieldFacet("category")
+    scores = sorting.ScoreFacet()
+    results = searcher.search(myquery, sortedby=[cats, scores])
+
+
+Grouping
+========
+
 It is often very useful to present "faceted" search results to the user.
-Faceting is dynamic clustering of search results into categories. The
+Faceting is dynamic grouping of search results into categories. The
 categories let users view a slice of the total results based on the categories
 they're interested in.
 
 category they're interested in, similarly to how the Spotlight quick results
 work on Mac OS X.
 
-.. tip::
-    Whoosh currently only supports **non-overlapping** categories. A document
-    cannot belong to multiple facets at the same time. (It is not an error if
-    the facets overlap; each document will simply be sorted into one category
-    arbitrarily.)
 
-Faceting relies on field caches. See :doc:`fieldcaches` for information about
-field caches.
+The groupedby keyword argument
+------------------------------
 
+You can use the following objects as ``groupedby`` values:
 
-Categorizing search results by field
-====================================
+A ``FacetType`` object
+    Uses this object to group the documents. See below for the available facet
+    types. The facet name will automatically be set to ``"facet"``.
 
-When you use the :meth:`Searcher.search` method, add the `groups` keyword
-argument to specify how to categorize the results::
+A field name string
+    Converts the field name into a ``FieldFacet`` (see below) and uses it to
+    sort the documents. The name of the field is used as the facet name.
 
-    # Group by the value of the "tag" field
-    results = searcher.search(my_query, groupedby=["tag"])
+A dictionary mapping facet names to FacetType objects
+    Sets up multiple grouping crieteria.
+
+A ``Facets`` object
+    This object is a lot like using a dictionary, but has some convenience
+    methods to make setting up multiple groupings a little easier.
+
+Examples
+--------
+
+Group by the value of the "category" field::
+
+    results = searcher.search(myquery, groupedby="category")
     
-    # Retrieve the groups from the results object
-    groups = results.groups("tag")
+Group by the value of the "category" field and also by the value of the "tags"
+field and a date range::
 
-The ``groups()`` method simply returns a dictionary mapping category names
-to lists of **document IDs**. The document IDs will be in their relative
-order from the original results.
+    cats = sorting.FieldFacet("category")
+    tags = sorting.FieldFacet("tags", allow_overlap=True)
+    results = searcher.search(myquery, groupedby={"cats": cats, "tags": tags})
+    
+    # ...or, using a Facets object has a little less duplication
+    facets = sorting.Facets()
+    facets.add_field("category")
+    facets.add_field("tags", allow_overlap=True)
+    results = searcher.search(myquery, groupedby=facets)
 
-    {"small": [5, 1, 4, 8, 2], "medium": [3, 0, 6], "large": [9, 7]}
+To group results by the *intersected values of multiple fields*, use a
+``MultiFacet`` object (see below). For example, if you have two fields named
+``tag`` and ``size``, you could group the results by all combinations of the
+``tag`` and ``size`` field, such as ``('tag1', 'small')``,
+``('tag2', 'small')``, ``('tag1', 'medium')``, and so on::
 
-The last thing you need to know is how to translate document numbers into
-something you can display. The ``Searcher`` object's ``stored_fields()``
-method takes a document number and returns the document's stored fields as a
-dictionary::
+    # Generate a grouping from the combination of the "tag" and "size" fields
+    mf = MultiField("tag", "size")
+    results = searcher.search(myquery, groupedby={"tag/size": mf})
+
+
+Getting the faceted groups
+--------------------------
+
+The ``Results.groups("facetname")`` method  returns a dictionary mapping
+category names to lists of **document IDs**.
+
+    {"small": [1, 2, 4, 5, 8], "medium": [0, 3, 6], "large": [7, 9]}
+
+The ``Searcher`` object's ``stored_fields()`` method takes a document number
+and returns the document's stored fields as a dictionary::
 
     for category_name in categories:
         print "Top 5 documents in the %s category" % category_name
         if len(doclist) > 5:
             print "  (%s more)" % (len(doclist) - 5)
 
-You can use the categories dictionary and ``stored_fields()`` to display the
-categories and results any way you want in your application.
+(You can use ``Searcher.stored_fields(docnum)`` to get the stored fields
+associated with a document number.)
 
+If you just want to **count** the number of documents in each group, instead of
+generating a full list of the documents, use the ``groupids=False`` keyword
+argument::
 
-Getting multiple categorizations of results
-===========================================
+    results = searcher.search(myquery, groupedby="size")
+    groups = results.groups("size")
+    # {"small": 5, "medium": 3, "large": 2}
 
 To generate multiple groupings, you can name multiple fields in the list you
 pass to the `groups` keyword::
 
     # Generate separate groupings for the "tag" and "size" fields
-    results = searcher.search(my_query, groupedby=["tag", "size"])
+    results = searcher.search(myquery, groupedby=["tag", "size"])
     
     # Get the groupings by "tag"
     tag_groups = results.groups("tag")
     size_groups = results.groups("size")
 
 
-Categorizing by multiple fields
-===============================
+Facet types
+===========
 
-To group results by the *combined values of multiple fields*, use a tuple of
-field names instead of a field name. For example, if you have two fields named
-``tag`` and ``size``, you could group the results by all combinations of the
-``tag`` and ``size`` field, such as ``('tag1', 'small')``,
-``('tag2', 'small')``, ``('tag1', 'medium')``, and so on::
+FieldFacet
+----------
 
-    # Generate a grouping from the combination of the "tag" and "size" fields
-    results = searcher.search(my_query, groupedby=[("tag", "size")])
+This is the most common facet type. It sorts or groups based on the
+value in a certain field in each document. This generally works best
+(or at all) if each document has only one term in the field (e.g. an ID
+field)::
+
+    # Sort search results by the value of the "path" field
+    facet = sorting.FieldFacet("path")
+    results = searcher.search(myquery, sortedby=facet)
     
-    groups = results.groups(("tag", "size"))
+    # Group search results by the value of the "parent" field
+    facet = sorting.FieldFacet("parent")
+    results = searcher.search(myquery, groupedby=facet)
+    parent_groups = results.groups("parent")
 
+By default, ``FieldFacet`` only supports **non-overlapping** grouping, where a
+document cannot belong to multiple facets at the same time (each document will
+be sorted into one category arbitrarily.) To get overlapping groups with
+multi-valued fields, use the ``allow_overlap=True`` keyword argument::
 
-Categorizing based on custom queries
-====================================
+    facet = sorting.FieldFacet(fieldname, allow_overlap=True)
 
-If you need more complex categories, you can set up categories defined by
-queries. For example, you can create price categories using range queries::
+This supports overlapping group membership where documents have more than one
+term in a field (e.g. KEYWORD fields). If you don't need overlapping, don't
+use ``allow_overlap`` because it's slower and less memory-efficient.
+
+
+QueryFacet
+----------
+
+You can set up categories defined by arbitrary queries. For example, you can
+group names using prefix queries::
 
     # Use queries to define each category
     # (Here I'll assume "price" is a NUMERIC field, so I'll use
     # NumericRange)
-    categories = {}
-    category["$0 - $100"] = query.NumericRange("price", 0, 100)
-    category["$101 - $500"] = query.NumericRange("price", 101, 500)
-    category["$501 - $1000"] = query.NumericRange("price", 501, 1000)
+    qdict = {}
+    qdict["A-D"] = query.TermRange("name", "a", "d")
+    qdict["E-H"] = query.TermRange("name", "e", "h")
+    qdict["I-L"] = query.TermRange("name", "i", "l")
+    # ...
     
-    # Define the facets on the searcher. If save=True, the cached
-    # facets will be saved to disk for future use. Use save=False to
-    # avoid this for one-off queries.
-    searcher.define_facets("pricerange", categories, save=False)
+    qfacet = sorting.QueryFacet(qdict)
+    r = searcher.search(myquery, groupedby={"firstltr": qfacet})
+    
+By default, ``QueryFacet`` only supports **non-overlapping** grouping, where a
+document cannot belong to multiple facets at the same time (each document will
+be sorted into one category arbitrarily.) To get overlapping groups with
+multi-valued fields, use the ``allow_overlap=True`` keyword argument::
 
-Now you can use ``pricerange`` as if it was the name of a field for the
-purposes of grouping and sorting::
+    facet = sorting.QueryFacet(querydict, allow_overlap=True)
 
-    r = searcher.search(my_query, groupedby=["princerange"])
 
+RangeFacet
+----------
 
+The ``RangeFacet`` is for NUMERIC field types. It divides a range of possible
+values into groups. For example, to group documents based on price into
+buckets $100 "wide"::
 
+    pricefacet = sorting.RangeFacet("price", 0, 1000, 100)
+    
+The first argument is the name of the field. The next two arguments are the
+full range to be divided. Value outside this range (in this example, values
+below 0 and above 1000) will be sorted into the "missing" (None) group. The
+fourth argument is the "gap size", the size of the divisions in the range.
 
+The "gap" can be a list instead of a single value. In that case, the values in
+the list will be used to set the size of the initial divisions, with the last
+value in the list being the size for all subsequent divisions. For example::
 
+    pricefacet = sorting.RangeFacet("price", 0, 1000, [5, 10, 35, 50])
+    
+...will set up divisions of 0-5, 5-15, 15-50, 50-100, and then use 50 as the
+size for all subsequent divisions (i.e. 100-150, 150-200, and so on).
 
+The ``hardend`` keyword argument controls whether the last division is clamped
+to the end of the range or allowed to go past the end of the range. For
+example, this::
 
+    facet = sorting.RangeFacet("num", 0, 10, 4, hardend=False)
+    
+...gives divisions 0-4, 4-8, and 8-12, while this::
 
+    facet = sorting.RangeFacet("num", 0, 10, 4, hardend=True)
+    
+...gives divisions 0-4, 4-8, and 8-10. (The default is ``hardend=False``.)
 
+.. note::
+    The ranges/buckets are always **inclusive** at the start and **exclusive**
+    at the end.
 
+
+DateRangeFacet
+--------------
+
+This is like ``RangeFacet`` but for DATETIME fields. The start and end values
+must be ``datetime.datetime`` objects, and the gap(s) is/are
+``datetime.timedelta`` objects.
+
+For example::
+
+    from datetime import datetime, timedelta
+
+    start = datetime(2000, 1, 1)
+    end = datetime.now()
+    gap = timedelta(days=365)
+    bdayfacet = sorting.DateRangeFacet("birthday", start, end, gap)
+
+As with ``RangeFacet``, you can use a list of gaps and the ``hardend`` keyword
+argument.
+
+
+ScoreFacet
+----------
+
+This facet is sometimes useful for sorting.
+
+For example, to sort by the "category" field, then for documents with the same
+category, sort by the document's score::
+
+    cats = sorting.FieldFacet("category")
+    scores = sorting.ScoreFacet()
+    results = searcher.search(myquery, sortedby=[cats, scores])
+
+The ``ScoreFacet`` always sorts higher scores before lower scores.
+
+.. note::
+    While using ``sortedby=ScoreFacet()`` should give the same results as using
+    the default scored ordering (``sortedby=None``), using the facet will be
+    slower because Whoosh automatically turns off many optimizations when
+    sorting.
+
+
+FunctionFacet
+-------------
+
+This facet lets you pass a custom function to compute the sorting/grouping key
+for documents. (Using this facet type may be easier than subclassing FacetType
+and Categorizer to set up some custom behavior.)
+
+The function will be called with the index searcher and index document ID as
+arguments. For example, if you have an index with term vectors::
+
+    schema = fields.Schema(id=fields.STORED,
+                           text=fields.TEXT(stored=True, vector=True))
+    ix = RamStorage().create_index(schema)
+    
+...you could use a function to sort documents higher the closer they are to
+having equal occurances of two terms:: 
+    
+    def fn(searcher, docnum):
+        v = dict(searcher.vector_as("frequency", docnum, "text"))
+        # Sort documents that have equal number of "alfa" and "bravo" first
+        return 0 - (1.0 / (abs(v.get("alfa", 0) - v.get("bravo", 0)) + 1.0))
+
+    facet = sorting.FunctionFacet(fn)
+    results = searcher.search(myquery, sortedby=facet)
+
+
+MultiFacet
+==========
+
+This facet type returns a composite of the keys returned by two or more
+sub-facets, allowing you to sort/group by the intersected values of multiple
+facets.
+
+``MultiFacet`` has methods for adding facets::
+
+    myfacet = sorting.RangeFacet(0, 1000, 10)
+
+    mf = sorting.MultiFacet()
+    mf.add_field("category")
+    mf.add_field("price", reverse=True)
+    mf.add_facet(myfacet)
+    mf.add_score()
+
+You can also pass a list of field names and/or ``FacetType`` objects to the
+initializer::
+
+    prices = sorting.FieldFacet("price", reverse=True)
+    scores = sorting.ScoreFacet()
+    mf = sorting.MultiFacet("category", prices, myfacet, scores)
+
+
+Missing values
+==============
+
+* When sorting, documents without any terms in a given field, or whatever else
+  constitutes "missing" for different facet types, will always sort to the end.
+
+* When grouping, "missing" documents will appear in a group with the
+  key ``None``.
+
+
+Expert: writing your own facet
+==============================
+
+TBD.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+

src/whoosh/analysis.py

                            xrange, next)
 from whoosh.lang.dmetaphone import double_metaphone
 from whoosh.lang.porter import stem
-from whoosh.util import lru_cache, unbound_cache
+from whoosh.util import lru_cache, unbound_cache, rcompile
 
 
 # Default list of stop words (words so common it's usually wasteful to index
 
 # Pre-configured regular expressions
 
-default_pattern = re.compile(r"\w+(\.?\w+)*", re.UNICODE)
-url_pattern = re.compile("""
+default_pattern = rcompile(r"\w+(\.?\w+)*")
+url_pattern = rcompile("""
 (
     [A-Za-z+]+://          # URL protocol
     \\S+?                  # URL body
 ) | (                      # or...
     \w+([:.]?\w+)*         # word characters, with optional internal colons/dots
 )
-""", re.VERBOSE | re.UNICODE)
+""", verbose=True)
 
 
 # Utility functions
             than matching on the expression.
         """
         
-        if isinstance(expression, string_type):
-            self.expression = re.compile(expression, re.UNICODE)
-        else:
-            self.expression = expression
+        self.expression = rcompile(expression)
         self.gaps = gaps
     
     def __eq__(self, other):
                 pos += 1
 
 
+class PathTokenizer(Tokenizer):
+    """A simple tokenizer that given a string ``"/a/b/c"`` yields tokens
+    ``["/a", "/a/b", "/a/b/c"]``.
+    """
+    
+    def __init__(self, expression="[^/]+"):
+        self.expr = rcompile(expression, re.UNICODE)
+    
+    def __call__(self, value, **kwargs):
+        assert isinstance(value, text_type), "%r is not unicode" % value
+        token = Token(**kwargs)
+        for match in self.expr.finditer(value):
+            token.text = value[:match.end()]
+            yield token
+
+
 # Filters
 
 class Filter(Composable):

src/whoosh/fields.py

 from whoosh.support.times import datetime_to_long
 
 
+# "Default" values to indicate missing values when sorting and faceting numeric
+# fields. There's no "out-of-band" value possible (except for floats, where we
+# use NaN), so we try to be conspicuous at least by using the maximum possible
+# value
+NUMERIC_DEFAULTS = {"b": 2**7-1, "B": 2**8-1, "h": 2**15-1, "H": 2**16-1,
+                    "i": 2**31-1, "I": 2**32-1, "q": 2**63-1, "Q": 2**64-1,
+                    "f": float("nan"), "d": float("nan"),
+                    }
+DEFAULT_LONG = NUMERIC_DEFAULTS["q"]
+
 # Exceptions
 
 class FieldConfigurationError(Exception):
     analyzer = format = vector = scorable = stored = unique = None
     indexed = True
     multitoken_query = "first"
-    sortable_type = text_type
     sortable_typecode = None
     spelling = False
     
         else:
             return False
     
+    def sortable_default(self):
+        """Returns a default value to use for "missing" values when sorting or
+        faceting in this field.
+        """
+        
+        return u('\uFFFF')
+    
     def to_text(self, value):
         """Returns a textual representation of the value. Non-textual fields
         (such as NUMERIC and DATETIME) will override this to encode objects
         for sorting. The default implementation simply returns the texts of all
         terms in the field.
         
-        The value of the field's ``sortable_type`` attribute should contain the
-        type of the second item (the sortable value) in the pairs, e.g.
-        ``unicode`` or ``int``.
-        
         This can be overridden by field types such as NUMERIC where some values
         in a field are not useful for sorting, and where the sortable values
         can be expressed more compactly as numbers.
         """
         
         self.type = type
-        if self.type is int:
-            self.sortable_type = int
-            if PY3:
-                self._to_text = long_to_text
-                self._from_text = text_to_long
-                self.sortable_typecode = "q" if signed else "Q"
-            else:
-                self._to_text = int_to_text
-                self._from_text = text_to_int
-                self.sortable_typecode = "i" if signed else "I"
-        elif self.type is long_type:
+        if self.type is long_type:
+            # This will catch the Python 3 int type
             self._to_text = long_to_text
             self._from_text = text_to_long
-            self.sortable_type = long_type
             self.sortable_typecode = "q" if signed else "Q"
+        elif self.type is int:
+            self._to_text = int_to_text
+            self._from_text = text_to_int
+            self.sortable_typecode = "i" if signed else "I"
         elif self.type is float:
             self._to_text = float_to_text
             self._from_text = text_to_float
         self.analyzer = IDAnalyzer()
         self.format = Existence(field_boost=field_boost)
     
+    def sortable_default(self):
+        return NUMERIC_DEFAULTS[self.sortable_typecode]
+    
     def _tiers(self, num):
         t = self.type
         if t is int and not PY3:

src/whoosh/filedb/fieldcache.py

     each document with a value through the array.
     """
     
-    def __init__(self, order=None, texts=None, hastexts=True, default=u(""),
-                 typecode="I"):
+    def __init__(self, order=None, texts=None, hastexts=True,
+                 default=u('\uFFFF'), typecode="I"):
         """
         :param order: an array of ints.
         :param texts: a list of text values.
     # Class constructor for building a field cache from a reader
     
     @classmethod
-    def from_field(cls, ixreader, fieldname, default=u("")):
+    def from_field(cls, ixreader, fieldname):
         """Creates an in-memory field cache from a reader.
         
         >>> r = ix.reader()
         texts = None
         if hastexts:
             typecode = "I"
-            texts = [default]
+            texts = [field.sortable_default()]
+            defaultnum = 0
         else:
             typecode = field.sortable_typecode
+            defaultnum = field.sortable_default()
         
         doccount = ixreader.doc_count_all()
         # Python does not support arrays of long long see Issue 1172711
         if typecode.lower() == "q":
-            order = [0] * doccount
+            order = [defaultnum] * doccount
         else:
-            order = array(typecode, [0] * doccount)
+            order = array(typecode, [defaultnum] * doccount)
         
         enum = enumerate(field.sortable_values(ixreader, fieldname))
         for i, (text, sortable) in enum:
                 newcode = "H"
             
             if newcode != order.typecode:
-                order = array(newcode, order)
+                # Can't use an array as the source for another array
+                order = array(newcode, iter(order))
                 typecode = newcode
         
         return cls(order, texts, hastexts=hastexts, typecode=typecode)
                 newcode = "B"
             elif len(self.texts) < 65535:
                 newcode = "H"
-            
             if newcode != self.order.typecode:
-                self.order = array(newcode, self.order)
+                self.order = array(newcode, iter(self.order))
                 self.typecode = newcode
         else:
             dbfile.write_uint(0)  # No texts
     
     def groups(self, docnums, counts=False):
         """Returns a dictionary mapping key values to document numbers. If
-        ``counts_only`` is True, the returned dictionary maps key values to the
+        ``counts`` is True, the returned dictionary maps key values to the
         number of documents in that 
         """
         
 # Streaming cache file writer
 
 class FieldCacheWriter(object):
-    def __init__(self, dbfile, size=0, hastexts=True, code="I", default=u("")):
+    def __init__(self, dbfile, size=0, hastexts=True, code="I",
+                 default=u('\uFFFF')):
         self.dbfile = dbfile
         self.order = array(self.code, [0] * size)
         self.hastexts = hastexts

src/whoosh/filedb/filepostings.py

             self._write_block()
 
     def finish(self, inlinelimit=1):
+        assert isinstance(inlinelimit, integer_types)
         if self.block is None:
             raise Exception("Called finish() when not in a block")
 

src/whoosh/filedb/filereading.py

         return self.dc
 
     def stored_fields(self, docnum):
+        assert docnum >= 0
         schema = self.schema
         return dict(item for item
                     in iteritems(self.storedfields[docnum])

src/whoosh/filedb/filewriting.py

 except ImportError:
     has_sqlite = False
 
-from whoosh.compat import iteritems, text_type
+from whoosh.compat import integer_types, iteritems, text_type
 from whoosh.fields import UnknownFieldError
 from whoosh.filedb.fileindex import Segment
 from whoosh.filedb.filepostings import FilePostingWriter
         pf = self.storage.create_file(segment.termposts_filename)
         pw = FilePostingWriter(pf, blocklimit=blocklimit)
         # Terms writer
-        self.termswriter = TermsWriter(self.schema, ti, pw, self.dawg,
-                                       self.wordsets)
+        self.termswriter = TermsWriter(self.schema, ti, pw, self.dawg)
         
         if self.schema.has_vectored_fields():
             # Vector index
         
         # Posting lists with <= this number of postings will be inlined into
         # the terms index instead of being written to the posting file
+        assert isinstance(inlinelimit, integer_types)
         self.inlinelimit = inlinelimit
         
         self.spelling = False

src/whoosh/qparser/common.py

 """
 
 from __future__ import print_function
-import re
 
 from whoosh.compat import string_type
 
+
 class QueryParserError(Exception):
     def __init__(self, cause, msg=None):
         super(QueryParserError, self).__init__(str(cause))
         self.cause = cause
 
 
-def rcompile(pattern, flags=0, verbose=False):
-    """A wrapper for re.compile that checks whether "pattern" is a regex object
-    or a string to be compiled, and automatically adds the re.UNICODE flag.
-    """
-    
-    if not isinstance(pattern, string_type):
-        # If it's not a string, assume it's already a compiled pattern
-        return pattern
-    if verbose:
-        flags |= re.VERBOSE
-    return re.compile(pattern, re.UNICODE | flags)
-
-
 def get_single_text(field, text, **kwargs):
     """Returns the first token from an analyzer's output.
     """

src/whoosh/qparser/dateparse.py

 
 from whoosh.compat import string_type, iteritems
 from whoosh.qparser import plugins, syntax
-from whoosh.qparser.common import rcompile
 from whoosh.qparser.taggers import Tagger
 from whoosh.support.relativedelta import relativedelta
 from whoosh.support.times import (adatetime, timespan, fill_in, is_void,
                                   TimeError, relative_days)
+from whoosh.util import rcompile
 
 
 class DateParseError(Exception):

src/whoosh/qparser/default.py

                 plugins.GroupPlugin(),
                 plugins.OperatorsPlugin(),
                 plugins.BoostPlugin(),
+                plugins.EveryPlugin(),
                 ]
 
     def add_plugins(self, pins):

src/whoosh/qparser/plugins.py

 from whoosh import query
 from whoosh.compat import iteritems, u, PY3
 from whoosh.qparser import syntax
-from whoosh.qparser.common import rcompile, attach
+from whoosh.qparser.common import attach
 from whoosh.qparser.taggers import RegexTagger, FnTagger
+from whoosh.util import rcompile
 
 
 class Plugin(object):
     
     def create(self, parser, match):
         kwargs = match.groupdict()
-        
-        if not PY3:
-            def convert_unicode_keys(d):
-                if isinstance(d, dict):
-                	return dict([(k.encode("utf-8"), convert_unicode_keys(v))\
-                	        for k, v in iteritems(d)])
-                return d
-            kwargs = convert_unicode_keys(kwargs)
-        
         return self.nodetype(**kwargs)
 
 
         return top
 
 
+class EveryPlugin(TaggingPlugin):
+    expr = "[*]:[*]"
+    priority = -1
+    
+    def create(self, parser, match):
+        return self.EveryNode()
+        
+    class EveryNode(syntax.SyntaxNode):
+        def r(self):
+            return "*:*"
+        
+        def query(self, parser):
+            return query.Every()
+
+
 class FieldsPlugin(TaggingPlugin):
     """Adds the ability to specify the field of a clause.
     """
         def create(self, parser, match):
             return syntax.FieldnameNode(match.group("text"), match.group(0))
     
-    def __init__(self, expr=r"(?P<text>\w+):", remove_unknown=True):
+    def __init__(self, expr=r"(?P<text>\w+|[*]):", remove_unknown=True):
         """
         :param expr: the regular expression to use for tagging fields.
         :param remove_unknown: if True, converts field specifications for

src/whoosh/qparser/taggers.py

 # those of the authors and should not be interpreted as representing official
 # policies, either expressed or implied, of Matt Chaput.
 
-from whoosh.qparser.common import rcompile
+from whoosh.util import rcompile
 
 
 # Tagger objects

src/whoosh/query.py

         reader = searcher.reader()
         
         if fieldname in (None, "", "*"):
+            # This takes into account deletions
             doclist = list(reader.all_doc_ids())
         elif reader.supports_caches() and reader.fieldcache_available(self.fieldname):
             # If the reader has a field cache, use it to quickly get the list
         return ListMatcher(doclist, all_weights=self.boost)
 
 
-
-
 class _NullQuery(Query):
     "Represents a query that won't match anything."
     

src/whoosh/searching.py

 import threading
 import weakref
 from collections import defaultdict
-from heapq import heappush, heapreplace
+from heapq import heappush, heapreplace, nlargest, nsmallest
 from math import ceil
 
-from whoosh import classify, highlight, query, scoring
+from whoosh import classify, highlight, query, scoring, sorting
 from whoosh.compat import (iteritems, itervalues, iterkeys, xrange, text_type,
                            string_type)
 from whoosh.reading import TermNotFound
         return self.search(q, **kwargs)
 
     def sorter(self, *args, **kwargs):
-        """Returns a :class:`whoosh.sorting.Sorter` object for this searcher.
-        See the documentation for ``Sorter`` for how to use the sorter object
-        to get sorted search results.
+        """This method is deprecated. Instead of using a Sorter, configure a
+        :class:`whoosh.sorting.FieldFacet` or
+        :class:`whoosh.sorting.MultiFacet` and pass it to the
+        :meth:`Searcher.search` method's ``sortedby`` keyword argument.
         """
         
-        from whoosh.sorting import Sorter
-        
-        return Sorter(self, *args, **kwargs)
-
-    def sort_query_using(self, q, fn, filter=None):
-        """Returns a :class:`Results` object with the documents matching the
-        given query ordered according score returned by the given function.
-        
-        The function is called for each matching document with two arguments:
-        this searcher and the document number. The function can usually only
-        work with information that can be accessed based on the document
-        number, such as stored fields, term vectors, and field caches.
-        
-        For example, assume an index where the "text" field was indexed with
-        term vectors. The following function loads the term vector for each
-        document and ranks documents containing equal occurrences of the terms
-        "love" and "hate" highest::
-        
-            def fn(searcher, docnum):
-                # Create a dictionary of term text to frequency
-                v = dict(searcher.vector_as("frequency", docnum, "text"))
-                # Give highest scores to documents that have equal numbers
-                # of the two terms
-                return 1.0 / (abs(v["love"] - v["hate"]) + 1.0)
-            
-            with myindex.searcher() as s:
-                q = And([Term("text", u"love"), Term("text", u"hate")])
-                results = s.sort_query_using(q, fn)
-        
-        (Note that the "function" can be an object with a ``__call__`` method.
-        This may be useful for sharing information between calls.)
-        
-        :param q: the query to run.
-        :param fn: a function to run on each document number to determine the
-            document's "score". Higher values appear earlier in the results.
-        :param filter: a query, Results object, or set of docnums. The results
-            will only contain documents that are also in the filter object.
-        """
-        
-        t = now()
-        comb = self._filter_to_comb(filter)
-        ls = [(fn(self, docnum), docnum) for docnum in self.docs_for_query(q)
-              if (not comb) or docnum in comb]
-        docset = set(docnum for _, docnum in ls)
-        ls.sort(key=lambda x: (0 - x[0], x[1]))
-        return Results(self, q, ls, docset, runtime=now() - t, filter=filter)
+        return sorting.Sorter(self, *args, **kwargs)
 
     def define_facets(self, name, qs, save=False):
         def doclists_for_searcher(s):
             dls = doclists_for_searcher(self)
             self.ixreader.define_facets(name, dls, save=save)
     
-    def categorize_query(self, q, fieldname, counts=False):
-        groups = {}
-        if isinstance(fieldname, string_type):
-            fieldname = (fieldname, )
-        
-        if self.subsearchers:
-            for s, offset in self.subsearchers:
-                r = s.reader()
-                r.group_docs_by(fieldname, q.docs(s), groups, counts=counts,
-                                offset=offset)
-        else:
-            self.ixreader.group_docs_by(fieldname, q.docs(self), groups,
-                                        counts=counts)
-        return groups
-    
     def docs_for_query(self, q, leafs=True):
         if self.subsearchers and leafs:
             for s, offset in self.subsearchers:
                 yield docnum
     
     def search(self, q, limit=10, sortedby=None, reverse=False, groupedby=None,
-               optimize=True, filter=None, mask=None):
+               optimize=True, filter=None, mask=None, groupids=True):
         """Runs the query represented by the ``query`` object and returns a
         Results object.
         
+        See :doc:`/facets` for information on using ``sortedby`` and/or
+        ``groupedby``.
+        
         :param query: a :class:`whoosh.query.Query` object.
         :param limit: the maximum number of documents to score. If you're only
             interested in the top N documents, you can set limit=N to limit the
             scoring for a faster search.
-        :param sortedby: the name of a field to sort by, or a tuple of field
-            names to sort by multiple fields. This is a shortcut for using a
-            :class:`whoosh.sorting.Sorter` object to do a simple sort. To do
-            complex sorts (where different fields are sorted in different
-            directions), use :meth:`Searcher.sorter` to get a sorter and use it
-            to perform the sorted search.
+        :param sortedby: see :doc`/facets`.
         :param reverse: Reverses the direction of the sort.
-        :param groupedby: a list of field names or facet names. If this
-            argument is not None, you can use the :meth:`Results.groups` method
-            on the results object to retrieve a dictionary mapping field/facet
-            values to document numbers.
+        :param groupedby: see :doc`/facets`.
         :param optimize: use optimizations to get faster results when possible.
         :param filter: a query, Results object, or set of docnums. The results
             will only contain documents that are also in the filter object.
         :param mask: a query, Results object, or set of docnums. The results
             will not contain documents that are also in the mask object.
+        :param groupids: by default, faceting groups map keys to lists of
+            document numbers associated with that key. To map to a simple count
+            of the number of documents instead of a list, use
+            ``groupids=False``.
         :rtype: :class:`Results`
         """
 
         if limit is not None and limit < 1:
             raise ValueError("limit must be >= 1")
 
-        if sortedby is not None:
-            sorter = self.sorter(sortedby=sortedby)
-            return sorter.sort_query(q, limit=limit, reverse=reverse,
-                                     filter=filter)
+        collector = Collector(limit=limit, usequality=optimize,
+                              groupedby=groupedby, reverse=reverse,
+                              groupids=groupids)
         
-        collector = Collector(limit=limit, usequality=optimize,
-                              groupedby=groupedby, reverse=reverse)
-        return collector.search(self, q, allow=filter, restrict=mask)
-        
+        if sortedby:
+            return collector.sort(self, q, sortedby, allow=filter,
+                                  restrict=mask)
+        else:
+            return collector.search(self, q, allow=filter, restrict=mask)
+    
     def correct_query(self, q, qstring, correctors=None, allfields=False,
                       terms=None, prefix=0, maxdist=2):
         """Returns a corrected version of the given user query using a default
 
 class Collector(object):
     def __init__(self, limit=10, usequality=True, replace=10, groupedby=None,
-                 timelimit=None, greedy=False, reverse=False):
+                 timelimit=None, greedy=False, reverse=False, groupids=True):
         """A Collector finds the matching documents, scores them, collects them
         into a list, and produces a Results object from them.
         
         self.timelimit = timelimit
         self.greedy = greedy
         self.reverse = reverse
+        self.groupids = groupids
         
-        # The groupedby attribute is expected to be a sequence of field names
-        if isinstance(groupedby, string_type):
-            groupedby = (groupedby, )
-        self.groupedby = groupedby
+        self.facets = None
+        if groupedby:
+            self.facets = sorting.Facets.from_groupedby(groupedby)
     
     def should_add_all(self):
         """Returns True if this collector needs to add all found documents (for
         limit = self.limit
         if limit:
             limit = min(limit, self.searcher.doc_count_all())
-        return not limit or self.groupedby
+        return not limit
     
     def use_block_quality(self, searcher, matcher=None):
         """Returns True if this collector can use block quality optimizations
             use = use and matcher.supports_block_quality()
         return use
     
-    def score(self, searcher, matcher):
-        """Called to compute the score for the current document in the given
-        :class:`whoosh.matching.Matcher`.
-        """
+    def _score_fn(self, searcher):
+        w = searcher.weighting
+        if w.use_final:
+            def scorefn(matcher):
+                score = matcher.score()
+                return w.final(searcher, matcher.id(), score)
+        else:
+            scorefn = None
+        return scorefn
+    
+    def _set_categorizers(self, searcher, offset):
+        groups = self.groups
+        if self.facets:
+            self.categorizers = dict((name, facet.categorizer(searcher))
+                                     for name, facet in self.facets.items())
+            
+            for name, catter in self.categorizers.items():
+                if self.groupids and name not in groups:
+                    groups[name] = defaultdict(list)
+                elif name not in groups:
+                    groups[name] = defaultdict(int)
+                
+                catter.set_searcher(searcher, offset)
+    
+    def _set_filters(self, allow, restrict):
+        if allow:
+            allow = self.searcher._filter_to_comb(allow)
+        self.allow = allow
+        if restrict:
+            restrict = self.searcher._filter_to_comb(restrict)
+        self.restrict = restrict
+    
+    def _set_timer(self):
+        # If this collector is time limited, start the timer thread
+        self.timer = None
+        if self.timelimit:
+            self.timer = threading.Timer(self.timelimit, self._timestop)
+            self.timer.start()
+    
+    def _reset(self):
+        self.groups = {}
+        self.items = []
+        self.timedout = False
+        self.runtime = -1
+        self.minscore = None
+    
+    def _timestop(self):
+        # Called by the Timer when the time limit expires. Set an attribute on
+        # the collector to indicate that the timer has expired and the
+        # collector should raise a TimeLimit exception at the next consistent
+        # state.
+        self.timer = None
+        self.timedout = True
+    
+    def _add_to_group(self, name, key, offsetid):
+        if self.groupids:
+            self.groups[name][key].append(offsetid)
+        else:
+            self.groups[name][key] += 1
+    
+    def collect(self, id, offsetid):
+        docset = self.docset
+        if docset is not None:
+            docset.add(offsetid)
         
-        w = searcher.weighting
-        score = matcher.score()
-        if w.use_final:
-            score = w.final(searcher, matcher.id(), score)
-        return score
+        if self.facets is not None:
+            add = self._add_to_group
+            for name, catter in self.categorizers.items():
+                if catter.allow_overlap:
+                    for key in catter.keys_for_id(id):
+                        add(name, catter.key_to_name(key), offsetid)
+                else:
+                    key = catter.key_to_name(catter.key_for_id(id))
+                    add(name, key, offsetid)
     
     def search(self, searcher, q, allow=None, restrict=None):
         """Top-level method call which uses the given :class:`Searcher` and
         
         self.searcher = searcher
         self.q = q
-        
-        if allow:
-            allow = searcher._filter_to_comb(allow)
-        self.allow = allow
-        if restrict:
-            restrict = searcher._filter_to_comb(restrict)
-        self.restrict = restrict
-        
-        self.groups = {}
-        self.items = []
-        self.minscore = None
-        self.timedout = False
-        self.runtime = -1
+        self._set_filters(allow, restrict)
+        self._reset()
+        self._set_timer()
         
         # If we're not using block quality, then we can add every document
         # number to a set as we see it, because we're not skipping low-quality
         # blocks
         self.docset = set() if not self.use_block_quality(searcher) else None
         
-        # If this collector is time limited, start the timer thread
-        self.timer = None
-        if self.timelimit:
-            self.timer = threading.Timer(self.timelimit, self._timestop)
-            self.timer.start()
-        
         # Perform the search
         t = now()
+        
         if searcher.is_atomic():
-            self.add_matches(searcher, q)
+            searchers = [(searcher, 0)]
         else:
-            for s, offset in searcher.subsearchers:
-                self.add_matches(s, q, offset=offset)
+            searchers = searcher.subsearchers
+            
+        for s, offset in searchers:
+            scorefn = self._score_fn(s)
+            self._set_categorizers(s, offset)
+            self.add_matches(s, q, offset, scorefn)
         
         # If we started a time limit timer thread, cancel it
         if self.timelimit and self.timer:
         self.runtime = now() - t
         return self.results()
     
-    def _timestop(self):
-        # Called by the Timer when the time limit expires. Set an attribute on
-        # the collector to indicate that the timer has expired and the
-        # collector should raise a TimeLimit exception at the next consistent
-        # state.
-        self.timer = None
-        self.timedout = True
-    
-    def add_matches(self, searcher, q, offset=0):
-        allow = self.allow
-        restrict = self.restrict
+    def add_matches(self, searcher, q, offset, scorefn):
         items = self.items
-        groups = self.groups
         limit = self.limit
         addall = self.should_add_all()
         matcher = q.matcher(searcher)
         usequality = self.use_block_quality(searcher, matcher)
         
-        # If this collector has grouping enabled, set up the key functions
-        keyfns = None
-        if self.groupedby:
-            keyfns = {}
-            for name in self.groupedby:
-                keyfns[name] = searcher.reader().key_fn(name)
-        
-        for offsetid, score in self.pull_matches(searcher, matcher, usequality,
-                                                 offset):
-            if allow and offsetid not in allow:
-                continue
-            if restrict and offsetid in restrict:
-                continue
-            
-            if keyfns:
-                for name, keyfn in iteritems(keyfns):
-                    if name not in groups:
-                        groups[name] = defaultdict(list)
-                    key = keyfn(offsetid - offset)
-                    groups[name][key].append(offsetid)
-            
+        for score, offsetid in self.pull_matches(searcher, matcher, offset,
+                                                 scorefn, usequality):
             # Document numbers are negated before putting them in the heap so
             # that higher document numbers have lower "priority" in the queue.
             # Lower document numbers should always come before higher document
                 if score > items[0][0]:
                     heapreplace(items, (score, negated_offsetid))
                     self.minscore = items[0][0]
-                    
-    def pull_matches(self, searcher, matcher, usequality, offset):
+    
+    def pull_matches(self, searcher, matcher, offset, scorefn, usequality):
         """Low-level method yields (docid, score) pairs from the given matcher.
         Called by :meth:`Collector.add_matches`.
         """
         
-        scorefn = self.score
+        allow = self.allow
+        restrict = self.restrict
         replace = self.replace
-        docset = self.docset
+        collect = self.collect
         minscore = self.minscore
         replacecounter = 0
         timelimited = bool(self.timelimit)
             # matcher with a more efficient version
             if replace:
                 if replacecounter == 0 or self.minscore != minscore:
-                    #print "minscore=", minscore, "max_qual=", matcher.max_quality()
                     matcher = matcher.replace(minscore or 0)
                     if not matcher.is_active():
                         break
             # minimum required quality
             if usequality and checkquality and minscore is not None:
                 matcher.skip_to_quality(minscore)
-                #print "skipped=", skipped
                 # Skipping ahead might have moved the matcher to the end of the
                 # posting list
                 if not matcher.is_active():
             # The current document ID 
             id = matcher.id()
             offsetid = id + offset
-            # If we're keeping track of IDs encountered, add this one
-            if docset is not None:
-                docset.add(offsetid)
             
-            # If we're using quality optimizations, check whether the current
-            # posting has higher quality than the minimum before yielding it.
-            score = scorefn(searcher, matcher)
-            yield (offsetid, score)
+            # Check whether the document is filtered
+            if ((not allow or offsetid in allow)
+                and (not restrict or offsetid not in restrict)):
+                # Collect and yield this document
+                collect(id, offsetid)
+                if scorefn:
+                    score = scorefn(matcher)
+                else:
+                    score = matcher.score()
+                yield (score, offsetid)
             
             # Check whether the time limit expired
-            if self.timedout:
+            if timelimited and self.timedout:
                 raise TimeLimit
             
             # Move to the next document. This method returns True if the
             # matcher has entered a new block, so we should check block quality
             # again.
             checkquality = matcher.next()
+    
+    def sort(self, searcher, q, sortedby, allow=None, restrict=None):
+        self.searcher = searcher
+        self.q = q
+        self.docset = set()
+        self._set_filters(allow, restrict)
+        self._reset()
+        self._set_timer()
+        
+        items = self.items
+        limit = self.limit
+        heapfn = nlargest if self.reverse else nsmallest
+        addall = self.should_add_all()
+        
+        facet = sorting.MultiFacet.from_sortedby(sortedby)
+        catter = facet.categorizer(searcher)
+        t = now()
+        
+        if searcher.is_atomic():
+            searchers = [(searcher, 0)]
+        else:
+            searchers = searcher.subsearchers
             
-                    
-    def results(self):
+        for s, offset in searchers:
+            self._set_categorizers(s, offset)
+            catter.set_searcher(s, offset)
+            matcher = q.matcher(s)
+            
+            if catter.requires_matcher:
+                ls = list(self.pull_matches(s, matcher, offset,
+                                            catter.key_for_matcher, False))
+            else:
+                ls = list(self.pull_unscored_matches(matcher, offset,
+                                                     catter.key_for_id))
+            
+            if addall:
+                items.extend(ls)
+            else:
+                items = heapfn(limit, items + ls)
+        
+        self.items = items
+        self.runtime = now() - t
+        return self.results(scores=False)
+    
+    def pull_unscored_matches(self, matcher, offset, keyfn):
+        allow = self.allow
+        restrict = self.restrict
+        collect = self.collect
+        timelimited = bool(self.timelimit)
+        
+        for id in matcher.all_ids():
+            # Check whether the time limit expired since the last match
+            if timelimited and self.timedout and not self.greedy:
+                raise TimeLimit
+            
+            # The current document ID 
+            offsetid = id + offset
+            
+            # Check whether the document is filtered
+            if ((not allow or offsetid in allow)
+                and (not restrict or offsetid not in restrict)):
+                # Collect and yield this document
+                collect(id, offsetid)
+                yield (keyfn(id), offsetid)
+            
+            # Check whether the time limit expired
+            if timelimited and self.timedout:
+                raise TimeLimit
+    
+    def results(self, scores=True):
         """Returns the current results from the collector. This is useful for
         getting the results out of a collector that was stopped by a time
         limit exception.
         """
         
-        # Docnums are stored as negative for reasons too obscure to go into
-        # here, re-negate them before returning
-        items = [(x[0], 0 - x[1]) for x in self.items]
+        if scores:
+            # Docnums are stored as negative for reasons too obscure to go into
+            # here, re-negate them before returning
+            items = [(x[0], 0 - x[1]) for x in self.items]
         
-        # Sort by negated scores so that higher scores go first, then by
-        # document number to keep the order stable when documents have the same
-        # score
-        items.sort(key=lambda x: (0 - x[0], x[1]), reverse=self.reverse)
+            # Sort by negated scores so that higher scores go first, then by
+            # document number to keep the order stable when documents have the
+            # same score
+            items.sort(key=lambda x: (0 - x[0], x[1]), reverse=self.reverse)
+        else:
+            items = sorted(self.items, reverse=self.reverse)
         
         return Results(self.searcher, self.q, items, self.docset,
                        groups=self.groups, runtime=self.runtime,
         # If you're using this collector, you need to examine all documents
         return True
     
-    def add_matches(self, searcher, q, offset=0):
+    def add_matches(self, searcher, q, offset, scorefn):
         sup = super(TermTrackingCollector, self)
         self.matchers = []
         q = self._tag(q)
-        return sup.add_matches(searcher, q, offset=offset)
+        return sup.add_matches(searcher, q, offset, scorefn)
     
-    def pull_matches(self, searcher, matcher, usequality, offset):
+    def pull_matches(self, searcher, matcher, offset, scorefn, usequality):
         super_method = super(TermTrackingCollector, self).pull_matches
         
-        for offsetid, score in super_method(searcher, matcher, usequality,
-                                            offset):
+        for score, offsetid in super_method(searcher, matcher, offset,
+                                            scorefn, usequality):
             for key, m in self.matchers:
                 if m.is_active() and m.id() == offsetid - offset:
                     if key not in self.catalog:
                         self.catalog[key] = set()
                     self.catalog[key].add(offsetid)
             
-            yield (offsetid, score)
+            yield (score, offsetid)
     
     def _tag(self, q):
         # Takes a query and returns a copy of the query with a TaggedQuery
         fields associated with a document ID.
         """
         
-        return self._groups[name]
+        if name not in self._groups:
+            raise KeyError("%r not in group names %r" % (name, self._groups.keys()))
+        return dict(self._groups[name])
     
     def _load_docs(self):
         # If self.docset is None, that means this results object was created

src/whoosh/sorting.py

 # policies, either expressed or implied, of Matt Chaput.
 
 from array import array
-from heapq import nlargest, nsmallest
 
-from whoosh.compat import string_type
-from whoosh.searching import Results
-from whoosh.util import now
+from whoosh.compat import string_type, u, xrange
+from whoosh.fields import DEFAULT_LONG
+from whoosh.support.times import (long_to_datetime, datetime_to_long,
+                                  timedelta_to_usecs)
 
 
+# Legacy sorting object
+
 class Sorter(object):
-    """This object does the work of sorting search results.
+    """This is a legacy interface. The functionality of the Sorter object was
+    moved into the :class:`FacetType` classes and the
+    :class:`whoosh.searching.Collector` in Whoosh 2.0. The old Sorter API is
+    still supported for backwards-compatibility, but it simply forwards to the
+    new API.
     
-    For simple sorting (where all fields go in the same direction), you can
-    just use the ``sortedby`` and ``reverse`` arguments to
-    :meth:`whoosh.searching.Searcher.search`::
-    
-        # Sort by ascending group
-        r = searcher.search(myquery, sortedby="group")
-        # Sort by ascending path and the ascending price price
-        r = searcher.search(myquery, sortedby=("path", "price"))
-        # Sort by descending path
-        r = searcher.search(myquery, sortedby="path", reverse=True)
-    
-    These are the equivalent of using the sorter directly::
-    
-        # Sort by ascending path and the ascending price price
-        sorter = searcher.sorter()
-        sorter.add_field("path")
-        sorter.add_field("price")
-        r = sorter.sort_query(myquery)
-    
-    For complex sorting (where some fields are ascending and some fields are
-    descending), you must instantiate a sorter object from the searcher and
-    specify the fields to sort by::
-    
-        # Sort by ascending group and then descending price
-        sorter = searcher.sorter()
-        sorter.add_field("group")
-        sorter.add_field("price", reverse=True)
-        r = sorter.sort_query(myquery)
-    
-    Alternatively, you can set up the sort criteria using a keyword argument::
-    
-        # Sort by ascending group and then descending price
-        crits = [("group", False), ("price", True)]
-        sorter = searcher.sorter(criteria=crits)
-        r = sorter.sort_query(myquery)
-    
-    Note that complex sorting can be much slower on large indexes than a
-    sort in which all fields are sorted in the same direction. Also, when you
-    do this type of sort on a multi-segment index, the sort cannot reuse field
-    caches and must recreate a field cache-like structure across the entire
-    index, which can effectively double memory usage for cached fields.
-    
-    You can re-use a configured sorter with different queries. However, the
-    sorter object always returns results from the searcher it was created with.
-    If the index changes and you refresh the searcher, you need to recreate the
-    sorter object to see the updates.
+    See :doc:`/facets` for information on the new API.
     """
 
-    def __init__(self, searcher, criteria=None, sortedby=None):
-        """
-        :param searcher: a :class:`whoosh.searching.Searcher` object to use for
-            searching.
-        :param criteria: a list of ``(fieldname, reversed)`` tuples, where the
-            second value in each tuple is a boolean indicating whether to
-            reverse the order of the sort for that field. Alternatively you can
-            use the :meth:`Sorter.add_field` method on the instantiated sorter.
-        :param sortedby: a convenience that generates a proper "criteria" list
-            from a fieldname string or list of fieldnames, to set up the sorter
-            for a simple search.
+    def __init__(self, searcher):
+        self.searcher = searcher
+        self.multi = MultiFacet()
+        
+    def add_field(self, fieldname, reverse=False):
+        self.multi.add_field(fieldname, reverse=reverse)
+    
+    def sort_query(self, q, limit=None, reverse=False, filter=None, mask=None,
+                   groupedby=None):
+        from whoosh.searching import Collector
+        
+        collector = Collector(limit=limit, groupedby=groupedby, reverse=reverse)
+        return collector.sort(self.searcher, q, self.multi, allow=filter,
+                              restrict=mask)
+    
+
+# Faceting objects
+
+class FacetType(object):
+    """Base class for "facets", aspects that can be sorted/faceted.
+    """
+    
+    def categorizer(self, searcher):
+        """Returns a :class:`Categorizer` corresponding to this facet.
         """
         
-        self.searcher = searcher
-        self.criteria = criteria or []
-        if sortedby:
-            if isinstance(sortedby, string_type):
-                sortedby = [sortedby]
-            for fieldname in sortedby:
-                self.criteria.append((fieldname, False))
-        
-        self.arrays = None
+        raise NotImplementedError
+    
 
-    def add_field(self, fieldname, reverse=False):
-        """Adds a field to the sorting criteria. Results are sorted by the
-        fields in the order you add them. For example, if you do::
-        
-            sorter.add_field("group")
-            sorter.add_field("price")
-            
-        ...the results are sorted by ``group``, and for results with the same
-        value of ``group``, are then sorted by ``price``.
-        
-        :param fieldname: the name of the field to sort by.
-        :param reverse: if True, reverses the natural ordering of the field.
+class Categorizer(object):
+    """Base class for categorizer objects which compute a key value for a
+    document based on certain criteria, for use in sorting/faceting.
+    
+    Categorizers are created by FacetType objects through the
+    :meth:`FacetType.categorizer` method. The
+    :class:`whoosh.searching.Searcher` object passed to the ``categorizer``
+    method may be a composite searcher (that is, wrapping a multi-reader), but
+    categorizers are always run **per-segment**, with segment-relative document
+    numbers.
+    
+    The collector will call a categorizer's ``set_searcher`` method as it
+    searches each segment to let the cateogorizer set up whatever segment-
+    specific data it needs.
+    
+    ``Collector.allow_overlap`` should be True if the caller should use the
+    ``keys_for_id`` method instead of ``key_for_id`` to group documents into
+    potentially overlapping groups.
+    """
+    
+    allow_overlap = False
+    requires_matcher = False
+    
+    def set_searcher(self, searcher, docoffset):
+        """Called by the collector when the collector moves to a new segment.
+        The ``searcher`` will be atomic. The ``docoffset`` is the offset of
+        the segment's document numbers relative to the entire index. You can
+        use the offset to get absolute index docnums by adding the offset to
+        segment-relative docnums.
         """
         
-        self.criteria.append((fieldname, reverse))
+        pass
     
-    def is_simple(self):
-        """Returns ``True`` if this is a "simple" sort (all the fields are
-        sorted in the same direction).
+    def key_for_matcher(self, matcher):
+        """Returns a key for the given matcher. The default implementation
+        simply gets the matcher's current document ID and calls ``key_for_id``,
+        but a subclass can override this if it needs information from the
+        matcher to compute the key.
         """
         
-        if len(self.criteria) < 2:
-            return True
+        return self.key_for_id(matcher.id())
+    
+    def key_for_id(self, docid):
+        """Returns a key for the given **segment-relative** document number.
+        """
         
-        firstdir = self.criteria[0][1]
-        return all(c[1] == firstdir for c in self.criteria)
+        raise NotImplementedError(self.__class__)
     
-    def _results(self, q, docnums, docset, runtime):
-        top_n = [(None, docnum) for docnum in docnums]
-        return Results(self.searcher, q, top_n, docset, runtime=runtime)
+    def keys_for_id(self, docid):
+        """Yields a series of keys for the given **segment-relative** document
+        number. This method will be called instead of ``key_for_id`` if
+        ``Categorizer.allow_overlap==True``.
+        """
+        
+        raise NotImplementedError(self.__class__)
     
-    def _simple_sort_query(self, q, limit=None, reverse=False, filter=None):
-        # If the direction of all sort fields is the same, we can use field
-        # caches to do the sorting
+    def key_to_name(self, key):
+        """Returns a representation of the key to be used as a dictionary key
+        in faceting. For example, the sorting key for date fields is a large
+        integer; this method translates it into a ``datetime`` object to make
+        the groupings clearer.
+        """
         
-        t = now()
-        docset = set()
-        sortedby = [c[0] for c in self.criteria]
-        reverse = self.criteria[0][1] ^ reverse
-        comb = self.searcher._filter_to_comb(filter)
+        return key
+    
+
+class FieldFacet(FacetType):
+    """Sorts/facest by the contents of a field.
+    
+    For example, to sort by the contents of the "path" field in reverse order,
+    and facet by the contents of the "tag" field::
+    
+        paths = FieldFacet("path", reverse=True)
+        tags = FieldFacet("tag")
+        results = searcher.search(myquery, sortedby=paths, groupedby=tags)
+    
+    This facet returns different categorizers based on the field type.
+    """
+    
+    def __init__(self, fieldname, reverse=False, allow_overlap=False):
+        """
+        :param fieldname: the name of the field to sort/facet on.
+        :param reverse: if True, when sorting, reverse the sort order of this
+            facet.
+        :param allow_overlap: if True, when grouping, allow documents to appear
+            in multiple groups when they have multiple terms in the field.
+        """
         
-        if self.searcher.subsearchers:
-            heap = []
-            
-            # I wish I could actually do a heap thing here, but the Python heap
-            # queue only works with greater-than, and I haven't thought of a
-            # smart way to get around that yet, so I'm being dumb and using
-            # nlargest/nsmallest on the heap + each subreader list :(
-            op = nlargest if reverse else nsmallest
-            
-            for s, offset in self.searcher.subsearchers:
-                # This searcher is wrapping a MultiReader, so push the sorting
-                # down to the leaf readers and then combine the results.
-                docnums = [docnum for docnum in q.docs(s)
-                           if (not comb) or docnum + offset in comb]
-                
-                # Add the docnums to the docset
-                docset.update(docnums)
-                
-                # Ask the reader to return a list of (key, docnum) pairs to
-                # sort by. If limit=None, the returned list is not sorted. If
-                # limit=True, it is sorted.
-                r = s.reader()
-                srt = r.key_docs_by(sortedby, docnums, limit, reverse=reverse,
-                                    offset=offset)
-                if limit:
-                    # Pick the "limit" smallest/largest items from the current
-                    # and new list
-                    heap = op(limit, heap + srt)
-                else:
-                    # If limit=None, we'll just add everything to the "heap"
-                    # and sort it at the end.
-                    heap.extend(srt)
-            
-            # Sort the heap and take the docnums
-            docnums = [docnum for _, docnum in sorted(heap, reverse=reverse)]
-            
+        self.fieldname = fieldname
+        self.reverse = reverse
+        self.allow_overlap = allow_overlap
+    
+    def categorizer(self, searcher):
+        from whoosh.fields import NUMERIC, DATETIME
+        
+        # The searcher we're passed here may wrap a multireader, but the
+        # actual key functions will always be called per-segment following a
+        # Categorizer.set_searcher method call
+        fieldname = self.fieldname
+        field = None
+        if fieldname in searcher.schema:
+            field = searcher.schema[fieldname]
+        
+        if self.allow_overlap:
+            return self.OverlappingFieldCategorizer(fieldname)
+        
+        elif isinstance(field, DATETIME):
+            # Return a subclass of NumericFieldCategorizer that formats dates
+            return self.DateFieldCategorizer(fieldname, self.reverse)
+        
+        elif isinstance(field, NUMERIC):
+            # Numeric fields are naturally reversible
+            return self.NumericFieldCategorizer(fieldname, self.reverse)
+        
+        elif self.reverse:
+            # If we need to "reverse" a string field, we need to do more work
+            return self.RevFieldCategorizer(searcher, fieldname, self.reverse)
+        
         else:
-            # This searcher is wrapping an atomic reader, so we don't need to
-            # get tricky combining the results of multiple readers, just ask
-            # the reader to sort the results.
-            r = self.searcher.reader()
-            docnums = [docnum for docnum in q.docs(self.searcher)
-                       if (not comb) or docnum in comb]
-            docnums = r.sort_docs_by(sortedby, docnums, reverse=reverse)
-            docset = set(docnums)
-            
-            # I artificially enforce the limit here, even thought the current
-            # implementation can't use it, so that the results don't change
-            # based on single- vs- multi-segment.
-            docnums = docnums[:limit]
+            # Straightforward: use the field cache to sort/categorize
+            return self.FieldCategorizer(fieldname)
+
+    class FieldCategorizer(Categorizer):
+        """Categorizer for regular, unreversed fields. Just uses the
+        fieldcache to get the keys.
+        """
         
-        runtime = now() - t
-        return self._results(q, docnums, docset, runtime)
+        def __init__(self, fieldname):
+            self.fieldname = fieldname
+        
+        def set_searcher(self, searcher, docoffset):
+            self.fieldcache = searcher.reader().fieldcache(self.fieldname)
+        
+        def key_for_id(self, docid):
+            return self.fieldcache.key_for(docid)
+        
+        def key_to_name(self, key):
+            if key == u('\uFFFF'):
+                return None
+            else:
+                return key
     
-    def _complex_cache(self):
-        self.arrays = []
-        r = self.searcher.reader()
-        for name, reverse in self.criteria:
-            arry = array("i", [0] * r.doc_count_all())
-            field = self.searcher.schema[name]
-            for i, (t, _) in enumerate(field.sortable_values(r, name)):
+    class NumericFieldCategorizer(Categorizer):
+        """Categorizer for numeric fields, which are naturally reversible.
+        """
+        
+        def __init__(self, fieldname, reverse):
+            self.fieldname = fieldname
+            self.reverse = reverse
+        
+        def set_searcher(self, searcher, docoffset):
+            self.default = searcher.schema[self.fieldname].sortable_default()
+            self.fieldcache = searcher.reader().fieldcache(self.fieldname)
+        
+        def key_for_id(self, docid):
+            value = self.fieldcache.key_for(docid)
+            if self.reverse:
+                return 0 - value
+            else:
+                return value
+        
+        def key_to_name(self, key):
+            if key == self.default:
+                return None
+            else:
+                return key
+
+    class DateFieldCategorizer(NumericFieldCategorizer):
+        """Categorizer for date fields. Same as NumericFieldCategorizer, but
+        converts the numeric keys back to dates for better labels.
+        """
+        
+        def key_to_name(self, key):
+            if key == DEFAULT_LONG:
+                return None
+            else:
+                return long_to_datetime(key)
+    
+    class RevFieldCategorizer(Categorizer):
+        """Categorizer for reversed fields. Since keys for non-numeric fields
+        are arbitrary data, it's not possible to "negate" them to reverse the
+        sort order. So, this object builds an array caching the order of
+        all documents according to the field, then uses the cached order as a
+        numeric key.
+        """
+        
+        def __init__(self, reader, fieldname, reverse):
+            # Cache the relative positions of all docs with the given field
+            # across the entire index
+            dc = reader.doc_count_all()
+            arry = array("i", [dc + 1] * dc)
+            field = self.searcher.schema[fieldname]
+            for i, (t, _) in enumerate(field.sortable_values(reader, fieldname)):
                 if reverse:
-                    i = 0 - i
-                postings = r.postings(name, t)
+                    i = dc - i
+                postings = reader.postings(fieldname, t)
                 for docid in postings.all_ids():
                     arry[docid] = i
-            self.arrays.append(arry)
+            self.array = arry
+            
+        def set_searcher(self, searcher, docoffset):
+            self.searcher = searcher
+            self.docoffset = docoffset
+        
+        def key_for_id(self, docid):
+            return self.array[docid + self.docoffset]
+        
+    class OverlappingFieldCategorizer(Categorizer):
+        allow_overlap = True
+        
+        def __init__(self, fieldname):
+            self.fieldname = fieldname
+        
+        def set_searcher(self, searcher, docoffset):
+            fieldname = self.fieldname
+            dc = searcher.doc_count_all()
+            field = searcher.schema[fieldname]
+            reader = searcher.reader()
+            
+            self.lists = [[] for _ in xrange(dc)]
+            for t, _ in field.sortable_values(reader, fieldname):
+                postings = reader.postings(fieldname, t)
+                for docid in postings.all_ids():
+                    self.lists[docid].append(t)
+        
+        def keys_for_id(self, docid):
+            return self.lists[docid] or None
+        
+        def key_for_id(self, docid):
+            ls = self.lists[docid]
+            if ls:
+                return ls[0]
+            else:
+                return None
 
-    def _complex_key_fn(self, docnum):
-        return tuple(arry[docnum] for arry in self.arrays)
 
-    def _complex_sort_query(self, q, limit=None, reverse=False, filter=None):
-        t = now()
-        if self.arrays is None:
-            self._complex_cache()
-        comb = self.searcher._filter_to_comb(filter)
-        docnums = [docnum for docnum in self.searcher.docs_for_query(q)
-                   if (not comb) or docnum in comb]
-        docnums.sort(key=self._complex_key_fn, reverse=reverse)
-        docset = set(docnums)
-        
-        # I artificially enforce the limit here, even thought the current
-        # implementation can't use it, so that the results don't change based
-        # on single- vs- multi-segment.
-        if limit:
-            docnums = docnums[:limit]
-        runtime = now() - t
-        return self._results(q, docnums, docset, runtime)
-
-    def sort_query(self, q, limit=None, reverse=False, filter=None):
-        """Returns a :class:`whoosh.searching.Results` object for the given
-        query, sorted according to the fields set up using the
-        :meth:`Sorter.add_field` method.
-        
-        The parameters have the same meaning as for the
-        :meth:`whoosh.searching.Searcher.search` method.
+class QueryFacet(FacetType):
+    """Sorts/facets based on the results of a series of queries.
+    """
+    
+    def __init__(self, querydict, other=None, allow_overlap=False):
+        """
+        :param querydict: a dictionary mapping keys to
+            :class:`whoosh.query.Query` objects.
+        :param other: the key to use for documents that don't match any of the
+            queries.
         """
         
-        if self.is_simple():
-            meth = self._simple_sort_query
+        self.querydict = querydict
+        self.other = other
+    
+    def categorizer(self, searcher):
+        return self.QueryCategorizer(self.querydict, self.other)
+    
+    class QueryCategorizer(Categorizer):
+        def __init__(self, querydict, other, allow_overlap=False):
+            self.querydict = querydict
+            self.other = other
+            self.allow_overlap = allow_overlap
+            
+        def set_searcher(self, searcher, offset):
+            self.docsets = {}
+            for qname, q in self.querydict.items():
+                docset = set(q.docs(searcher))
+                if docset:
+                    self.docsets[qname] = docset
+            self.offset = offset
+        
+        def key_for_id(self, docid):
+            for qname in self.docsets:
+                if docid in self.docsets[qname]:
+                    return qname
+            return self.other
+        
+        def keys_for_id(self, docid):
+            found = False
+            for qname in self.docsets:
+                if docid in self.docsets[qname]:
+                    yield qname
+                    found = True
+            if not found:
+                yield None
+
+
+class RangeFacet(QueryFacet):
+    """Sorts/facets based on numeric ranges. For textual ranges, use
+    :class:`QueryFacet`.
+    
+    For example, to facet the "price" field into $100 buckets, up to $1000::
+    
+        prices = RangeFacet("price", 0, 1000, 100)
+        results = searcher.search(myquery, groupedby=prices)
+        
+    The ranges/buckets are always **inclusive** at the start and **exclusive**
+    at the end.
+    """
+    
+    def __init__(self, fieldname, start, end, gap, hardend=False):
+        """
+        :param fieldname: the numeric field to sort/facet on.
+        :param start: the start of the entire range.
+        :param end: the end of the entire range.
+        :param gap: the size of each "bucket" in the range. This can be a
+            sequence of sizes. For example, ``gap=[1,5,10]`` will use 1 as the
+            size of the first bucket, 5 as the size of the second bucket, and
+            10 as the size of all subsequent buckets.
+        :param hardend: if True, the end of the last bucket is clamped to the
+            value of ``end``. If False (the default), the last bucket is always
+            ``gap`` sized, even if that means the end of the last bucket is
+            after ``end``.
+        """
+        
+        self.fieldname = fieldname
+        self.start = start
+        self.end = end
+        self.gap = gap
+        self.hardend = hardend
+        self._queries()
+    
+    def _range_name(self, startval, endval):
+        return (startval, endval)
+    
+    def _queries(self):
+        from whoosh import query
+        
+        if not self.gap:
+            raise Exception("No gap secified (%r)" % self.gap)
+        if isinstance(self.gap, (list, tuple)):
+            gaps = self.gap
+            gapindex = 0
         else:
-            meth = self._complex_sort_query
+            gaps = [self.gap]
+            gapindex = -1
+        
+        self.querydict = {}
+        cstart = self.start
+        while cstart < self.end:
+            thisgap = gaps[gapindex]
+            if gapindex >= 0:
+                gapindex += 1
+                if gapindex == len(gaps):
+                    gapindex = -1
             
-        return meth(q, limit, reverse, filter)
+            cend = cstart + thisgap
+            if self.hardend:
+                cend = min(self.end, cend)
+            
+            rangename = self._range_name(cstart, cend)
+            q = query.NumericRange(self.fieldname, cstart, cend, endexcl=True)
+            self.querydict[rangename] = q
+            
+            cstart += thisgap
+    
+    def categorizer(self, searcher):
+        return QueryFacet(self.querydict).categorizer(searcher)
     
 
+class DateRangeFacet(RangeFacet):
+    """Sorts/facets based on date ranges.
+    
+    For example, to facet the "birthday" field into year-sized buckets::
+    
+        startdate = datetime(1920, 0, 0)
+        enddate = datetime.now()
+        gap = timedelta(days=365)
+        bdays = RangeFacet("birthday", startdate, enddate, gap)
+        results = searcher.search(myquery, groupedby=bdays)
+        
+    The ranges/buckets are always **inclusive** at the start and **exclusive**
+    at the end.
+    """
+    
+    def __init__(self, fieldname, startdate, enddate, delta, hardend=False):
+        """
+        :param fieldname: the datetime field to sort/facet on.
+        :param startdate: the start of the entire range.
+        :param enddate: the end of the entire range.
+        :param delta: a timedelta object representing the size of each "bucket"
+            in the range. This can be a sequence of timedeltas. For example,
+            ``gap=[timedelta(days=1), timedelta(days=5), timedelta(days=10)]``
+            will use 1 day as the size of the first bucket, 5 days as the size
+            of the second bucket, and 10 days as the size of all subsequent
+            buckets.
+        :param hardend: if True, the end of the last bucket is clamped to the
+            value of ``end``. If False (the default), the last bucket is always
+            ``gap`` sized, even if that means the end of the last bucket is
+            after ``end``.
+        """
+        
+        self.fieldname = fieldname
+        self.start = datetime_to_long(startdate)
+        self.end = datetime_to_long(enddate)
+        self.hardend = hardend
+        if isinstance(delta, (list, tuple)):
+            self.gap = [timedelta_to_usecs(d) for d in delta]
+        else:
+            self.gap = timedelta_to_usecs(delta)
+        self._queries()
+    
+    def _range_name(self, startval, endval):
+        return (long_to_datetime(startval), long_to_datetime(endval))