Commits

Matt Chaput committed cb200be

Renamed ComplexPhrasePlugin to SequencePlugin. Added optional slop factor.
Added optional prefix to fuzzy syntax.
Added optional slop factor to phrase syntax.
Added docs for sequence and fuzzy queries.
Added unit tests.

Comments (0)

Files changed (5)

docs/source/parsing.rst

     date:[31 march 2001 to]
 
 
+Adding fuzzy term queries
+-------------------------
+
+Fuzzy queries are good for catching misspellings and similar words.
+The :class:`whoosh.qparser.FuzzyTermPlugin` lets you search for "fuzzy" terms,
+that is, terms that don't have to match exactly. The fuzzy term will match any
+similar term within a certain number of "edits" (character insertions,
+deletions, and/or transpositions -- this is called the "Damerau-Levenshtein
+edit distance").
+
+To add the fuzzy plugin::
+
+    parser = qparser.QueryParser("fieldname", my_index.schema)
+    parser.add_plugin(qparser.FuzzyTermPlugin())
+
+Once you add the fuzzy plugin to the parser, you can specify a fuzzy term by
+adding a ``~`` followed by an optional maximum edit distance. If you don't
+specify an edit distance, the default is ``1``.
+
+For example, the following "fuzzy" term query::
+
+    cat~
+
+would match ``cat`` and all terms in the index within one "edit" of cat,
+for example ``cast`` (insert ``s``), ``at`` (delete ``c``), and ``act``
+(transpose ``c`` and ``a``).
+
+If you wanted ``cat`` to match ``bat``, it requires two edits (delete ``c`` and
+insert ``b``) so you would need to set the maximum edit distance to ``2``::
+
+    cat~2
+
+Because each additional edit you allow increases the number of possibilities
+that must be checked, edit distances greater than ``2`` can be very slow.
+
+It is often useful to require that the first few characters of a fuzzy term
+match exactly. This is called a prefix. You can set the length of the prefix
+by adding a slash and a number after the edit distance. For example, to use
+a maximum edit distance of ``2`` and a prefix length of ``3``::
+
+    johannson~2/3
+
+You can specify a prefix without specifying an edit distance::
+
+    johannson~/3
+
+The default prefix distance is ``0``.
+
+
+Allowing complex phrase queries
+-------------------------------
+
+The default parser setup allows phrase (proximity) queries such as::
+
+    "whoosh search library"
+
+The default phrase query tokenizes the text between the quotes and creates a
+search for those terms in proximity.
+
+If you want to do more complex proximity searches, you can replace the phrase
+plugin with the :class:`whoosh.qparser.SequencePlugin`, which allows any query
+between the quotes. For example::
+
+    "(john OR jon OR jonathan~) peters*"
+
+The sequence syntax lets you add a "slop" factor just like the regular phrase::
+
+    "(john OR jon OR jonathan~) peters*"~2
+
+To replace the default phrase plugin with the sequence plugin::
+
+    parser = qparser.QueryParser("fieldname", my_index.schema)
+    parser.remove_plugin_class(qparser.PhrasePlugin)
+    parser.add_plugin(qparser.SequencePlugin())
+
+Alternatively, you could keep the default phrase plugin and give the sequence
+plugin different syntax by specifying a regular expression for the start/end
+marker when you create the sequence plugin. The regular expression should have
+a named group ``slop`` for the slop factor. For example::
+
+    parser = qparser.QueryParser("fieldname", my_index.schema)
+    parser.add_plugin(qparser.SequencePlugin("!(~(?P<slop>[1-9][0-9]*))?"))
+
+This would allow you to use regular phrase queries and sequence queries at the
+same time::
+
+    "regular phrase" AND !sequence query~2!
+
+
 Advanced customization
 ======================
 

docs/source/querylang.rst

 This is the field in which any terms the user does not explicitly specify a field
 for will be searched.
 
+Whoosh's query parser is capable of parsing different and/or additional syntax
+through the use of plug-ins. See :doc:`parsing`.
+
 
 Individual terms and phrases
 ============================
 Note that a field must store Position information for phrase searching to work in
 that field.
 
+Normally when you specify a phrase, the maximum difference in position between
+each word in the phrase is 1 (that is, the words must be right next to each
+other in the document). For example, the following matches if a document has
+``library`` within 5 words after ``whoosh``::
+
+    "whoosh library"~5
+
 
 Boolean operators
 =================

src/whoosh/qparser/plugins.py

     The maximum edit distance can only be a single digit. Note that edit
     distances greater than 2 can take an extremely long time and are generally
     not useful.
+    
+    You can specify a prefix length using ``~n/m``. For example, to allow a
+    maximum edit distance of 2 and require a prefix match of 3 characters::
+    
+        johannson~2/3
+    
+    To specify a prefix with the default edit distance::
+    
+        johannson~/3
     """
 
+    expr = rcompile("""
+    (?<=\\S)                          # Only match right after non-space
+    ~                                 # Initial tilde
+    (?P<maxdist>[0-9])?               # Optional maxdist
+    (/                                # Optional prefix slash
+        (?P<prefix>[1-9][0-9]*)       # prefix
+    )?                                # (end prefix group)
+    """, verbose=True)
+
     class FuzzinessNode(syntax.SyntaxNode):
-        def __init__(self, maxdist):
+        def __init__(self, maxdist, prefix, original):
             self.maxdist = maxdist
+            self.prefix = prefix
+            self.original = original
 
         def __repr__(self):
             return "<~%d>" % (self.maxdist,)
     class FuzzyTermNode(syntax.TextNode):
         qclass = query.FuzzyTerm
 
-        def __init__(self, wordnode, maxdist):
+        def __init__(self, wordnode, maxdist, prefix):
             self.fieldname = wordnode.fieldname
             self.text = wordnode.text
             self.boost = wordnode.boost
             self.startchar = wordnode.startchar
             self.endchar = wordnode.endchar
             self.maxdist = maxdist
+            self.prefix = prefix
 
         def r(self):
             return "%s ~%d" % (self.text, self.maxdist)
 
         def query(self, parser):
+            # Use the superclass's query() method to create a FuzzyTerm query
+            # (it looks at self.qclass), just because it takes care of some
+            # extra checks and attributes
             q = syntax.TextNode.query(self, parser)
+            # Set FuzzyTerm-specific attributes
             q.maxdist = self.maxdist
+            q.prefix = self.prefix
             return q
 
-    def __init__(self, expr="[~](?P<maxdist>[0-9])?"):
-        self.expr = rcompile(expr)
-
     def create(self, parser, match):
         mdstr = match.group("maxdist")
-        if mdstr is None or mdstr == '':
-            maxdist = 1
-        else:
-            maxdist = int(mdstr)
-        return self.FuzzinessNode(maxdist)
+        maxdist = int(mdstr) if mdstr else 1
+
+        pstr = match.group("prefix")
+        prefix = int(pstr) if pstr else 0
+
+        return self.FuzzinessNode(maxdist, prefix, match.group(0))
 
     def filters(self, parser):
         return [(self.do_fuzzyterms, 0)]
 
     def do_fuzzyterms(self, parser, group):
         newgroup = group.empty_copy()
-        for i, node in enumerate(group):
+        i = 0
+        while i < len(group):
+            node = group[i]
             if i < len(group) - 1 and isinstance(node, syntax.WordNode):
                 nextnode = group[i + 1]
                 if isinstance(nextnode, self.FuzzinessNode):
-                    node = self.FuzzyTermNode(node, nextnode.maxdist)
+                    node = self.FuzzyTermNode(node, nextnode.maxdist,
+                                              nextnode.prefix)
+                    i += 1
             if isinstance(node, self.FuzzinessNode):
-                continue
+                node = syntax.to_word(node)
             if isinstance(node, syntax.GroupNode):
                 node = self.do_fuzzyterms(parser, node)
 
             newgroup.append(node)
+            i += 1
         return newgroup
 
 
 
     class PhraseTagger(RegexTagger):
         def create(self, parser, match):
-            return PhrasePlugin.PhraseNode(match.group("text"),
-                                           match.start("text"))
+            text = match.group("text")
+            textstartchar = match.start("text")
+            slopstr = match.group("slop")
+            slop = int(slopstr) if slopstr else 1
+            return PhrasePlugin.PhraseNode(text, textstartchar, slop)
 
-    def __init__(self, expr='"(?P<text>.*?)"'):
+    def __init__(self, expr='"(?P<text>.*?)"(~(?P<slop>[1-9][0-9]*))?'):
         self.expr = expr
 
     def taggers(self, parser):
         return [(self.PhraseTagger(self.expr), 0)]
 
 
-class ComplexPhrasePlugin(Plugin):
+class SequencePlugin(Plugin):
     """Adds the ability to group arbitrary queries inside double quotes to
     produce a query matching the individual sub-queries in sequence.
     
     
         qp = qparser.QueryParser("field", my_schema)
         qp.remove_plugin_class(qparser.PhrasePlugin)
-        qp.add_plugin(qparser.ComplexPhrasePlugin())
+        qp.add_plugin(qparser.SequencePlugin())
     
     This enables parsing "phrases" such as::
     
         "(jon OR john OR jonathan~1) smith*"
     """
 
-    def __init__(self, expr='["]'):
+    def __init__(self, expr='["](~(?P<slop>[1-9][0-9]*))?'):
         """
         :param expr: a regular expression for the marker at the start and end
             of a phrase. The default is the double-quotes character.
 
         self.expr = expr
 
-    class ComplexPhraseNode(syntax.GroupNode):
+    class SequenceNode(syntax.GroupNode):
         qclass = query.Sequence
 
     class QuoteNode(syntax.MarkerNode):
-        pass
-
-    class QuoteTagger(RegexTagger):
-        def create(self, parser, match):
-            return ComplexPhrasePlugin.QuoteNode()
+        def __init__(self, slop=None):
+            self.slop = int(slop) if slop else 1
 
     def taggers(self, parser):
-        return [(self.QuoteTagger(self.expr), 0)]
+        return [(FnTagger(self.expr, self.QuoteNode, "quote"), 0)]
 
     def filters(self, parser):
         return [(self.do_quotes, 650)]
 
     def do_quotes(self, parser, group):
+        # New group to copy nodes into
         newgroup = group.empty_copy()
-        phrase = None
+        # Buffer for sequence nodes; when it's None, it means we're not in
+        # a sequence
+        seq = None
+
+        # Start copying nodes from group to newgroup. When we find a quote
+        # node, start copying nodes into the buffer instead. When we find
+        # the next (end) quote, put the buffered nodes into a SequenceNode
+        # and add it to newgroup.
         for node in group:
             if isinstance(node, syntax.GroupNode):
+                # Recurse
                 node = self.do_quotes(parser, node)
 
             if isinstance(node, self.QuoteNode):
-                if phrase is None:
-                    # Start a new phrase
-                    phrase = []
+                if seq is None:
+                    # Start a new sequence
+                    seq = []
                 else:
-                    # End the current phrase
-                    newgroup.append(self.ComplexPhraseNode(phrase))
-                    phrase = None
-            elif phrase is None:
-                # Not in a phrase, add directly
+                    # End the current sequence
+                    sn = self.SequenceNode(seq, slop=node.slop)
+                    newgroup.append(sn)
+                    seq = None
+            elif seq is None:
+                # Not in a sequence, add directly
                 newgroup.append(node)
             else:
-                # In a phrase, add it to the buffer
-                phrase.append(node)
+                # In a sequence, add it to the buffer
+                seq.append(node)
 
         # We can end up with buffered nodes if there was an unbalanced quote;
-        # just put the nodes back into the group
-        if phrase is not None:
-            newgroup.extend(phrase)
+        # just add the buffered nodes directly to newgroup
+        if seq is not None:
+            newgroup.extend(seq)
 
         return newgroup
 

tests/test_parse_plugins.py

                  '(content:alfa AND (reverse:bravo OR reverse:ovarb))')
 
 
-def test_complex_phrase():
+def test_fuzzy_plugin():
+    ana = analysis.StandardAnalyzer("\\S+")
+    schema = fields.Schema(f=fields.TEXT(analyzer=ana))
+    qp = default.QueryParser("f", schema)
+    qp.add_plugin(plugins.FuzzyTermPlugin())
+
+    q = qp.parse("bob~")
+    assert_equal(q.__class__, query.FuzzyTerm)
+    assert_equal(q.fieldname, "f")
+    assert_equal(q.text, "bob")
+    assert_equal(q.maxdist, 1)
+
+    q = qp.parse("Alfa Bravo~ Charlie")
+    assert_equal(q.__class__, query.And)
+    assert_equal(q[0].__class__, query.Term)
+    assert_equal(q[0].text, "alfa")
+    assert_equal(q[1].__class__, query.FuzzyTerm)
+    assert_equal(q[1].fieldname, "f")
+    assert_equal(q[1].text, "bravo")
+    assert_equal(q[1].maxdist, 1)
+    assert_equal(q[2].__class__, query.Term)
+    assert_equal(q[2].text, "charlie")
+
+    q = qp.parse("Alfa Bravo~2 Charlie")
+    assert_equal(q.__class__, query.And)
+    assert_equal(q[0].__class__, query.Term)
+    assert_equal(q[0].text, "alfa")
+    assert_equal(q[1].__class__, query.FuzzyTerm)
+    assert_equal(q[1].fieldname, "f")
+    assert_equal(q[1].text, "bravo")
+    assert_equal(q[1].maxdist, 2)
+    assert_equal(q[2].__class__, query.Term)
+    assert_equal(q[2].text, "charlie")
+
+    q = qp.parse("alfa ~2 bravo")
+    assert_equal(q.__class__, query.And)
+    assert_equal(q[0].__class__, query.Term)
+    assert_equal(q[0].text, "alfa")
+    assert_equal(q[1].__class__, query.Term)
+    assert_equal(q[1].text, "~2")
+    assert_equal(q[2].__class__, query.Term)
+    assert_equal(q[2].text, "bravo")
+
+    qp = default.QueryParser("f", None)
+    q = qp.parse("'bob~'")
+    assert_equal(q.__class__, query.Term)
+    assert_equal(q.fieldname, "f")
+    assert_equal(q.text, "bob~")
+
+
+def test_function_plugin():
+    class FakeQuery(query.Query):
+        def __init__(self, children, *args, **kwargs):
+            self.children = children
+            self.args = args
+            self.kwargs = kwargs
+            self.fieldname = None
+
+        def __hash__(self):
+            return hash(tuple(self.children)) ^ hash(self.args)
+
+        def __unicode__(self):
+            qs = "|".join(str(q) for q in self.children)
+            args = ",".join(self.args)
+            kwargs = ",".join("%s:%s" % item for item in self.kwargs.items())
+            return u("<%s %s %s>") % (qs, args, kwargs)
+
+        __str__ = __unicode__
+
+    def fuzzy(qs, prefix=0, maxdist=2):
+        prefix = int(prefix)
+        maxdist = int(maxdist)
+        return query.FuzzyTerm(qs[0].fieldname, qs[0].text,
+                               prefixlength=prefix, maxdist=maxdist)
+
+    fp = plugins.FunctionPlugin({"foo": FakeQuery, "fuzzy": fuzzy})
+    qp = default.QueryParser("f", None)
+    qp.add_plugin(fp)
+
+    def check(qstring, target):
+        q = qp.parse(u(qstring), normalize=False)
+        assert_equal(str(q), target)
+
+    check("alfa #foo charlie delta",
+          "(f:alfa AND <  > AND f:charlie AND f:delta)")
+
+    check("alfa #foo(charlie delta) echo",
+          "(f:alfa AND <f:charlie|f:delta  > AND f:echo)")
+
+    check("alfa #foo(charlie AND delta) echo",
+          "(f:alfa AND <(f:charlie AND f:delta)  > AND f:echo)")
+
+    check("alfa #foo[a] charlie delta",
+          "(f:alfa AND < a > AND f:charlie AND f:delta)")
+
+    check("alfa #foo[a, b](charlie delta) echo",
+          "(f:alfa AND <f:charlie|f:delta a,b > AND f:echo)")
+
+    check("alfa #foo[a,b,c=d](charlie AND delta) echo",
+          "(f:alfa AND <(f:charlie AND f:delta) a,b c:d> AND f:echo)")
+
+    check("alfa #foo[a,b,c=d]() (charlie AND delta)",
+          "(f:alfa AND < a,b c:d> AND ((f:charlie AND f:delta)))")
+
+    check("alfa #foo[a=1,b=2](charlie AND delta)^2.0 echo",
+          "(f:alfa AND <(f:charlie AND f:delta)  a:1,b:2,boost:2.0> AND f:echo)")
+
+    check("alfa #fuzzy[maxdist=2](bravo) charlie",
+          "(f:alfa AND f:bravo~2 AND f:charlie)")
+
+
+def test_sequence_plugin():
     qp = default.QueryParser("f", None)
     qp.remove_plugin_class(plugins.PhrasePlugin)
     qp.add_plugin(plugins.FuzzyTermPlugin())
-    qp.add_plugin(plugins.ComplexPhrasePlugin())
+    qp.add_plugin(plugins.SequencePlugin())
 
     q = qp.parse(u('alfa "bravo charlie~2 (delta OR echo)" foxtrot'))
     assert_equal(q.__unicode__(),
     assert_equal(q[3][1].__class__, query.TermRange)
     assert_equal(q[4].text, "echo")
 
+    q = qp.parse(u('alfa "bravo charlie~3"~2 delta'))
+    assert_equal(q[1].__class__, query.Sequence)
+    assert_equal(q[1].slop, 2)
+    assert_equal(q[1][1].__class__, query.FuzzyTerm)
+    assert_equal(q[1][1].maxdist, 3)
 
 
 

tests/test_parsing.py

     assert_equal(repr(p.process('"b c"')),
                  "<AndGroup <None:PhraseNode 'b c'~1>>")
 
+    q = p.parse('alfa "bravo charlie"~2 delta')
+    assert_equal(q[1].__class__, query.Phrase)
+    assert_equal(q[1].words, ["bravo", "charlie"])
+    assert_equal(q[1].slop, 2)
+
 
 def test_groups():
     p = default.QueryParser("t", None, [plugins.WhitespacePlugin(),
     assert_equal(q[1], query.Term("f", "B"))
 
 
-def test_fuzzy_plugin():
-    schema = fields.Schema(f=fields.TEXT)
-    qp = default.QueryParser("f", schema)
-    qp.add_plugin(plugins.FuzzyTermPlugin())
 
-    q = qp.parse("bob~")
-    assert_equal(q.__class__, query.FuzzyTerm)
-    assert_equal(q.fieldname, "f")
-    assert_equal(q.text, "bob")
-    assert_equal(q.maxdist, 1)
 
-    q = qp.parse("Alfa Bravo~ Charlie")
-    assert_equal(q.__class__, query.And)
-    assert_equal(q[0].__class__, query.Term)
-    assert_equal(q[0].text, "alfa")
-    assert_equal(q[1].__class__, query.FuzzyTerm)
-    assert_equal(q[1].fieldname, "f")
-    assert_equal(q[1].text, "bravo")
-    assert_equal(q[1].maxdist, 1)
-    assert_equal(q[2].__class__, query.Term)
-    assert_equal(q[2].text, "charlie")
 
-    q = qp.parse("Alfa Bravo~2 Charlie")
-    assert_equal(q.__class__, query.And)
-    assert_equal(q[0].__class__, query.Term)
-    assert_equal(q[0].text, "alfa")
-    assert_equal(q[1].__class__, query.FuzzyTerm)
-    assert_equal(q[1].fieldname, "f")
-    assert_equal(q[1].text, "bravo")
-    assert_equal(q[1].maxdist, 2)
-    assert_equal(q[2].__class__, query.Term)
-    assert_equal(q[2].text, "charlie")
 
-    qp = default.QueryParser("f", None)
-    q = qp.parse("'bob~'")
-    assert_equal(q.__class__, query.Term)
-    assert_equal(q.fieldname, "f")
-    assert_equal(q.text, "bob~")
 
 
-def test_function_plugin():
-    class FakeQuery(query.Query):
-        def __init__(self, children, *args, **kwargs):
-            self.children = children
-            self.args = args
-            self.kwargs = kwargs
-            self.fieldname = None
 
-        def __hash__(self):
-            return hash(tuple(self.children)) ^ hash(self.args)
 
-        def __unicode__(self):
-            qs = "|".join(str(q) for q in self.children)
-            args = ",".join(self.args)
-            kwargs = ",".join("%s:%s" % item for item in self.kwargs.items())
-            return u("<%s %s %s>") % (qs, args, kwargs)
 
-        __str__ = __unicode__
 
-    def fuzzy(qs, prefix=0, maxdist=2):
-        prefix = int(prefix)
-        maxdist = int(maxdist)
-        return query.FuzzyTerm(qs[0].fieldname, qs[0].text,
-                               prefixlength=prefix, maxdist=maxdist)
 
-    fp = plugins.FunctionPlugin({"foo": FakeQuery, "fuzzy": fuzzy})
-    qp = default.QueryParser("f", None)
-    qp.add_plugin(fp)
 
-    def check(qstring, target):
-        q = qp.parse(u(qstring), normalize=False)
-        assert_equal(str(q), target)
 
-    check("alfa #foo charlie delta",
-          "(f:alfa AND <  > AND f:charlie AND f:delta)")
 
-    check("alfa #foo(charlie delta) echo",
-          "(f:alfa AND <f:charlie|f:delta  > AND f:echo)")
-
-    check("alfa #foo(charlie AND delta) echo",
-          "(f:alfa AND <(f:charlie AND f:delta)  > AND f:echo)")
-
-    check("alfa #foo[a] charlie delta",
-          "(f:alfa AND < a > AND f:charlie AND f:delta)")
-
-    check("alfa #foo[a, b](charlie delta) echo",
-          "(f:alfa AND <f:charlie|f:delta a,b > AND f:echo)")
-
-    check("alfa #foo[a,b,c=d](charlie AND delta) echo",
-          "(f:alfa AND <(f:charlie AND f:delta) a,b c:d> AND f:echo)")
-
-    check("alfa #foo[a,b,c=d]() (charlie AND delta)",
-          "(f:alfa AND < a,b c:d> AND ((f:charlie AND f:delta)))")
-
-    check("alfa #foo[a=1,b=2](charlie AND delta)^2.0 echo",
-          "(f:alfa AND <(f:charlie AND f:delta)  a:1,b:2,boost:2.0> AND f:echo)")
-
-    check("alfa #fuzzy[maxdist=2](bravo) charlie",
-          "(f:alfa AND f:bravo~2 AND f:charlie)")
-
-