1. Matt Chaput
  2. whoosh
  3. Issues
Issue #158 resolved

multitoken_query - strange defaults lead to strange results

Thomas Waldmann
created an issue

We (at MoinMoin) stumbled over strange whoosh behaviour that was easily explainable after we found multitoken_query. :)

I have seen you added this in whoosh 1.5 and defaulted it to "first" everywhere for compatibility reasons. While I can understand trying to be compatible, it looks rather like a bug to me that should be fixed (by default) and not kept for compatibility.

E.g. if one has a TEXT field and indexes "foo bar baz", it gets tokenized to "foo", "bar", "baz" and put into index.

If one does a query then for "foo bar", it'll tokenize that into "foo", "bar" and then throw away the "bar" because of multitoken_query="first" default and search only for "foo", embarrassing the user with strange search results.

I only discovered this by using teh source. Afterwards I also found some docs about it, but IIRC I didn't see this in the tutorials or at another place except the FieldType docs (which one usually discovers rather late).

So, how could one improve this?

default to "and" (like when using multiple terms, they are also ANDed by default. "or" is usually stupid/annoying. "phrase" might also make sense, but maybe not as a default.)

If you only have one token, AND(token) is the same as "first" behaviour, so maybe this is good enough for compatibility? and if a user gives more than one token, he maybe expects whoosh making use of it. :)

In any case (no matter whether you change the default or not) document it at a more visible place. tutorial and other "prose" parts of the docs, not just in the FieldType docs.

Comments (11)

  1. Matt Chaput repo owner

    I'm not sure what you mean by 'If one does a query then for "foo bar", it'll tokenize that into "foo", "bar" and then throw away the "bar"'.

    >>> from whoosh import fields, qparser
    >>> schema = fields.Schema(text=fields.TEXT)
    >>> parser = qparser.QueryParser("text", schema)
    >>> parser.parse(u"foo bar")
    And([Term('text', u'foo'), Term('text', u'bar')])
    

    Please let me know what are you did, what you expected to happen, and what happened instead?

    Thanks!

  2. Thomas Waldmann reporter

    Hmm, I didn't try EXACTLY that in practice, but I was making that up (assuming consistent behaviour). If everything has multitoken_query="first" default, why is it then behaving different here?

    What we have right now is a schema with this:

    name=TEXT(stored=True, multitoken_query="and", analyzer=item_name_analyzer())
    contenttype=TEXT(stored=True, multitoken_query="and", analyzer=MimeTokenizer())
    

    The item name gets tokenized like one would expect for wiki item names (using 2 different IntraWordFilters for index and query, as you show in your docs). Content Types get tokenized into "text", "plain", "charset=utf-8" for example.

    We started without the multitoken_query param and it showed that "first" behaviour, which we did not want. E.g. a search for "text/x-rst" gave all text items, not only the text/x-rst ones.

  3. Matt Chaput repo owner
    • changed status to open

    Ah, I understand. Yes, "first" is the worst choice, and not worth backwards compatibility. I'll try to change it so the default multitoken_query type is the parser's default grouping.

  4. Thomas Waldmann reporter
    • changed status to new

    I could reproduce it with your code, slightly modified so it uses the analyser, not the highlevel qp:

    >>> from whoosh import fields, qparser
    >>> schema = fields.Schema(text=fields.TEXT)
    >>> parser = qparser.QueryParser("text", schema)
    >>> parser.parse(u"foo bar")
    And([Term('text', u'foo'), Term('text', u'bar')])
    This is the "highlevel query parser" in action.
    
    >>> parser.parse(u"'foo bar'")  # single quotes around it
    Term('text', u'foo')
    This is the analyser in action. Unexpected result.
    
    >>> schema = fields.Schema(text=fields.TEXT(multitoken_query="and")) # not default "first"
    >>> parser = qparser.QueryParser("text", schema)
    >>> parser.parse(u"'foo bar'")
    And([Term('text', u'foo'), Term('text', u'bar')])
    Expected result.
    
  5. Log in to comment