query correction gives bad query string

Issue #194 open
Thomas Waldmann
created an issue

Code: {{{ with flaskg.storage.searcher(all_revs=history) as searcher: corrected = searcher.correct_query(q, query) if corrected.query != q: print "Original query string: %r" % query print "Original query: %r" % q print "Corrected query string: %r" % corrected.string print "Corrected query: %r" % corrected.query }}}

Output: {{{ Original query string: u'hardwa' Original query: Or([Term('name_exact', u'hardwa'), Term('name', u'hardwa'), Term('content', u'hardwa')])

Corrected query string: u'hardwarehardware' Corrected query: Or([Term('name_exact', u'hardwa'), Term('name', u'hardware'), Term('content', u'hardware')]) }}}

You see it corrects query string to "hardwarehardware", not to "hardware" (as expected).

The corrected query is correct, though.

Comments (4)

  1. Thomas Waldmann reporter

    Works better with explicit (not default) field names:

    Original query string: u'name:hardwa OR content:mainboa'
    Original query: Or([Term(u'name', u'hardwa'), Term(u'content', u'mainboa')])
    Corrected query string: u'name:hardware OR content:mainboard'
    Corrected query: Or([Term(u'name', u'hardware'), Term(u'content', u'mainboard')])
  2. Matt Chaput repo owner
    • changed status to open

    Sorry it took so long for me to get to this... work :(

    I can see where query correction is interacting poorly with MultifieldParser, but I'm not sure what the correct behavior is. What if "hardwa" gets corrected to different things in different fields? E.g. "hardware" in the content field and "hard" in title? What should the query string be corrected to? One answer is


    ...gets corrected to

    content:hardware title:hard

    ...but I don't think that's likely to be what the user intended. It also might be difficult to implement in the existing Whoosh architecture.

    I think the best workaround is to correct a version of the query parsed with a single-field QueryParser with the default field set to the field you care about, and then feed the corrected query back into the MultifieldParser.

    Untested code:

    qtext = u"hardwa"
    # The "real" parser
    qp = qparser.MultifieldParser(["name", "content"], schema)
    q = qp.parse(qtext)
    # A single-field parser
    qp1 = qparser.QueryParser("content", schema)
    q1 = qp1.parse(qtext)
    # Correct the single-field version
    corrected = s.correct_query(q1, qtext)
    if corrected.query != q1:
            print "Original query string: %r" % qtext
            print "Original query: %r" % q
            print "Corrected query string: %r" % corrected.string
            print "Corrected query: %r" % corrected.query
            # Use the multi-field parser to parse the corrected string
            q = qp.parse(corrected.string)
  3. Thomas Waldmann reporter

    I guess you meant:

    (content:hardware OR title:hard)

    And if "correction" is defined to use the best thing on a per-field basis, then it sounds reasonable that this could be different stuff for different fields. As it is a OR, it could still improve results. Of course, as for every "machine correction", it is not assured that it will improve results.

    If it can't be fixed due to architecture, then the issue should be documented, so people don't use the MF stuff for correction.

  4. Log in to comment