query correction gives bad query string

Thomas Waldmann avatarThomas Waldmann created an issue


        with flaskg.storage.searcher(all_revs=history) as searcher:
            corrected = searcher.correct_query(q, query)
            if corrected.query != q:
                print "Original query string: %r" % query
                print "Original query: %r" % q
                print "Corrected query string: %r" % corrected.string
                print "Corrected query: %r" % corrected.query


Original query string: u'hardwa'
Original query: Or([Term('name_exact', u'hardwa'), Term('name', u'hardwa'), Term('content', u'hardwa')])

Corrected query string: u'hardwarehardware'
Corrected query: Or([Term('name_exact', u'hardwa'), Term('name', u'hardware'), Term('content', u'hardware')])

You see it corrects query string to "hardwarehardware", not to "hardware" (as expected).

The corrected query is correct, though.

Comments (4)

  1. Thomas Waldmann

    Works better with explicit (not default) field names:

    Original query string: u'name:hardwa OR content:mainboa'
    Original query: Or([Term(u'name', u'hardwa'), Term(u'content', u'mainboa')])
    Corrected query string: u'name:hardware OR content:mainboard'
    Corrected query: Or([Term(u'name', u'hardware'), Term(u'content', u'mainboard')])
  2. Matt Chaput
    • changed status to open

    Sorry it took so long for me to get to this... work :(

    I can see where query correction is interacting poorly with MultifieldParser, but I'm not sure what the correct behavior is. What if "hardwa" gets corrected to different things in different fields? E.g. "hardware" in the content field and "hard" in title? What should the query string be corrected to? One answer is


    ...gets corrected to

    content:hardware title:hard

    ...but I don't think that's likely to be what the user intended. It also might be difficult to implement in the existing Whoosh architecture.

    I think the best workaround is to correct a version of the query parsed with a single-field QueryParser with the default field set to the field you care about, and then feed the corrected query back into the MultifieldParser.

    Untested code:

    qtext = u"hardwa"
    # The "real" parser
    qp = qparser.MultifieldParser(["name", "content"], schema)
    q = qp.parse(qtext)
    # A single-field parser
    qp1 = qparser.QueryParser("content", schema)
    q1 = qp1.parse(qtext)
    # Correct the single-field version
    corrected = s.correct_query(q1, qtext)
    if corrected.query != q1:
            print "Original query string: %r" % qtext
            print "Original query: %r" % q
            print "Corrected query string: %r" % corrected.string
            print "Corrected query: %r" % corrected.query
            # Use the multi-field parser to parse the corrected string
            q = qp.parse(corrected.string)
  3. Thomas Waldmann

    I guess you meant:

    (content:hardware OR title:hard)

    And if "correction" is defined to use the best thing on a per-field basis, then it sounds reasonable that this could be different stuff for different fields. As it is a OR, it could still improve results. Of course, as for every "machine correction", it is not assured that it will improve results.

    If it can't be fixed due to architecture, then the issue should be documented, so people don't use the MF stuff for correction.

  4. Log in to comment
Tip: Filter by directory path e.g. /media app.js to search for public/media/app.js.
Tip: Use camelCasing e.g. ProjME to search for ProjectModifiedEvent.java.
Tip: Filter by extension type e.g. /repo .js to search for all .js files in the /repo directory.
Tip: Separate your search with spaces e.g. /ssh pom.xml to search for src/ssh/pom.xml.
Tip: Use ↑ and ↓ arrow keys to navigate and return to view the file.
Tip: You can also navigate files with Ctrl+j (next) and Ctrl+k (previous) and view the file with Ctrl+o.
Tip: You can also navigate files with Alt+j (next) and Alt+k (previous) and view the file with Alt+o.