Inconsistent fuzzy search results?

Issue #467 new
Stephen Brown
created an issue

I have a database of user details, and I'm using whoosh for the search indexing.

The index contains a searchable 'text' field that combines a user's data, including full name, email address etc.

What I'm finding is fuzzy search doesn't return results where there should be, as based on the 'sharp' version - but sometimes it can.

In the below code snippet of a quick unit test, I'm searching for two names (surnames in fact), "Busschbach" and "Geldmacher". I do a fuzzy on "bussbach~2" (missing the middle 'ch') and "gelmacher~1" (missing the d in 'geld') after their sharp results are in, and compare the fuzzy result, asserting that num_sharp_results >= num sharp results. The sharp results in either case return 1 result.

I actually get the correct 1 result returned in the fuzzy case for "Busschbach", but no results for "Geldmacher". There are other test instances where the result just fails.

Note that those two surnames are exactly specified in the 'surname_auto' field, and when I create a query parser for that field instead of 'text', I get the sharp result again, but this time no fuzzy results for either.

Actually, in the 'surname_auto' case, "geldmacher~0" returns the correct 1 result, but "geldmacher~0" in the 'text' case still returns no results. ("Busschbach" works for both here.)

Matt, might you be able to assist direct me where in the whoosh library code to investigate to help track down this issue? Or maybe you have an idea why this might be occurring?

def setUp(self):
    self.ix = index.open_dir(WHOOSH_INDEX_PATH)
    self.searcher = self.ix.searcher()

def test_fuzzy_search(self):
    qp = QueryParser("text", schema=self.ix.schema)
    qp.add_plugin(FuzzyTermPlugin)

    test_names = [('busschbach', 'bussbach~2'), ('geldmacher', 'gelmacher~1')]

    for (tn_sharp, tn_fuzzy) in test_names:
        q = qp.parse(tn_sharp)
        sharp_results = self.searcher.search(q)
        num_sharp_results = len(sharp_results)


        q = qp.parse(tn_fuzzy)
        fuzzy_results = self.searcher.search(q)
        num_fuzzy_results = len(fuzzy_results)

        self.assertGreaterEqual(num_fuzzy_results, num_sharp_results)

Comments (2)

  1. Log in to comment