1. Matt Chaput
  2. whoosh
  3. Issues
Issue #174 resolved

ReaderCorrector can lose original word, even if it's in the index

Anonymous created an issue

I've hit a few situations where the ReaderCorrector doesn't include the original word in its suggestions, even if that word occurs (with comparatively high frequency) in the index.

For instance, with a given index and field, I'm calling {{{reader_corrector.suggest(u"lost")}}} and getting suggestions [u'most', u'loss', u'lose', u'lots', u'los']

Calling {{{reader_corrector._suggestions(u"lost", 1, 0, set())}}} gives

{{{ [((1, -123.0), u'lost'), ((1, -3.0), u'los'), ((1, -65.0), u'lose'), ((1, -101.0), u'loss'), ((1, -24.0), u'lots'), ((1, -23.0), u'lot'), ((1, -141.0), u'last'), ((1, -1.0), u'lest'), ((1, -71.0), u'list'), ((1, -216.0), u'cost'), ((1, -4.0), u'host'), ((1, -266.0), u'most'), ((1, -76.0), u'post')] }}}

So I'd expect the suggestions to be [lost, most, cost, last, loss]

Looking at the spellings.py, what seems to be wrong is that

i) spellings.py line 74 should maybe be using {{{xrange(0, maxdist+1)}}}, so that the original term is automatically first if it's there

ii) the lines

{{{ elif item < heap[0]: heapreplace(heap, item) }}}

mean that if item is better than the best so far, replace the best with item - should probably be if the item is better than the worst so far replace that.

Comments (4)

  1. Matt Chaput repo owner

    There was a definite bug in sorting the suggestions which I've fixed. But the intended API was for the original word to never appear in the suggested corrections (even if it appears in the field), so I fixed that as well. That seems more in line with what I'd expect of a spelling suggestion method, but if you have a use-case you want to argue I'm happy to listen.

  2. david_s

    Ah, my mistake was thinking of it as being for silently autocorrecting rather than suggesting alternatives - I'd been overwriting each word in the query with the top suggestion and assuming that anything that was a correctly spelt word would be the top suggestion for itself.

    I think this is a reasonable use case so it might be a nice option to have, but it's possible to work around anyway.

  3. Log in to comment