English Snowball stemmer returns special words as <str> instead of <unicode>

Create issue
Issue #476 new
Matthijs van der Klip created an issue

I have found that the English Snowball stemmer tries to return special words like 'news', 'atlas', etc. untouched. However instead of returning the original word, it returns the relevant special words list entry which is a <str> instead of the original (presumably <unicode>) word.

Original code in whoosh/lang/snowball/english.py:

if word in self.__special_words:
    return self.__special_words[word]

Should be changed to:

if word in self.__special_words:
    return word

Found this by trying to combine the LanguageAnalyzer with a CharsetFilter. The latter will generate errors for special words, because it is expecting <unicode> instead of <str> text:

TypeError: expected a string or other character buffer object

Temporarily worked around this issue by putting a custom UnicodeFilter in between LanguageAnalyzer and CharsetFilter so that words are forced back to <unicode> before being passed to the CharsetFilter.

Comments (0)

  1. Log in to comment