Issue #358 resolved

My analyzer raise AssertionError

Vladislav Polukhin
created an issue

I use Whoosh==2.5.2 and python2.7. Simple code:

# encoding: utf-8
from __future__ import unicode_literals
import re
from whoosh.analysis import RegexTokenizer, LanguageAnalyzer

unit_expression = re.compile(r'(?P<token>(\d(\.\d+)?)+\s+(кг|л))', re.UNICODE | re.IGNORECASE)
analyze = RegexTokenizer(unit_expression) | LanguageAnalyzer('ru')
print [t.text for t in analyze('Детский стиральный порошок, 1 кг')]

Traceback:

Traceback (most recent call last):
  File "/home/nuklea/workspace/project/whoosh_test.py", line 9, in <module>
    print [t.text for t in analyze('Детский стиральный порошок, 1 кг')]
  File "/home/nuklea/.virtualenvs/project/lib/python2.7/site-packages/whoosh/analysis/morph.py", line 135, in __call__
    for t in tokens:
  File "/home/nuklea/.virtualenvs/project/lib/python2.7/site-packages/whoosh/analysis/filters.py", line 288, in __call__
    for t in tokens:
  File "/home/nuklea/.virtualenvs/project/lib/python2.7/site-packages/whoosh/analysis/filters.py", line 220, in __call__
    for t in tokens:
  File "/home/nuklea/.virtualenvs/project/lib/python2.7/site-packages/whoosh/analysis/tokenizers.py", line 118, in __call__
    assert isinstance(value, text_type), "%s is not unicode" % repr(value)
AssertionError: <generator object __call__ at 0x1695eb0> is not unicode

Comments (5)

  1. Matt Chaput repo owner

    A tokenizer starts an analyzer chain, so it can only be the first element in an analyzer chain. LanguageAnalyzer already has a tokenizer inside, so you can't add another one in front. You need to recreate the internals of the LanguageAnalyzer with your own tokenizer:

    from analysis import RegexTokenizer, StopFilter, StemFilter
    
    my_tokenizer = RegexTokenizer(...)
    my_analyzer = my_tokenizer | StopFilter(lang="ru") | StemFilter(lang="ru")
    

    I'll try to add a more descriptive error message :)

  2. Matt Chaput repo owner

    The default analyzer will cause two problems here:

    1. By default it removes "words" less than 2 characters, so "1" will not be indexed.
    2. It will index 20ml as "20ml" instead of "20" and "ml"

    You need to:

    1. Set the minsize parameter of StopFilter to 1.
    2. Either change RegexTokenizer() to use a very clever pattern that separates runs of letters from runs of numerals, or use IntraWordFilter which does this automatically (and other stuff) but is slower.

    e.g.:

    ana = (analysis.RegexTokenizer()
           | analysis.LowercaseFilter()
           | analysis.IntraWordFilter()
           | analysis.StopFilter(lang="ru", minsize=1)
           | analysis.StemFilter(lang="ru")
           )
    

    :)

  3. Matt Chaput repo owner

    On the other hand, if you wanted to be able to, for example, find items with a volume less than 8 L, or sort results by size, you would need to parse the numbers out of the original text and put them in a numeric field. You would want to have separate fields for volume, mass, etc. and convert values into a base unit (L, kg, etc.).

    If you wanted to get really fancy, you could even make a query parser extension that would automatically do unit recognition and conversion on the query, so the user could search for washing > 100 g and it would be converted to title:washing mass:[0.1 to ]

  4. Matt Chaput repo owner

    That's strange, this exact program works for me.

    from __future__ import unicode_literals
    
    from whoosh import fields
    from whoosh.analysis import *
    from whoosh.qparser import QueryParser
    
    ana = RegexTokenizer() | LowercaseFilter() | IntraWordFilter() | StopFilter(lang='ru', minsize=1) | StemFilter(lang="ru")
    schema = fields.Schema(name=fields.TEXT(analyzer=ana))
    
    parser = QueryParser('name', schema=schema)
    query = parser.parse('порошок "3 кг"')
    assert unicode(query) == '(name:порошок AND name:"3 кг")'
    
  5. Log in to comment