Issue #229 new

Searching on some keyword hangs

David Lv
created an issue

When I search on an index, it always hangs when the keyword is "duma", but it works when the keyword is "du".

{{{

!python

from whoosh.index import create_in from whoosh.fields import * from whoosh.qparser import QueryParser from whoosh import index from whoosh.analysis import StandardAnalyzer from whoosh.analysis import LowercaseFilter from whoosh.analysis import NgramFilter

tokenizer = StandardAnalyzer() | NgramFilter(minsize=2, maxsize=2) | LowercaseFilter() schema = Schema(zipfile=TEXT(stored=True,analyzer=tokenizer), path=TEXT(stored=True,analyzer=tokenizer), ext=TEXT) indexdir= 'E:\kindle3\dvd\indexdir';

sindex = index.open_dir(indexdir); searcher = sindex.searcher() ;

keyword=u'duma'; # when keyword is duma, it always hangs query = QueryParser("path", schema).parse(keyword) print "searching..." results = searcher.search(query) print results; }}}

Comments (5)

  1. David Lv reporter

    it hangs in the line: results = searcher.search(query).

    I debugged the source code, and find there is infinite loop in the add_matches() method in searching.py.

    Hope it could be fixed, so I could use whoosh in production.

  2. Thomas Waldmann
    searching...
    <Top 0 Results for And([Term('path', u'du'), Term('path', u'um'), Term('path', u'ma')]) runtime=0.000357151031494>
    

    This is the result with empty index, works (I added a create_in() call). So it seems to depend on index contents somehow.

    Could you maybe produce a complete example that reproduces your issue?

    BTW, you don't need those trailing semicolons in python.

  3. Matt Chaput repo owner

    What infinite loop did you find in add_matches? I can't reproduce this problem in a simple test case, e.g.:

        from whoosh import analysis, fields
        from whoosh.compat import u
        from whoosh.filedb.filestore import RamStorage
        from whoosh.util import permutations
    
        domain = u("alfa bravo charlie delta echo foxtrot").split()
    
        ana = analysis.NgramWordAnalyzer(2)
        schema = fields.Schema(path=fields.TEXT(stored=True, analyzer=ana))
    
        ix = RamStorage().create_index(schema)
        with ix.writer() as w:
            for ls in permutations(domain):
                w.add_document(path=" ".join(ls))
    
        with ix.searcher() as s:
            keyword = u("arli")
            q = qparser.QueryParser("path", schema).parse(keyword)
            print "searching..."
            results = s.search(q)
            print results
    

    (As an aside, the NgramWordAnalyzer does what you want... your analyzer is doing extra work since StandardAnalyzer already includes a lower case filter.)

    Can you give me more information to help track down the problem?

    Note that you're indexing two-letter ngrams, meaning that you're going to have a relatively small number of terms with gigantic posting lists (the lists of documents/positions at which each term occurs). The query string "duma" will give a query of OR("du", "um", "ma"). If your index is very large, this could be pretty slow.

    For better results, do something like NgramAnalyzer(2, 5). The higher the second number (the maximum ngram size), the more terms need to be stored in the index, but the faster searches will be.

  4. Log in to comment