1. Matt Chaput
  2. whoosh
  3. Issues
Issue #332 new

Facets and Terms

Anonymous created an issue

When trying to use both sorting and terms, I get a KeyError when calling hit.matched_terms(). Is it not possible to use both sorting and terms?

results = searcher.search(query, sortedby='ID', reverse=True, terms=True) for hit in results: print hit.matched_terms()

Comments (5)

  1. Matt Chaput repo owner

    This works for me. What version are you using?

    def test_sorted_result_terms():
        schema = fields.Schema(id=fields.KEYWORD(sortable=True),
                               body=fields.TEXT)
        ix = RamStorage().create_index(schema)
        with ix.writer() as w:
            w.add_document(id=u("one"), body=u("alfa bravo charlie"))
            w.add_document(id=u("two"), body=u("bravo charlie delta"))
            w.add_document(id=u("three"), body=u("charlie delta echo"))
            w.add_document(id=u("four"), body=u("delta echo alfa"))
            w.add_document(id=u("five"), body=u("echo alfa charlie"))
    
        with ix.searcher() as s:
            q = query.Or([query.Term("body", "charlie"), query.Term("body", "alfa")])
            r = s.search(q, sortedby="id", reverse=True, terms=True)
    
            assert ([hit["id"] for hit in r]
                    == ["two", "three", "one", "four", "five"])
    
            assert ([hit.matched_terms() for hit in r]
                    == [[("body", "charlie")],
                        [("body", "charlie")],
                        [("body", "alfa"), ("body", "charlie")],
                        [("body", "alfa")],
                        [("body", "alfa"), ("body", "charlie")],
                        ])
    
  2. rholloway

    Still trying to track down issue. Running 2.4.1.

    Creating a test script using your code above (and similar variations, such as adding NUMERIC field which is what I wish to sort by), it runs successfully. Running against what I have indexed on disk, I get KeyError on line 1399 within searching.py.

    Running results.has_matched_terms() returns true.The key it fails on is "40784" which I am not sure where it comes from (it isn't the match on sort NUMERIC field, anyways). Modifying searching.py to print self.results.docterms.keys() prints [417], so only one key listed in there. Again, not sure what that references or what should be in there.

    My schema is a bit larger than below, but essentially

    Schema(vid=NUMERIC(stored=True,unique=True),entered=DATETIME(stored=True),name=TEXT(stored=True,analyzer=ana),...)

    is an example of types of fields. Want to sort by vid.

    code to test it is as simple as:

    ix = open_dir("index")
    with ix.searcher() as searcher:
      query = QueryParser("name", ix.schema).parse(u"whoosh")
      results = searcher.search(query, sortedby='vid', reverse=True, terms=True)
      print results.has_matched_terms()
      for r in results:
        print r.matched_terms()
    

    Crashes on first iteration of loop. Without trying to print matched terms, everything works fine. Without trying to sort, I can get matched terms no problem.

  3. rholloway

    Switching to 2.5.1 worked. However, it seems to have been significantly slower on searching the index when using sorting and limit.

    My previous workaround was to search the index with limit=None and do some python post-processing against the results.

    # get results
    results = searcher.search(query,limit=None, Terms=True)
    # sort
    results = sorted(results, key = lambda k: int(k['vid']), reverse=True)
    
    # return (limited) results
    return results[0:limit]
    

    My understanding is this should be roughly the equivalent of

    searcher.search(query,limit=limit, sortedby='vid', reverse=True, Terms=True)
    

    and initially thought it would be better (cleaner at least) and likely faster to do the latter, However, that doesn't seem to be the case. I have a 10 second timeout which times out for the latter, but response takes ~2 seconds for the first method.

    Is this supposed to be the case? Not too sure how the combination of limiting results on a sorted index should or does work (would think it would need all results first to sort anyways).

    In any case, the combination of sorting/limiting along with Terms does appear fixed in 2.5.1.

  4. Matt Chaput repo owner

    In version 2.4 the generation of sorting information was done at first search and cached on disk. In version 2.5 this was changed (I would say fixed) to be done at indexing time -- you need to add sortable=True to fields you want to be able to sort on, otherwise, the sorting info will still be generated at the first search but not cached (since the "proper" way to do it is in the index). I recommend you try adding sortable to your schema. I will try to backport the fix to 2.4 though.

  5. Log in to comment