1. Matt Chaput
  2. whoosh
Issue #183 resolved

BM25F scores not always decreasing with number of irrelevant words

david_s
created an issue

If I search for a single keyword, I'd expect the BM25F scores for the hits that have the same number of occurrences of the keyword to decrease monotonically with the number of other words in the search. But this doesn't seem to be the case.

My schema is {{{ schema = Schema(url = ID(stored = True), category = TEXT(stored = True),
title = TEXT(analyzer=StemmingAnalyzer(), stored = True), content = TEXT(analyzer=StemmingAnalyzer()), doc_id = ID(unique = True, stored = True), date = DATETIME(stored = True), client = TEXT(stored = True) ) }}}

and the code for the search is: {{{ searcher = self.get_s whoosh.index.open_dir(indexdir, indexname = client_id).searcher(weighting=scoring.BM25F(B = 1.0)) cont_parser = QueryParser("title", self.get_schema(client)) parsed_cont_query = cont_parser.parse(querystring) results = searcher.search(parsed_cont_query) }}} Searching for a single keyword and finding a bunch of results for which the keyword appears in 'title' exactly once, I frequently seem to get results with very different numbers of non-keywords getting exactly the same score.

If I'm not doing something obviously stupid and I'm not expecting the wrong behaviour, I'd quite like to debug by sticking some logging into the code that ultimately calls 'score' to find out what weight and length it's using for any given document, but I'm having trouble unravelling where that is.

Comments (5)

  1. Matt Chaput repo owner
    • changed status to open

    If you're using a recent version, the best place would be in whoosh/scoring.py, line 191 (WeightLengthScorer.score).

        # Untested code
        def score(self, matcher):
            # return self._score(matcher.weight(), self.dfl(matcher.id()))
            docnum = matcher.id()
            weight = matcher.weight()
            length = self.dfl(docnum)
            print "Scoring document %s: weight=%f length=%d" % (docnum, weight, length)
            score =  self._score(weight, length)
            print "  score=", score
            return score
    

    Unfortunately the scorer doesn't currently keep a reference to the searcher so it's not easy to convert the document number into something useful for debugging purposes. If you store the searcher you're using somewhere global, like a module attribute, then you can use it for debugging output:

    mymodule.searcher = searcher
    
        # Untested code
        def score(self, matcher):
            # return self._score(matcher.weight(), self.dfl(matcher.id()))
            docnum = matcher.id()
            weight = matcher.weight()
            length = self.dfl(docnum)
            print "Scoring document %s: weight=%f length=%d" % (docnum, weight, length)
            print "Title:", mymodule.searcher.stored_fields(docnum).get("title")
            score =  self._score(weight, length)
            print "  score=", score
            return score
    

    Thanks for your help!

  2. david_s reporter

    I've put in the following unsubtle debug code.

    In searching.py, Collector.pull_matches():

            while matcher.is_active():
                                                                                                                             
                logging.debug("+++++++++++++++++")
                logging.debug("Matcher type is %s" % type(matcher))
                logging.debug("Matching against %s" % str(matcher.term()))
                logging.debug("ID: %s, weight %f" % (matcher.id(), matcher.weight()))
                logging.debug("Title: %s" % self.searcher.stored_fields(matcher.id()).get("title"))
    

    and

                if ((not allow or offsetid in allow)
                    and (not restrict or offsetid not in restrict)):
                    # Collect and yield this document                                                                                                                                       
                    collect(id, offsetid)
                    if scorefn:
                        logging.debug("****************")
                        score = scorefn(matcher)
                        logging.debug("ID: %s, weight: %f, score: %f" % (matcher.id(), matcher.weight(), score))
                        logging.debug("Title: %s" % self.searcher.stored_fields(matcher.id()).get("title"))
                    else:
                        logging.debug("****************")
                        score = matcher.score()
                        logging.debug("ID: %s, weight: %f, score: %f" % (matcher.id(), matcher.weight(), score))
                        logging.debug("Title: %s" % self.searcher.stored_fields(matcher.id()).get("title"))
                    yield (score, offsetid)
    

    And in scoring.py WeightLengthScorer:

        def score(self, matcher):
            logging.debug("Weight %f, dfl %f" % (matcher.weight(), self.dfl(matcher.id())))
            return self._score(matcher.weight(), self.dfl(matcher.id()))
    

    Sample output is:

    2011-08-24 13:16:37,440 - Matcher type is <class 'whoosh.filedb.filepostings.FilePostingReader'>
    2011-08-24 13:16:37,440 - Matching against ('title', u'chip')
    2011-08-24 13:16:37,440 - ID: 270, weight 2.000000
    2011-08-24 13:16:37,441 - Title: Videos and demos
    2011-08-24 13:16:37,441 - ****************
    2011-08-24 13:16:37,441 - Weight 2.000000, dfl 3.000000
    2011-08-24 13:16:37,441 - ID: 270, weight: 2.000000, score: 8.387126
    2011-08-24 13:16:37,442 - Title: Videos and demos
    

    When I get my results back, the hit with hit['score'] = 8.387126 should have weight 2.0, but shouldn't have dfl 3.0

    Unless I'm doing something silly, it seems like something's wrong with the ids that matcher.id() is using, or with the way that scorer.dfl() is using it... my guess without looking too hard would be that the scorer should be using something similar to scorer.dfl(matcher.id() + offset). Is that possible?

  3. david_s reporter

    I think that that might be the case, actually - including a line

                        logging.debug("Proper title (?): %s" % self.searcher.stored_fields(matcher.id() + offset).get("title"))
    

    gives the right text back.

    I don't understand the matchers and searchers well enough to see if there's a simple solution, though.

  4. Log in to comment