BM25F scores not always decreasing with number of irrelevant words
If I search for a single keyword, I'd expect the BM25F scores for the hits that have the same number of occurrences of the keyword to decrease monotonically with the number of other words in the search. But this doesn't seem to be the case.
My schema is
schema = Schema(url = ID(stored = True), category = TEXT(stored = True), title = TEXT(analyzer=StemmingAnalyzer(), stored = True), content = TEXT(analyzer=StemmingAnalyzer()), doc_id = ID(unique = True, stored = True), date = DATETIME(stored = True), client = TEXT(stored = True) )
and the code for the search is:
searcher = self.get_s whoosh.index.open_dir(indexdir, indexname = client_id).searcher(weighting=scoring.BM25F(B = 1.0)) cont_parser = QueryParser("title", self.get_schema(client)) parsed_cont_query = cont_parser.parse(querystring) results = searcher.search(parsed_cont_query)
Searching for a single keyword and finding a bunch of results for which the keyword appears in 'title' exactly once, I frequently seem to get results with very different numbers of non-keywords getting exactly the same score.
If I'm not doing something obviously stupid and I'm not expecting the wrong behaviour, I'd quite like to debug by sticking some logging into the code that ultimately calls 'score' to find out what weight and length it's using for any given document, but I'm having trouble unravelling where that is.