Phrase query matching partial result

Issue #486 new
Laurent Tramoy
created an issue

Hi,

My query is a simple phrase query "python library" , and I want to find the exact matches of this phrase, without counting random "python" or "library". Here is my code:

from whoosh import fields, scoring, analysis, query
from whoosh.filedb.filestore import FileStorage


def search(text, q):
    storage = FileStorage("tests/index")
    # This regex is the same as the default, except that it does not split on
    # dashes.
    regex_expr = '\\w+((\\.?|-?)\\w+)*'
    analyzer = analysis.StandardAnalyzer(expression=regex_expr, stoplist=[])
    schema = fields.Schema(
        authors=fields.KEYWORD(commas=True, stored=True),
        description=fields.TEXT(analyzer=analyzer, stored=True)
    )
    index = storage.create_index(schema, indexname="usages")
    w = index.writer()
    w.add_document(authors='unkwnown', description=text)
    w.commit()
    searcher = index.searcher(weighting=scoring.Frequency)
    return searcher.search(q, terms=True)

If the two terms don't appear next to each either, there is not hit, as expected:

text1 = "bla bla library bla bla bla python"
q = query.Phrase("description", ["python", "library"])
search(text1, q)
# returns <Top 0 Results for Phrase('description', ['python', 'library'], slop=1, boost=1.000000) runtime=0.000455269000667613>

And if they do, we have a hit:

text2 = "bla bla python library bla bla"
q = query.Phrase("description", ["python", "library"])
search(text2, q)
# returns <Top 1 Results for Phrase('description', ['python', 'library'], slop=1, boost=1.000000) runtime=0.000455269000667613>

So far, nothing surprising. But my problem is the partials matches when both the phrase and the single terms appear in the document:

text3 = "bla bla python" + " bla "*100 + "bla bla python library" + " bla "*100 + "library" 
res = search(text3, q)[0]
res.highlights("description")
# returns 'bla bla <b class="match term0">python</b> bla  bla  bla  bla  bla...bla  bla bla bla <b class="match term0">python</b> <b class="match term1">library</b> bla  bla  bla  bla  bla...bla  bla  bla  bla <b class="match term1">library</b>'
res.score()
# returns 4.0

I guess the behavior is expected, but is there a way to highlight, and score, only when the terms are right next to each other?

I believe I read the doc thoroughly, as well as the previous issues, so I apologize if this was already answered somewhere.

thanks

Comments (0)

  1. Log in to comment