Analyzer fails if data is stored as a list of unicode values

Issue #321 invalid
Ashutosh Singla
created an issue
def item_name_analyzer():
    """
    Analyzer behaviour:

    Input: u"some item name", u"SomeItem/SubItem", u"GSOC2011"

    Output: u"some", u"item", u"name"; u"Some", u"Item", u"Sub", u"Item"; u"GSOC", u"2011"
    """
    iwf = MultiFilter(index=IntraWordFilter(mergewords=True, mergenums=True),
                      query=IntraWordFilter(mergewords=False, mergenums=False)
                     )
    analyzer = RegexTokenizer(r"\S+") | iwf | LowercaseFilter()
    return analyzer

""" add two documents to the index """
writer.add_document(title=u"MyDocument", content=u"This is my document!",
                    path=u"/a", tags=u"first short", icon=u"/icons/star.png")
writer.add_document(title=[u"MyDocument"], content=u"This is the second example.",
                    path=u"/b", tags=u"second short", icon=u"/icons/sheep.png")

Consider using the item_name_analyzer defined above, if we search for "title":u"Document", the one with the title u"MyDocument" gives a hit where as one with the title [u"MyDocument"] fails.

Comments (5)

  1. Matt Chaput repo owner

    If you pass a string to add_document, it is tokenized using the field's analyzer. If you pass a list or tuple, it is interpreted as a list of pre-analyzed tokens, and the analyzer is never run.

  2. Thomas Waldmann

    hmm, isn't that a bit strange to make it depends on 1-or-n whether the analyzer is run?

    I somehow also had expected that when you give a list, it just does the same for each element.

  3. Matt Chaput repo owner

    Someone wanted to be able to bypass the analyzer for certain fields/documents where they already had the exact tokens they wanted. In retrospect should have made it so you had to wrap the list in some kind of marker object instead of checking for a list.

  4. Thomas Waldmann

    hmm, tried feeding the tokens as a list.

    but then i run into next issue: the field is with stored=True, but I only give tokens, so I can't get back the original names as it would usually be possible with stored=True.

  5. Log in to comment