Issue #355 resolved

Inconsistency in parsing unicode and str types in fields.NUMERIC

Michele Orrù
created an issue

The attached code demostrated the inconstistency. Appending the irc log of my conversation with Thomas as a reference.

<          maker > unicode() and str() give different outputs /
<          maker > apparently datetime is safe against this test, but fields.NUMERIC not (
<          maker > (but I have to write more tests to assert that) 
< ThomasWaldmann > hmm? if it is NUMERIC, how is (u)"maker" valid?
<          maker > ThomasWaldmann u'maker' is not valid, but 'maker' yes
< ThomasWaldmann > but that's not numeric
<          maker > I would have expected to have the same result with unicode and strings
< ThomasWaldmann > are you using py?
<          maker > nop
<          maker > .. bit
< ThomasWaldmann > well, guess you have to dig into the code to find out why. but maybe try if it happens also with more practical usecases, not only when giving totally invalid input.
<          maker > ThomasWaldmann mine is a practical case
<          maker > I have a reduce(operator.or_, Term(field, query) for field in fields)
< ThomasWaldmann > how can "maker" be a valid term for searching a NUMERIC field?
<          maker > that's a query term; if I just want to search through all possible fields, I shouldn't care about its type. Moreover, str() behaves differently, which imo makes it a bug either way
< ThomasWaldmann > yes, if it is invalid, it should just never match

Comments (4)

  1. Matt Chaput repo owner

    There's a philosophical question about whether a malformed query should (a) raise an exception or (b) just match nothing. If (a), then the bug is that b"maker" should raise an error. If (b) the bug is that u"maker" should just match nothing.

    Normally I'd favor (a). However, here we can't raise an exception at the logical place (constructing the query) because at that point we don't know the field description (queries are independent of any search or index). We can only error when you search, which feels like it would be confusing, so I'd probably agree that (b) is the better way in this case.

    Unfortunately the patch doesn't make them match "nothing", it makes them match 0, which could lead to very confusing results where the user doesn't notice erroneous input and then gets very strange unexpected hits.

    Please let me look at this today and see if I can think of something better :)

  2. Matt Chaput repo owner

    I think the right thing here is to check for a ValueError whenever a query calls to_bytes(). Then the query can return a NullMatcher so the query won't match anything.

  3. Log in to comment