Crashes due to incorrect "use after free" in IntersectionMatcher.replace()

Issue #181 resolved
Wouter Bolsterlee
created an issue

Whoosh crashes on some queries during querying, while other queries on the same index work fine. In both cases the query is just a simple single term search on a NGRAMWORDS field, e.g. "apple" leads to a crash while "orange" does not. The traceback is below:

{{{ Traceback (most recent call last): File "./scrap.py", line 16, in <module> results = searcher.search(q, 30) File "whoosh/searching.py", line 667, in search return collector.search(self, q, allow=filter, restrict=mask) File "whoosh/searching.py", line 967, in search self.add_matches(q, offset, scorefn) File "whoosh/searching.py", line 981, in add_matches for score, offsetid in self.pull_matches(q, offset, scorefn): File "whoosh/searching.py", line 1033, in pull_matches matcher = matcher.replace(minscore or 0) File "whoosh/matching.py", line 890, in replace return IntersectionMatcher(a, b).replace(minquality) File "whoosh/matching.py", line 1194, in replace b = b.replace(minquality - a.max_quality() if minquality else 0) AttributeError: 'NullMatcher' object has no attribute 'max_quality' }}}

I am not familiar with whoosh's code base, but nevertheless tried to investigate this crasher, and I think I sort of understand what happens. If not, pardon my ignorance. Here are my observations:

  • I'm experiencing this crash on a test index I built using MultiSegmentWriter, without optimizing the index afterwards. In my particular case there are 8 segments.
  • My "or" query on a NGRAMWORDS(queryor=True) field translates into a UnionMatcher. While evaluating a query causing crashers, this matcher is converted into a IntersectionMatcher by the code in UnionMatcher.replace(), as can be seen in the penultimate backtrace line above. This conversion/optimisation does not happen to the queries that do not cause crashes, it seems.
  • In IntersectionMatcher's constructor the expression "a = a.replace(minquality - b_max_quality if minquality else 0)" results in "a" to be replace by a NullMatcher. I verified this by adding print statements for debugging, but I don't understand the code well enough to see how this is supposed to work exactly.
  • One line further down in IntersectionMatcher's constructor, exactly the same type of .replace() is performed on the second matcher (b).
  • However, this line accesses a.max_quality(), but since "a" was just optimized into a NullMatcher, this throws the error shown above. In other words, this is basically a "use after free" under some conditions.

I've cooked up a patch that seems to fix the problem for me. The patch changes the following:

  • Save the max_quality() value into a local variable and use that value everywhere in this method. I've done this for both matchers for consistency and code legibility, and this also avoids the overhead of (possibly) calling max_quality() twice on both Matcher objects, since it is also used in an if statement in the same function.
  • While making these changes, it occurred to me that the "a_active" and "b_active" variables assigned at the top of the function are only used once (new values are assigned and used further down), so I removed those, avoiding needless assignments.

The patch has been attached to this report. Feedback is highly appreciated, since I'm not familiar at all with this code base (as stated before).

One last note, at the risk of asking a stupid question, since I do not understand at all how the matching and scoring code works: is it really intended that a.replace() uses b.max_quality() and that b.replace() uses a.max_quality(), i.e. a uses b, and b uses a? (If not, a patch would be just swap letters a and b). An explanation about how this works would be highly appreciated.

Thanks for the amazing work you've done on whoosh. Please keep up the good work!

Comments (14)

  1. Matt Chaput repo owner

    Was calling a.max_quality() even though a might have just been replaced with NullMatcher(). Fixes issue #181. Best solution might have been to just add implement max_quality() on NullMatcher, which I've added here, but I also tried to make the logic clearer/more robust in IntersectionMatcher.replace().

    e94950411d48

  2. Wouter Bolsterlee reporter

    Fwiw, my patch had an obvious typo: b_max_quality was assigned the value of a.max_quality() instead of b.max_quality()... Anyway, good to see you've fixed it.

    Another thing though. I'm still interested in the logic behind the max_quality stuff (the final question in my original report). Some explanation will improve my understanding of the code. Thanks in advance.

  3. Wouter Bolsterlee reporter

    Here's a trivial patch:

    diff -r e94950411d48 src/whoosh/matching.py
    --- a/src/whoosh/matching.py	Tue Aug 23 13:07:22 2011 -0400
    +++ b/src/whoosh/matching.py	Tue Aug 23 20:31:16 2011 +0200
    @@ -1196,7 +1196,7 @@
             if minquality:
                 a_max = a.max_quality()
                 b_max = b.max_quality()
    -            if minquality and a_max + b_max < minquality:
    +            if a_max + b_max < minquality:
                     # If the combined quality of the sub-matchers can't contribute,
                     # return an inactive matcher
                     return NullMatcher()
    
  4. Matt Chaput repo owner

    Thanks :)

    The idea here is that at various points during the search, the collector asks the matcher tree to return a more efficient version of itself if possible. It passes minquality, which is the minimum score a document would have to get to be added to the top N results. (For "quality", just think score. Quality used to be an approximation, but now that's fixed so it's the same as the score.)

    Each matcher has a max_quality() method which returns the maximum score the matcher can possibly produce. If a matcher's max_quality() value is less-than-or-equal-to minquality, we know the matcher can't contribute any more documents to the top N, so it might as well convert itself to a NullMatcher and be eliminated from the tree.

    For an IntersectionMatcher (which implements an AND relation between two sub-matchers), we know that the sum of the max quality of the two sub-matchers must be greater than minquality to contribute to the top N. So the minquality required of each sub-matcher is the parent's minquality minus the potential contribution (the max quality) of the other sub-matcher.

  5. Wouter Bolsterlee reporter

    Thanks for the explanation - that helped.

    Another comment: since a NullMatcher, by design, does not have any state, wouldn't it be better if the matching module instantiated a single NullMatcher upon import, which can act as a singleton instance anytime a NullMatcher is returned from a replace() method. That would save the object construction overhead and slightly lower memory use, but I'm unsure these code paths are actually in the critical path and hence worth the effort.

  6. Matt Chaput repo owner

    Replaced NullMatcher with a singleton. Although instantiation of NullMatchers isn't exactly on a hot path (it can only happen during replace()), I always meant to get around to this because it makes sense, and will add some tiny speed improvement, so why not? Fixes issue #181.

    Fixed braindead test in test_results.

    16e341cbb827

  7. Wouter Bolsterlee reporter
    • changed status to new

    Now that NullMatcher is a singleton, I think the instance should be returned from all code that used to instantiate NullMatcher instances. However, matching.py is still full of lines like these:

    return NullMatcher()
    

    Shouldn't that be

    return NullMatcher
    

    instead?

  8. Matt Chaput repo owner

    Yes, but since NullMatcher() is a no-op, it has no practical performance impact and theoretically leaves the option of switching to a different implementation. I might do a sweep someday to change them all to clear instance references, but it's not that important to me.

  9. Log in to comment