Performance degradation with Whoosh 2.5.1

Create issue
Issue #345 resolved
pombredanne NA created an issue

I test upgraded an application using Whoosh from 2.4.1 to 2.5.1 and the test runs (which are pretty long tests) are now taking about 100 minutes (up from 60 minutes) which is an increase of about 50% of run time. The only change between the these two runs is the new Whoosh version and a minor code change to deal with Whoosh module changes (spans to query.spans) I am not sure what is the reason for this and I will try to dig a bit to find if this happens during indexing, query building or searching. I am using fairly large and long queries assembled in reasonably complex SpanNear trees with or without slop.

Comments (12)

  1. pombredanne NA reporter

    @mchaput I am not using any sorting of the results. let me come with a simple/sample test code that will highlight the problem

  2. pombredanne NA reporter

    See attached test script whoosh25perf.py with code that does about the same as my real setup.

    Run this in a Whoosh checkout for each version.

    With Whoosh 2.4.1, I get about 95 secs consistently: $ python whoosh25perf.py 95.8299999237

    With Whoosh 2.5.1, I get about 120 secs consistently: $ python whoosh25perf.py 121.996000051

    This is ~30% increase in run time. The problem is more salient in my full setup with more variety of queries and docs as well as more indexes being created and trashed I get ~50% increase in run time

  3. Matt Chaput repo owner

    Improved performance when creating tons of SpanNear queries. Fixes issue #345.

    The main performance difference in the test code was actually due to a fix: in Whoosh 2.5, the Phrase query was fixed to call to_bytes() on the phrase words. When making phrases 100s or 1000s of words long, the difference was very noticeable.

    Made two fixes to improve the performance:

    1. Fixed logic in Phrase.matcher() to fail early on a missing word, instead of calling to_bytes() on all words and THEN checking if they're missing. This made a huge difference in the test code because it was mostly searching for non-existant words.

    2. Added a new SpanNear2 query type and changed Phrase to use it instead of SpanNear. SpanNear2 has slightly less overhead.

    These two changes make the test code actually run faster than in 2.4.

    → <<cset 40cafca477f5>>

  4. Matt Chaput repo owner

    It should work the same, but in your test code I didn't bother... I just changed Phrase to use SpanNear2 instead of SpanNear and that made the difference.

  5. Matt Chaput repo owner

    (Or, I guess I should have said, that made "a" difference... the main fix was failing fast if one of the terms in the phrase wasn't in the index.)

  6. pombredanne NA reporter

    Using 2.5.2 I can confirm significant speedup for my long tests (than with 2.4.1) Improvements are between 14 and 16% smaller runtimes... Thank you very much, my Lord!

  7. Log in to comment