Issue #220 resolved

performance for complex search

Thomas Waldmann
created an issue

Made with code from: https://bitbucket.org/thomaswaldmann/python-search-benchmark/changeset/65300ace8c17

Machine: Thinkpad X300, 2x 1.2GHz, 4GB RAM, SSD {{{ $ python bench.py Params: DOC_COUNT: 3000 WORD_LEN: 10 EXTRA_FIELD_COUNT: 10 EXTRA_FIELD_LEN: 100

Benchmarking: xappy 0.5 / xapian 1.2.5 Indexing takes 9.2s (326.5/s) Searching takes 1.0s (2904.5/s) Complex Searching takes 0.6s (463.3/s)

Benchmarking: xodb 0.4.17 / xapian 1.2.5 Indexing takes 8.9s (335.5/s) Searching takes 2.8s (1069.1/s) Complex Searching takes 1.0s (308.4/s)

Benchmarking: whoosh 2.3.2 Indexing takes 10.6s (283.7/s) Searching takes 2.6s (1174.1/s) Complex Searching takes 6.0s (49.7/s) }}}

Considering whoosh is pure python, it is rather fast compared to the xapian (C++) based code. Except the complex searching, which is somehow slower.

Is there maybe some clever optimization missing that xapian does, but whoosh does not?

IIRC, the performance difference is only visible when doing a search with result size limit (== as done in the benchmark) and not when there is no limit.

Can whoosh be improved for complex searches?

Comments (2)

  1. Matt Chaput repo owner
    • changed status to open

    One minor point: I think you should change the benchmark code so that it runs randomized "complex" queries instead of the same one over and over. Xapian might just be caching the result. (And I could game the benchmark by doing the same ;)

    I'll take a look to see if there's a reason for the performance difference other than "Looping in Python is slooooow".

  2. Matt Chaput repo owner

    Here's the results I get on a desktop machine using randomized complex searches:

    Params:
    DOC_COUNT: 3000 WORD_LEN: 10
    EXTRA_FIELD_COUNT: 10 EXTRA_FIELD_LEN: 100
    
    Benchmarking: xappy 0.5 / xapian 1.2.3
    Indexing takes 7.6s (393.5/s)
    Searching takes 0.8s (3703.7/s)
    Complex Searching takes 0.5s (604.8/s)
    
    Benchmarking: whoosh 2.5.0
    Indexing takes 5.8s (520.3/s)
    Searching takes 1.1s (2840.9/s)
    Complex Searching takes 2.9s (103.3/s)
    

    (Of course I just realized the benchmark script should be modified to run the same random complex searches for all engines, but it probably won't make a big difference.)

    Looking at the intersection matching code I don't see any room for optimization. I don't know if a 6x difference between C and Python on this kind of tightly looping is exactly unexpected anyway.

    So unfortunately unless I think of something brilliant the answer to the original question is probably "no, I can't see how Whoosh can improve complex searches right now."

  3. Log in to comment