Slow complex OR searches on indexed Wikipedia

Issue #469 new
tim.dettmers@gmail.com
created an issue

I used Whoosh in my research to train a reinforcement algorithm on top of Whoosh where this algorithm is supposed to optimize searches. The idea is that one starts with a question as a starting point which is gradually rewritten by the algorithm.

Using these full question as OR queries against an indexed Wikipedia is extremely slow compared to other IR engines.

For Whoosh I get about 16s for 32 queries or 0.5s per query; For Xapian, I get roughly 0.1s per query and with Lucene, it is around 0.05s. 0.05s is still too slow for my application, but these differences are quite marked and make Whoosh unusable in this case. I tried many different indexed versions of Wikipedia, with and without storage and also without and with phrases, but I essentially get the same results.

What is apparent that Whoosh is as fast as other search engines to find the actual documents, but scoring is very slow. If I do in-memory sparse matrix-vector multiplication with numpy I get 0.06s per query, where the matrix is already stored in memory, but also contains the entire Wikipedia. Smaller matrix-vector products on only the relevant documents are usually better than 0.01s which is closer to what performance I need.

I admit that it is a quite special case, but it highlights that there are some issues with Whoosh for complex queries which might be insightful to you. Especially ranking seems to be slow.

Comments (1)

  1. Log in to comment