Issue #393 new

VarBytesColumn performance improvement (using stored offsets)

Christian Jacobsen
created an issue

The VarBytesColumn currently stores (and loads) the lengths of each byte string in the column in the passed dbfile. Offsets are not stored however, and are instead computed when the VarBytesColumn is loaded by using the stored lengths. When a VarBytesColumn contains a largish number of byte strings (say hundreds of thousands to millions) the performance penalty of computing the offsets becomes very noticeable.

The condition under which I saw the poor performance with VarBytesColumn was as follows: I had a large number of 6ish megabyte segments. There are about 2 million documents in the index. However, since in general, queries do not span multiple segments retrieval of results has been very quick. At one point however, some segments were merged to create a 380 MB file, now containing (presumably) the bulk of the data, and suddenly query speed decreased by a factor of around 10. I narrowed this down to the loop which computes offsets from lengths, which was consuming the majority of the time runtime for my queries.

Storing precomputed offsets can, for VarBytesColumn with large numbers of strings, dramatically reduce the time it takes to load the column data. Storing the offsets uses only a relatively small amount of storage (for 2 million strings, assuming 4 byte longs, this is about 8 MB). Since GrowableArray (used for the offsets) will use unsigned types and segments are limited to 2 GB, it is easily possible to store all required offsets without danger of overflow.

I have a proposed patch: https://bitbucket.org/christianjacobsen/whoosh/branches/compare/christianjacobsen/whoosh:f2aa2bc78f13..mchaput/whoosh:999cd5fb0d11#diff

This will need some work... but can be used to demonstrate the performance increase. It is currently written to be backwards compatible (i.e., it can load data from current indices, but does not write a backwards compatible index). This backwards compatibility may or may not be a desirable goal, but if it is, should possibly be achieved in better ways. My very unscientific testing shows that queries on an index without stored offsets takes around 1.264224 seconds with 2 million rows in the column, whereas when optimize() has been run on the index to store the precomputed offsets, the same query takes around 0.060601 seconds.

Comments (0)

  1. Log in to comment