Whoosh NG technical notes
The next version of Whoosh will be rewritten to store the index information in a pluggable Key-Value database. This will make the implementation simpler and allow the user to optionally plug in a fast backend (e.g. LMDB or LevelDB) to improve performance.
It's not exactly clear the most efficient way to store an inverted index in a key-value database. This is a scratch list for ideas.
- Store all information in one database or separate information types in separate databases?
- Some dbs allow multiple "sub-databases". Should Whoosh use this functionality and simulate it in other dbs using key prefixes?
- Term info as "-field:term" -> "<weight><docfreq><uid><extra>"
- Document info as "_<docid>" -> "<field lengths>"
- Stored fields as "[<docid>" -> "<pickle>"
- Vector as "]<docid>field" -> "<uid><terminfo>..."
Have to think about the overhead of a large number of key/value pairs in the database. Storing every posting as a separate pair would be really bad.
- Store blocks of N postings e.g. "+<term><first docid>" -> "<docid><weight>..."
- Store positions separately?
- While indexing, buffer posting list for every term in memory (!) until it has N postings, then add that block to the database (to avoid constant reading and rewriting of posting blocks from/to the db).
- If a field is marked unique, include a posting with the term info and don't buffer.
- To read all blocks, use sorted key iteration.
- To skip to a certain document, use sorted key positioning.
- Currently written per segment, cached in memory
- How to store in db? Writing a "field" -> "<huge array>" pair to the db might be bad.
- Store duplicated length with every doc ID in the posting lists? (Xapian used to do this but changed it because it wasted so much space.)
- Store as one pair per document/field, eg. "|<docid>field" -> "key"? Is that too many pairs?
- Or store them in the posting lists? Xapian does this.
- In the old world it was straightforward to store a per-segment word graph and merge them to get spelling results.
- In the new word, will probably abandon the FST entirely and search the sorted keys directly, using efficiency tricks from Aspell.
- Currently, deleting a document simply adds the document number to the "deleted documents" set instead of being deleted on disk because the file structures are designed to be "write-once" for speed. The deleted set is used to remove deleted documents from results. Eventually as segments merge the deleted documents are filtered out.
- In the new KV store, the per-document information about the deleted document can be deleted efficiently, but the document would still be referenced in the postings, which would be too expensive to rewrite, so there will still need to be a deleted list that filters results.
- How to remove deleted documents from the postings? Can't rely on merging to do it automatically. Will probably require an "index.compact()" method that rewrites all postings.
- Replace document values "in place" instead of delete-and-append? Would require versioning in posting lists.
- Absolute document numbers instead of segment-relative docnums.
- Use 32-bit IDs and reassign IDs on compaction? Or use 64-bit IDs and make IDs permanent?
- Allow user to assign arbitrary doc ID strings?
- Keys and values are variable-length bytestrings.
- Sorted key/key-range access
Optional or nice-to-have features:
- May have an interface for optionally taking advantage of more efficient mappings provided by the implementation, for example an integer-to-bytes mapping or mappings with fixed key/value sizes.
- May have an interface for storing multiple databases. Implementations that actually support multiple databases could provide the interface directly, while other implementations could automatically prefix keys with a database identifier.
Here are some quick benchmarks based on approx. 100 000 short keys and values (keys are the words from the UNIX dict file, values are the md5 hash of the word).
Index is the time to add all key/value pairs to the database (in random order). Open is the time to open a reader. Search is the time to look up the values of all keys (in random order). Contains is the time to check if the database contains each key (in random order). All Keys is the time to iterate over all keys (in order) in the database. Range is the time to iterate over keys (in order) in a fixed range. Missing is the time to look up 10 000 random keys that aren't in the database.
Index Open Search Contains All Keys Range Missing test.OHash 2.239817 0.000731 2.459640 2.331673 0.378434 0.311319 0.272103 test.Sqlite 0.873848 0.185237 3.085652 2.935084 0.121910 0.137674 0.409620 test.Tokyo 0.190685 0.003975 0.302323 0.310911 0.000030 0.097651 0.106678 test.LMDB 0.113563 0.002233 0.122503 0.150806 0.015622 0.059349 0.093369 test.LevelDB 0.310141 0.015839 0.465431 0.470122 0.026774 0.058462 0.139914 test.Bsd 1.278952 0.000001 0.936407 1.059523 0.264198 0.148929 0.203436
One million short random keys and values:
Index Open Search Contains All Keys Range Missing test.Sqlite 12.948117 1.892798 34.996831 31.551146 1.221176 1.250766 0.438312 test.Tokyo 20.020066 0.000209 21.272469 20.365950 0.000031 1.083051 0.153967 test.LMDB 1.643633 0.000176 1.620265 1.410010 0.162899 0.551481 0.111052 test.LevelDB 3.417837 0.018466 5.510011 4.437665 0.282070 0.615950 0.156464 test.Bsd 15.686237 0.000192 11.928028 12.165802 2.837713 1.463350 0.225478
Note: Kyoto Cabinet was not tested because I couldn't find a Python binding for it.
The need for a new pure-Python backend
In the above performance comparison, "OHash" is Whoosh's current on-disk lookup implementation, based on CDB. Even though the CDB storage algorithm is very fast in C, in Python it's much slower than the native alternatives.
SQLite has acceptable indexing performance (though much slower than Tokyo Cabinet, LMDB, or LevelDB) but the lookup time is terrible.
Python doesn't reliably ship with a usable on-disk key-value store.
- BSDDB is not included in Python as of version 3+.
- The database interfaces included on UNIX (e.g. GDB) are not available on Windows and don't have transactions or sorted keys.
- SQLite can be made to act like a K-V store but it is SLOW (see above).
So, I'll probably create a pure-Python key-value store for use as the default store.
I've done a lot of research trying to find a fast on-disk KV storage method (e.g. skip lists, B+ trees, COLAs), but I haven't found a way to store everything efficiently on disk that's fast enough in Python. So, the new store will almost certainly have to keep all keys in memory to be usable. Hopefully this won't be a huge burden since memory is so plentiful these days, and people storing very large indexes can use the SQLite backend or install a native database library.
Of course, the most important aspect of writing a new KV store will be picking a cool name for the project. I'm open to suggestions, for now I'm using KiKi (key key... get it?).
This is a preliminary benchmark based on simply writing the key/value pairs to disk in a BitCask-like format and reading the keys into memory. The actual implementation will probably use a Log Structured Merge (LSM) tree algorithm similar to how LevelDB and Whoosh's segment merging work, because it's conceptually simple. However the multiple rewrites required by the merges will make the actual indexing numbers slower.
Index Open Search Contains All Keys Range Missing test.KiKi 0.404039 0.073141 0.169897 0.133468 0.057818 0.068044 0.111820
Whoosh should ship with "connectors" for various backends. The user would then need to install the actual database and whatever Python library is necessary to use it (e.g. the lmdb package).
In-process databases (these have all been tested):
- Tokyo Cabinet
- BSDDB B-tree
- Google AppEngine datastore
Supporting other on-disk hashes
You could theoretically simulate sorted key access on a store that doesn't support it (e.g. GDB) by simply sorting the keys in memory.
However I don't know how transactions (having changes show up to readers all at once or not at all) could be simulated. Maybe you could use stores that don't support transactions if you accept that you can only have a single reader/writer process (that is, no other threads/processes can read the index or they will get an inconsistent view during updates).
This may be the biggest question mark. Currently it starts N processes to index documents separately, then (optionally) uses the merge machinery to merge the separate segments at the end. If the new world keeps all information in a single database instead of in separate segments, how does this work?
Other possible changes
- Abandon custom "format" system and hardwire postings/vectors to store (optional) freq, positions, characters, and/or abitrary user payloads.