1. Matt Chaput
  2. whoosh

Wiki

Clone wiki

whoosh / Whoosh3

Whoosh NG technical notes

Overview

The next version of Whoosh will be rewritten to store the index information in a pluggable Key-Value database. This will make the implementation simpler and allow the user to optionally plug in a fast backend (e.g. LMDB or LevelDB) to improve performance.

Index semantics

It's not exactly clear the most efficient way to store an inverted index in a key-value database. This is a scratch list for ideas.

Sub-databases

  • Store all information in one database or separate information types in separate databases?
  • Some dbs allow multiple "sub-databases". Should Whoosh use this functionality and simulate it in other dbs using key prefixes?

Terms

  • Term info as "-field:term" -> "<weight><docfreq><uid><extra>"

Per-document info

  • Document info as "_<docid>" -> "<field lengths>"
  • Stored fields as "[<docid>" -> "<pickle>"
  • Vector as "]<docid>field" -> "<uid><terminfo>..."

Postings

Have to think about the overhead of a large number of key/value pairs in the database. Storing every posting as a separate pair would be really bad.

Idea:

  • Store blocks of N postings e.g. "+<term><first docid>" -> "<docid><weight>..."
  • Store positions separately?
  • While indexing, buffer posting list for every term in memory (!) until it has N postings, then add that block to the database (to avoid constant reading and rewriting of posting blocks from/to the db).
  • If a field is marked unique, include a posting with the term info and don't buffer.
  • To read all blocks, use sorted key iteration.
  • To skip to a certain document, use sorted key positioning.

Field lengths

  • Currently written per segment, cached in memory
  • How to store in db? Writing a "field" -> "<huge array>" pair to the db might be bad.
  • Store duplicated length with every doc ID in the posting lists? (Xapian used to do this but changed it because it wasted so much space.)

Sort keys

  • Store as one pair per document/field, eg. "|<docid>field" -> "key"? Is that too many pairs?
  • Or store them in the posting lists? Xapian does this.

Spelling

  • In the old world it was straightforward to store a per-segment word graph and merge them to get spelling results.
  • In the new word, will probably abandon the FST entirely and search the sorted keys directly, using efficiency tricks from Aspell.

Deleting

  • Currently, deleting a document simply adds the document number to the "deleted documents" set instead of being deleted on disk because the file structures are designed to be "write-once" for speed. The deleted set is used to remove deleted documents from results. Eventually as segments merge the deleted documents are filtered out.
  • In the new KV store, the per-document information about the deleted document can be deleted efficiently, but the document would still be referenced in the postings, which would be too expensive to rewrite, so there will still need to be a deleted list that filters results.
  • How to remove deleted documents from the postings? Can't rely on merging to do it automatically. Will probably require an "index.compact()" method that rewrites all postings.
  • Replace document values "in place" instead of delete-and-append? Would require versioning in posting lists.

Doc IDs

  • Absolute document numbers instead of segment-relative docnums.
  • Use 32-bit IDs and reassign IDs on compaction? Or use 64-bit IDs and make IDs permanent?
  • Allow user to assign arbitrary doc ID strings?

Key-value store

Requirements

  • Keys and values are variable-length bytestrings.
  • Sorted key/key-range access
  • Transactions

Optional or nice-to-have features:

  • May have an interface for optionally taking advantage of more efficient mappings provided by the implementation, for example an integer-to-bytes mapping or mappings with fixed key/value sizes.
  • May have an interface for storing multiple databases. Implementations that actually support multiple databases could provide the interface directly, while other implementations could automatically prefix keys with a database identifier.

Performance comparison

Here are some quick benchmarks based on approx. 100 000 short keys and values (keys are the words from the UNIX dict file, values are the md5 hash of the word).

Index is the time to add all key/value pairs to the database (in random order). Open is the time to open a reader. Search is the time to look up the values of all keys (in random order). Contains is the time to check if the database contains each key (in random order). All Keys is the time to iterate over all keys (in order) in the database. Range is the time to iterate over keys (in order) in a fixed range. Missing is the time to look up 10 000 random keys that aren't in the database.

              Index      Open       Search     Contains   All Keys   Range      Missing
  test.OHash   2.239817   0.000731   2.459640   2.331673   0.378434   0.311319   0.272103
 test.Sqlite   0.873848   0.185237   3.085652   2.935084   0.121910   0.137674   0.409620
  test.Tokyo   0.190685   0.003975   0.302323   0.310911   0.000030   0.097651   0.106678
   test.LMDB   0.113563   0.002233   0.122503   0.150806   0.015622   0.059349   0.093369
test.LevelDB   0.310141   0.015839   0.465431   0.470122   0.026774   0.058462   0.139914
    test.Bsd   1.278952   0.000001   0.936407   1.059523   0.264198   0.148929   0.203436

One million short random keys and values:

              Index      Open       Search     Contains   All Keys   Range      Missing
 test.Sqlite  12.948117   1.892798  34.996831  31.551146   1.221176   1.250766   0.438312
  test.Tokyo  20.020066   0.000209  21.272469  20.365950   0.000031   1.083051   0.153967
   test.LMDB   1.643633   0.000176   1.620265   1.410010   0.162899   0.551481   0.111052
test.LevelDB   3.417837   0.018466   5.510011   4.437665   0.282070   0.615950   0.156464
    test.Bsd  15.686237   0.000192  11.928028  12.165802   2.837713   1.463350   0.225478

Note: Kyoto Cabinet was not tested because I couldn't find a Python binding for it.

The need for a new pure-Python backend

In the above performance comparison, "OHash" is Whoosh's current on-disk lookup implementation, based on CDB. Even though the CDB storage algorithm is very fast in C, in Python it's much slower than the native alternatives.

SQLite has acceptable indexing performance (though much slower than Tokyo Cabinet, LMDB, or LevelDB) but the lookup time is terrible.

Python doesn't reliably ship with a usable on-disk key-value store.

  • BSDDB is not included in Python as of version 3+.
  • The database interfaces included on UNIX (e.g. GDB) are not available on Windows and don't have transactions or sorted keys.
  • SQLite can be made to act like a K-V store but it is SLOW (see above).

So, I'll probably create a pure-Python key-value store for use as the default store.

I've done a lot of research trying to find a fast on-disk KV storage method (e.g. skip lists, B+ trees, COLAs), but I haven't found a way to store everything efficiently on disk that's fast enough in Python. So, the new store will almost certainly have to keep all keys in memory to be usable. Hopefully this won't be a huge burden since memory is so plentiful these days, and people storing very large indexes can use the SQLite backend or install a native database library.

Of course, the most important aspect of writing a new KV store will be picking a cool name for the project. I'm open to suggestions, for now I'm using KiKi (key key... get it?).

This is a preliminary benchmark based on simply writing the key/value pairs to disk in a BitCask-like format and reading the keys into memory. The actual implementation will probably use a Log Structured Merge (LSM) tree algorithm similar to how LevelDB and Whoosh's segment merging work, because it's conceptually simple. However the multiple rewrites required by the merges will make the actual indexing numbers slower.

           Index      Open       Search     Contains   All Keys   Range      Missing
test.KiKi   0.404039   0.073141   0.169897   0.133468   0.057818   0.068044   0.111820

Available backends

Whoosh should ship with "connectors" for various backends. The user would then need to install the actual database and whatever Python library is necessary to use it (e.g. the lmdb package).

In-process databases (these have all been tested):

  • LMDB
  • LevelDB
  • Tokyo Cabinet
  • BSDDB B-tree
  • SQLite

Client-server databases:

  • Redis
  • Google AppEngine datastore
  • Cassandra
  • MongoDB

Supporting other on-disk hashes

You could theoretically simulate sorted key access on a store that doesn't support it (e.g. GDB) by simply sorting the keys in memory.

However I don't know how transactions (having changes show up to readers all at once or not at all) could be simulated. Maybe you could use stores that don't support transactions if you accept that you can only have a single reader/writer process (that is, no other threads/processes can read the index or they will get an inconsistent view during updates).

Multiprocessor indexing

This may be the biggest question mark. Currently it starts N processes to index documents separately, then (optionally) uses the merge machinery to merge the separate segments at the end. If the new world keeps all information in a single database instead of in separate segments, how does this work?

Other possible changes

  • Abandon custom "format" system and hardwire postings/vectors to store (optional) freq, positions, characters, and/or abitrary user payloads.

Updated