reindex() required when using data['_id'] in index

Issue #7 resolved
Anton Shestakov
created an issue

Suppose you have more than one document type, distinguished by '_type' property. All of those documents are in free format without required properties except for '_type'. You need to make separate and meaningful indexes for those document types. Separation is trivial and covered in tutorial under "Tables, collections...?", but the only way to make the index meaningful when there's no property to rely on is to use data['_id']. There are gotchas, of course, but let's think as a user who never saw CodernityDB before.

Basically, an index for documents of type 'x' will just filter out documents of other type and use data['_id'] as a key. Seems logical so far.

#!/usr/bin/env python

from CodernityDB.database import Database
from CodernityDB.hash_index import HashIndex


class XIndex(HashIndex):
    def __init__(self, *args, **kwargs):
        kwargs['key_format'] = '32s'
        super(XIndex, self).__init__(*args, **kwargs)

    def make_key_value(self, data):
        if data.get('_type', None) == 'x':
            return data['_id'], None

    def make_key(self, key):
        return key


def main():
    db = Database('/tmp/test')
    db.create()
    db.add_index(XIndex(db.path, 'x'))

    first = db.insert({'_type': 'x', 'data': 'yes, please'})
    second = db.insert({'_type': 'x', 'value': 'totally not data'})
    extra = db.insert({'_type': 'y'})

    # reindexing required
    db.reindex()

    print 'items in x before update:', db.count(db.all, 'x')

    # using our custom index
    doc = db.get('x', second['_id'], with_doc=True)['doc']
    doc['updated'] = True
    db.update(doc)

    print 'items in x after update:', db.count(db.all, 'x')


if __name__ == '__main__':
    main()

In the code, db.reindex() is required before using our index (otherwise it's empty). This is unexpected, but may be understandable because documents don't have unique id yet. On the other hand, after updating a document of type 'x' it's gone from our index without any error. This is even more unexpected because a): it was there just before update and b): it isn't a new document, it already has unique '_id' attribute. Why can't user rely on the very mechanism of maintaining uniqueness and addressability of documents?

There probably is a reason to this, but it's an issue nevertheless. Or I could've done this completely wrong. And I'm sorry if this was explained in the documentation somewhere, I could've overlooked it.

Comments (4)

  1. codernity repo owner

    Hey,

    Thank you for reporting.

    But I can't find real use case for that usage. If you know _id value you can query id index. Querying "secondary index" by _id is generally bad idea.

    All records in "secondary" indexes have _id value stored (that's how with_doc works). If you want to for example count for all records with given _type, then just create index that will return that _type. Indexing secondary index by _id value is probably not the best decision. Can you please explain your use case for that ?

    after updating a document of type 'x' it's gone from our index without any error.

    All errors from secondary indexes make_key and make_key_value are ignored, because otherwise, broken index function would block database completely. The update is causing exactly the same error as you have on "insert" => data has no _id value on that stage.

    -- Jedrzej

  2. Anton Shestakov reporter

    I thought about it, but yeah, there's probably no use case for that.

    But I have an idea though: instead of all errors from make_key and make_key_value being ignored, how about making them into warnings? It would definitely help users to debug their indexes. warnings module can be used to turn warnings into errors (to hopefully crash unit-tests). What do you think?

  3. Log in to comment