Issue #14 invalid

id_stor not greater than 4G

Massimiliano Ciancio
created an issue

I'm testing CodernityDb on a Debian 7 with ext4 using a massive insert of 50.000.000 of items each one with some data and a set of about 500 values.

When id_stor reaches the size of 4G the script stops with the following error:

File "/usr/local/lib/python2.7/dist-packages/CodernityDB/hash_index.py", line 690, in insert 0)) struct.error: 'I' format requires 0 <= number <= 4294967295


I changed hash_limit to 16^7, key_format to 'Q' and entry_line_format to '<32s{key}QQcQ'.

Where is the problem??? Why it insists with 'I' if I put 'Q' wherever?

The script is the following (it's the one in the examples, modified)

!/usr/bin/env python

from CodernityDB.database import Database from CodernityDB.hash_index import HashIndex from random import random from time import time

class WithXIndex(HashIndex):

def __init__(self, *args, **kwargs):
    kwargs['key_format'] = 'Q'
    kwargs['hash_lim'] = 0xfffffff
    kwargs['entry_line_format'] = '<32s{key}QQcQ'
    super(WithXIndex, self).__init__(*args, **kwargs)

def make_key_value(self, data):
    a_val = data.get("x")
    if a_val is not None:
        return a_val, None
    return None

def make_key(self, key):
    return key

def create(): db = Database('/tmp/tut2') db.create() x_ind = WithXIndex(db.path, 'x') db.add_index(x_ind)

for x in xrange(50000000):
    values = set()
    for i in xrange(int(1000*random())):
        z = int(random()*1000000)
        values.add(z)
    db.insert(dict(x=x,y=2*x+1,values=values))

if name == 'main': create()

Comments (7)

  1. Natalia Zon

    Just came here to report the same bug and then I saw this. I think changing your custom index won't solve this problem, it seems that somethings going on with the hashing algorithm which is being calculated for each entry. For me it crushes for the exact same number, 4294967295, which for also seems to be the maximum size of the id_stor file, in bytes. Could anyone explain this?

  2. codernity repo owner

    The exception that you have is from id index. It's main index in CodernityDB, always created.

    So you have 2 choices: - http://labs.codernity.com/codernitydb/database_indexes.html#sharding-in-indexes - or create id index different than default one.

    Below you will find examples how to do it:

    from CodernityDB.hash_index import UniqueHashIndex
    from CodernityDB.database import Database
    from CodernityDB.sharded_hash import ShardedUniqueHashIndex
    
    
    class MyBigIDIndex(UniqueHashIndex):
    
        def __init__(self, *args, **kwargs):
            kwargs['key_format'] = '<32s8sQIcQ'
            super(MyBigIDIndex, self).__init__(*args, **kwargs)
    
    
    class CustomIdSharded(ShardedUniqueHashIndex):
    
        custom_header = 'from CodernityDB.sharded_hash import ShardedUniqueHashIndex'
    
        def __init__(self, *args, **kwargs):
            kwargs['sh_nums'] = 10
            super(CustomIdSharded, self).__init__(*args, **kwargs)
    
    
    def create_1():
        db = Database('/tmp/db1')
        db.create(with_id_index=False)
        db.add_index(MyBigIDIndex(db.path, 'id'))
        return db
    
    
    def create_2():
        db = Database('/tmp/db2')
        db.create(with_id_index=False)
        db.add_index(UniqueHashIndex(db.path, 'id', key_format="<32s8sQQcQ"))
        return db
    
    
    def create_3():
        db = Database('/tmp/db3')
        db.create(with_id_index=False)
        db.add_index(CustomIdSharded(db.path, 'id'))
        return db
    
    if __name__ == '__main__':
        db = create_3()
        for x in xrange(10 ** 8):
            db.insert(dict(x=x))
    

    What is better Sharded or Q ? It depends. On index creation you can define entry_line_format / key_format (they differ for id and not id index, but the difference is not important at this point). entry_line_format describes index record in following way: 1. mark for use little-endian encoding < 2. document id format 32s 3. index key format {key}, it will be replaced with c or if defined with value from key_format parameter. 4. start of a record in storage format I 5. size of a record in storage format I 6. status format c (you probably do not want to change it) 7. next record (in case of conflicts) format I

    So as you see, you can specify storage releated things and record releated things. What does it mean ? Well imagine following situations: if you store a lot of small objects, 4G for storage may be enough, then you don't need to change 4. But you might need to change 7th field. In other words, index will don't fit 4G and storage will fit it. if you store a lof of big objects, then you probably need to change 4 and 7 (both index and storage 4G+)

    You don't need to change 5th field, unless you want to store single objects bigger than 4G.

    So back to Sharding vs Q. Sharding allows you to use I values where Q would be needed. In fact it creates several indexes and tries to distribute keys across them.

    Please ask there if you have more questions about this use case.

    I'm also changing type of this report to invalid because it's not a bug.

  3. codernity repo owner

    Natalia Zon

    As I write above, all you need is to change id index. 4294967295 is just the max value for I ctype. In my previous post you will find answer for this "problem".

    it seems that somethings going on with the hashing algorithm which is being calculated > for each entry

    That might be possible optimization with big number of records but CodernityDB handles key duplicates without problems (http://en.wikipedia.org/wiki/Hash_table#Separate_chaining). The default size of hash function is 0xfffff (65k). be careful with adjusting that value, making it bigger will allocate more space on disk (if FS doesn't support files with holes).

  4. codernity repo owner

    As I write before, it's not a bug. It's default behaviour. Which can be changed by user. Any suggestions are welcome. Obviously further discussion also.

  5. Natalia Zon

    Thanks. I should probably use sharding though, since my data is 'naturally' distributed into coherent sets. Is it possible to define which key values should belong to which shard, or is it distributed automatically? Anyways, I'll look more into the docs today. My first idea was just to create a separate database for every 'shard' of my data. Since my data is distributed in a tree-like structure with 92 categories at the first layer that would give me 92 DBs. But perhaps sharding is a more elegant solution. I'm totally new to CodernityDB, but I already found it to be very useful, thanks for creating it :)

  6. codernity repo owner

    Natalia Zon In your situation, if you can easily specify different "categories" for objects in your database you can freely choose between "sharding" and separate databases.

    They would be pretty the same thing in your situation. Use whatever suits you better.

    Is it possible to define which key values should belong to which shard, or is it distributed automatically? Yes it's possible. It's nothing very fancy, just python ;-) Whole "logic" is probably there:

        def create_key(self):
            h = uuid.UUID(int=getrandbits(128), version=4).hex
            trg = self.last_used + 1
            if trg >= self.sh_nums:
                trg = 0
            self.last_used = trg
            h = '%02x%30s' % (trg, h[2:])
            return h
    

    It does round robin over all defined shards, and adds prefix for key to find correct shard then. Change this logic, to "select category" based on key, and it's done. Key create / shard select https://bitbucket.org/codernity/codernitydb/src/tip/CodernityDB/sharded_hash.py?at=default#cl-44 then on get https://bitbucket.org/codernity/codernitydb/src/tip/CodernityDB/sharded_hash.py?at=default#cl-90

    So shard is selected from key.

    . I'm totally new to CodernityDB, but I already found it to be very useful, thanks for creating it :)

    Thanks for good words, feel free to ask questions and/or suggestions.

  7. Log in to comment