_rev not guaranteed to be unique

Issue #1 resolved
Anonymous created an issue

On update, _rev is reassigned with a random 4 byte number. This is to prevent older versions of the data from being updated. However, there is a slight chance that _rev will be assigned the same 4 bytes as its previous _rev. If you try to update an older version of the data, it will succeed.

The chance of this happening is 1 / (65536 * 65536) = 1 / 4294967296. However, if you have multiple version of the data open, the collisions will increase cumulatively. If you have 10 older versions open, the risk of _rev collision will be 10 times as likely.

Expected behaviour

_rev should be guaranteed unique, older versions of data should not be able to be updated.

Comments (4)

  1. Codernity repo owner


    This is to prevent older versions of the data from being updated.

    It's to check IF the data can be updated (check is performed only for current and new value). There is no way to update "older version of data" in database (by design). But if you by data mean to "overwrite" latest data with older, just skip then please my comments about old data access in DB. Having correct value of _rev field will allow you to "overwrite" db data with the current one that you send.

    So the possibility to allow to update data with _rev number because of collision is 1 / 4294967296, and it doesn't matter how many old versions exists (you can always update only latest data, there is no access to older at all). Obviously it's possible (bday paradox makes it even more frequent). Adding such unique _rev generation mechanism into default implementation might be not that good idea, because that DB was designed to "small" projects etc.

    Overwrite latest data with previous one can occur only when both have the same _rev, but JUST the current + incomming one, not incomming one vs all in DB. So the error chance is always the same (for single operation and considering only 2 existing data copies). Even in multithread environmens you still have to hit collision between "current" and incomming one (write lock on index is performed during update).

    Also please note that tomorrow morning (15.11 <12 CEST) we will publish update. It has fixes for multithread environments.

  2. Martin McNickle


    Thanks for the reply.

    I realised I made a mistake with the maths, this issue happens much more often that I indicated before. In my test case I had written:

        b['data'] = 'z'
        print "Should not reach here, but will succeed roughly 1 / (65536 * 65536) times"
    except RevConflict:
        print "Should always be a conflict"

    when in fact, the probability of the first print statement being incorrect printed is actually just 1 / 65536.

    I think the situation in the test case could be a common one. Two "editors" are working on the same document. Editor A saves first, Editor B tries to save afterwards but has their update rejected, because he wasn't working on the most recent version. This is the correct behaviour. However, there is actually a 1 / 65536 chance of Editor B's changes overwriting Editor A's changes in this scenario, due to the fact that _rev is randomly generated.

    I can appreciate that CodernityDB is aimed at small systems, but for me, having my code do the wrong thing at random isn't something I could accept. The main problem is that there is no way to detect that the code has done the wrong thing. If I'm relying on the correct updating behaviour to enforce data integrity, then I could end up with inconsistent data without realising.

    I do believe there is a solution though that addresses your need for small, fixed-width meta-data, and my need as a developer to be confident that my code is behaving correctly. If you really need _rev to be fixed-width you could make it a simple incrementing number with a maximum value (say 65536). Every time a document is updated, the _rev is incremented. A consequence of this is that a document could only be updated 65536 times.

    This has several advantages for the developer.

    1. I am in a position to say whether or not the limit on the number of updates will affect me; I may be creating a system that will be very unlikely to hit that hard limit.

    2. The update behaviour is now deterministic and well defined. As the _rev is guaranteed to be unique, there is no chance, that Editor B's changes will be applied over the top of Editor's A changes.

    3. If I hit the hard limit, I'm in a position to be able to detect that and create a strategy to deal with it:


        b['data'] = 'z'
    except RevConflict:
        # do what I want to do in the case of a conflict
    except MaxNumRevsReached:
        # Deal with the fact that this document has reached its
        # maximum number of updates

    I hope you can see how this is better for the developer. With a well-known, deterministic limit, I can plan for it and still maintain data integrity. With the other random scheme, I can't detect when the code does something wrong, and I cannot be sure about the integrity of the data.


    In my initial report I should have been more precise with my language when I said:

    However, if you have multiple version of the data open, the collisions will increase cumulatively.

    It should have read:

    However, if you have multiple different version of the data open, the collisions will increase cumulatively.

    This was the birthday paradox situation you mentioned, and I agree, it's more of a theoretical situation. However, with the realisiation that the collisions happen much more frequently, the situation I outline above with Editor A and Editor B, is not theoretical. It could happen even on a modest size system.

  3. Codernity repo owner

    Hello again,

    As we discussed on IRC it's different "usage" that we originally addressed: - generate random _rev to avoid "faulty" update by 3rd party (it was required) vs - generate sequence _rev

    But after your post I see the point there. What we can do now is that we can easily (probably) mix those 2 options (again as we discussed on IRC)

    We will add counter mechanism to current _rev one. On overflow we will reset counter and start from 0 again, but having random value on rest of bytes should be enough to "fix" the problem that you mention, because the duplicate _revs would be then possible only with more than 64k live objects of single data (or 64k updates while the first original object is still in the memory).

    I hope it will "fix" the problem that you mention.

    Thanks again.

  4. Log in to comment