Issue #6211 wontfix

Use short node IDs in your links (BB-7489)

Martin Geisler
created an issue

I believe this changed when support for Git was added: the generated URLs that reference changesets use the full 40 character hexadecimal node ID.

I understand that it's nice to provide the full IDs: that way you don't need to implement logic to handle collisions any better than the underlying tools, i.e., you don't need to handle them on the assumption that SHA-1 is a secure hash function.


  • The long links are cumbersome: instead of a handy short node ID (12 characters for Mercurial) that will fit into most search boxes without clipping, I'm giving a 40 character string. That string is way to big for me to handle as anything but an opaque blob of bytes.

    With a short node ID I can glance at it in an email and quickly see if it looks like it's the same as the one shown in my browsers address bar. I can copy a small node ID by hand in full — with a long ID I need to decide when to give up and when my prefix is long enough. I can paste a short ID into most tools without it overflowing the input field. Overall short IDs are more convenient and handy.

  • The long links are almost always longer than what I'm comfortable pasting into text. The fixed text ( take up 30 characters, so with 40 characters for the node ID we're already at 70 and then you need to add the username and repository.

    The result is a 80+ character link which flows very badly in a text file or an email. I end up with lines longer than 80 characters which means that my diffs display badly in a standard width terminal.

  • The long links are mostly unnecessary. There are no collisions in the 275k changesets in the OpenOffice repository among the first 10 hexadecimal characters — there is a single collision among the first 9 characters.

    This means that short links will work well in practice for almost all repositories.

  • You should actually handle collisions anyway: today I can shorten the Bitbucket URLs and they still work:

    However, I get a "404 Not Found" if the prefix becomes ambiguous:

    That ought to be handled better: present the user with a list of matching changesets

Comments (11)

  1. Dylan Etkin

    Hi Martin,

    We actually used to hand out shortened links and we still do support shorter hashes. However we wanted to be unambiguous with the links we hand out. The complete hash is used in lots of places.

    I am afraid we are going to keep the full hash in the links.



  2. Martin Geisler reporter

    Hi Dylan,

    I know you used to have nice and short links and that they're still supported: I always trim my Bitbucket links to a reasonable size before I paste them somewhere.

    Since long links are ugly in plain text (developers deal with plain text links all day long in bug reports, emails, forum messages, IRC, Facebook posts, ...), and since collisions are extremely rare (I found no collisions with 10 character IDs among 275K changesets), then I think it makes a lot of sense to generate pretty links by default.

    The "gain" or "advantage" of long links is that they are unambigious. But it is a tiny advantage for you and a disadvantage for the user.

    I call it a tiny advantage for Bitbucket since you've been generating tons of short links in the first many years of Bitbucket's life and so you still need to handle these links in a good way in the future. This means that you ought to report collisions in those prefixes and tell the user about them instead of just presenting a 404 Not Found page.

  3. Angel Ezquerra

    I think using 40 character hashes for links just "to be unambiguous" is unnecessary.

    Imagine that you had the whole population of Earth (lets round up to 10 billion people) signed for BitBucket. Let's imagine that each and every one of them did 1000 commits per day to the same repository. Let's imagine that we kept this up for 1 million years. How much of the 20 digit hash space would be exhausted?

    10e9 people * 1000 commits/day * 366 days * 1e6 years / 16^20 = 0.003

    That is, even under those clearly ridiculous assumptions you would have only covered 3% of the 20 digit hash space. Increase that to 25 and it becomes 0.3e6%!

    So you do not really need the whole 40 digit hash to be unambiguous. 25 digits would be really enough. Even 15 would work under slightly less crazy assumptions (100 commits per day from 10.000 contributors per repo for 1000 years).

  4. Steve Losh

    Angel, you're calculating the odds of a specific hash being used twice. What's actually relevant here is the probabililty of a collision of any two hashes. This is called the birthday paradox/problem/attack:

    Using your numbers, we should expect a collision to happen (90% probability) in about six hours:

    Obviously your numbers are super high so we shouldn't actually see collisions that often. But still, it's a lot more common than colliding with a specific hash.

  5. Angel Ezquerra

    Steve Losh its funny that as I was writing my previous message I was calculating the odds that someone would mention the Birthday Paradox :-)

    I was too lazy to actually calculate the collision odds, that's why I was very careful to only speak about the percentage of the hash space that would be consumed in my ridiculous hypothetical scenario. Obviously once half the hash space is consumed the odds of a single new generated hash colliding with an existing one would be 50%!

    However, and as you well note, the numbers are so ridiculously small even for a crazy scenario like this one that for the BitBucket use case my case stands that using a 20 digit hash would really be enough.

  6. Angel Ezquerra

    Steve Losh 's message made me think about about aggregate probabilities. This post is intended as a set of fun thought experiments, so please bear with me :-)

    If I am not mistaken the aggregate collision probability after generating N hashes on a single repo would be:

    P(collision single repo) = 0 + 1/(16^hash_size) + 2/(16^hash_size)  + ... + (N-1)/(16^hash_size)

    i.e. the chance the the first one would collide (i.e. 0) plus the chance of the second commit colliding (1 / number_of_hashes) plus the chance of the third one colliding (2 * 1 / number_of_hashes), etc.

    That is:

    p = (1/16^hash_size) * sum(0,N, n-1) = (1/16^hash_size) * (n - 1) * n / 2

    In python:

    p = lambda n, hash_size: (n-1) * n / 2 / (16**hash_size)

    Now let's use that to see how different hash sizes would fare:

    • With 40 digit hashes you could have 100.000 committers making a single commit per second for 15 billion years (a little above the estimated age of the Universe) to a single repo and you'd have a 0.077% chance of ever having had a single collision in that repo.

    • With 30 digit hashes you could have 10.000 committers committing once per second during 100.000 years and you'd have a 0.038% chance of having a single collision in that repo.

    • With 25 digit hashes you'd have to be a bit more reasonable. You could get 10.000 people making one commit per minute (to the same repo!) for 10.000 years to get a 0.1% chance of a single collision in that repo.

    Now, these numbers ignore the actual limitations of mercurial and git. None of them can or are likely to ever support repos with the number of revisions that the above scenarios would produce. So let's limit to something a bit more reasonable, such as 10 million revisions (for the record, the biggest mercurial repo I know has less than 1 million revisions). In that case the collision probability for that single repo for different hashes would be:

    Hash_size -> Collision probability (%)
    20        ->         4.0e-9 %
    15        ->         0.0043 %
    12        ->        17.8 %

    So folks, if you have such a huge repo, do not use 12 digit hashes! Even with a 1 million revision repo the probability would be 0.18%.

    Now, BitBucket does not have a single repo. Since the collision probability in different repos is independent, the probability of having M repos without a collision is P(no collision on single repo) ^ M = (1 - P(collision on single repo)) ^ M.

    p_collision_total = 1.0 - (1.0 - p(average_repo_size) / 16**hash_size)

    So the key pieces of info are:

    • how big is the average repo in bitbucket (in terms of commits)
    • how many repos does BitBucket have (or expect to ever have).

    For a hash size of 25, any unreasonable assumption I could come up with for these two key values at the BitBucket scale gave a 0.0% chance of collision according to Python! The reason is that the numbers are so small that they go beyond the precision of python floats! Even assuming a trillion (10e12!) repos of 10 million revisions each gave 0.0% collision probability.

    For a hash size of 20, making a more reasonable set of assumptions, such as 100 million repos of 100.000 revisions on average the probability would give a 4.1 in a million (4.1e-5) chance of collision.

    On the other hand, against my original intuition, 15 digit hashes would definitely not be totally safe at the BitBucket scale. Under the conditions above (100 million repos of 100.000 revisions on average) you'd get a 35.2 % chance of collision.

    To sum up, the BitBucket guys were right in that using short (and by short I mean 15 or less digit) hashes is not safe at the BitBucket scale. However, hashes of 25 or more digits are 100% safe (0.0 percent chance of collision). 20 digit hashes would probably be OK as well.

    Who said probability could not be fun? :-D

    EDIT: fix table markdown

  7. Martin Geisler reporter

    This is a typical example of what I see as a problem with the long URLs: a developer has pasted links to Bitbucket in our bug tracker and included the full node IDs.

    The links looks bad since they're wrapped around and to copy the ID (so that I can use it on the command line, in TortoiseHg, etc) I now need to select the last 40 characters. I find it easier to select just the final 12 characters.

  8. Martin Geisler reporter

    Angel Ezquerra I believe your computations are wrong: you don't add the probabilities, you multiply them. This is described well in the Wikipedia article on the birthday paradox.

    I came up with the following the compute the probability of a collision between 1,000,000 changesets when using 12 character (48 bit) hashes:

    >>> import operator
    >>> mul = lambda seq: reduce(operator.mul, seq, 1)
    >>> 1 - mul([1 - float(i)/2**48 for i in range(1, 1000000)])

    That is a 0.17% probability of finding one (or more) collisions. Computing the probability for 10 million commits gives 16%.

    In any case: the point is not whether or not Bitbucket can avoid dealing with collisions. They must deal with them if they want to support prefix URLs and since they've generated tons of those URLs in the past I don't think they can drop support for them.

  9. Angel Ezquerra

    Martin Geisler Maybe I am missing something but I think that you cannot multiply the probabilities because we are not looking for the probability that all hashes collide. If we did the probability would be 0 because the first hash cannot collide!

    The numbers I gave made two assumptions:

    1. The probability of collision of a hash is independent
    2. The collisions are mutually exclusive (i.e. no more than one hash can collide at the same time).

    I think #1 is fair given the way hashes are generated. I think assumption #2 makes my calculation a worst case scenario because given two non mutually exclusive events A and B:

    P(A or B) = P(A) + P(B) - P(A and B).

    P(A or B) is maximized when P(A and B) = 0 (i.e. they are mutually exclusive).

    In our case this means that we assume that a repo will not have more than a single collision (because as soon as there is a collision the repo is broken and we stop).

    With that said:

    • The first hash has 0 probability of collision
    • The second hash can only collide with 1 hash, i.e. it has 1/total_hashes = 16^hash_size
    • The third hash could collide with the first or the second hash, i.e. it has 2/total_hashes collision probability
    • The forth has 3/total_hashes and so on.

    The probability of having a single collision after having created N hashes is the probability of the 1st hash colliding OR the second colliding OR the third, and so on. Assuming the collisions are independent but mutually exclusive. Using the formulas above the total probability of any one of them happening is the sum of their probabilities as I calculated.

    We could do the actual calculation removing the mutual exclusivity requirement, and the probabilities would look even better.

  10. Log in to comment