Pull requests

#90 Merged
Repository
schlamar schlamar
Branch
default
Repository
hpk42 hpk42
Branch
default

Limit PYTHONHASHSEED to 1024 on Windows to prevent possible MemoryErrors.

Author
  1. Marc Schlaich
Reviewers
Description
No description

Comments (19)

  1. Marc Schlaich author

    Also, as I mentioned in an earlier comment, to be sure that the full range of hash seeds is being tested under Windows, I would consider choosing random with some probability. For example, 50% of the time it could choose random, and 50% of the time it could choose something between 1 and 1024. This would provide the benefits of both full test coverage, as well as reproducibility (for at least some of the integers).

    I'm strictly against such a unpredictable behavior. I would say either random or limit the maximum. holger krekel What do you think?

    One more comment. Sorry for not checking myself, but is the hash seed generated only once per Tox run for all Python versions running the tests? I seem to remember the original patch generating one number globally so that the seed would be the same across all versions. If that's the case, then doing a version check here might not make sense because the version constraint might be satisfied for some versions under test but not others.

    You are right. If you start tox with Python 3.4 and run a test on py27, it will pass a seed in range(1, 4294967295). Thanks for pointing that out. I'm removing the second commit.

  2. Chris Jerdonek

    What about setting a new hash seed on a per-python/env basis?

    I think that would be okay.

    For example, 50% of the time it could choose random, and 50% of the time it could choose something between 1 and 1024.

    I'm strictly against such a unpredictable behavior.

    But really it's no less predictable than choosing randomly from [1, 1024]. It's just choosing randomly from a different set, namely [1, 1024] + ['random'], and with a modified distribution. The problem with choosing from [1, 1024] is that if there is a failure with 1025, say, the tester would never catch it using Tox. And the problem with always using random is that seed-dependent failures would never be reproducible from test output, which would be an unnecessary inconvenience.

  3. Marc Schlaich author

    The seed is used as salt for a hash function, so choosing only from [1, 1024] doesn't mean we significantly restricted our coverage. It is actually exactly the same as picking 1024 random values from [1, 4294967295]. So you need to run hundreds of tests to get a theoretical difference (and thousands in practice I guess).

    To be safe we can increase the limit to something like 1024 * 1024 = 1048576 (1 MB). So there is no statistical difference as long as you don't run millions of tests.

  4. Chris Jerdonek

    It is quite a significant restriction. Even with the larger limit, you would still be limiting yourself to only 0.02% of all possible hash seeds that could occur in production. I understand that in practice, it would be highly unlikely for a hash seed dependent bug not to manifest in such a subset, but I think a test runner shouldn't impose such a restriction when it isn't necessary. Allowing both explicit values and random to occur with some probability provides the best of both worlds. What is the downside?

  5. holger krekel repo owner

    If you have a test suite that is highly dependent on hashseed salting (maybe for crypto things) you should control it yourselv anyway and/or make sure you can run against py34. The downside of 'random' is that we cannot show the value through a regular tox run, making it harder to repeat a failing test run for a particular hash seed. If you hit a "one in a million" situation you want to be able to repeat it. Also note that with Python3.4 and up we are not going to impose a restriction on hashseeds.

  6. Marc Schlaich author

    It is quite a significant restriction.

    You are ignoring the fact that hash functions create uniformly distributed values. If you run 10 tests it doesn't make any theoretically (!) difference if you run it against [0, 10] or if you pick 10 random values from [1, 4294967295].

    hash seed dependent bug

    Is that actually a thing? Can you reference a practical case where this is relevant?

    IMO the only practical impact of setting the PYTHONHASHSEED is that dict creation time and dict ordering is predictable and repeatable:

    $ PYTHONHASHSEED=2 python -c "print (dict((str(i), i) for i in range(10)))"
    {'8': 8, '9': 9, '2': 2, '3': 3, '0': 0, '1': 1, '6': 6, '7': 7, '4': 4, '5': 5}
    
    $ PYTHONHASHSEED=3 python -c "print (dict((str(i), i) for i in range(10)))"
    {'6': 6, '7': 7, '4': 4, '5': 5, '2': 2, '3': 3, '0': 0, '1': 1, '8': 8, '9': 9}
    
    $ PYTHONHASHSEED=2 python -c "print (dict((str(i), i) for i in range(10)))"
    {'8': 8, '9': 9, '2': 2, '3': 3, '0': 0, '1': 1, '6': 6, '7': 7, '4': 4, '5': 5}
    

    If you have a test suite that is highly dependent on hashseed salting (maybe for crypto things)

    If anyone is using Python's hash for crypto stuff he should burn in hell. =) We are not talking about cryptographic hash functions here. Cryptographic hash functions need to be predictable they are not salted. This only affects Python's internal hash function used for dictionary keys and similar things.

    Also note that with Python3.4 and up we are not going to impose a restriction on hashseeds.

    Actually not as noted above:

    You are right. If you start tox with Python 3.4 and run a test on py27, it will pass a seed in [1, 4294967295]. Thanks for pointing that out. I'm removing the second commit.

  7. Marc Schlaich author

    Is that actually a thing? Can you reference a practical case where this is relevant?

    The only thing I can think of is if someone relies on dict ordering. And this is already covered by using two different seeds.

  8. holger krekel repo owner

    OK, i think Marc Schlaich came up with convincing arguments. What is left is moving the HASHSEED logic into a per-environment thing, basically making the decision on hashseed values in _makeenvconfig(). There is anyway no need to have it globally (available independently from testenv).

  9. holger krekel repo owner

    Marc, i merged this because i thought that the patch is kind of good enough and i didn't see further activity on this PR. I am happy to merge another PR or do you think it's bad to have merged this PR and still think it should rather backed out?

  10. Chris Jerdonek

    It's strange. The e-mail notifications of the follow-up to my last comment didn't show up in my inbox (until Holger's comment an hour ago), so I never got a chance to respond to Marc's questions. Since this is already merged now, is there nothing more to comment on now?

  11. Chris Jerdonek

    The downside of 'random' is that we cannot show the value through a regular tox run, making it harder to repeat a failing test run for a particular hash seed.

    The reason I suggested choosing random some percentage of the time for Windows less than 3.4 is that otherwise, some test failures might never occur when running Tox. (The probability of choosing random can be set to something small, like 10% of the time.) Yes, the downside is that such a failure wouldn't be reproducible. But it is better for the user to have a chance of being alerted to a problem (albeit with incomplete information) than never to be alerted at all.

    Marc wanted a practical case. Say you have code that loops over the keys of a dictionary, and the code works fine if the keys aren't sorted. However, there is a bug in the code that occurs if the keys happen to be in sorted order (e.g. an if block doesn't execute in this case so a variable never gets set). If 7 items are involved, then the bug will only occur for one out of the 5,040 possible orderings. If the hash seed is limited to less than 1,024, then at most 1,024 of these orderings will be represented, which is likely to leave out the case where the bug occurs. This is just an example to illustrate, but I hope you get the idea from it.

  12. Marc Schlaich author

    Marc wanted a practical case.

    No, this one is not convincing. If someone relies on the ordering of dicts there is exactly one ordering where everything works as expected and every other fail. So you need just a second run with a different seed to spot this bug.

    still think it should rather backed out?

    Only the second commits needs to be backed out, the one which limits it to Python < 3.4. Because this issue can still appear on Python 3.4 if tox passes a high seed to a Python 2.7/3.3 tox environment.

  13. Chris Jerdonek

    Marc, you misunderstand me. The scenario I described is one where the code works correctly if the keys aren't sorted, so that the bug manifests only in the case where the keys are sorted. That is the opposite of how you interpreted it.

  14. Marc Schlaich author

    Just to make this even more clear: The only bug where dict ordering should be an issue is if someone falsely thinks they are ordered but it works as expected because he is using Python 2 where the default seed is constant. So just a single run with a non default seed should bubble up this bug.

    Plus, if you really have some function which could behave differently based on how the dictionary is iterated, you have to manually check all possibilities and do not rely on that PYTHONHASHSEED might someday check the specific configuration which fail in your code. So for your example, one unittest should just be running all permutations of your 7-elemented dictionary against your code.

  15. Chris Jerdonek

    The only bug where dict ordering should be an issue is if someone falsely thinks they are ordered but it works as expected because he is using Python 2 where the default seed is constant. So just a single run with a non default seed should bubble up this bug.

    Again, this is false, and I can provide a counterexample if necessary.

    Plus, if you really have some function which could behave differently based on how the dictionary is iterated, you have to manually check all possibilities and do not rely on that PYTHONHASHSEED might someday check the specific configuration which fail in your code. So for your example, one unittest should just be running all permutations of your 7-elemented dictionary against your code.

    Tox is a tool to help the user, and it should help users find bugs whether or not they adhere to a certain standard of writing tests. Some users might not even be aware they have code that depends on iteration order and PYTHONHASHSEED, etc. In my mind, that was one of the reasons for varying PYTHONHASHSEED in the first place.