Make the output of METADATA files reproducible.

#74 Declined
Repository
lambyuk
Branch
strict-resource-testing
Repository
pypa
Branch
default
Author
  1. lambyuk
Reviewers
Description

Whilst working on the Reproducible Builds effort [0], we noticed that wheel generates nondeterminstic output.

For example, in python-pip's urllib3-1.15.1.dist-info/METADATA, a Requires-Dist header has two version constraints and they appear in a nondeterminstic order:

-Requires-Dist: PySocks (>=1.5.6,<2.0); extra == 'socks' +Requires-Dist: PySocks (<2.0,>=1.5.6); extra == 'socks'

[0] https://reproducible-builds.org/

Comments (5)

  1. Frederik Rietdijk

    Have you tried setting PYTHONHASHSEED=0? In Nixpkgs I've recently introduced some patches to build the interpreter and packages reproducibly. I've build this package 5 times and they all produced the same output.

    Edit: actually, I built 1.20 of urllib3.

  2. lambyuk author

    Thanks for your reply. I think I have two distinct points in return:

    First, if you can build it 5 times it doesn't mean the package is reproducible - the output is not specified by the language to be determinstic so it would be undefined behaviour. Also, consider the non CPython runtimes may/do have different dict implementations. :)

    Second, a practical point, we/Debian can't set PYTHONHASHSEED=0 everywhere and for every package build.. that would be far too invasive and probably even impossible due to the way our packaging works. :(

  3. Frederik Rietdijk

    A patch to fix the timestamp in the bytecode along with PYTHONHASHSEED gives us deterministic builds of the interpreter and packages.

    We build each wheel and package for each Python interpreter version. I haven't checked yet the non-CPython interpreters.

    Anyway, I can understand that you might not be able to set PYTHONHASHSEED but I do not see how you're then going to guarantee reproducible builds without 'fixing' many packages. E.g., in 3.x of CPython sets are used a lot. Do you intend in such situations to patch the code to use a list instead?

  4. lambyuk author

    . Do you intend in such situations to patch the code to use a list instead?

    Only when they result in non-determinstic output into build files. We have been doing this for Python 2.x already with great success (and actually not that many patches…)