No support for Unicode MANIFEST files.

Ryan Bourgeois avatarRyan Bourgeois created an issue

Unicode characters in MANIFEST files break distribute when doing a bdist_egg:

https://gist.github.com/3200738

The package in question, Pyramid, has a SOURCES.txt that apparently has some non-ascii characters in it.

Comments (35)

  1. Lennart Regebro

    I guess it would be possible to support UTF-8 in MANIFEST/SOURCES.txt. Neither Distutils nor Setuptools define the encoding of these files, and Distribute assumes ASCII. Of course, having non-ascii filenames is asking for trouble. Pyramid has it because it has tests making sure that the trouble is handled properly.

    (Note: Unicode is not UTF-8.)

  2. Toshio Kuratomi

    There's a few things wrong with this. I've figured out an initial problem but I've run into a second problem that I suspect is in the surrogateescape error handler changes but I'm not sure.

    With distribute-0.6.29:

    $ export LC_ALL=C $ export LANG=C $ python3 setup.py sdist ERROR: test_manifest_is_read_with_utf8_encoding (setuptools.tests.test_sdist.TestSdistTest)


    Traceback (most recent call last): File "/srv/git/python-setuptools/python3-python-setuptools-0.6.29-1.fc16/build/src/setuptools/tests/test_sdist.py", line 208, in test_manifest_is_read_with_utf8_encoding open(filename, 'w').close() UnicodeEncodeError: 'ascii' codec can't encode character '\xf6' in position 13: ordinal not in range(128)

    This seems like it just needs to convert to a byte string on python3:

    if sys.version_info >= (3,): open(filename.encode('utf-8'), 'w').close() else: open(filename, 'w').close()

    However, once that's fixed, the unittest fails. instrumenting the code a bit shows this:

    cmd.filelist.files => ['setup.py', 'sdist_test/init.py', 'sdist_test/a.txt', 'sdist_test/b.txt', 'sdist_test/sm\udcc3\udcb6rbr\udcc3\udcb6d.py', 'sdist_test.egg-info/PKG-INFO', 'sdist_test.egg-info/SOURCES.txt', 'sdist_test.egg-info/dependency_links.txt', 'sdist_test.egg-info/top_level.txt', 'sdist_test.egg-info/SOURCES.txt']

    Note that our filename with utf-8 characters in it is a py3 str with surrogateescaped chars in it. If this test case is still about reading things in as utf-8, then it should be correctly decoded from utf-8.

    I've also copied the sdist_test directory to make sure that the filename is correctly written:

    $ LC_ALL=C ls -b init.py a.txt b.txt c.rst sm\303\266rbr\303\266d.py

    ls -b shows octal values of non-ascii 0303 0266 => '\xc3\xb6' => u'\xf6' => ö

  3. Toshio Kuratomi

    Small update to my last comment -- SOURCES.txt in that comment was found by copying the unittest's self.temp_dir.

    Instrumenting setuptools/cmd/sdist.py's read_manifest() show's that the value in the sdist command's self.manifest is incorrect. At that point, it's:

    b'sdist_test.egg-info/top_level.txtsdist_test/sm\xf6rbr\xf6d.py\n'

    Which could be the unicode code point or the latin-1 encoding of ö.

  4. Toshio Kuratomi

    I've browsed through distutils.filelist.findall(), distutils.filelist.Filelist, and the sdist and egg_info commands. What I think is happening is that when the sdist command is run, it reads in the files it's going to package up and uses surrogateescape to add the files to the filelist. When the filelist is written out to the SOURCES.txt file distribute transforms surrogateescaped str's that are valid utf-8 into utf-8 prior to writing the file. Although this is what is written to the SOURCES.txt file, the in memory filelist still contains the surrogateescaped version of the filename.

    So here's the question -- should the unittest be changed to re-read the SOURCES.txt from the newly created dist? Or should the code be changed to reflect what was written to the SOURCES.txt file once that is written out?

  5. Arfrever Frehtes Taifersar Arahesis

    There are additional problems occuring with Python 3.1.

    Errors with a UTF-8 locale:

    ======================================================================
    ERROR: test_sdist_with_latin1_encoded_filename (setuptools.tests.test_sdist.TestSdistTest)
    ----------------------------------------------------------------------
    Traceback (most recent call last):
      File "/tmp/distribute-0.6.30/build/src/setuptools/tests/test_sdist.py", line 291, in test_sdist_with_latin1_encoded_filename
        cmd.run()
      File "/tmp/distribute-0.6.30/build/src/setuptools/command/sdist.py", line 161, in run
        self.make_distribution()
      File "/usr/lib64/python3.1/distutils/command/sdist.py", line 436, in make_distribution
        file = self.make_archive(base_name, fmt, base_dir=base_dir)
      File "/usr/lib64/python3.1/distutils/cmd.py", line 372, in make_archive
        dry_run=self.dry_run)
      File "/usr/lib64/python3.1/distutils/archive_util.py", line 180, in make_archive
        filename = func(base_name, base_dir, **kwargs)
      File "/usr/lib64/python3.1/distutils/archive_util.py", line 56, in make_tarball
        tar.add(base_dir)
      File "/usr/lib64/python3.1/tarfile.py", line 1965, in add
        self.add(os.path.join(name, f), os.path.join(arcname, f), recursive, exclude)
      File "/usr/lib64/python3.1/tarfile.py", line 1965, in add
        self.add(os.path.join(name, f), os.path.join(arcname, f), recursive, exclude)
      File "/usr/lib64/python3.1/tarfile.py", line 1958, in add
        self.addfile(tarinfo, f)
      File "/usr/lib64/python3.1/tarfile.py", line 1981, in addfile
        buf = tarinfo.tobuf(self.format, self.encoding, self.errors)
      File "/usr/lib64/python3.1/tarfile.py", line 986, in tobuf
        return self.create_gnu_header(info, encoding, errors)
      File "/usr/lib64/python3.1/tarfile.py", line 1017, in create_gnu_header
        return buf + self._create_header(info, GNU_FORMAT, encoding, errors)
      File "/usr/lib64/python3.1/tarfile.py", line 1095, in _create_header
        stn(info.get("name", ""), 100, encoding, errors),
      File "/usr/lib64/python3.1/tarfile.py", line 177, in stn
        s = s.encode(encoding, errors)
    UnicodeEncodeError: 'utf-8' codec can't encode character '\udcf6' in position 28: surrogates not allowed
    
    ----------------------------------------------------------------------
    

    Errors with C locale:

    ======================================================================
    ERROR: test_manifest_is_read_with_utf8_encoding (setuptools.tests.test_sdist.TestSdistTest)
    ----------------------------------------------------------------------
    Traceback (most recent call last):
      File "/tmp/distribute-0.6.30/build/src/setuptools/tests/test_sdist.py", line 208, in test_manifest_is_read_with_utf8_encoding
        open(filename, 'w').close()
    UnicodeEncodeError: 'ascii' codec can't encode character '\xf6' in position 13: ordinal not in range(128)
    
    ======================================================================
    ERROR: test_sdist_with_latin1_encoded_filename (setuptools.tests.test_sdist.TestSdistTest)
    ----------------------------------------------------------------------
    Traceback (most recent call last):
      File "/tmp/distribute-0.6.30/build/src/setuptools/tests/test_sdist.py", line 291, in test_sdist_with_latin1_encoded_filename
        cmd.run()
      File "/tmp/distribute-0.6.30/build/src/setuptools/command/sdist.py", line 161, in run
        self.make_distribution()
      File "/usr/lib64/python3.1/distutils/command/sdist.py", line 436, in make_distribution
        file = self.make_archive(base_name, fmt, base_dir=base_dir)
      File "/usr/lib64/python3.1/distutils/cmd.py", line 372, in make_archive
        dry_run=self.dry_run)
      File "/usr/lib64/python3.1/distutils/archive_util.py", line 180, in make_archive
        filename = func(base_name, base_dir, **kwargs)
      File "/usr/lib64/python3.1/distutils/archive_util.py", line 56, in make_tarball
        tar.add(base_dir)
      File "/usr/lib64/python3.1/tarfile.py", line 1965, in add
        self.add(os.path.join(name, f), os.path.join(arcname, f), recursive, exclude)
      File "/usr/lib64/python3.1/tarfile.py", line 1965, in add
        self.add(os.path.join(name, f), os.path.join(arcname, f), recursive, exclude)
      File "/usr/lib64/python3.1/tarfile.py", line 1958, in add
        self.addfile(tarinfo, f)
      File "/usr/lib64/python3.1/tarfile.py", line 1981, in addfile
        buf = tarinfo.tobuf(self.format, self.encoding, self.errors)
      File "/usr/lib64/python3.1/tarfile.py", line 986, in tobuf
        return self.create_gnu_header(info, encoding, errors)
      File "/usr/lib64/python3.1/tarfile.py", line 1017, in create_gnu_header
        return buf + self._create_header(info, GNU_FORMAT, encoding, errors)
      File "/usr/lib64/python3.1/tarfile.py", line 1095, in _create_header
        stn(info.get("name", ""), 100, encoding, errors),
      File "/usr/lib64/python3.1/tarfile.py", line 177, in stn
        s = s.encode(encoding, errors)
    UnicodeEncodeError: 'ascii' codec can't encode character '\udcf6' in position 28: ordinal not in range(128)
    
    ======================================================================
    ERROR: test_sdist_with_utf8_encoded_filename (setuptools.tests.test_sdist.TestSdistTest)
    ----------------------------------------------------------------------
    Traceback (most recent call last):
      File "/tmp/distribute-0.6.30/build/src/setuptools/tests/test_sdist.py", line 266, in test_sdist_with_utf8_encoded_filename
        cmd.run()
      File "/tmp/distribute-0.6.30/build/src/setuptools/command/sdist.py", line 161, in run
        self.make_distribution()
      File "/usr/lib64/python3.1/distutils/command/sdist.py", line 436, in make_distribution
        file = self.make_archive(base_name, fmt, base_dir=base_dir)
      File "/usr/lib64/python3.1/distutils/cmd.py", line 372, in make_archive
        dry_run=self.dry_run)
      File "/usr/lib64/python3.1/distutils/archive_util.py", line 180, in make_archive
        filename = func(base_name, base_dir, **kwargs)
      File "/usr/lib64/python3.1/distutils/archive_util.py", line 56, in make_tarball
        tar.add(base_dir)
      File "/usr/lib64/python3.1/tarfile.py", line 1965, in add
        self.add(os.path.join(name, f), os.path.join(arcname, f), recursive, exclude)
      File "/usr/lib64/python3.1/tarfile.py", line 1965, in add
        self.add(os.path.join(name, f), os.path.join(arcname, f), recursive, exclude)
      File "/usr/lib64/python3.1/tarfile.py", line 1958, in add
        self.addfile(tarinfo, f)
      File "/usr/lib64/python3.1/tarfile.py", line 1981, in addfile
        buf = tarinfo.tobuf(self.format, self.encoding, self.errors)
      File "/usr/lib64/python3.1/tarfile.py", line 986, in tobuf
        return self.create_gnu_header(info, encoding, errors)
      File "/usr/lib64/python3.1/tarfile.py", line 1017, in create_gnu_header
        return buf + self._create_header(info, GNU_FORMAT, encoding, errors)
      File "/usr/lib64/python3.1/tarfile.py", line 1095, in _create_header
        stn(info.get("name", ""), 100, encoding, errors),
      File "/usr/lib64/python3.1/tarfile.py", line 177, in stn
        s = s.encode(encoding, errors)
    UnicodeEncodeError: 'ascii' codec can't encode characters in position 28-29: ordinal not in range(128)
    
    ----------------------------------------------------------------------
    
  6. Stefan H. Holek

    LANG=C will not work, as Python 3 bases its codec choices on the environment. It's a bit like saying "I forced it to use ASCII and then it blew up on the UTF-8 characters". ;-)

    Python 3.1 tarfile not liking surrogates is something else though. Let's see what I can do.

  7. Stefan H. Holek

    I was able to produce a similar error with the zipfile module under Python 3.3 (LANG=en_US.UTF-8):

    Traceback (most recent call last):
      File "/usr/local/python3.3/lib/python3.3/zipfile.py", line 392, in _encodeFilenameFlags
        return self.filename.encode('ascii'), self.flag_bits
    UnicodeEncodeError: 'ascii' codec can't encode character '\udcfc' in position 21: ordinal not in range(128)
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "setup.py", line 20, in <module>
        'setuptools',
      File "/usr/local/python3.3/lib/python3.3/distutils/core.py", line 148, in setup
        dist.run_commands()
      File "/usr/local/python3.3/lib/python3.3/distutils/dist.py", line 917, in run_commands
        self.run_command(cmd)
      File "/usr/local/python3.3/lib/python3.3/distutils/dist.py", line 936, in run_command
        cmd_obj.run()
      File "/home/stefan/sandbox/setuptools-git/lib/python3.3/site-packages/distribute-0.6.30-py3.3.egg/setuptools/command/sdist.py", line 161, in run
        self.make_distribution()
      File "/usr/local/python3.3/lib/python3.3/distutils/command/sdist.py", line 447, in make_distribution
        file = self.make_archive(base_name, fmt, base_dir=base_dir)
      File "/usr/local/python3.3/lib/python3.3/distutils/cmd.py", line 370, in make_archive
        dry_run=self.dry_run)
      File "/usr/local/python3.3/lib/python3.3/distutils/archive_util.py", line 178, in make_archive
        filename = func(base_name, base_dir, **kwargs)
      File "/usr/local/python3.3/lib/python3.3/distutils/archive_util.py", line 118, in make_zipfile
        zip.write(path, path)
      File "/usr/local/python3.3/lib/python3.3/zipfile.py", line 1328, in write
        self.fp.write(zinfo.FileHeader())
      File "/usr/local/python3.3/lib/python3.3/zipfile.py", line 382, in FileHeader
        filename, flag_bits = self._encodeFilenameFlags()
      File "/usr/local/python3.3/lib/python3.3/zipfile.py", line 394, in _encodeFilenameFlags
        return self.filename.encode('utf-8'), self.flag_bits | 0x800
    UnicodeEncodeError: 'utf-8' codec can't encode character '\udcfc' in position 21: surrogates not allowed
    
  8. Toshio Kuratomi

    If I'm reading this feature correctly, there's nothing that prevents LANG=C from working for this code. The feature is to specifically allow utf-8 characters in the manifest. So you read the manifest as bytes. You turn the manifest into str type. You operate on it. If you write it back to a manifest file, you turn it into utf-8 again. Then you write it as a sequence of bytes. In this portion, the locale is not involved.

    The locale does become involved (on *nix) when you create a manifest from the files that are on the filesystem. Distutils reads in some of the filenames from the filesystem. It uses the locale settting to decode the bytes that make up the filenames. If the filenames are undecodable using the locale, the filenames are decoded to str using surrogateescape'd representations of the unknown bytes. The question then becomes -- what does distribute specify it should do with those filenames?

    From reading the two initial comments to this issue, it looks like distribute needs to transform the entries into valid utf-8 before writing them to the manifest. This would satisfy "No package would ever be portable if local encodings were allowed in metadata." In the case where the filenames were not utf-8, distribute can toss an error, mangle the filenames (which will cause problems), attempt to store the invalid bytes escaped within the utf-8 string, or attempt to convert to utf-8 and toss an error if it is invalid. Looking at the results of the test_sdist_with_latin1_encoded_filename() unittest, it appears that distribute does none of these. Instead, it adds the raw byte sequence from the filesystem to the manifest. If this is the proper behaviour, it should probably be documented that the manifest's filelist is not utf-8 encoded metadata but bytes instead.

    Now for the unittest failure -- the reading, writing, and creation of manifest files works just fine in a C locale. The str representations use surrogateescape when reading and the precise bytes are written to disk when writing. The problem is how the representation is used in between those two points. In the unittest, a comparison is made between the filename and the filelist.files entry. The filename contains the vanilla str representation. The filelists contains the surrogateescaped version of the same. The unittest fails. But in reality, the two entries should match.

    The question is what should be adjusted to make that so.

  9. Toshio Kuratomi

    Still had some tracebacks in the unittests in the C locale. Here's a patch that applies to your branch to solve those.

    Changes are:

    setuptools/command/egg_info.py

    1) Better checking of whether a string is valid utf-8. The additional checks is that surrogateescaped chars in the string (put there if the locale didn't know how to interpret those characters) are decodable in utf-8

    2) Makes sure the manifest's filelist contain only decoded strings. These will be easier to compare than the surrogateescaped versions.

    setuptools/command/sdist.py

    3) Turn the decoded string back into a str with surrogateescaped chars if needed for the current locale. This is because we're passing the calues into a stdlib function that can't handle unicode chars that aren't encodable in the current locale.

    setuptools/tests/test_sdist.py

    4) easyfix for a traceback in the unittest due to needing to translate a str filename to bytes without aid from the locale.

  10. Toshio Kuratomi

    Interesting strategy but not one I'd recommend in general. It's the same strategy as python2 (where str represented both text and bytes). Here we have to deal with it as the python3 stdlib is both producing (via distutils) and expecting to consume (in the os.path API) the mixed decodable and undecodable bytes.

  11. Stefan H. Holek

    Toshio Kuratomi: Thanks for the patch, it helped me understand the issues much more clearly. I am reluctant to go this far just yet though.

    What I did instead, is to back out the use of surrogateescape in favor of skipping filenames that cannot be en- or decoded as expected. This avoids UnicodeErrors at the expense of files that would break other parts of the tool-chain anyway (zipfile).

    Note that tests still fail unless the locale is UTF-8. I see little point in obfuscating them even more just so they pass if LANG=C (or what-have-you). Tests must work on Windows, Mac, and Linux under Python 2 and 3, which is tricky enough as it is. ;-)

  12. Toshio Kuratomi

    I suppose if it's just code that's used to create a manifest from files on disk, that isn't too bad. You'll still have to make sure that LANG=C works with code that reads the manifest and possibly also code that writes a manifest from an in memory representation. The first case can be changed by the developer of the module to satisfy the requirements (and if they enforce utf-8 filenames for modules on those developers it's actually a backdoor feature :-). The latter two cases are more legitimate cases for generic code running on a system so they would need to be avoided.

  13. Log in to comment
Tip: Filter by directory path e.g. /media app.js to search for public/media/app.js.
Tip: Use camelCasing e.g. ProjME to search for ProjectModifiedEvent.java.
Tip: Filter by extension type e.g. /repo .js to search for all .js files in the /repo directory.
Tip: Separate your search with spaces e.g. /ssh pom.xml to search for src/ssh/pom.xml.
Tip: Use ↑ and ↓ arrow keys to navigate and return to view the file.
Tip: You can also navigate files with Ctrl+j (next) and Ctrl+k (previous) and view the file with Ctrl+o.
Tip: You can also navigate files with Alt+j (next) and Alt+k (previous) and view the file with Alt+o.