Issue #193 open

Better handling of Unicode filenames on Python 2

philip_thiem
created an issue

Edit: Following b0a2fcc (Pull Request 43), some tests in Python 2 began to fail. This ticket starts with the failure, then goes into discussion about a better approach.

ERROR: test_sdist_with_utf8_encoded_filename (setuptools.tests.test_sdist.TestSdistTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\development\setuptools\setuptools\tests\test_sdist.py", line 340, in test_sdist_with_utf8_encoded_filename
    cmd.run()
  File "C:\development\setuptools\setuptools\command\sdist.py", line 92, in run
    self.run_command('egg_info')
  File "c:\python27\lib\distutils\cmd.py", line 326, in run_command
    self.distribution.run_command(command)
  File "c:\python27\lib\distutils\dist.py", line 972, in run_command
    cmd_obj.run()
  File "C:\development\setuptools\setuptools\command\egg_info.py", line 163, in run
    self.find_sources()
  File "C:\development\setuptools\setuptools\command\egg_info.py", line 187, in find_sources
    mm.run()
  File "C:\development\setuptools\setuptools\command\egg_info.py", line 253, in run
    self.write_manifest()
  File "C:\development\setuptools\setuptools\command\egg_info.py", line 276, in write_manifest
    "writing manifest file '%s'" % self.manifest)
  File "c:\python27\lib\distutils\cmd.py", line 349, in execute
    util.execute(func, args, msg, dry_run=self.dry_run)
  File "c:\python27\lib\distutils\util.py", line 324, in execute
    func(*args)
  File "C:\development\setuptools\setuptools\command\egg_info.py", line 309, in write_file
    contents = contents.encode("utf-8")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 79: ordinal not in range(128)

This started occurring for me recently on python 2.x, when encoding a str type in python 2.x first an decoding is performed using the default encoding (as opposed to the preferred coding) before encoding to utf-8. If skipping the encode as was fixed in b0a2fcc was causing a bug, then we need to make sure that content is decoded to unicode before passing it to the function.

Comments (20)

  1. philip_thiem reporter

    I have several thoughts floating around, so not sure I have a good solid answer. Perhaps the answer is "temporarily, while we consider the ramifications." At first I thought this was going to be potentially further reaching because write_file grepped in several locations within command.egg_info. It seems that there are two. First on line 123 within the egg_info class:

        def write_file(self, what, filename, data):
            """Write `data` to `filename` (if not a dry run) after announcing it
    
            `what` is used in a log message to identify what is being written
            to the file.
            """
            log.info("writing %s to %s", what, filename)
            if sys.version_info >= (3,):
                data = data.encode("utf-8")
            if not self.dry_run:
                f = open(filename, 'wb')
                f.write(data)
                f.close()
    

    and the second on line 302:

    def write_file(filename, contents):
        """Create a file with the specified name and write 'contents' (a
        sequence of strings without line terminators) to it.
        """
        contents = "\n".join(contents)
        contents = contents.encode("utf-8")
        f = open(filename, "wb")        # always write POSIX-style manifest
        f.write(contents)
        f.close()
    

    I suppose the two functions were the same save for the log.info and dry_run check. The methods write_requirements, write_toplevel_names, write_arg, write_entries, all use the class function eventually. However, it seems that only write_manifest uses the module level write_file after all.

    Looking at the original versions of the two functions (with the py3 check and encode) I'm assuming that with py3 all the data/filelist contents should already be unicode because bytes don't have encode. If the commit remains in some form, do we need to review contents written by both functions?

    The first thing I'm not certain on, are the files that write_files output expected to be in utf-8 under python 2.x? If the first is yes, then the commit should remain. A separate issue would be whether or not setuptools should be moving that direction as py2.7 will live till 2020.

    Both the base distutil sdist and FileList takes whatever os.listdir returns. I'm feeling strongly that the files in the filelist are not necessarily unicode in python 2.x. If the manifest are in utf-8 then we must decode any encoded strings (as I've done). On the flip side, may there be unicode names in the filelist? The ways this could have happened which I see include:

    • Our call to os.listdir is being given a unicode path
    • Some hook (like maybe svn) is injecting unicode paths
    • Some third party hook

    If this behavior will be supported (at least one individual must be getting unicode filenames) then either we need to decode all unicode names in write_manifest to then output in the file system encoding, OR decode the filesystem names (as Pull Request 50 did) before outputting using whichever encoding. Either way, maybe a note needs to be added to the documentation.

    Added: I suppose my thoughts are:

    • If unicode is not permitted in the filelist:
      • try to find out why unicode if getting into the list.
      • If the output will not be in utf-8 drop the commit. If unicode in the filelist is supported, and PR 50 will not be needed.
      • If the output will be in utf-8, keep the commit, and modify PR 50 to unconditionally.
    • Personally, I tend to favor being more generous in what I accept and would prefer permitted unicode in the filelist at the offender's peril. Then:
      • If the output will not be in utf-8, drop both commit and the PR. Instead apply a version of the commit so that on PY2, unicode objects are encoded to the filesystem encoding.
      • If the output will be in utf-8 and unicode permitted in the file list my PR50 should be ok and not cause any weird side effects with other text files.
  2. Jason R. Coombs

    If the commit remains in some form, do we need to review contents written by both functions?

    Perhaps, but as of yet the issue hasn't been raised except for the module-level write_file, so I'd like to limit the scope to that one.

    Are the files that write_files output expected to be in utf-8 under python 2.x?

    Like many implementations that rely on Python 2 to silently encode/decode, the answer is not well defined. I believe this gives us freedom to set the expectation, specifically to make the file forward-compatible with the explict expectation in Python 3 that the contents are UTF-8 encoded and that the filenames can contain Unicode.

    Thanks so much for your analysis. It has really helped me understand the problem.

    Due to the potential impact of this change, I'm going to back out the original commit just to unblock releases of other less impactful changes. Then, we can stage your PR along with the always-decode patch for an upcoming release on its own (so it can be separately announced, tested, and easily backed out if necessary).

  3. Jason R. Coombs

    I've taken one step toward removing the complexities of the code paths. Next, I want to investigate whether the FileList.append method can handle the decoding on Python 2, such that it always contains text in .files. I tried this, but ran into some problems, so I'm shelving the work for now.

  4. philip_thiem reporter

    Thinking about it there is fair amount of duplication of the encoding stuff across the board. Definitely trying to abstract some of this stuff conceptually would probably make some of this easier to follow.

  5. Log in to comment