1. Antoine Pitrou
  2. pathlib
Issue #22 resolved

Initializing Path with a unicode string raises an exception

Ralph Heinkel
created an issue

Initializing Path with a unicode string raises an exception:

>>> import pathlib
>>> pathlib.Path(u'/etc/fstab')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/site-packages/pathlib.py", line 917, in __new__
    self = cls._from_parts(args, init=False)
  File "/usr/local/lib/python2.7/site-packages/pathlib.py", line 595, in _from_parts
    drv, root, parts = self._parse_args(args)
  File "/usr/local/lib/python2.7/site-packages/pathlib.py", line 587, in _parse_args
    % type(a))
TypeError: argument should be a path or str object, not <type 'unicode'>

Comments (4)

  1. grainednoise

    I have looked at the changeset, and I feel the solution you chose is a step in the wrong direction. Using non-unicode filenames - particularly in Windows - can best be described as unexploded bomb in your code. Sure, you may be lucky, but if it does explode you'll be in a world of pain.

    The biggest problem here are directory listings, i.e. any function calling os.listdir(). If you use it with a non-unicode string under Windows in a directory containing files which have unicode characters in them you can get one of two results:

    • The unicode character is present in the Windows code page for your language settings (the most common one being CP-1252, in western Europe at least) and it will be automatically converted to the corresponding value. For instance, the Euro symbol u"\u20ac" will become b"\x80". Ugly as this may be, it still kind of works as functions like open() and os.listdir() happily accept these code-page dependent strings. But getting an unambiguous unicode representation from those isn't possible, unless you do some assuming -and you shouldn't. Conversely, if I read things correctly, with the changes in the above changeset, pathlib will break here as the path contains non-ASCII characters.

    • If the unicode is character not to be found in said code page, Windows will substitute a "?" for it. This might not be immediately obvious (and is, indeed, not detected by pathlib) but it's not a valid name, and any API call using that name will fail.

    I haven't tested anything under Linux or OS-X, and hopefully both behave in a more consistent manner (like always returning UTF-8) but this is in no way a given. But the only way we can be reasonably sure we avoid this mess is to use unicode filenames throughout.

  2. Antoine Pitrou repo owner

    You are right on all accounts, but this is a general problem with Python 2. pathlib has been designed from the start for Python 3, where file paths are treated much more sensibly. It works under 2.7, but not as well as it would under Python 3 (hence the wording: """Python 3.2 or later is recommended, but pathlib is also usable with Python 2.7""").

  3. Log in to comment