UnicodeDecodeError when displaying files with accents

Issue #36 invalid
rsmith31415 created an issue

There is a small issue displaying files that contain accents but this only happens in Windows.

I tested this creating a new file called hello:

>>> sarge.get_both('bash -c "cd Desktop/Test\ folder; ls"')
(u'hello\r\n', u'')

As you can see, everything works as expected. Then I changed the filename to héllo and ran the same command. This is the result:

>>> sarge.get_both('bash -c "cd Desktop/Test\ folder; ls"')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Anaconda\envs\sarge-env\lib\site-packages\sarge\__init__.py", line 14
74, in get_both
    return p.stdout.text, p.stderr.text
  File "C:\Anaconda\envs\sarge-env\lib\site-packages\sarge\__init__.py", line 27
2, in text
    return self.bytes.decode(self.encoding)
  File "C:\Anaconda\envs\sarge-env\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe9 in position 1: invalid continuation byte
>>>

Can you reproduce this issue?

Comments (5)

  1. Vinay Sajip repo owner

    This is not a bug.

    If you know that an encoding XXX other than UTF-8 is used, you can use p = run(cmd, stdout=Capture(encoding=XXX), stderr=Capture(encoding=XXX)) and then access p.stdout.text or p.stderr.text.

  2. rsmith31415 reporter

    Unfortunately, in my case, the user is expected to use bash commands to navigate through directories and files, so it is not possible to set the encoding on the fly. Is there an alternative?

    UPDATE:

    Actually, I think there is something else going on. cmd doesn't display unicode characters and instead shows something like "h?llo" and as far as I'm concerned, that's okay. However, sarge.run actually works:

    >>> sarge.run('bash -c "ls"')
    h?llo
    <sarge.Pipeline object at 0x014A7E50>
    >>> sarge.run('bash -c "ls"').stderr
    h?llo
    >>> sarge.run('bash -c "ls"').stdout
    h?llo
    >>>
    

    (although I'm not sure why stderr is the same than stdout). It is only get_stdout and its friends (get_stdout and get_stderr) that fail in this situation.

  3. rsmith31415 reporter

    @vinay.sajip I think the reason that sarge.run works without raising an exception unlike sarge.get_both is that sarge.get_both is trying to decode as you described in your answer (using utf8). However, wouldn't this be solved by simply using decode(self.encoding, 'replace') in your function (__init__.py:267):

        @property
        def text(self):
            """
            All the bytes in the capture queue, decoded as text.
            """
            return self.bytes.decode(self.encoding)
    

    That way you get something like this:

    >>> 'héllo'.decode('utf8', 'replace')
    u'h\xe9llo'
    

    which is more user-friendly.

  4. Vinay Sajip repo owner

    Yes, but you can get the bytes just as easily, using the bytes property, and then you can decode them as you see fit. Any other default decoding in the text method is bound to trip some other people up, just as the current default has tripped you up. So: just get the bytes and do as you like with them. Consider the text property as just a very basic convenience. Alternatively, subclass Capture to do what you want and pass that instead of a vanilla Capture instance.

  5. Log in to comment