UnicodeDecodeError when displaying files with accents

Issue #36 invalid

rsmith31415 created an issue 2015-09-24

There is a small issue displaying files that contain accents but this only happens in Windows.

I tested this creating a new file called hello:

>>> sarge.get_both('bash -c "cd Desktop/Test\ folder; ls"')
(u'hello\r\n', u'')

As you can see, everything works as expected. Then I changed the filename to héllo and ran the same command. This is the result:

>>> sarge.get_both('bash -c "cd Desktop/Test\ folder; ls"')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Anaconda\envs\sarge-env\lib\site-packages\sarge\__init__.py", line 14
74, in get_both
    return p.stdout.text, p.stderr.text
  File "C:\Anaconda\envs\sarge-env\lib\site-packages\sarge\__init__.py", line 27
2, in text
    return self.bytes.decode(self.encoding)
  File "C:\Anaconda\envs\sarge-env\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe9 in position 1: invalid continuation byte
>>>

Can you reproduce this issue?

Comments (5)

Vinay Sajip repo owner
- changed status to invalid
This is not a bug.

If you know that an encoding XXX other than UTF-8 is used, you can use p = run(cmd, stdout=Capture(encoding=XXX), stderr=Capture(encoding=XXX)) and then access p.stdout.text or p.stderr.text.
- 2015-09-24T05:03:22+00:00
rsmith31415 reporter
Unfortunately, in my case, the user is expected to use bash commands to navigate through directories and files, so it is not possible to set the encoding on the fly. Is there an alternative?

UPDATE:

Actually, I think there is something else going on. cmd doesn't display unicode characters and instead shows something like "h?llo" and as far as I'm concerned, that's okay. However, sarge.run actually works:
```
>>> sarge.run('bash -c "ls"')
h?llo
<sarge.Pipeline object at 0x014A7E50>
>>> sarge.run('bash -c "ls"').stderr
h?llo
>>> sarge.run('bash -c "ls"').stdout
h?llo
>>>
```
(although I'm not sure why stderr is the same than stdout). It is only get_stdout and its friends (get_stdout and get_stderr) that fail in this situation.
- 2015-09-24T05:08:33+00:00
rsmith31415 reporter
@vinay.sajip I think the reason that sarge.run works without raising an exception unlike sarge.get_both is that sarge.get_both is trying to decode as you described in your answer (using utf8). However, wouldn't this be solved by simply using decode(self.encoding, 'replace') in your function (__init__.py:267):
```
    @property
    def text(self):
        """
        All the bytes in the capture queue, decoded as text.
        """
        return self.bytes.decode(self.encoding)
```
That way you get something like this:
```
>>> 'héllo'.decode('utf8', 'replace')
u'h\xe9llo'
```
which is more user-friendly.
- 2015-09-26T05:41:41+00:00
Vinay Sajip repo owner
Yes, but you can get the bytes just as easily, using the bytes property, and then you can decode them as you see fit. Any other default decoding in the text method is bound to trip some other people up, just as the current default has tripped you up. So: just get the bytes and do as you like with them. Consider the text property as just a very basic convenience. Alternatively, subclass Capture to do what you want and pass that instead of a vanilla Capture instance.
- 2015-09-26T10:23:41+00:00
rsmith31415 reporter
Thanks. Both alternatives help a lot.
- 2015-09-26T23:54:12+00:00
Log in to comment

Assignee: –

Type: bug

Priority: minor

Status: invalid

Votes: 0

Watchers: 2