- changed status to invalid
UnicodeDecodeError when displaying files with accents
There is a small issue displaying files that contain accents but this only happens in Windows.
I tested this creating a new file called hello:
>>> sarge.get_both('bash -c "cd Desktop/Test\ folder; ls"')
(u'hello\r\n', u'')
As you can see, everything works as expected. Then I changed the filename to héllo and ran the same command. This is the result:
>>> sarge.get_both('bash -c "cd Desktop/Test\ folder; ls"')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Anaconda\envs\sarge-env\lib\site-packages\sarge\__init__.py", line 14
74, in get_both
return p.stdout.text, p.stderr.text
File "C:\Anaconda\envs\sarge-env\lib\site-packages\sarge\__init__.py", line 27
2, in text
return self.bytes.decode(self.encoding)
File "C:\Anaconda\envs\sarge-env\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe9 in position 1: invalid continuation byte
>>>
Can you reproduce this issue?
Comments (5)
-
repo owner -
reporter Unfortunately, in my case, the user is expected to use bash commands to navigate through directories and files, so it is not possible to set the encoding on the fly. Is there an alternative?
UPDATE:
Actually, I think there is something else going on. cmd doesn't display unicode characters and instead shows something like "h?llo" and as far as I'm concerned, that's okay. However, sarge.run actually works:
>>> sarge.run('bash -c "ls"') h?llo <sarge.Pipeline object at 0x014A7E50> >>> sarge.run('bash -c "ls"').stderr h?llo >>> sarge.run('bash -c "ls"').stdout h?llo >>>
(although I'm not sure why stderr is the same than stdout). It is only get_stdout and its friends (get_stdout and get_stderr) that fail in this situation.
-
reporter @vinay.sajip I think the reason that
sarge.run
works without raising an exception unlikesarge.get_both
is thatsarge.get_both
is trying to decode as you described in your answer (using utf8). However, wouldn't this be solved by simply usingdecode(self.encoding, 'replace')
in your function (__init__.py:267
):@property def text(self): """ All the bytes in the capture queue, decoded as text. """ return self.bytes.decode(self.encoding)
That way you get something like this:
>>> 'héllo'.decode('utf8', 'replace') u'h\xe9llo'
which is more user-friendly.
-
repo owner Yes, but you can get the bytes just as easily, using the
bytes
property, and then you can decode them as you see fit. Any other default decoding in thetext
method is bound to trip some other people up, just as the current default has tripped you up. So: just get thebytes
and do as you like with them. Consider thetext
property as just a very basic convenience. Alternatively, subclassCapture
to do what you want and pass that instead of a vanillaCapture
instance. -
reporter Thanks. Both alternatives help a lot.
- Log in to comment
This is not a bug.
If you know that an encoding XXX other than UTF-8 is used, you can use
p = run(cmd, stdout=Capture(encoding=XXX), stderr=Capture(encoding=XXX))
and then accessp.stdout.text
orp.stderr.text
.