py.process.cmdexec fails if the out/err contains non-ascii characters

Issue #130 new
Antonio Cuni
created an issue

Consider the following file trouble.py, which just output a non-ascii character to stdout: {{{

-- encoding: utf-8 --

trouble = u'à' print trouble.encode('utf-8') }}}

py.process.cmdexec fails if we try to run it: {{{

import py py.process.cmdexec('python trouble.py') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/antocuni/pypy/misc/py-trunk/py/_process/cmdexec.py", line 32, in cmdexec raise ExecutionFailed(status, status, cmd, out, err) py.process.cmdexec.Error: ExecutionFailed: 1 python trouble.py Traceback (most recent call last): File "trouble.py", line 3, in <module> print trouble UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 0: ordinal not in range(128) }}}

This happens because cmdexec try to decode stderr and stdout using sys.stdout.encoding or sys.getdefaultencoding(), which not necessarly match the output encoding of the program.

Real life use-case: a pypy test invokes gcc, which uses utf-8 characters for quotes. The test fails because of the exception inside cmdexec, even if the test itself completely ignores the stderr.

Comments (6)

  1. Ronny Pfannschmidt

    its a bit troubling that the exception does not match the example program

    on closer introspection with the example it reveals that subprocess claims ascii encoding i supose it needs to fallback to utf8/latin1 on ascii

  2. Holger Krekel repo owner

    Is there a safe way to determine the output encoding of the invoked program (in this case GCC)?

    Maybe it makes sense to use subprocess.call() instead of command exec which on python2 will return str objects. In this case we wouldn't change anything on pylib. I lean towards the latter solution.

  3. Amaury Forgeot d'Arc

    I have the same issue on Windows. Why not use the encoding of sys.stdout? After all, this is where the program output goes normally, and it is supposed to be readable there.

    Here is the patch that I use to run pypy tests, it's needed because the name of the temporary directory depends on the name of the branch, gotten with some "svn info" command.

    Index: ../py/_process/cmdexec.py
    ===================================================================
    --- ../py/_process/cmdexec.py   (revision 78105)
    +++ ../py/_process/cmdexec.py   (working copy)
    @@ -21,10 +21,7 @@
                 stdout=subprocess.PIPE, stderr=subprocess.PIPE)
         out, err = process.communicate()
         if sys.version_info[0] < 3: # on py3 we get unicode strings, on py2 not
    -        try:
    -            default_encoding = sys.getdefaultencoding() # jython may not have it
    -        except AttributeError:
    -            default_encoding = sys.stdout.encoding or 'UTF-8'
    +        default_encoding = sys.stdout.encoding or 'UTF-8'
             out = unicode(out, process.stdout.encoding or default_encoding)
             err = unicode(err, process.stderr.encoding or default_encoding)
         status = process.poll()
    
  4. Antonio Cuni reporter

    I don't think you can safely use sys.stdout.encoding. On my machine, sys.stdout.encoding == 'ISO-8859-15', but troubles.py still output utf-8.

    I agree that it's probably a bug in troubles.py, but I still think that cmdexec should not crash just because of that. E.g., what happen if I call cmdexec('cat /boot/vmlinuz')? Should I expect it to be utf-8 or latin-15? :-)

  5. pchambon

    I also encountered nasty UnicodeDecodeErrors when simply using pytest tests. The "--lsof" option, which was expected to fail on windows, gave a cp1252 error string, breaking the whole execution.

    > C:\Users\Prolifik\Desktop\pytest\.tox\py27\Scripts\py.test.EXE --lsof -rfsxX --junitxml=C:\Users\Prolifik\Desktop\pytest\.tox\py27\log\junit-py27.xml
    process.stderr.encoding -> None 
    default_encoding -> ascii
    STDERR TO BE CONVERTED: "'lsof' n'est pas reconnu en tant que commande interne\nou externe, un programme ex\x82cutable ou un fichier de commandes.\n"
    

    Is there a consensus on a tolerant out/err decoding here ? Using sys.stderr, or other sys encoding-related getters, or even a mere fallback to decode("ascii", "ignore") in worst case ? I can have a look at a PR when i's settled.

  6. Log in to comment