test_cmdline fails on Python 2.7 with UnicodeDecodeError

Issue #1492 resolved
Nikolay Orlyuk
created an issue
======================================================================                       
ERROR: test_L_opt (test_cmdline.CmdLineTest)                                                 
----------------------------------------------------------------------                       
Traceback (most recent call last):                                                                                                                                                        
  File "/var/tmp/paludis/build/dev-python-Pygments-2.3.1/work/PYTHON_ABIS/2.7/Pygments-2.3.1/tests/test_cmdline.py", line 149, in test_L_opt
    o = self.check_success('-L')                                                                                                                                                          
  File "/var/tmp/paludis/build/dev-python-Pygments-2.3.1/work/PYTHON_ABIS/2.7/Pygments-2.3.1/tests/test_cmdline.py", line 64, in check_success
    code, out, err = run_cmdline(*cmdline, **kwds)                                                                                                                                        
  File "/var/tmp/paludis/build/dev-python-Pygments-2.3.1/work/PYTHON_ABIS/2.7/Pygments-2.3.1/tests/test_cmdline.py", line 56, in run_cmdline
    out, err = stdout_buffer.getvalue().decode('utf-8'), \                                   
  File "/usr/lib/python2.7/StringIO.py", line 271, in getvalue                               
    self.buf += ''.join(self.buflist)                                                                                                                                                     
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 15: ordinal not in range(128)

----------------------------------------------------------------------                       
Ran 2041 tests in 12.717s                                                                    

(full log)

On Python 2.7 test_cmdline uses native StringIO. Similar error can be achieved with

repr(''.join(['\xc2', u'']))

I.e. it looks like an implicit unicode conversion because of mixed (unicode vs normal strings) writes to stdout in pygments.cmdline.

Comments (12)

  1. Nikolay Orlyuk reporter

    As mentioned in full log: - Python 2.7.15 - Pygments 2.3.1

    #!/bin/bash
    set -ex
    
    python --version
    locale -a
    locale
    
    PNV=Pygments-2.3.1
    
    wget -qO - https://files.pythonhosted.org/packages/source/P/Pygments/${PNV}.tar.gz | tar xz
    
    make -C "$PNV" test
    

    Output

    zsh% ./bug1492.sh 
    + python --version
    Python 2.7.15
    + locale -a
    C
    POSIX
    en_GB.utf8
    en_US.utf8
    fr_FR.utf8
    uk_UA.utf8
    + locale
    LANG=en_GB.utf8
    LC_CTYPE=en_US.UTF-8
    LC_NUMERIC=C
    LC_TIME=C
    LC_COLLATE=C
    LC_MONETARY=C
    LC_MESSAGES=C
    LC_PAPER=C
    LC_NAME=C
    LC_ADDRESS=C
    LC_TELEPHONE=C
    LC_MEASUREMENT=C
    LC_IDENTIFICATION=C
    LC_ALL=
    + PNV=Pygments-2.3.1
    + wget -qO - https://files.pythonhosted.org/packages/source/P/Pygments/Pygments-2.3.1.tar.gz
    + tar xz
    + make -C Pygments-2.3.1 test
    make: Entering directory '/tmp/ws/Pygments-2.3.1'
    Pygments 2.3.1 test suite running (Python 2.7.15)...
    .............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................SS.S.......S.....SS.S.......S......................E........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
    ======================================================================
    ERROR: test_L_opt (test_cmdline.CmdLineTest)
    ----------------------------------------------------------------------
    Traceback (most recent call last):
      File "/tmp/ws/Pygments-2.3.1/tests/test_cmdline.py", line 149, in test_L_opt
        o = self.check_success('-L')
      File "/tmp/ws/Pygments-2.3.1/tests/test_cmdline.py", line 64, in check_success
        code, out, err = run_cmdline(*cmdline, **kwds)
      File "/tmp/ws/Pygments-2.3.1/tests/test_cmdline.py", line 56, in run_cmdline
        out, err = stdout_buffer.getvalue().decode('utf-8'), \
      File "/usr/x86_64-pc-linux-gnu/lib/python2.7/StringIO.py", line 271, in getvalue
        self.buf += ''.join(self.buflist)
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 15: ordinal not in range(128)
    
    ----------------------------------------------------------------------
    Ran 2041 tests in 13.234s
    
    FAILED (SKIP=8, errors=1)
    make: *** [Makefile:53: test] Error 1
    make: Leaving directory '/tmp/ws/Pygments-2.3.1'
    
  2. Nikolay Orlyuk reporter

    It works on Alpine Linux:

    PNV=Pygments-2.3.1
    docker run --rm --network=host -i alpine << EOS
    apk add make python py2-nose
    adduser -D builder
    su - builder
    cd /tmp
    wget -qO - https://files.pythonhosted.org/packages/source/P/Pygments/${PNV}.tar.gz | tar xz
    make -C "${PNV}" test
    EOS
    

    Number of test the same 2041, but no errors.

  3. Nikolay Orlyuk reporter

    Same for Exherbo docker image (build python)

    PNV=Pygments-2.3.1
    
    docker run --rm --network=host -i exherbo/exherbo_ci bash << EOS
    set -ex
    echo '*/* python_abis: -* 2.7 build_options: -recommended_tests' >> /etc/paludis/options.conf
    chgrp paludisbuild /dev/tty
    cave sync
    cave resolve -zx repository/python
    cave resolve -zx python:2.7 nose
    cd /tmp
    wget -qO - https://files.pythonhosted.org/packages/source/P/Pygments/${PNV}.tar.gz | tar xz
    make -C "${PNV}" test
    EOS
    

    But when I changed default optimization flags from -march=x86-64 -mtune=generic -pipe -O2 to -march=native -mtune=native -pipe -O2 (GCC 8.2.0) it starts to fail in a same way.

    PNV=Pygments-2.3.1
    
    docker run --rm --network=host -i exherbo/exherbo_ci bash << EOS
    set -ex
    echo '*/* python_abis: -* 2.7 build_options: -recommended_tests' >> /etc/paludis/options.conf
    sed -i -r 's/(-(march|mtune)=)[^ ]*/\1native/g' /etc/paludis/bashrc
    chgrp paludisbuild /dev/tty
    cave sync
    cave resolve -zx repository/python
    cave resolve -zx python:2.7 nose
    cd /tmp
    wget -qO - https://files.pythonhosted.org/packages/source/P/Pygments/${PNV}.tar.gz | tar xz
    make -C "${PNV}" test
    EOS
    

    Diff in effective options (dumped with -Q --help=target --help=optimizers):

    --- /tmp/gcc-opts-generic.txt   2019-01-16 22:11:12.550604066 +0000
    +++ /tmp/gcc-opts-native.txt    2019-01-16 22:11:27.833868200 +0000
    @@ -12 +12 @@
    -  -mabm                            [disabled]
    +  -mabm                            [enabled]
    @@ -15,2 +15,2 @@
    -  -madx                            [disabled]
    -  -maes                            [disabled]
    +  -madx                            [enabled]
    +  -maes                            [enabled]
    @@ -24 +24 @@
    -  -march=                          x86-64
    +  -march=                          skylake
    @@ -26,4 +26,4 @@
    -  -mavx                            [disabled]
    -  -mavx2                           [disabled]
    -  -mavx256-split-unaligned-load    [enabled]
    -  -mavx256-split-unaligned-store   [enabled]
    +  -mavx                            [enabled]
    +  -mavx2                           [enabled]
    +  -mavx256-split-unaligned-load    [disabled]
    +  -mavx256-split-unaligned-store   [disabled]
    @@ -46,2 +46,2 @@
    -  -mbmi                            [disabled]
    -  -mbmi2                           [disabled]
    +  -mbmi                            [enabled]
    +  -mbmi2                           [enabled]
    @@ -52 +52 @@
    -  -mclflushopt                     [disabled]
    +  -mclflushopt                     [enabled]
    @@ -58 +58 @@
    -  -mcx16                           [disabled]
    +  -mcx16                           [enabled]
    @@ -61 +61 @@
    -  -mf16c                           [disabled]
    +  -mf16c                           [enabled]
    @@ -64 +64 @@
    -  -mfma                            [disabled]
    +  -mfma                            [enabled]
    @@ -70 +70 @@
    -  -mfsgsbase                       [disabled]
    +  -mfsgsbase                       [enabled]
    @@ -78 +78 @@
    -  -mhle                            [disabled]
    +  -mhle                            [enabled]
    @@ -92 +92 @@
    -  -mlzcnt                          [disabled]
    +  -mlzcnt                          [enabled]
    @@ -97 +97 @@
    -  -mmovbe                          [disabled]
    +  -mmovbe                          [enabled]
    @@ -109 +109 @@
    -  -mno-sse4                        [enabled]
    +  -mno-sse4                        [disabled]
    @@ -115 +115 @@
    -  -mpclmul                         [disabled]
    +  -mpclmul                         [enabled]
    @@ -119 +119 @@
    -  -mpopcnt                         [disabled]
    +  -mpopcnt                         [enabled]
    @@ -124 +124 @@
    -  -mprfchw                         [disabled]
    +  -mprfchw                         [enabled]
    @@ -127,2 +127,2 @@
    -  -mrdrnd                          [disabled]
    -  -mrdseed                         [disabled]
    +  -mrdrnd                          [enabled]
    +  -mrdseed                         [enabled]
    @@ -135,3 +135,3 @@
    -  -mrtm                            [disabled]
    -  -msahf                           [disabled]
    -  -msgx                            [disabled]
    +  -mrtm                            [enabled]
    +  -msahf                           [enabled]
    +  -msgx                            [enabled]
    @@ -145,4 +145,4 @@
    -  -msse3                           [disabled]
    -  -msse4                           [disabled]
    -  -msse4.1                         [disabled]
    -  -msse4.2                         [disabled]
    +  -msse3                           [enabled]
    +  -msse4                           [enabled]
    +  -msse4.1                         [enabled]
    +  -msse4.2                         [enabled]
    @@ -152 +152 @@
    -  -mssse3                          [disabled]
    +  -mssse3                          [enabled]
    @@ -165 +165 @@
    -  -mtune=                          generic
    +  -mtune=                          skylake
    @@ -175,4 +175,4 @@
    -  -mxsave                          [disabled]
    -  -mxsavec                         [disabled]
    -  -mxsaveopt                       [disabled]
    -  -mxsaves                         [disabled]
    +  -mxsave                          [enabled]
    +  -mxsavec                         [enabled]
    +  -mxsaveopt                       [enabled]
    +  -mxsaves                         [enabled]
    

    But I guess this issue related either to Python or to GCC 8.2.0. Problem is that I don't have minimal example to reproduce now (repr(''.join(['\xc2', u''])) fails for both)

  4. Anteru

    On Ubuntu 18.10, it should be compiled with the system GCC which is 8.2 IIRC, but Python reports itself as 2.7.15+. Maybe they have some patch applied which makes it work?

  5. Nikolay Orlyuk reporter

    Sorry. I cannot reproduce issue anymore inside of the container. Note strange error message on line 19037 in log from container I referenced before. I have no idea how it appears there.

    After I re-build part of my host system with x86_64/generic optimization including Python 2.7.15 I still get that issue. So I guess issue is not with optimization flags.

    I'll try to debug it a bit further. Will check my assumption that sys.stdout or sys.stderr being fed with mix of unicode and str on my system when I run this test.

  6. Nikolay Orlyuk reporter

    Ok. I was able to narrow the issue. My initial idea about mixed output that contains bytes with non-ascii chars confirmed. See test output and corresponding patch applied to expose issue.

    So for Python 2.5.17 on my machine next code fails:

    from pygments import StringIO
    buf = StringIO()
    buf.write('The Arduino\xc2\xae language style')
    buf.write(u'unicode')
    buf.getvalue()
    

    Same for Python 2.5.17 on Alpine Linux:

    zsh% docker run --rm --network=host -it alpine                 
    / # apk add python2
    fetch http://dl-cdn.alpinelinux.org/alpine/v3.8/main/x86_64/APKINDEX.tar.gz
    fetch http://dl-cdn.alpinelinux.org/alpine/v3.8/community/x86_64/APKINDEX.tar.gz
    (1/10) Installing libbz2 (1.0.6-r6)
    (2/10) Installing expat (2.2.5-r0)
    (3/10) Installing libffi (3.2.1-r4)
    (4/10) Installing gdbm (1.13-r1)
    (5/10) Installing ncurses-terminfo-base (6.1_p20180818-r1)
    (6/10) Installing ncurses-terminfo (6.1_p20180818-r1)
    (7/10) Installing ncurses-libs (6.1_p20180818-r1)
    (8/10) Installing readline (7.0.003-r0)
    (9/10) Installing sqlite-libs (3.25.3-r0)
    (10/10) Installing python2 (2.7.15-r1)
    Executing busybox-1.28.4-r2.trigger
    OK: 51 MiB in 23 packages
    / # python2
    Python 2.7.15 (default, Aug 16 2018, 14:17:09) 
    [GCC 6.4.0] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> from StringIO import StringIO
    >>> buf = StringIO()
    >>> buf.write('The Arduino\xc2\xae language style')
    >>> buf.write(u'unicode')
    >>> buf.getvalue()
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/usr/lib/python2.7/StringIO.py", line 271, in getvalue
        self.buf += ''.join(self.buflist)
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 11: ordinal not in range(128)
    

    You can see output that I printed before error that buflist contains:

    • u'* coconut_python, coconut_py, coconut_python3, coconut_py3:\n coconut_python (filenames *.py_template)'
    • ' The Arduino\xc2\xae language style. This style is designed to highlight the Arduino source code, so exepect the best results with it.'
  7. Nikolay Orlyuk reporter

    Indeed PEP-0257 (docstrings) says:

    For consistency, always use """triple double quotes""" around docstrings. Use r"""raw triple double quotes""" if you use any backslashes in your docstrings. For Unicode docstrings, use u"""Unicode triple-quoted strings""".

    In pygments/styles/arduino.py:

    class ArduinoStyle(Style):
        """
        The Arduino® language style. This style is designed to highlight the
        Arduino source code, so exepect the best results with it.
        """
    
  8. Nikolay Orlyuk reporter

    And since the only strings with unicode are coming from coconut (installed on my host system). I have a docker script for you to reproduce it :)

    ./bug1492-docker.sh

    #!/bin/bash
    
    PNV=Pygments-2.3.1
    
    docker run --rm --network=host -i alpine << EOS
    apk add make python py2-nose py2-pip
    pip install coconut  # <-- missing part of the puzzle
    adduser -D builder
    su - builder
    cd /tmp
    wget -qO - https://files.pythonhosted.org/packages/source/P/Pygments/${PNV}.tar.gz | tar xz
    make -C "${PNV}" test
    EOS
    

    Whew!.... I can use -march=native on my host.

  9. Log in to comment