UTF-8 BOM causes encode errors with pygmentize command

Issue #851 resolved
Eric Knibbe
created an issue

The pygmentize command should remove the UTF-8 BOM when sending to terminal or a file. For example, pygmentize tests/examplefiles/BOM.js prints the file's contents without issue, but pygmentize tests/examplefiles/BOM.js | less returns

UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 13: ordinal not in range(128)

Also, cat tests/examplefiles/BOM.js | pygmentize -l js returns

UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128)

Specifying UTF-8 as the encoding still causes the BOM to appear as a character:

pygmentize -O encoding='utf-8' tests/examplefiles/BOM.js | less returns

<U+FEFF>/* There is a BOM at the beginning of this file. */

Occurs with Python 2.7.3.

Comments (2)

  1. amarquesbra

    To contribute:

    # less foo.xml
    *** Error while highlighting:
    UnicodeEncodeError: 'ascii' codec can't encode character u'\xe2' in position 44: ordinal not in range(128)
       (file "/usr/lib64/python2.7/codecs.py", line 351, in write)

    By curiosity only, using OpenSUSE.

    Renderize a XML file with n lines but show only these:

    <?xml version="1.0" encoding="UTF-8"?>
    <ListingDataFeed xmlns:xs="http://www.w3.org/2001/XMLSchema">
    foo.xml (END)

    And the last line of this excerpt was the bug:

    # more foo.xml
    <?xml version="1.0" encoding="UTF-8"?>
    <ListingDataFeed xmlns:xs="http://www.w3.org/2001/XMLSchema">
          <Title>Comercial loja - Bairro Setor Marista em Goiânia</Title>
  2. Log in to comment