Couldn't parse Python file with cp1252 encoding from xlwt

Issue #431 resolved
Murat Knecht
created an issue

Coverage choked on a file from the xlwt library.

Couldn't parse 'venv/lib/python2.7/site-packages/xlwt/BIFFRecords.py' as Python source: ''charmap' codec can't decode byte 0x9d in position 68292: character maps to <undefined>' at line 0

To reproduce install these dependencies in a virtualenv env in the working directory:

argparse==1.2.1
coverage==4.0
wsgiref==0.1.2
xlwt==0.7.5

Then run this (which is the boiled down version of what coverage does, afaict:

from coverage import phystokens

f = open("./env/lib/python2.7/site-packages/xlwt/BIFFRecords.py", "rb")
raw = f.read()

enc = phystokens._source_encoding_py2(raw)
print("encoding: {}".format(enc))

uni = raw.decode(enc, "replace")
phystokens.compile_unicode(uni, "<string>", "exec")

When using compile directly on raw, it works.

Possibly related to #157.

Comments (6)

  1. Ned Batchelder repo owner

    Somehow, utf8 is getting mixed into this:

    >>> b"\x93hi\x94".decode("cp1252").encode("utf8")
    '\xe2\x80\x9chi\xe2\x80\x9d'
    

    The xlwt code has curly quotes in the docstrings (\x93 and \x94 in cp1252). Converted to utf8, there are \x9d bytes, which are then being interpreted as cp1252, which has no character at \x9d.

  2. Murat Knecht reporter

    It looks like a compile bug in that it re-encodes the already Unicode source with the encoding specified in the header … which does not make sense. Nevertheless, most combinations of loading the file and dumping it into compile work, so coverage might want to use a workaround.

  3. Log in to comment