1. Ned Batchelder
  2. coverage.py
Issue #303 resolved

UnicodeDecodeError

douglasbsf
created an issue
Traceback (most recent call last):
  File "/usr/local/bin/coverage", line 8, in <module>
    load_entry_point('coverage==3.7.1', 'console_scripts', 'coverage')()
  File "/Library/Python/2.7/site-packages/coverage/cmdline.py", line 721, in main
    status = CoverageScript().command_line(argv)
  File "/Library/Python/2.7/site-packages/coverage/cmdline.py", line 461, in command_line
    **report_args)
  File "/Library/Python/2.7/site-packages/coverage/control.py", line 662, in html_report
    return reporter.report(morfs)
  File "/Library/Python/2.7/site-packages/coverage/html.py", line 113, in report
    self.report_files(self.html_file, morfs, self.config.html_dir)
  File "/Library/Python/2.7/site-packages/coverage/report.py", line 84, in report_files
    report_fn(cu, self.coverage._analyze(cu))
  File "/Library/Python/2.7/site-packages/coverage/html.py", line 253, in html_file
    html = html.decode(encoding)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 3418: ordinal not in range(128)

Comments (18)

  1. Ned Batchelder repo owner
    • edited description

    Can you please provide more detail? In particular, a reproducible test case would be appreciated. My guess is that you have a non-ASCII character in a comment in your source file, and that if you add a coding comment to the top, coverage will work fine:

    # -*- coding: utf8 -*-
    
  2. douglasbsf reporter

    Another detail: I'm using django unit test like this: coverage run --source='.' manage.py test client.tests . Using a simple code, it's works fine, like this: coverage run myprogram.py The problem occurs only when I call "coverage html"

  3. Robert Sussland

    I am having the same issue. There are no non-unicode characters in any of the python source files, however the test suite I am running coverage on downloads files with unicode file names and processes file with unicode characters. I cannot send you my code as it is pulling data from an on-site database. The stack trace doesn't show which file triggered the error:

      File "/Users/rsussland/pop/bin/coverage", line 9, in <module>
        load_entry_point('coverage==4.0a0', 'console_scripts', 'coverage')()
      File "/Users/rsussland/pop/lib/python2.7/site-packages/coverage-4.0a0-py2.7-macosx-10.6-intel.egg/coverage/cmdline.py", line 747, in main
        status = CoverageScript().command_line(argv)
      File "/Users/rsussland/pop/lib/python2.7/site-packages/coverage-4.0a0-py2.7-macosx-10.6-intel.egg/coverage/cmdline.py", line 467, in command_line
        **report_args)
      File "/Users/rsussland/pop/lib/python2.7/site-packages/coverage-4.0a0-py2.7-macosx-10.6-intel.egg/coverage/control.py", line 679, in html_report
        return reporter.report(morfs)
      File "/Users/rsussland/pop/lib/python2.7/site-packages/coverage-4.0a0-py2.7-macosx-10.6-intel.egg/coverage/html.py", line 109, in report
        self.report_files(self.html_file, morfs, self.config.html_dir)
      File "/Users/rsussland/pop/lib/python2.7/site-packages/coverage-4.0a0-py2.7-macosx-10.6-intel.egg/coverage/report.py", line 81, in report_files
        report_fn(cu, self.coverage._analyze(cu))
      File "/Users/rsussland/pop/lib/python2.7/site-packages/coverage-4.0a0-py2.7-macosx-10.6-intel.egg/coverage/html.py", line 241, in html_file
        html = html.decode(encoding)
      File "/Users/rsussland/pop/lib/python2.7/encodings/utf_8.py", line 16, in decode
        return codecs.utf_8_decode(input, errors, True)
    UnicodeDecodeError: 'utf8' codec can't decode byte 0xf6 in position 231299: invalid start byte
    
  4. Ian Cordasco

    The problem is that ignoring errors when encoding unicode in ascii is that you're going to lose data. What data is trying to be used, I'm not certain, but Robert Sussland could you please put your traceback in a fenced code block, i.e., precede it with three backticks and a newline, and follow it by a newline and three backticks. That will make it much easier to read.

  5. Ian Cordasco

    So I'm looking at this code, and I'm wondering why decode is even called. It seems to be a special case for Python 2.6 and 2.7 but the string API in those versions is different than Python 3. You can do 'foo bar bogus'.encode('ascii') with confidence on Python 2. I'm not sure how well it will work with xmlcharrefreplace, but I suspect that is possibly the only problem. I'm going to investigate a bit.

  6. Robert Sussland

    I'm not familiar enough with this code to determine whether the policy is to work with code points or bytestrings, but unless you import unicode_literals, 'foo bar bogus' is already an ascii encoded byte string, so calling encode on it will decode with the ascii codec and re-encode it with the same codec. If you are intending to work with code points then you do need to decode, but I don't see any of the standard patterns for doing that in the html.py file -- for example, you are calling with open instead of with codecs.open, so your other strings are already encoded as bytestrings and not code points, in which case no need to decode at all, but care must be taken when combining bytestrings of different encodings (they are all ascii byte strings).

  7. Ned Batchelder repo owner

    Robert Sussland The line of code in question is dealing with the HTML version of your source files. I find it very hard to believe that you have no non-ascii characters in your source file. Perhaps in a comment? A curly apostrophe? The data you download for your tests doesn't matter, that isn't part of the HTML report.

    In Python 2.7, do this:

    open("mysource.py").read().decode('ascii')
    

    Does it succeed, or raise an exception? Also, it looks like your file is really large? Can you share any of the code with me?

  8. Ned Batchelder repo owner

    Sorry, I see that you say the file isn't in the error message. At the very least, I can add some information there so these problems are easier to diagnose while we decide on an approach.

  9. Log in to comment