Chokes on source files with non-utf-8 encoding

Issue #157 resolved
Wolfgang Schnerring created an issue

If you have python source files that are, e.g. latin-1 encoded, the reporter will die like this:

{{{ coverage.main() File "/var/cache/eggs/coverage-3.5.1-py2.6-linux-x86_64.egg/coverage/", line 657, in main status = CoverageScript().command_line(argv) File "/var/cache/eggs/coverage-3.5.1-py2.6-linux-x86_64.egg/coverage/", line 549, in command_line, **report_args) File "/var/cache/eggs/coverage-3.5.1-py2.6-linux-x86_64.egg/coverage/", line 599, in html_report, config=self.config) File "/var/cache/eggs/coverage-3.5.1-py2.6-linux-x86_64.egg/coverage/", line 83, in report self.report_files(self.html_file, morfs, config, config.html_dir) File "/var/cache/eggs/coverage-3.5.1-py2.6-linux-x86_64.egg/coverage/", line 86, in report_files report_fn(cu, self.coverage._analyze(cu)) File "/var/cache/eggs/coverage-3.5.1-py2.6-linux-x86_64.egg/coverage/", line 198, in html_file self.write_html(html_path, html) File "/var/cache/eggs/coverage-3.5.1-py2.6-linux-x86_64.egg/coverage/", line 103, in write_html write_encoded(fname, html, 'ascii', 'xmlcharrefreplace') File "/var/cache/eggs/coverage-3.5.1-py2.6-linux-x86_64.egg/coverage/", line 137, in write_encoded f.write(text.decode('utf8')) File "/usr/local/python2.6/lib/python2.6/encodings/", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode byte 0xe4 in position 14451: invalid continuation byte }}}

The workaround is simple, of course, change the file's encoding and declaration (and you should be using utf-8 if any, anyway). But still I wonder whether this could be handled more gracefully and with an error message that tells what's going on.

Comments (9)

  1. Kirit Sælensminde

    We've recently seen this error on what appears to be a properly Unicode encoded file :(

    It would be really great if we could at least get the file name that was being processed when the error is thrown. I'd be happy to look into how to do that.

    What do you think the right approach would be? Change the exception type to one that includes that in the error, or try to annotate the existing exception in some way?

  2. Ned Batchelder repo owner

    @Kirit: the problem isn't bad encodings, it's any encoding other than utf-8. Is that your situation? If you think you have a new scenario, attach a file demonstrating the problem.

    The right way to fix the problem is to use the encoding declaration at the top of the file when reading the source.

  3. Kirit Sælensminde

    Hi Ned. The file is UTF-8 and has the encoding declaration at the beginning -- or at least, the file that we think it is. I agree totally that the file needs to be fixed and UTF-8 is the way to go.

    What I'm hoping to do for you though is to get the full file pathname that causes the problem into the exception in some way so that it's clear when the error happens which file needs fixing. I.e. the error might read:

    UnicodeDecodeError: 'utf8' codec can't decode byte 0xe4 in position 14451: invalid continuation byte in file '/home/kirit/Projects/foo/bar/'
  4. Ned Batchelder repo owner

    This is now fixed in f7acbcfe9ca9 .

    Kirit: I never wanted you to have to "fix" your source code. If Python accepts it, should accept it. I hope you'll find it works better now.

  5. Log in to comment