Chokes on source files with non-utf-8 encoding

Wolfgang Schnerring avatarWolfgang Schnerring created an issue

If you have python source files that are, e.g. latin-1 encoded, the reporter will die like this:

    coverage.main()
  File "/var/cache/eggs/coverage-3.5.1-py2.6-linux-x86_64.egg/coverage/cmdline.py", line 657, in main
    status = CoverageScript().command_line(argv)
  File "/var/cache/eggs/coverage-3.5.1-py2.6-linux-x86_64.egg/coverage/cmdline.py", line 549, in command_line
    directory=options.directory, **report_args)
  File "/var/cache/eggs/coverage-3.5.1-py2.6-linux-x86_64.egg/coverage/control.py", line 599, in html_report
    reporter.report(morfs, config=self.config)
  File "/var/cache/eggs/coverage-3.5.1-py2.6-linux-x86_64.egg/coverage/html.py", line 83, in report
    self.report_files(self.html_file, morfs, config, config.html_dir)
  File "/var/cache/eggs/coverage-3.5.1-py2.6-linux-x86_64.egg/coverage/report.py", line 86, in report_files
    report_fn(cu, self.coverage._analyze(cu))
  File "/var/cache/eggs/coverage-3.5.1-py2.6-linux-x86_64.egg/coverage/html.py", line 198, in html_file
    self.write_html(html_path, html)
  File "/var/cache/eggs/coverage-3.5.1-py2.6-linux-x86_64.egg/coverage/html.py", line 103, in write_html
    write_encoded(fname, html, 'ascii', 'xmlcharrefreplace')
  File "/var/cache/eggs/coverage-3.5.1-py2.6-linux-x86_64.egg/coverage/backward.py", line 137, in write_encoded
    f.write(text.decode('utf8'))
  File "/usr/local/python2.6/lib/python2.6/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe4 in position 14451: invalid continuation byte

The workaround is simple, of course, change the file's encoding and declaration (and you should be using utf-8 if any, anyway). But still I wonder whether this could be handled more gracefully and with an error message that tells what's going on.

Comments (9)

  1. Kirit Sælensminde

    We've recently seen this error on what appears to be a properly Unicode encoded file :(

    It would be really great if we could at least get the file name that was being processed when the error is thrown. I'd be happy to look into how to do that.

    What do you think the right approach would be? Change the exception type to one that includes that in the error, or try to annotate the existing exception in some way?

  2. Ned Batchelder

    kirit: the problem isn't bad encodings, it's any encoding other than utf-8. Is that your situation? If you think you have a new scenario, attach a file demonstrating the problem.

    The right way to fix the problem is to use the encoding declaration at the top of the file when reading the source.

  3. Kirit Sælensminde

    Hi Ned. The file is UTF-8 and has the encoding declaration at the beginning -- or at least, the file that we think it is. I agree totally that the file needs to be fixed and UTF-8 is the way to go.

    What I'm hoping to do for you though is to get the full file pathname that causes the problem into the exception in some way so that it's clear when the error happens which file needs fixing. I.e. the error might read:

    UnicodeDecodeError: 'utf8' codec can't decode byte 0xe4 in position 14451: invalid continuation byte in file '/home/kirit/Projects/foo/bar/baz.py'
    
  4. Log in to comment
Tip: Filter by directory path e.g. /media app.js to search for public/media/app.js.
Tip: Use camelCasing e.g. ProjME to search for ProjectModifiedEvent.java.
Tip: Filter by extension type e.g. /repo .js to search for all .js files in the /repo directory.
Tip: Separate your search with spaces e.g. /ssh pom.xml to search for src/ssh/pom.xml.
Tip: Use ↑ and ↓ arrow keys to navigate and return to view the file.
Tip: You can also navigate files with Ctrl+j (next) and Ctrl+k (previous) and view the file with Ctrl+o.
Tip: You can also navigate files with Alt+j (next) and Alt+k (previous) and view the file with Alt+o.