gettext target can't handle non-ascii files

masklinn avatarmasklinn created an issue

When attempting to use a template with non-ASCII characters (in my case, a language switcher in layout.html), the export blows up with the following error:

# Sphinx version: 1.2b1
# Python version: 2.7.3
# Docutils version: 0.10 release
# Jinja2 version: 2.6
Traceback (most recent call last):
  File "lib/python2.7/site-packages/sphinx/cmdline.py", line 247, in main
    app.build(force_all, filenames)
  File "lib/python2.7/site-packages/sphinx/application.py", line 211, in build
    self.builder.build_update()
  File "lib/python2.7/site-packages/sphinx/builders/__init__.py", line 211, in build_update
    'out of date' % len(to_build))
  File "lib/python2.7/site-packages/sphinx/builders/gettext.py", line 150, in build
    self._extract_from_template()
  File "lib/python2.7/site-packages/sphinx/builders/gettext.py", line 145, in _extract_from_template
    for line, meth, msg in extract_translations(context):
  File "lib/python2.7/site-packages/jinja2/ext.py", line 209, in _extract
    source = self.environment.parse(source)
  File "lib/python2.7/site-packages/jinja2/environment.py", line 391, in parse
    return self._parse(source, name, filename)
  File "lib/python2.7/site-packages/jinja2/environment.py", line 398, in _parse
    return Parser(self, source, name, _encode_filename(filename)).parse()
  File "lib/python2.7/site-packages/jinja2/parser.py", line 32, in __init__
    self.stream = environment._tokenize(source, name, filename, state)
  File "lib/python2.7/site-packages/jinja2/environment.py", line 429, in _tokenize
    source = self.preprocess(source, name, filename)
  File "lib/python2.7/site-packages/jinja2/environment.py", line 423, in preprocess
    self.iter_extensions(), unicode(source))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 292: ordinal not in range(128)

Looking at the code, at gettext.py:144 a "TODO: encoding" comment is left. Jinja2 then blows up when coercing the template's source code to unicode.

I would suggest the encoding should be assumed utf-8 (rather than ascii) even if not configurable given Jinja2's template loaders all default to UTF-8 and the unicode section of the documentation specifically notes:

We recommend utf-8 as Encoding for Python modules and templates as it’s possible to represent every Unicode character in utf-8 and because it’s backwards compatible to ASCII. For Jinja2 the default encoding of templates is assumed to be utf-8.

(emphasis mine)

Comments (5)

  1. pcav

    Thanks for this. Any roadmap or estimated release time for it? BTW, any plan to add the language switch mantioned above as a standard feature? Thanks.

  2. Log in to comment
Tip: Filter by directory path e.g. /media app.js to search for public/media/app.js.
Tip: Use camelCasing e.g. ProjME to search for ProjectModifiedEvent.java.
Tip: Filter by extension type e.g. /repo .js to search for all .js files in the /repo directory.
Tip: Separate your search with spaces e.g. /ssh pom.xml to search for src/ssh/pom.xml.
Tip: Use ↑ and ↓ arrow keys to navigate and return to view the file.
Tip: You can also navigate files with Ctrl+j (next) and Ctrl+k (previous) and view the file with Ctrl+o.
Tip: You can also navigate files with Alt+j (next) and Alt+k (previous) and view the file with Alt+o.