Issues

Issue #37 resolved

html_error_template doesn't handle non-ascii characters in source code

guest
created an issue

Max Ischenko reported this here:

http://groups.google.com/group/pylons-discuss/browse_frm/thread/22e2af6ba7bbf778

I've attached a patch to convert the source code to unicode via the source code's magic encoding comment (like Myghty does -- in fact I pretty much ripped all of dairiki's code for this). In the rare case that the magic encoding comment doesn't describe the non-ascii characters correctly, the 'replace' error handler is used to add garbage in its place

But before we commit it, I think I would like to see an addition to the Template class.

Pylons currently does this in its exception handler to render the exception:

return mako.exceptions.html_error_template().render()[610:-16]

With this patch, and Max's example, Pylons now causes a UnicodeEncodeError here:

UnicodeEncodeError: 'ascii' codec can't encode characters in position 4-9: ordinal not in range(128)

What's needed for Pylons to render the html_error_template correctly (including the non-ascii characters showing up in the browser), it now has to do this:

return mako.exceptions.html_error_template().render_unicode().encode(sys.getdefaultencoding(), 'htmlentityreplace')[610:-16]

Mike I may have asked you this before, but is there a reason why mako Templates don't have an 'encoding_errors' parameter to go along with 'output_encoding'?

If it did, we could have html_error_template set encoding_errors='htmlentitiyreplace'. Then anyone needing an html_error_template wouldn't have to be concerned with unicode at all -- they could simply render() (like Pylons is doing now) and always get back a sys.getdefaultencoding() encoded string with html entitys for non-defaultencoding characters. That is what Myghty did:

http://www.myghty.org/trac/changeset/2120

http://www.myghty.org/trac/changeset/2121

I remember arguing this implementation a little with dairiki when he did this, but it seemed to work out pretty well

Comments (3)

  1. Mike Bayer repo owner

    with regards to the patch, why would we go through all that trouble to parse the magic encoding comment when we are receiving a CompileException or SyntaxException ? in both cases, we have already lexed the template and we have already converted the source to unicode. this encoding could be passed along to CompileException and SyntaxException. also "parser" is not defined, and the whole BOM thing if we want to support that (which im not thrilled about, but I can be swayed) should just be part of Lexer.

    the second part of the question, putting encoding_errors as a parameter to Template/TemplateLookup, i would say its not a big deal, the only reason its not there is because there already is a way to convert to unicode explicitly (more than one way to do it...but not a big deal)

  2. Mike Bayer repo owner

    ok so in [changeset:259] im propigating the source of the original template as a unicode object to the `LexerException`/`CompileException`, also `ModuleInfo` can get at the template source and its encoding from the generated module so also returns source as a unicode object. added tests for every version of this issue i could think of (memory based, file based, etc).

  3. Log in to comment