allow usage of non-ascii bytestring literals in templates

Anonymous created an issue

The mako template parser has a problem, or a weirdness, depending on your view. Basically it is not possible to compile any template that contains non-ascii characters inside the ${} code. The problem traces back to python's built-in compiler inability to compile out-of-ascii unicode source. To fix it some kind of encoding-juggling inside ast.py (the 'parse' function?) would be needed as well as adding a #-*- prefix to the code being compiled there. Alas, I haven't been able to fix this myself (mysterious body snatcher exceptions pop out) neither have I enough time to work on it but I'm sure you get the idea.

To replicate the problem, just compile "${f('\u0142')}" as a mako template.

I should add that the problem is serious, at least for us and a showstopper for mako adoption in our project.

  1. Michael Bayer repo owner

    ive added backslash replacing for non-ascii characters to expressions sent for AST parsing within expressions, python code blocks, and control lines in [changeset:189]. check the unit tests added to that changeset to get the idea. note that using non-ascii characters anywhere in templates requires that the encoding of the template be specified at the top via a "magic encoding comment".

    I'm afraid it's still wrong. Test case:

    import mako.template t = u"#-*- encoding:utf-8\n${f('\u0142')}".encode('utf-8') te = mako.template.Template(t) te.render_unicode(f=lambda x:x)

    returns u'
    u0142', should return u'\u0142' (tested on svn rev 190).

    im sorry, i dont understand at this point. test case:

    import mako.template

    t = u"#-*- encoding:utf-8\n${f('\u0142')}".encode('utf-8') te = mako.template.Template(t) print te.code f = lambda x:x

    assert f('\u0142') == te.render_unicode(f=f) print repr(unicode(f('\u0142'))) print repr(te.render_unicode(f=lambda x:x))}}}

    generated code (if you believe this is incorrect, tell me what it should say - note that all expressions are expected to be str()-able or unicode expressions since they get passed to unicode() unconditionally - use context.write() to bypass this):

    from mako import runtime, filters, cache UNDEFINED = runtime.UNDEFINED _magic_number = 1 _modified_time = 1169479868.3539629 _template_filename=None _template_uri='memory:0x63f30' _template_cache=cache.Cache(name, _modified_time) _exports = []

    def render_body(context,pageargs): locals = dict(pageargs=pageargs) f = context.get('f', UNDEFINED)

    1. SOURCE LINE 2 context.write(unicode(f('\u0142'))) return ''

    program output - assertion case passes:

    u0142' u'

    also observe the unit tests added within the changeset, which embed literal multibyte expressions that come out identically to the original.

    also, try out the attached patch. it breaks all the current unit tests but i think its what you are looking for, it basically passes the string straight through, adds the "coding" comment to the top of the generated file. i would essentially have to throw out the whole way Mako does unicode and rewrite it to go this approach, it seems.

    OK, it was using cStringIO. this one passes most tests. again, basic idea is just spitting out the genned module in the same encoding as what was given. not sure if its working all the way though. i know what youre looking for, the total "straight through" without using u"" at all. not sure if i can get this working totally.

    I guess I introduced confusion with '\u0142' which should actually be u'\u0142' - a subtle but important difference :)

    Now, this assertion should hold, but doesn't:

    assert f(u'\u0142') == te.render_unicode(f=f)

    where te = Template(u"#-*- encoding:utf-8\n${f('\u0142')}".encode('utf-8'))

    I'm currently reviewing your code and the patch attached and looking for a way to implement what I want. Will keep you updated.

    ultimately, to make everyone no longer notice that you have to say `u'foo'` and not `'foo'`, we have to make it so that generated modules are in the same encoding as the source file. a lot of weird problems arise when you do this, including that the AST parsing needs to be passed bytestrings instead of unicode objects, which then breaks other stuff, and so on. i dont think its high priority now since id prefer people to just use unicode objects.

