Issue #31 new

polib mutilates escape sequences

qx0monster
created an issue

I've just updated to polib 1.0.1, after sticking around with an 0.5.x version for a long time. Great job, D-J! Sorry to re-raise this old issue, today with a slightly different (and hopefully more convincing) phrasing.

polib mutilates valid escape sequences.

To wit, here is a simple test case:

{{{

!python

bash> cat t.po

msgid "" msgstr ""

msgid "unicode: \u00ae; octal: \141; hex: \x61; control: \b \f \v \a" msgstr "" bash> python Python 2.6.5 (r265:79063, Apr 16 2010, 13:09:56) [GCC 4.4.3] on linux2 Type "help", "copyright", "credits" or "license" for more information.

import polib polib.pofile("t.po").save() quit() bash> cat t.po

msgid "" msgstr ""

msgid "unicode: \u00ae; octal: \141; hex: \x61; control: \b \f \v \a" msgstr "" bash> }}}

All escape sequences unknown to polib (ie. outside of \t, \r and \n) get an additional '\' in front of them. This is particular problematic for us in the case of unicode escapes, as they are frequently used to enter hard-to-type characters into msgid's and msgstr's (like the "Registered" character in the sample).

The problem arises as polib unescapes strings on reads (which removes some '\', but leaves them with unknown sequences like '\u...') and escapes on writes/stringify (which unconditionally prefixes unknown escape seq's with another '\').

I thought a lot about it, but to keep a long story short my resolution is to have polib leave unknown escape sequences untouched. We've ran long with this patch in several projects with good results. I probably add more of my considerations as a separate comment.

Here is the pull request: https://bitbucket.org/izi/polib/pull-request/8/removed-escaping-unescaping-of-unknown

Comments (11)

  1. qx0monster reporter

    To add a bit more rational to this issue:

    • Compatibility with gettext tools: The gettext parser supports (a) octal numbers of the form '\ooo', (b) hex numbers of the form '\x...' and (c) some additional control seq's like '\b', '\f', '\v' and '\a' (while most of the latter are warned about as "not suitable" for internationalization efforts).
    • The gettext parser is strict, in that it does not allow unknown escape sequences (it bombs). It also seems to replace escape sequences with their binary equivalent, e.g. '\x61' is replaced with 'a'. Leaving octal, hex and the other allowed control seq's untouched seems like a good interop measure, for people using both polib and gettext tools.
    • Unicode support: gettext tools don't support unicode escapes (Funnily, there is a code comment in po-lex.c, saying "FIXME: \u and \U are not handled"! http://git.savannah.gnu.org/cgit/gettext.git/tree/gettext-tools/src/po-lex.c?id=b75a50bc5f30666fb36469a90245e1d7856d672c#n851). But as I said support for unicode escapes is crucial for us, and I would appreciate if polib was getting ahead of gettext in this respect. Being just lenient with unknown escape seq's would do the job.
  2. David Jean Louis repo owner

    Hi, thanks for the explanations, it makes perfect sense indeed. That said, doing this by default will totally break backwards compatibility, so I'm +1 on this if it's an explicit option. Regards.

  3. qx0monster reporter

    Fine with me. Where would this option be set? On the POFile/_BaseFile object? Will you add it to the params of the pofile() factory?

  4. David Jean Louis repo owner

    I think this should be done in polib.pofile and polib.mofile, but also in polib.POFile and polib.MOFile classes, as it's done for the check_for_duplicates option.

  5. Jakub Wilk

    The correct solution is to fix the parser to decode all escape sequences (and possibly signal an error on unknown ones).

    AFAICS it is true that with qx0monster's patch polib indeed round-trips properly. It's just the Python objects that are supposed to represent the PO file contents don't make sense. For example, for this file:

    msgid ""
    msgstr ""
    
    msgid "\\n"
    msgstr ""
    

    I get:

    >>> polib.pofile('t.po')[0].msgid[-1].isspace()
    True
    

    which is wrong: the msgid didn't contain any whitespace.

  6. qx0monster reporter

    Sorry, I'm out of this. I will not be able to provide a patch as discussed, so please don't wait for me. Close or proceed as you see fit.

  7. Log in to comment