Issue #31 new

polib mutilates escape sequences

created an issue

I've just updated to polib 1.0.1, after sticking around with an 0.5.x version for a long time. Great job, D-J! Sorry to re-raise this old issue, today with a slightly different (and hopefully more convincing) phrasing.

polib mutilates valid escape sequences.

To wit, here is a simple test case:



bash> cat t.po

msgid "" msgstr ""

msgid "unicode: \u00ae; octal: \141; hex: \x61; control: \b \f \v \a" msgstr "" bash> python Python 2.6.5 (r265:79063, Apr 16 2010, 13:09:56) [GCC 4.4.3] on linux2 Type "help", "copyright", "credits" or "license" for more information.

import polib polib.pofile("t.po").save() quit() bash> cat t.po

msgid "" msgstr ""

msgid "unicode: \u00ae; octal: \141; hex: \x61; control: \b \f \v \a" msgstr "" bash> }}}

All escape sequences unknown to polib (ie. outside of \t, \r and \n) get an additional '\' in front of them. This is particular problematic for us in the case of unicode escapes, as they are frequently used to enter hard-to-type characters into msgid's and msgstr's (like the "Registered" character in the sample).

The problem arises as polib unescapes strings on reads (which removes some '\', but leaves them with unknown sequences like '\u...') and escapes on writes/stringify (which unconditionally prefixes unknown escape seq's with another '\').

I thought a lot about it, but to keep a long story short my resolution is to have polib leave unknown escape sequences untouched. We've ran long with this patch in several projects with good results. I probably add more of my considerations as a separate comment.

Here is the pull request:

Comments (11)

  1. qx0monster reporter

    To add a bit more rational to this issue:

    • Compatibility with gettext tools: The gettext parser supports (a) octal numbers of the form '\ooo', (b) hex numbers of the form '\x...' and (c) some additional control seq's like '\b', '\f', '\v' and '\a' (while most of the latter are warned about as "not suitable" for internationalization efforts).
    • The gettext parser is strict, in that it does not allow unknown escape sequences (it bombs). It also seems to replace escape sequences with their binary equivalent, e.g. '\x61' is replaced with 'a'. Leaving octal, hex and the other allowed control seq's untouched seems like a good interop measure, for people using both polib and gettext tools.
    • Unicode support: gettext tools don't support unicode escapes (Funnily, there is a code comment in po-lex.c, saying "FIXME: \u and \U are not handled"! But as I said support for unicode escapes is crucial for us, and I would appreciate if polib was getting ahead of gettext in this respect. Being just lenient with unknown escape seq's would do the job.
  2. David Jean Louis repo owner

    Hi, thanks for the explanations, it makes perfect sense indeed. That said, doing this by default will totally break backwards compatibility, so I'm +1 on this if it's an explicit option. Regards.

  3. Jakub Wilk

    The correct solution is to fix the parser to decode all escape sequences (and possibly signal an error on unknown ones).

    AFAICS it is true that with qx0monster's patch polib indeed round-trips properly. It's just the Python objects that are supposed to represent the PO file contents don't make sense. For example, for this file:

    msgid ""
    msgstr ""
    msgid "\\n"
    msgstr ""

    I get:

    >>> polib.pofile('t.po')[0].msgid[-1].isspace()

    which is wrong: the msgid didn't contain any whitespace.

  4. Log in to comment