Issue #6 resolved

__str__() methods are returning unicode instead of str

Anonymous created an issue

Hello,

I was looking at this issue in django-rosetta, which uses polib: http://code.google.com/p/django-rosetta/issues/detail?id=75

I tracked it down to a bug in polib that caused various str() methods to return unicode instead of str. Calling str() on such an object causes an exception, when str() tries to encode the string using the default ascii code page.

This bug is caused by polib's assumption that codecs.open() returns a generator of str objects. In fact, it generates unicode when you give codecs.open() an encoding parameter (this is the case in Python 2.5). The unicode type then gets propagated in string formatting and joining till it's returned by a str() method.

The quick fix would be to encode the string as it comes out of codecs.open(). Something like this:

(Ignore the line numbers, the version I'm using is the one shipped with django-rosetta.)

{{{

!python

--- polib.py.old 2010-06-15 10:31:03.000000000 +0100 +++ polib.py 2010-06-15 12:20:40.000000000 +0100 @@ -1110,6 +1110,7 @@ """ i, lastlen = 1, 0 for line in self.fhandle: + line = line.encode("utf8") line = line.strip() if line == '': i = i+1

}}}

Rick S

Comments (9)

  1. David Jean Louis repo owner

    Hi,

    The patch you are proposing is not ok for several reasons :

    • polib threats all strings as unicode internally, so the encode() must be done as late as possible,
    • the po files encoding is not always "utf-8".

    But you are right, the str method is not bullet proof ... Can you tell me if the problem is solved with the attached patch ?

    Regards,

    -- David

  2. Anonymous

    Surprise surprise, I just found out that the patch breaks file saving. The file object returned by codecs.open() with an encoding is different from the object returned by open(). Both read() and write() methods are expecting unicode. Rosetta's download feature writes to StringIO instead, which is why my patch worked there.

    If polib treats strings as unicode (which is nice), perhaps it would be worth the effort to prefix all string literals with u? That would make it more obvious what we're dealing with. More importantly, str() should definitely return str objects. So maybe rename all str() to unicode()?

    Rick

  3. David Jean Louis repo owner

    argh... bitbucket doesn't send me email updates... (maybe because your reporting as anonymous, not sure).

    Anyway, sorry for the delay, so: does the patch solves the bug ?

    Thanks !

    -- David

  4. Rick Shi
    • changed status to open

    Hi, I was too lazy to create an account...

    Your patch is giving me this error: "str() takes exactly 1 argument (2 given)". I tried to debug it, but I get a Python crash (!!) if I try to print an entry. The exception is thrown in _BaseFile.str(). Can you take another look? Sorry that I don't have much time to spend on it.

    Rick

  5. Log in to comment