Unicode type conversion conflicts with MySQLs own unicode client conversion, due to unknown mysql setting

Issue #340 resolved
Former user created an issue

There is something wrong with Unicode column support in MySQL:

Current test/sql/testtypes.py pass without errors, however, if you change test string to contain Polish national characters like this:

rawdata = '\xc4\x85\xc4\x87\xc3\xb3\xc5\x82\xc5\x84\xc3\xb3\xc5\x81\xc5\x83\xc3\x93\xc4\x84\xc5\x9a\xc4\x86'

tests keep passing on Postgres but start failing on MySQL (both 4 and 5), with following tracebacks:

======================================================================
ERROR: testbasic (__main__.UnicodeTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test\sql\testtypes.py", line 144, in testbasic
    self.echo(repr(x['unicode_data']('unicode_data')))
  File "./lib/sqlalchemy\engine\base.py", line 718, in __getitem__
    return self.__parent._get_col(self.__row, key)
  File "./lib/sqlalchemy\engine\base.py", line 625, in _get_col
    return rec[0](0).dialect_impl(self.dialect).convert_result_value(row[rec[1](rec[1)], self.dialect)
  File "./lib/sqlalchemy\types.py", line 187, in convert_result_value
    return value.decode(dialect.encoding)
  File "C:\Python24\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 12-13: invalid data

======================================================================
ERROR: testengineparam (__main__.UnicodeTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test\sql\testtypes.py", line 163, in testengineparam
    self.echo(repr(x['unicode_data']('unicode_data')))
  File "./lib/sqlalchemy\engine\base.py", line 718, in __getitem__
    return self.__parent._get_col(self.__row, key)
  File "./lib/sqlalchemy\engine\base.py", line 625, in _get_col
    return rec[0](0).dialect_impl(self.dialect).convert_result_value(row[rec[1](rec[1)], self.dialect)
  File "./lib/sqlalchemy\types.py", line 187, in convert_result_value
    return value.decode(dialect.encoding)
  File "C:\Python24\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 12-13: invalid data

----------------------------------------------------------------------

I've tested this a little more, and it seems completely wrong characters end up in MySQL database - possibly result of doubled attempt at utf8 encoding (once by sqlalchemy and once by mysql client). Unfortunately i do not understand sqlalchemy internals enough to completely diagnose and fix the problem.

Anyone willing to help?

Comments (13)

  1. Mike Bayer repo owner

    youre almost certainly using the "convert_unicode" flag in your create_engine() call. this flag is not compatible with the Unicode type, since it will produce exactly the behavior youve observed (i.e. two client layers both encoding on the way in).

    using a regular MySQL URL:

    patched unittest:

    Index: test/sql/testtypes.py
    ===================================================================
    --- test/sql/testtypes.py       (revision 1985)
    +++ test/sql/testtypes.py       (working copy)
    @@ -136,7 +136,8 @@
         def tearDownAll(self):
             unicode_table.drop()
         def testbasic(self):
    -        rawdata = 'Alors vous imaginez ma surprise, au lever du jour, quand une dr\xc3\xb4le de petit voix m\xe2\x80\x99a r\xc3\xa9veill\xc3\xa9. Elle disait: \xc2\xab S\xe2\x80\x99il vous pla\xc3\xaet\xe2\x80\xa6 dessine-moi un mouton! \xc2\xbb\n'
    +        rawdata = '\xc4\x85\xc4\x87\xc3\xb3\xc5\x82\xc5\x84\xc3\xb3\xc5\x81\xc5\x83\xc3\x93\xc4\x84\xc5\x9a\xc4\x86'
    +#        rawdata = 'Alors vous imaginez ma surprise, au lever du jour, quand une dr\xc3\xb4le de petit voix m\xe2\x80\x99a r\xc3\xa9veill\xc3\xa9. Elle disait: \xc2\xab S\xe2\x80\x99il vous pla\xc3\xaet\xe2\x80\xa6 dessine-moi un mouton! \xc2\xbb\n'
             unicodedata = rawdata.decode('utf-8')
             unicode_table.insert().execute(unicode_data=unicodedata, plain_data=rawdata)
             x = unicode_table.select().execute().fetchone()
    @@ -154,7 +155,8 @@
             prev_unicode = db.engine.dialect.convert_unicode
             try:
                 db.engine.dialect.convert_unicode = True
    -            rawdata = 'Alors vous imaginez ma surprise, au lever du jour, quand une dr\xc3\xb4le de petit voix m\xe2\x80\x99a r\xc3\xa9veill\xc3\xa9. Elle disait: \xc2\xab S\xe2\x80\x99il vous pla\xc3\xaet\xe2\x80\xa6 dessine-moi un mouton! \xc2\xbb\n'
    +            rawdata = '\xc4\x85\xc4\x87\xc3\xb3\xc5\x82\xc5\x84\xc3\xb3\xc5\x81\xc5\x83\xc3\x93\xc4\x84\xc5\x9a\xc4\x86'
    +#            rawdata = 'Alors vous imaginez ma surprise, au lever du jour, quand une dr\xc3\xb4le de petit voix m\xe2\x80\x99a r\xc3\xa9veill\xc3\xa9. Elle disait: \xc2\xab S\xe2\x80\x99il vous pla\xc3\xaet\xe2\x80\xa6 dessine-moi un mouton! \xc2\xbb\n'
                 unicodedata = rawdata.decode('utf-8')
                 unicode_table.insert().execute(unicode_data=unicodedata, plain_data=rawdata)
                 x = unicode_table.select().execute().fetchone()
    

    output:

    python test/sql/testtypes.py --db mysql --verbose --log-info=sqlalchemy.engine  UnicodeTest
    INFO:sqlalchemy.engine.base.Engine.0x..70:
    CREATE TABLE unicode_table (
            id INTEGER NOT NULL AUTO_INCREMENT, 
            unicode_data VARCHAR(250), 
            plain_data VARCHAR(250), 
            PRIMARY KEY (id)
    )
    
    
    INFO:sqlalchemy.engine.base.Engine.0x..70:None
    INFO:sqlalchemy.engine.base.Engine.0x..70:COMMIT
    testbasic (__main__.UnicodeTest) ... INFO:sqlalchemy.engine.base.Engine.0x..70:INSERT INTO unicode_table (unicode_data, plain_data) VALUES (%s, %s)
    INFO:sqlalchemy.engine.base.Engine.0x..70:['\xc4\x85\xc4\x87\xc3\xb3\xc5\x82\xc5\x84\xc3\xb3\xc5\x81\xc5\x83\xc3\x93\xc4\x84\xc5\x9a\xc4\x86']('\xc4\x85\xc4\x87\xc3\xb3\xc5\x82\xc5\x84\xc3\xb3\xc5\x81\xc5\x83\xc3\x93\xc4\x84\xc5\x9a\xc4\x86',)
    INFO:sqlalchemy.engine.base.Engine.0x..70:COMMIT
    INFO:sqlalchemy.engine.base.Engine.0x..70:SELECT unicode_table.id, unicode_table.unicode_data, unicode_table.plain_data 
    FROM unicode_table
    INFO:sqlalchemy.engine.base.Engine.0x..70:[(__main__.UnicodeTest) ... INFO:sqlalchemy.engine.base.Engine.0x..70:INSERT INTO unicode_table (unicode_data, plain_data) VALUES (%s, %s)
    INFO:sqlalchemy.engine.base.Engine.0x..70:['\xc4\x85\xc4\x87\xc3\xb3\xc5\x82\xc5\x84\xc3\xb3\xc5\x81\xc5\x83\xc3\x93\xc4\x84\xc5\x9a\xc4\x86', '\xc4\x85\xc4\x87\xc3\xb3\xc5\x82\xc5\x84\xc3\xb3\xc5\x81\xc5\x83\xc3\x93\xc4\x84\xc5\x9a\xc4\x86'](]
    u'\u0105\u0107\xf3\u0142\u0144\xf3\u0141\u0143\xd3\u0104\u015a\u0106'
    '\xc4\x85\xc4\x87\xc3\xb3\xc5\x82\xc5\x84\xc3\xb3\xc5\x81\xc5\x83\xc3\x93\xc4\x84\xc5\x9a\xc4\x86'
    ok
    testengineparam)
    INFO:sqlalchemy.engine.base.Engine.0x..70:COMMIT
    INFO:sqlalchemy.engine.base.Engine.0x..70:SELECT unicode_table.id, unicode_table.unicode_data, unicode_table.plain_data 
    FROM unicode_table
    INFO:sqlalchemy.engine.base.Engine.0x..70:[]
    u'\u0105\u0107\xf3\u0142\u0144\xf3\u0141\u0143\xd3\u0104\u015a\u0106'
    u'\u0105\u0107\xf3\u0142\u0144\xf3\u0141\u0143\xd3\u0104\u015a\u0106'
    ok
    INFO:sqlalchemy.engine.base.Engine.0x..70:
    DROP TABLE unicode_table
    INFO:sqlalchemy.engine.base.Engine.0x..70:None
    INFO:sqlalchemy.engine.base.Engine.0x..70:COMMIT
    
    ----------------------------------------------------------------------
    Ran 2 tests in 0.160s
    
    OK
    
  2. Former user Account Deleted

    This is what i get using the patched unittest:

    C:\svn\sqlalchemy>python test/sql/testtypes.py --dburi=mysql://test:test@devel/test_mszumocki --verbose --log-info=sqlalchemy.engine  UnicodeTest
    INFO:sqlalchemy.engine.base.Engine.0x..10:
    CREATE TABLE unicode_table (
            id INTEGER NOT NULL AUTO_INCREMENT,
            unicode_data VARCHAR(250),
            plain_data VARCHAR(250),
            PRIMARY KEY (id)
    )
    
    
    INFO:sqlalchemy.engine.base.Engine.0x..10:None
    INFO:sqlalchemy.engine.base.Engine.0x..10:COMMIT
    testbasic (__main__.UnicodeTest) ... INFO:sqlalchemy.engine.base.Engine.0x..10:INSERT INTO unicode_table (unicode_data, plain_data) VALUES (%s, %s)
    INFO:sqlalchemy.engine.base.Engine.0x..10:['\xc4\x85\x
    c4\x87\xc3\xb3\xc5\x82\xc5\x84\xc3\xb3\xc5\x81\xc5\x83\xc3\x93\xc4\x84\xc5\x9a\xc4\x86']('\xc4\x85\xc4\x87\xc3\xb3\xc5\x82\xc5\x84\xc3\xb3\xc5\x81\xc5\x83\xc3\x93\xc4\x84\xc5\x9a\xc4\x86',)
    INFO:sqlalchemy.engine.base.Engine.0x..10:COMMIT
    INFO:sqlalchemy.engine.base.Engine.0x..10:SELECT unicode_table.id, unicode_table.unicode_data, unicode_table.plain_data
    FROM unicode_table
    INFO:sqlalchemy.engine.base.Engine.0x..10:[(__main__.UnicodeTest) ... INFO:sqlalchemy.engine.base.Engine.0x..10:INSERT INTO unicode_table (unicode_data, plain_data) VALUES (%s, %s)
    INFO:sqlalchemy.engine.base.Engine.0x..10:['\xc4\x85\xc4\x87\xc3\xb3\xc5\x82\xc5\x84\xc3\xb3\xc5\x81\xc5\x83\xc3\x93\xc4\x84\xc5\x9a\xc4\x86', '\xc4\x85\x
    c4\x87\xc3\xb3\xc5\x82\xc5\x84\xc3\xb3\xc5\x81\xc5\x83\xc3\x93\xc4\x84\xc5\x9a\xc4\x86'](]
    ERROR
    testengineparam)
    INFO:sqlalchemy.engine.base.Engine.0x..10:COMMIT
    INFO:sqlalchemy.engine.base.Engine.0x..10:SELECT unicode_table.id, unicode_table.unicode_data, unicode_table.plain_data
    FROM unicode_table
    INFO:sqlalchemy.engine.base.Engine.0x..10:[TABLE unicode_table
    INFO:sqlalchemy.engine.base.Engine.0x..10:None
    INFO:sqlalchemy.engine.base.Engine.0x..10:COMMIT
    
    ======================================================================
    ERROR: testbasic (__main__.UnicodeTest)
    ----------------------------------------------------------------------
    Traceback (most recent call last):
      File "test/sql/testtypes.py", line 144, in testbasic
        self.echo(repr(x['unicode_data'](]
    ERROR
    INFO:sqlalchemy.engine.base.Engine.0x..10:
    DROP)))
      File "./lib/sqlalchemy\engine\base.py", line 718, in __getitem__
        return self.__parent._get_col(self.__row, key)
      File "./lib/sqlalchemy\engine\base.py", line 625, in _get_col
        return rec[0](0).dialect_impl(self.dialect).convert_result_value(row[rec[1](rec[1)], self.dialect)
      File "./lib/sqlalchemy\types.py", line 187, in convert_result_value
        return value.decode(dialect.encoding)
      File "C:\Python24\lib\encodings\utf_8.py", line 16, in decode
        return codecs.utf_8_decode(input, errors, True)
    UnicodeDecodeError: 'utf8' codec can't decode bytes in position 12-13: invalid data
    
    ======================================================================
    ERROR: testengineparam (__main__.UnicodeTest)
    ----------------------------------------------------------------------
    Traceback (most recent call last):
      File "test/sql/testtypes.py", line 163, in testengineparam
        self.echo(repr(x['unicode_data']('unicode_data')))
      File "./lib/sqlalchemy\engine\base.py", line 718, in __getitem__
        return self.__parent._get_col(self.__row, key)
      File "./lib/sqlalchemy\engine\base.py", line 625, in _get_col
        return rec[0](0).dialect_impl(self.dialect).convert_result_value(row[rec[1](rec[1)], self.dialect)
      File "./lib/sqlalchemy\types.py", line 187, in convert_result_value
        return value.decode(dialect.encoding)
      File "C:\Python24\lib\encodings\utf_8.py", line 16, in decode
        return codecs.utf_8_decode(input, errors, True)
    UnicodeDecodeError: 'utf8' codec can't decode bytes in position 12-13: invalid data
    
    ----------------------------------------------------------------------
    Ran 2 tests in 0.079s
    
    FAILED (errors=2)
    

    Python 2.4 #60, MySQLdb 1.2.0 Any idea what might be different with my enviromnent?

  3. Mike Bayer repo owner

    MySQL client library is using utf8 ? if so, then yes thats the same as using that setting. this would imply you dont have to use the Unicode datatype.

    the only transparent solution i can see to this would be an extra flag on the mysql dialect that sends a message to the Unicode and String types not to do any conversion.

  4. Former user Account Deleted

    I meant MySQL server is set up to use utf8 as default character set/collation. After some more testing i don't think that matters though.

    What would i need to do to implement the solution you mentioned?

  5. Mike Bayer repo owner

    it would involve changing at least Unicode, and possibly String, in types.py to call an extra flag on the dialect "should_encode" or something like that, before it does type conversion (at least incoming).

    but this solution is not very transparent anyway. also, I would very much prefer to understand exactly why your database is doing what it is....can you do a test with plain mysqldb code (i.e. dbapi code) and see if it is in fact doing unicode conversion already ?

    im assuming simply not using Unicode types is not an option ?

  6. Former user Account Deleted

    This is going to be long so have patience please ;)

    One mistake on my part earlier: patched test failed on MySQL 4 only, it worked fine on MySQL 5 with utf8 default charset. However, when i've modified testtypes.py not to delete the temporary table in UnicodeTest:

        def tearDownAll(self):
            #unicode_table.drop()
            pass
    

    i noticed that the unicode_table actually contains incorrect characters in unicode_data field.

    I tested it both with mysql GUI admin and following simple dbapi program:

    #!/usr/bin/env python
    import MySQLdb as mysql
    conn = mysql.connect(host='bazy', user='test', passwd='test', db='test_mszumocki')#, charset='utf8')
    cursor = conn.cursor()
    try:
        cursor.execute('select unicode_data from unicode_table')
        for row in cursor.fetchone():
            print repr(row)
            if isinstance(row, unicode): print repr(row.encode('utf8'))
    finally:
        cursor.close()
    conn.close()
    

    Results from the above program (this is what actually gets put into database):

    u'\xc4\u2026\xc4\u2021\xc3\xb3\xc5\u201a\xc5\u201e\xc3\xb3\xc5\x81\xc5\u0192\xc3\u201c\xc4\u201e\xc5\u0161\xc4\u2020'
    '\xc3\x84\xe2\x80\xa6\xc3\x84\xe2\x80\xa1\xc3\x83\xc2\xb3\xc3\x85\xe2\x80\x9a\xc3\x85\xe2\x80\x9e\xc3\x83\xc2\xb3\xc3
    \x85\xc2\x81\xc3\x85\xc6\x92\xc3\x83\xe2\x80\x9c\xc3\x84\xe2\x80\x9e\xc3\x85\xc5\xa1\xc3\x84\xe2\x80\xa0'
    

    Results from the above program when i commented out the "charset='utf8'" part:

    '\xc4\x85\xc4\x87\xc3\xb3\xc5\x82\xc5\x84\xc3\xb3\xc5\x81\xc5\x83\xc3\x93\xc4\x84\xc5\x9a\xc4\x86'
    

    This looks like the correct data. Only, it come from incorrect one that's sitting in the database.

    Results from testtypes.py UnicodeTest.test_engine_param (this is what is supposed to be in database, and what testcase thinks is in database):

    u'\u0105\u0107\xf3\u0142\u0144\xf3\u0141\u0143\xd3\u0104\u015a\u0106'
    u'\u0105\u0107\xf3\u0142\u0144\xf3\u0141\u0143\xd3\u0104\u015a\u0106'
    

    So whatever way and whenever character mangling happens, it's symmetrical on insert and on select.

    This might be relevant: http://sourceforge.net/tracker/index.php?func=detail&aid=1521274&group_id=22307&atid=374932

    If it is relevant (i'm not quite sure myself), it's not fixed in newest MySQLdb (1.2.2b2).

    Any hints on what to try next?

  7. Former user Account Deleted

    With new MySQLDb 1.2.2, everything seems to be working ok, as long as you connect to mysql with use_unicode=True and charset=utf8.

    I believe this ticket can be closed now.

  8. Log in to comment