- changed status to wontfix
Unicode type conversion conflicts with MySQLs own unicode client conversion, due to unknown mysql setting
There is something wrong with Unicode column support in MySQL:
Current test/sql/testtypes.py pass without errors, however, if you change test string to contain Polish national characters like this:
rawdata = '\xc4\x85\xc4\x87\xc3\xb3\xc5\x82\xc5\x84\xc3\xb3\xc5\x81\xc5\x83\xc3\x93\xc4\x84\xc5\x9a\xc4\x86'
tests keep passing on Postgres but start failing on MySQL (both 4 and 5), with following tracebacks:
======================================================================
ERROR: testbasic (__main__.UnicodeTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "test\sql\testtypes.py", line 144, in testbasic
self.echo(repr(x['unicode_data']('unicode_data')))
File "./lib/sqlalchemy\engine\base.py", line 718, in __getitem__
return self.__parent._get_col(self.__row, key)
File "./lib/sqlalchemy\engine\base.py", line 625, in _get_col
return rec[0](0).dialect_impl(self.dialect).convert_result_value(row[rec[1](rec[1)], self.dialect)
File "./lib/sqlalchemy\types.py", line 187, in convert_result_value
return value.decode(dialect.encoding)
File "C:\Python24\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 12-13: invalid data
======================================================================
ERROR: testengineparam (__main__.UnicodeTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "test\sql\testtypes.py", line 163, in testengineparam
self.echo(repr(x['unicode_data']('unicode_data')))
File "./lib/sqlalchemy\engine\base.py", line 718, in __getitem__
return self.__parent._get_col(self.__row, key)
File "./lib/sqlalchemy\engine\base.py", line 625, in _get_col
return rec[0](0).dialect_impl(self.dialect).convert_result_value(row[rec[1](rec[1)], self.dialect)
File "./lib/sqlalchemy\types.py", line 187, in convert_result_value
return value.decode(dialect.encoding)
File "C:\Python24\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 12-13: invalid data
----------------------------------------------------------------------
I've tested this a little more, and it seems completely wrong characters end up in MySQL database - possibly result of doubled attempt at utf8 encoding (once by sqlalchemy and once by mysql client). Unfortunately i do not understand sqlalchemy internals enough to completely diagnose and fix the problem.
Anyone willing to help?
Comments (13)
-
repo owner -
Account Deleted This is what i get using the patched unittest:
C:\svn\sqlalchemy>python test/sql/testtypes.py --dburi=mysql://test:test@devel/test_mszumocki --verbose --log-info=sqlalchemy.engine UnicodeTest INFO:sqlalchemy.engine.base.Engine.0x..10: CREATE TABLE unicode_table ( id INTEGER NOT NULL AUTO_INCREMENT, unicode_data VARCHAR(250), plain_data VARCHAR(250), PRIMARY KEY (id) ) INFO:sqlalchemy.engine.base.Engine.0x..10:None INFO:sqlalchemy.engine.base.Engine.0x..10:COMMIT testbasic (__main__.UnicodeTest) ... INFO:sqlalchemy.engine.base.Engine.0x..10:INSERT INTO unicode_table (unicode_data, plain_data) VALUES (%s, %s) INFO:sqlalchemy.engine.base.Engine.0x..10:['\xc4\x85\x c4\x87\xc3\xb3\xc5\x82\xc5\x84\xc3\xb3\xc5\x81\xc5\x83\xc3\x93\xc4\x84\xc5\x9a\xc4\x86']('\xc4\x85\xc4\x87\xc3\xb3\xc5\x82\xc5\x84\xc3\xb3\xc5\x81\xc5\x83\xc3\x93\xc4\x84\xc5\x9a\xc4\x86',) INFO:sqlalchemy.engine.base.Engine.0x..10:COMMIT INFO:sqlalchemy.engine.base.Engine.0x..10:SELECT unicode_table.id, unicode_table.unicode_data, unicode_table.plain_data FROM unicode_table INFO:sqlalchemy.engine.base.Engine.0x..10:[(__main__.UnicodeTest) ... INFO:sqlalchemy.engine.base.Engine.0x..10:INSERT INTO unicode_table (unicode_data, plain_data) VALUES (%s, %s) INFO:sqlalchemy.engine.base.Engine.0x..10:['\xc4\x85\xc4\x87\xc3\xb3\xc5\x82\xc5\x84\xc3\xb3\xc5\x81\xc5\x83\xc3\x93\xc4\x84\xc5\x9a\xc4\x86', '\xc4\x85\x c4\x87\xc3\xb3\xc5\x82\xc5\x84\xc3\xb3\xc5\x81\xc5\x83\xc3\x93\xc4\x84\xc5\x9a\xc4\x86'](] ERROR testengineparam) INFO:sqlalchemy.engine.base.Engine.0x..10:COMMIT INFO:sqlalchemy.engine.base.Engine.0x..10:SELECT unicode_table.id, unicode_table.unicode_data, unicode_table.plain_data FROM unicode_table INFO:sqlalchemy.engine.base.Engine.0x..10:[TABLE unicode_table INFO:sqlalchemy.engine.base.Engine.0x..10:None INFO:sqlalchemy.engine.base.Engine.0x..10:COMMIT ====================================================================== ERROR: testbasic (__main__.UnicodeTest) ---------------------------------------------------------------------- Traceback (most recent call last): File "test/sql/testtypes.py", line 144, in testbasic self.echo(repr(x['unicode_data'](] ERROR INFO:sqlalchemy.engine.base.Engine.0x..10: DROP))) File "./lib/sqlalchemy\engine\base.py", line 718, in __getitem__ return self.__parent._get_col(self.__row, key) File "./lib/sqlalchemy\engine\base.py", line 625, in _get_col return rec[0](0).dialect_impl(self.dialect).convert_result_value(row[rec[1](rec[1)], self.dialect) File "./lib/sqlalchemy\types.py", line 187, in convert_result_value return value.decode(dialect.encoding) File "C:\Python24\lib\encodings\utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode bytes in position 12-13: invalid data ====================================================================== ERROR: testengineparam (__main__.UnicodeTest) ---------------------------------------------------------------------- Traceback (most recent call last): File "test/sql/testtypes.py", line 163, in testengineparam self.echo(repr(x['unicode_data']('unicode_data'))) File "./lib/sqlalchemy\engine\base.py", line 718, in __getitem__ return self.__parent._get_col(self.__row, key) File "./lib/sqlalchemy\engine\base.py", line 625, in _get_col return rec[0](0).dialect_impl(self.dialect).convert_result_value(row[rec[1](rec[1)], self.dialect) File "./lib/sqlalchemy\types.py", line 187, in convert_result_value return value.decode(dialect.encoding) File "C:\Python24\lib\encodings\utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode bytes in position 12-13: invalid data ---------------------------------------------------------------------- Ran 2 tests in 0.079s FAILED (errors=2)
Python 2.4
#60, MySQLdb 1.2.0 Any idea what might be different with my enviromnent? -
Account Deleted Note - MySQL is set up to use utf8 as default character set. Might that be it?
-
repo owner MySQL client library is using utf8 ? if so, then yes thats the same as using that setting. this would imply you dont have to use the Unicode datatype.
the only transparent solution i can see to this would be an extra flag on the mysql dialect that sends a message to the Unicode and String types not to do any conversion.
-
Account Deleted I meant MySQL server is set up to use utf8 as default character set/collation. After some more testing i don't think that matters though.
What would i need to do to implement the solution you mentioned?
-
repo owner it would involve changing at least Unicode, and possibly String, in types.py to call an extra flag on the dialect "should_encode" or something like that, before it does type conversion (at least incoming).
but this solution is not very transparent anyway. also, I would very much prefer to understand exactly why your database is doing what it is....can you do a test with plain mysqldb code (i.e. dbapi code) and see if it is in fact doing unicode conversion already ?
im assuming simply not using Unicode types is not an option ?
-
repo owner - changed status to open
-
repo owner - removed status
-
repo owner - changed title to Unicode type conversion conflicts with MySQLs own unicode client conversion, due to unknown mysql setting
-
Account Deleted This is going to be long so have patience please ;)
One mistake on my part earlier: patched test failed on MySQL 4 only, it worked fine on MySQL 5 with utf8 default charset. However, when i've modified testtypes.py not to delete the temporary table in UnicodeTest:
def tearDownAll(self): #unicode_table.drop() pass
i noticed that the unicode_table actually contains incorrect characters in unicode_data field.
I tested it both with mysql GUI admin and following simple dbapi program:
#!/usr/bin/env python import MySQLdb as mysql conn = mysql.connect(host='bazy', user='test', passwd='test', db='test_mszumocki')#, charset='utf8') cursor = conn.cursor() try: cursor.execute('select unicode_data from unicode_table') for row in cursor.fetchone(): print repr(row) if isinstance(row, unicode): print repr(row.encode('utf8')) finally: cursor.close() conn.close()
Results from the above program (this is what actually gets put into database):
u'\xc4\u2026\xc4\u2021\xc3\xb3\xc5\u201a\xc5\u201e\xc3\xb3\xc5\x81\xc5\u0192\xc3\u201c\xc4\u201e\xc5\u0161\xc4\u2020' '\xc3\x84\xe2\x80\xa6\xc3\x84\xe2\x80\xa1\xc3\x83\xc2\xb3\xc3\x85\xe2\x80\x9a\xc3\x85\xe2\x80\x9e\xc3\x83\xc2\xb3\xc3 \x85\xc2\x81\xc3\x85\xc6\x92\xc3\x83\xe2\x80\x9c\xc3\x84\xe2\x80\x9e\xc3\x85\xc5\xa1\xc3\x84\xe2\x80\xa0'
Results from the above program when i commented out the "charset='utf8'" part:
'\xc4\x85\xc4\x87\xc3\xb3\xc5\x82\xc5\x84\xc3\xb3\xc5\x81\xc5\x83\xc3\x93\xc4\x84\xc5\x9a\xc4\x86'
This looks like the correct data. Only, it come from incorrect one that's sitting in the database.
Results from testtypes.py UnicodeTest.test_engine_param (this is what is supposed to be in database, and what testcase thinks is in database):
u'\u0105\u0107\xf3\u0142\u0144\xf3\u0141\u0143\xd3\u0104\u015a\u0106' u'\u0105\u0107\xf3\u0142\u0144\xf3\u0141\u0143\xd3\u0104\u015a\u0106'
So whatever way and whenever character mangling happens, it's symmetrical on insert and on select.
This might be relevant: http://sourceforge.net/tracker/index.php?func=detail&aid=1521274&group_id=22307&atid=374932
If it is relevant (i'm not quite sure myself), it's not fixed in newest MySQLdb (1.2.2b2).
Any hints on what to try next?
-
repo owner just an update, Shannon Behrens has submitted a bug report to MySQLdb:
http://sourceforge.net/tracker/index.php?func=detail&aid=1592353&group_id=22307&atid=374932
link to the SA mailing list discussions:
http://www.mail-archive.com/sqlalchemy@googlegroups.com/msg00366.html http://groups.google.com/group/sqlalchemy/browse_thread/thread/28fbd1cc8e66e821/#
-
Account Deleted With new MySQLDb 1.2.2, everything seems to be working ok, as long as you connect to mysql with use_unicode=True and charset=utf8.
I believe this ticket can be closed now.
-
repo owner - changed status to resolved
- Log in to comment
youre almost certainly using the "convert_unicode" flag in your create_engine() call. this flag is not compatible with the Unicode type, since it will produce exactly the behavior youve observed (i.e. two client layers both encoding on the way in).
using a regular MySQL URL:
patched unittest:
output: