- changed milestone to 0.5.xx
Error when there are accented chars in the fieldnames using mssql
Hi,
Another problem appeared working against the Microsoft Dynamics NAV database (MSSQL). The error raises whenever a field with special chars like accented vocals (typical in spanish) are involved. I suppose that sqlalchemy thinks that the fieldnames are encoded in utf8, but it looks as if they are encoded in unicode. The collation property of the database is Modern_Spanish_CI_AS. When I have used ADO I must write the fieldname encoded in unicode.
C:\Python26\lib\site-packages\sqlalchemy-0.5.0rc5dev-py2.6.egg\sqlalchemy\databa
ses\mssql.py:878: DeprecationWarning: object.__new__() takes no parameters
return super(MSSQLDialect, cls).__new__(cls, *args, **kwargs)
Traceback (most recent call last):
File "exportOrdersFromNAV2.py", line 328, in <module>
).filter(PurchaseHeader.documentType==1).one()
File "c:\python26\lib\site-packages\SQLAlchemy-0.5.0rc5dev-py2.6.egg\sqlalchem
y\orm\query.py", line 1061, in one
ret = list(self[0:2](0:2))
File "c:\python26\lib\site-packages\SQLAlchemy-0.5.0rc5dev-py2.6.egg\sqlalchem
y\orm\query.py", line 961, in __getitem__
return list(res)
File "c:\python26\lib\site-packages\SQLAlchemy-0.5.0rc5dev-py2.6.egg\sqlalchem
y\orm\query.py", line 1101, in __iter__
return self._execute_and_instances(context)
File "c:\python26\lib\site-packages\SQLAlchemy-0.5.0rc5dev-py2.6.egg\sqlalchem
y\orm\query.py", line 1104, in _execute_and_instances
result = self.session.execute(querycontext.statement, params=self._params, m
apper=self._mapper_zero_or_none(), _state=self._refresh_state)
File "c:\python26\lib\site-packages\SQLAlchemy-0.5.0rc5dev-py2.6.egg\sqlalchem
y\orm\session.py", line 753, in execute
clause, params or {})
File "c:\python26\lib\site-packages\SQLAlchemy-0.5.0rc5dev-py2.6.egg\sqlalchem
y\engine\base.py", line 819, in execute
return Connection.executors[c](c)(self, object, multiparams, params)
File "c:\python26\lib\site-packages\SQLAlchemy-0.5.0rc5dev-py2.6.egg\sqlalchem
y\engine\base.py", line 869, in _execute_clauseelement
return self.__execute_context(context)
File "c:\python26\lib\site-packages\SQLAlchemy-0.5.0rc5dev-py2.6.egg\sqlalchem
y\engine\base.py", line 896, in __execute_context
return context.get_result_proxy()
File "c:\python26\lib\site-packages\SQLAlchemy-0.5.0rc5dev-py2.6.egg\sqlalchem
y\engine\default.py", line 266, in get_result_proxy
return base.ResultProxy(self)
File "c:\python26\lib\site-packages\SQLAlchemy-0.5.0rc5dev-py2.6.egg\sqlalchem
y\engine\base.py", line 1410, in __init__
self._init_metadata()
File "c:\python26\lib\site-packages\SQLAlchemy-0.5.0rc5dev-py2.6.egg\sqlalchem
y\engine\base.py", line 1444, in _init_metadata
colname = item[0](0).decode(self.dialect.encoding)
File "C:\Python26\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 29-31: invalid d
ata
The value of item in line 1444 in sqlalchemy/engine/base.py is:
('Bizak$Purchase Header_Tipolog\xeda de Abono', <type 'int'>, None, 10, None, None, False)
The table Header has a field called de Abono, note the accented 'i' in the first word. And self.dialect.encoding is:
'utf8'
I hope all this can help. I used the last mercurial trunk to test this.
Thank you.
Comments (14)
-
repo owner -
repo owner - changed component to mssql
- assigned issue to
-
The unicode schema tests, although enabled, all fail on MSSQL with pyodbc.
-
repo owner OK so, someone needs to provide for us a working round trip using pyodbc alone that properly understands multibyte characters. Once we see that its possible with pyodbc, an approach can be built for SQLAlchemy.
-
repo owner hey michael -
what happens if you apply this patch ?
Index: lib/sqlalchemy/databases/mssql.py =================================================================== --- lib/sqlalchemy/databases/mssql.py (revision 5532) +++ lib/sqlalchemy/databases/mssql.py (working copy) @@ -1199,7 +1199,7 @@ supports_sane_multi_rowcount = False # PyODBC unicode is broken on UCS-4 builds supports_unicode = sys.maxunicode == 65535 - supports_unicode_statements = supports_unicode + supports_unicode_statements = False execution_ctx_cls = MSSQLExecutionContext_pyodbc def __init__(self, **params):
All this does is cause SQLA to encode the string into the configured encoding before passing the statement along.
-
zzzeek: with that change the table names get created all messed up. For instance the table name Unitéble2 gets created as Unitéble2. This causes the INSERTS and so forth to be unable to locate the table. It just seems like we have a mismatch somewhere.
I've been working with the UnicodeSchemaTest trying to pin down the problem. Notice the issue here when I run that test, the acute in u'méil' gets stored as \xe9 (which is standard), but in the order by it's showing up as \x82. Ideas?
(Pdb) print tt1.select(order_by=desc(u'méil')) SELECT unitable1.[m\xe9il](m\xe9il), unitable1.[\u6e2c\u8a66](\u6e2c\u8a66) FROM unitable1 ORDER BY m\x82il DESC
-
It turns out with the xe9 / x82 issue it was that the order by was actually using a different character. So the information was correct. It doesn't solve the problem though, so that wasn't related to the issue. Here's a stack trace if it's useful at all:
====================================================================== ERROR: test_reflect (__main__.UnicodeSchemaTest) ---------------------------------------------------------------------- Traceback (most recent call last): File "test\sql\unicode.py", line 103, in test_reflect self.assert_(tt1.select(order_by=desc(u'méil')).execute().fetchall() == File "S:\sqlalchemy.git\lib\sqlalchemy\sql\expression.py", line 1127, in execute return e._execute_clauseelement(self, multiparams, params) File "S:\sqlalchemy.git\lib\sqlalchemy\engine\base.py", line 1204, in _execute_clauseelement return connection._execute_clauseelement(elem, multiparams, params) File "S:\sqlalchemy.git\lib\sqlalchemy\engine\base.py", line 1279, in _execute_clauseelement return proxy.execute(self, super(ProxyConnection, self).execute, elem, *(multiparams or [**(params or {})) File "S:\sqlalchemy.git\test\testlib\assertsql.py", line 261, in execute result = execute(clauseelement, *multiparams, **params) File "S:\sqlalchemy.git\lib\sqlalchemy\engine\base.py", line 819, in execute return Connection.executors[c](]),)(self, object, multiparams, params) File "S:\sqlalchemy.git\lib\sqlalchemy\engine\base.py", line 869, in _execute_clauseelement return self.__execute_context(context) File "S:\sqlalchemy.git\lib\sqlalchemy\engine\base.py", line 896, in __execute_context return context.get_result_proxy() File "S:\sqlalchemy.git\lib\sqlalchemy\engine\default.py", line 266, in get_result_proxy return base.ResultProxy(self) File "S:\sqlalchemy.git\lib\sqlalchemy\engine\base.py", line 1410, in __init__ self._init_metadata() File "S:\sqlalchemy.git\lib\sqlalchemy\engine\base.py", line 1444, in _init_metadata colname = item[0](0).decode(self.dialect.encoding) File "C:\Python25\lib\encodings\utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1-3: invalid data
-
repo owner OK you need to break this test down into smaller steps. that specific error is occurring when we get a result set and there's multibyte column names in the cursor.description. that's actually fairly promising. Its likely that cursor.description in this case already contains python unicode strings or perhaps they are double-encoded in some way. we can make that decode() conditional if need be.
-
Just for the record. We discussed this on IRC. The problem is with how pyodbc is returning the cursor.description. It's always returning it as latin-1, so the default conversion to utf-8 is failing. We discussed possibly adding a dialect option to support description_encoding where it could be set to 'latin-1' for the pyodbc dialect.
-
Account Deleted Replying to empty:
Just for the record. We discussed this on IRC. The problem is with how pyodbc is returning the cursor.description. It's always returning it as latin-1, so the default conversion to utf-8 is failing. We discussed possibly adding a dialect option to support description_encoding where it could be set to 'latin-1' for the pyodbc dialect.
Just to confirm, I was using the pyodbc dialect when the exception appeared.
-
How does this look?
-
I corrected this in dfd80ba089c0d0637f54cbd6b21332d5f5115999. If the implementation is not right let me know.
-
Account Deleted Replying to empty:
I corrected this in dfd80ba089c0d0637f54cbd6b21332d5f5115999. If the implementation is not right let me know.
I confirm that now it works. Thank you.
-
repo owner - removed milestone
Removing milestone: 0.5.0 (automated comment)
- Log in to comment
Since you haven't posted a full round trip test case, I don't know if your issue is related to reflection, table configuration, encoding configuration, specific DBAPI in use (not specified here), or other issues.
SQLA supports unicode strings as identifier names. The proper approach for most DBAPIs is to encode the unicode string into the specified encoding before sending. This is heavily dependent on what the DBAPI allows. pymssql in particular has absolute zero support for unicode identifiers or data.
I'm not familiar with the level of our support for MSSQL overall although the unicode schema tests are currently enabled for this platform which would suggest they are working.
In any case, if you are building the Table manually, you must use a Python unicode object for each multibyte identifier name, not an encoded string.
If the issue is dependent on reflection, it may be that the MSSQL dialect is not decoding incoming schema identifiers to unicode objects. If this is the case, I'd suggest not using reflection for this particular table as a workaround, and this ticket can be made specific to reflection.