Error when there are accented chars in the fieldnames using mssql

Issue #1264 resolved
Former user created an issue

Hi,

Another problem appeared working against the Microsoft Dynamics NAV database (MSSQL). The error raises whenever a field with special chars like accented vocals (typical in spanish) are involved. I suppose that sqlalchemy thinks that the fieldnames are encoded in utf8, but it looks as if they are encoded in unicode. The collation property of the database is Modern_Spanish_CI_AS. When I have used ADO I must write the fieldname encoded in unicode.

C:\Python26\lib\site-packages\sqlalchemy-0.5.0rc5dev-py2.6.egg\sqlalchemy\databa
ses\mssql.py:878: DeprecationWarning: object.__new__() takes no parameters
  return super(MSSQLDialect, cls).__new__(cls, *args, **kwargs)
Traceback (most recent call last):
  File "exportOrdersFromNAV2.py", line 328, in <module>
    ).filter(PurchaseHeader.documentType==1).one()
  File "c:\python26\lib\site-packages\SQLAlchemy-0.5.0rc5dev-py2.6.egg\sqlalchem
y\orm\query.py", line 1061, in one
    ret = list(self[0:2](0:2))
  File "c:\python26\lib\site-packages\SQLAlchemy-0.5.0rc5dev-py2.6.egg\sqlalchem
y\orm\query.py", line 961, in __getitem__
    return list(res)
  File "c:\python26\lib\site-packages\SQLAlchemy-0.5.0rc5dev-py2.6.egg\sqlalchem
y\orm\query.py", line 1101, in __iter__
    return self._execute_and_instances(context)
  File "c:\python26\lib\site-packages\SQLAlchemy-0.5.0rc5dev-py2.6.egg\sqlalchem
y\orm\query.py", line 1104, in _execute_and_instances
    result = self.session.execute(querycontext.statement, params=self._params, m
apper=self._mapper_zero_or_none(), _state=self._refresh_state)
  File "c:\python26\lib\site-packages\SQLAlchemy-0.5.0rc5dev-py2.6.egg\sqlalchem
y\orm\session.py", line 753, in execute
    clause, params or {})
  File "c:\python26\lib\site-packages\SQLAlchemy-0.5.0rc5dev-py2.6.egg\sqlalchem
y\engine\base.py", line 819, in execute
    return Connection.executors[c](c)(self, object, multiparams, params)
  File "c:\python26\lib\site-packages\SQLAlchemy-0.5.0rc5dev-py2.6.egg\sqlalchem
y\engine\base.py", line 869, in _execute_clauseelement
    return self.__execute_context(context)
  File "c:\python26\lib\site-packages\SQLAlchemy-0.5.0rc5dev-py2.6.egg\sqlalchem
y\engine\base.py", line 896, in __execute_context
    return context.get_result_proxy()
  File "c:\python26\lib\site-packages\SQLAlchemy-0.5.0rc5dev-py2.6.egg\sqlalchem
y\engine\default.py", line 266, in get_result_proxy
    return base.ResultProxy(self)
  File "c:\python26\lib\site-packages\SQLAlchemy-0.5.0rc5dev-py2.6.egg\sqlalchem
y\engine\base.py", line 1410, in __init__
    self._init_metadata()
  File "c:\python26\lib\site-packages\SQLAlchemy-0.5.0rc5dev-py2.6.egg\sqlalchem
y\engine\base.py", line 1444, in _init_metadata
    colname = item[0](0).decode(self.dialect.encoding)
  File "C:\Python26\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 29-31: invalid d
ata

The value of item in line 1444 in sqlalchemy/engine/base.py is:

('Bizak$Purchase Header_Tipolog\xeda de Abono', <type 'int'>, None, 10, None, None, False)

The table Header has a field called de Abono, note the accented 'i' in the first word. And self.dialect.encoding is:

'utf8'

I hope all this can help. I used the last mercurial trunk to test this.

Thank you.

Comments (14)

  1. Mike Bayer repo owner

    Since you haven't posted a full round trip test case, I don't know if your issue is related to reflection, table configuration, encoding configuration, specific DBAPI in use (not specified here), or other issues.

    SQLA supports unicode strings as identifier names. The proper approach for most DBAPIs is to encode the unicode string into the specified encoding before sending. This is heavily dependent on what the DBAPI allows. pymssql in particular has absolute zero support for unicode identifiers or data.

    I'm not familiar with the level of our support for MSSQL overall although the unicode schema tests are currently enabled for this platform which would suggest they are working.

    In any case, if you are building the Table manually, you must use a Python unicode object for each multibyte identifier name, not an encoded string.

    If the issue is dependent on reflection, it may be that the MSSQL dialect is not decoding incoming schema identifiers to unicode objects. If this is the case, I'd suggest not using reflection for this particular table as a workaround, and this ticket can be made specific to reflection.

  2. Mike Bayer repo owner

    OK so, someone needs to provide for us a working round trip using pyodbc alone that properly understands multibyte characters. Once we see that its possible with pyodbc, an approach can be built for SQLAlchemy.

  3. Mike Bayer repo owner

    hey michael -

    what happens if you apply this patch ?

    Index: lib/sqlalchemy/databases/mssql.py
    ===================================================================
    --- lib/sqlalchemy/databases/mssql.py   (revision 5532)
    +++ lib/sqlalchemy/databases/mssql.py   (working copy)
    @@ -1199,7 +1199,7 @@
         supports_sane_multi_rowcount = False
         # PyODBC unicode is broken on UCS-4 builds
         supports_unicode = sys.maxunicode == 65535
    -    supports_unicode_statements = supports_unicode
    +    supports_unicode_statements = False
         execution_ctx_cls = MSSQLExecutionContext_pyodbc
    
         def __init__(self, **params):
    

    All this does is cause SQLA to encode the string into the configured encoding before passing the statement along.

  4. Michael Trier

    zzzeek: with that change the table names get created all messed up. For instance the table name Unitéble2 gets created as Unitéble2. This causes the INSERTS and so forth to be unable to locate the table. It just seems like we have a mismatch somewhere.

    I've been working with the UnicodeSchemaTest trying to pin down the problem. Notice the issue here when I run that test, the acute in u'méil' gets stored as \xe9 (which is standard), but in the order by it's showing up as \x82. Ideas?

    (Pdb) print tt1.select(order_by=desc(u'méil'))
    SELECT unitable1.[m\xe9il](m\xe9il), unitable1.[\u6e2c\u8a66](\u6e2c\u8a66)
    FROM unitable1 ORDER BY m\x82il DESC
    
  5. Michael Trier

    It turns out with the xe9 / x82 issue it was that the order by was actually using a different character. So the information was correct. It doesn't solve the problem though, so that wasn't related to the issue. Here's a stack trace if it's useful at all:

    ======================================================================
    ERROR: test_reflect (__main__.UnicodeSchemaTest)
    ----------------------------------------------------------------------
    Traceback (most recent call last):
      File "test\sql\unicode.py", line 103, in test_reflect
        self.assert_(tt1.select(order_by=desc(u'méil')).execute().fetchall() ==
      File "S:\sqlalchemy.git\lib\sqlalchemy\sql\expression.py", line 1127, in execute
        return e._execute_clauseelement(self, multiparams, params)
      File "S:\sqlalchemy.git\lib\sqlalchemy\engine\base.py", line 1204, in _execute_clauseelement
        return connection._execute_clauseelement(elem, multiparams, params)
      File "S:\sqlalchemy.git\lib\sqlalchemy\engine\base.py", line 1279, in _execute_clauseelement
        return proxy.execute(self, super(ProxyConnection, self).execute, elem, *(multiparams or [**(params or {}))
      File "S:\sqlalchemy.git\test\testlib\assertsql.py", line 261, in execute
        result = execute(clauseelement, *multiparams, **params)
      File "S:\sqlalchemy.git\lib\sqlalchemy\engine\base.py", line 819, in execute
        return Connection.executors[c](]),)(self, object, multiparams, params)
      File "S:\sqlalchemy.git\lib\sqlalchemy\engine\base.py", line 869, in _execute_clauseelement
        return self.__execute_context(context)
      File "S:\sqlalchemy.git\lib\sqlalchemy\engine\base.py", line 896, in __execute_context
        return context.get_result_proxy()
      File "S:\sqlalchemy.git\lib\sqlalchemy\engine\default.py", line 266, in get_result_proxy
        return base.ResultProxy(self)
      File "S:\sqlalchemy.git\lib\sqlalchemy\engine\base.py", line 1410, in __init__
        self._init_metadata()
      File "S:\sqlalchemy.git\lib\sqlalchemy\engine\base.py", line 1444, in _init_metadata
        colname = item[0](0).decode(self.dialect.encoding)
      File "C:\Python25\lib\encodings\utf_8.py", line 16, in decode
        return codecs.utf_8_decode(input, errors, True)
    UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1-3: invalid data
    
  6. Mike Bayer repo owner

    OK you need to break this test down into smaller steps. that specific error is occurring when we get a result set and there's multibyte column names in the cursor.description. that's actually fairly promising. Its likely that cursor.description in this case already contains python unicode strings or perhaps they are double-encoded in some way. we can make that decode() conditional if need be.

  7. Michael Trier

    Just for the record. We discussed this on IRC. The problem is with how pyodbc is returning the cursor.description. It's always returning it as latin-1, so the default conversion to utf-8 is failing. We discussed possibly adding a dialect option to support description_encoding where it could be set to 'latin-1' for the pyodbc dialect.

  8. Former user Account Deleted

    Replying to empty:

    Just for the record. We discussed this on IRC. The problem is with how pyodbc is returning the cursor.description. It's always returning it as latin-1, so the default conversion to utf-8 is failing. We discussed possibly adding a dialect option to support description_encoding where it could be set to 'latin-1' for the pyodbc dialect.

    Just to confirm, I was using the pyodbc dialect when the exception appeared.

  9. Log in to comment