sqlalchemy-monetdb, python2 and unicode

Issue #3284 resolved
Former user created an issue

Hi!

I'm about to release sqlalchemy-monetdb 0.9 but I have one final bug that I just don't can't manage to squash for a while now. I just don't understand where the problem lies, if it is a problem with sqlalchemy-monetdb (0.9) or python-monetdb.

problem is where you have a utf-8 encoded unicode string in a source file which is inserted into the database using sqlalchemy. Somehow the fetched results differ. Only happens with Python2, not python3. Probably it is a very stupid misunderstanding of how utf-8 encoded strings are handled somewhere.

see test output here:

https://travis-ci.org/gijzelaerr/sqlalchemy-monetdb/jobs/44208224

If you don't have a simple suggestion then never mind, I probably need to dive into python2 and unicode handling again.

Comments (4)

  1. Mike Bayer repo owner

    well when passing non-ascii data to the database, the first thing that has to happen is that the string must be first constructed as a Python unicode object, e.g. u'data'. Next, either the DBAPI driver, or SQLAlchemy, needs to do the "right thing" with this string such that monetdb receives the data in the expected format. Typically, the DBAPI has to encode the string into bytes, using something like string.encode('utf-8'). It also has to do the "decode" on the other end; as data is received from the database to be sent to Python, something like "bytes".decode('utf-8') has to be called.

    The encoding used is usually a configurable setting with the database; the client libraries as well as the backend server and the schema itself may all contain directives which refer to the character encoding to be used. For example, when using the MySQLdb driver, if the URL specifies "encoding=utf8&use_unicode=1", MySQLdb will encode and decode strings back and forth from Python unicode to bytes using utf-8. if the URL instead specifies "use_unicode=0", then MySQLdb passes through data unchanged, and SQLAlchemy instead takes on the job of doing the encode/decode, which occurs within the String type.

    Once the desired encoding/decoding is set up, then the database itself needs to receive the data and persist it in such a way that this information will be preserved. On a database like Oracle or SQL server for example, special wide-encoded datatypes like NVARCHAR must be used in order to preserve this information. Postgresql and MySQL have server-side encodings set up as well.

    Anyway, here, the first step is to figure out how to get Python unicode in and out of monetdb symmetrically, using the DBAPI directly. If the encode/decode has to be done outside of the driver, or within it, and if special settings need to happen with monetdb on the client or server side, would be figured out. once you have unicode/non-ascii round trips working with a short test script, it's mostly a matter of configuration of the SQLAlchemy dialect to conform to this behavior; it has options for pretty much every possible contingency.

  2. Former user Account Deleted reporter

    Hi Mike,

    Thanks for this great answer.

    I've made the python-monetdb library in such a way that all unicode is encoded as utf-8 and all string are decoded to unicode. This is symetric:

    https://gist.github.com/gijzelaerr/9fde6c6d40c9c3303045

    I never thought about supporting different encoding schema's, partially because the python DB API doesn't mention anything about this and nobody had this as a feature request for the last 5 years. maybe an idea for future releases, but also my time is finite ;)

    So, python-monetdb encodes to utf-8 and decodes strings to unicode, expecting it to be utf-8. How do I correctly configure the monetdb driver to match this behavior?

    Thanks again,

    • Gijs
  3. Former user Account Deleted reporter

    actually I found it :) I need to set supports_unicode_statements

    Also the python-monetdb API has a small bug which is fixed now. Thx for the push in the right direction.

    class MonetDialect(default.DefaultDialect):
        supports_unicode_statements = True
    
  4. Log in to comment