sqlalchemy-monetdb, python2 and unicode

Mike Bayer repo owner

well when passing non-ascii data to the database, the first thing that has to happen is that the string must be first constructed as a Python unicode object, e.g. u'data'. Next, either the DBAPI driver, or SQLAlchemy, needs to do the "right thing" with this string such that monetdb receives the data in the expected format. Typically, the DBAPI has to encode the string into bytes, using something like string.encode('utf-8'). It also has to do the "decode" on the other end; as data is received from the database to be sent to Python, something like "bytes".decode('utf-8') has to be called.

The encoding used is usually a configurable setting with the database; the client libraries as well as the backend server and the schema itself may all contain directives which refer to the character encoding to be used. For example, when using the MySQLdb driver, if the URL specifies "encoding=utf8&use_unicode=1", MySQLdb will encode and decode strings back and forth from Python unicode to bytes using utf-8. if the URL instead specifies "use_unicode=0", then MySQLdb passes through data unchanged, and SQLAlchemy instead takes on the job of doing the encode/decode, which occurs within the String type.

Once the desired encoding/decoding is set up, then the database itself needs to receive the data and persist it in such a way that this information will be preserved. On a database like Oracle or SQL server for example, special wide-encoded datatypes like NVARCHAR must be used in order to preserve this information. Postgresql and MySQL have server-side encodings set up as well.

Anyway, here, the first step is to figure out how to get Python unicode in and out of monetdb symmetrically, using the DBAPI directly. If the encode/decode has to be done outside of the driver, or within it, and if special settings need to happen with monetdb on the client or server side, would be figured out. once you have unicode/non-ascii round trips working with a short test script, it's mostly a matter of configuration of the SQLAlchemy dialect to conform to this behavior; it has options for pretty much every possible contingency.

2015-01-05T16:50:20+00:00

Former user Account Deleted reporter

Hi Mike,

Thanks for this great answer.

I've made the python-monetdb library in such a way that all unicode is encoded as utf-8 and all string are decoded to unicode. This is symetric:

https://gist.github.com/gijzelaerr/9fde6c6d40c9c3303045

I never thought about supporting different encoding schema's, partially because the python DB API doesn't mention anything about this and nobody had this as a feature request for the last 5 years. maybe an idea for future releases, but also my time is finite ;)

So, python-monetdb encodes to utf-8 and decodes strings to unicode, expecting it to be utf-8. How do I correctly configure the monetdb driver to match this behavior?

Thanks again,

Gijs

2015-01-07T13:21:15+00:00

Former user Account Deleted reporter

actually I found it :) I need to set supports_unicode_statements

Also the python-monetdb API has a small bug which is fixed now. Thx for the push in the right direction.

class MonetDialect(default.DefaultDialect):
    supports_unicode_statements = True

2015-01-07T15:21:33+00:00

Former user Account Deleted reporter

changed status to resolved

this issue has been resolved

2015-01-07T15:21:59+00:00

Comments (4)