unicode issue

Issue #1638 resolved
Former user created an issue

Unicode is a very confusing subject to many programmers, myself included, so if I say anything stupid please treat me gently! I've been trying to get to grips with unicode the last few days for one because I know it's long overdue and also because it's at the heart of a bug in my code.

The error I'm getting is a UnicodeDecodeError, because the ascii codec can't decode certain bytes. I'm dealing with some strings that I know contain non-ascii characters, but somewhere along the lines SQLA is trying to convert these strings to unicode objects using the ascii codec. The problem for me occurs on calling session.execute(text(...)).

The offending code seems to be lines 152-155 of sqlalchemy.engine.default (0.5.6):

            if not dialect.supports_unicode_statements:
                self.statement = unicode(compiled).encode(self.dialect.encoding)
            else:
                self.statement = unicode(compiled)

The conversion to unicode takes place without specifying a codec, which looks to me like a huge trap waiting to happen. Furthermore, if the dialect does not support unicode statements (as is the case with Postgre which is what I'm using), the conversion seems pointless, although to me this looks like an intentional decision the reason for which just isn't clear to me.

As a fix, I would suggest that the unicode conversion takes place using the codec specified by the dialect, i.e.:

That said, it seems that the PG dialect's default encoding is UTF8, and personally I'd prefer to use something else, for example latin-1. If I specify latin-1 when I use create_engine, is this passed to the PG dialect object somewhere (presumably line 45 of sqlalchemy.engine.default in the constructor)? It'd make sense to me to make sure to use the encoding specified with create_engine if it isn't being passed to dialect.encoding.

Comments (6)

  1. Mike Bayer repo owner

    Short of if you can illustrate an actual problematic behavior, there's no bug to be seen here. If you can at least attach a stack trace, that would indicate the source operation which is leading to this error.

    The call to unicode(compiled) is against a Compiled object which deals with a string object that is already a Python unicode object. Additionally, that line of code only deals with the SQL construct including table and column names, but not actual values in your database - those are all moved in via bind parameters. Unless you are passing a text() construct where you are hardcoding non-ascii characters in a plain Python bytestring (i.e. "foo" instead of u"foo"), in which case you should always use u"foo" for any literal value in your source code that contains non-ascii characters.

    Other than the text() issue, the other possibility is that your database contains mis-encoded strings which you are attempting to convert to utf-8, although the stack trace wouldn't lead to that particular line. But that you say "a huge trap waiting to happen" implies that you didn't actually arrive at that line of code for the error message.

    Please attach a lot more detail, including at least stack traces, and preferably some code which reproduces the issue you're seeing.

  2. Mike Bayer repo owner

    Here, let me show you most likely what you're doing:

    # coding: utf-8
    
    from sqlalchemy import *
    
    engine = create_engine('sqlite:///:memory:', echo=True)
    
    # correct
    print engine.execute(text(u"select 'drôle de petit voix m’a réveillé.'")).scalar().encode('utf-8')
    
    # incorrect! raises the line of code you cite
    print engine.execute(text("select 'drôle de petit voix m’a réveillé.'")).scalar().encode('utf-8')
    

    the second call reproduces the error. Always use u"foo" for Python literals that contain non-ascii characters.

  3. Former user Account Deleted
    • removed status
    • changed status to open

    Replying to zzzeek:

    But that you say "a huge trap waiting to happen" implies that you didn't actually arrive at that line of code for the error message. Sorry for the confusion there, those lines of code were at the root of the stack trace.

    The problem with your example is that you know what encoding you want to use. My code is taking text from an input that could be in any encoding, storing this away in the DB, and later pulling the information out and giving it back to the user and letting them deal with the encoding issue.

    (In case you suggest I try storing it in a binary format or something like that, it is still useful for me to be able to perform text comparisons in my queries, though this only ever uses ASCII values, and whatever encoding the text is in, it's one that supports the ASCII subset.)

  4. Mike Bayer repo owner

    if you are storing text in an unknown encoding, you basically are storing binary data of unknown format. Textual comparisons here are not possible, since the "text" cannot be extracted from the binary information unless the encoding is known (or has been accurately guessed). Relational databases don't include the capability to deal with text that contains unknown binary values embedded within ASCII. Without the encoding, neither the client DBAPI nor the database server itself can treat the data as text - both sides of the conversation are always configured with an encoding, whether its implicitly ASCII/UTF-8 or has been explicitly configured on the client and server.

    Per the example above, none of this has any relevance at all to the text() construct, assuming the usage of bind parameters for data which is strongly recommended in all cases. Embedding the data within the SQL statement itself, especially sent over from end users with no knowledge of what the data contains, is a frowned-upon practice since this practice is at the core of SQL injection attacks.

  5. Log in to comment