No nvarchar support in mssql.py

Issue #298 resolved
Former user created an issue

mssql.py doesn't support nvarchar fields. I've attached a patch that adds an MSUnicode type and maps it to nvarchar. The patch also corrects an indentation error (tab instead of spaces)

Comments (27)

  1. paj
    • changed status to open
    • removed status

    Unicode columns are still created as VARCHAR. The problem is that Unicode is a TypeDecorator around a String type, and that TypeDecorator.dialect_impl gets the database type for the type that it wraps (i.e. String, not Unicode).

    Fixing this could be interesting! One option is to make Unicode a completely separate type from String. This could raise issues for cross-database portability. MS-SQL has different database types for string and unicode fields. Postgres has a single string type, and it is a database option whether your strings are unicode.

    It has been noted that NVARCHAR fields are UCS-2 only. Still, this is preferable to using VARCHAR with utf-8 encoding, as other applications (not using SQLAlchemy) may access the data. In any case, most Python builds are UCS-2 only.

  2. Mike Bayer repo owner

    why not have MSUnicode override the "impl" class member to point to an NVARCHAR type defined in MS-SQL ? then MSUnicode doesnt wrap String anymore (not directly at least).

  3. paj

    I should just clarify, the patch use the approach Michael suggested on the mailing list: another thing we might do if you want Unicode to act database-neutral is add a step at line 85 of types.py to ask the dialect first for a type_descriptor of "Unicode", then if not found to ask for the "impl" version which is normally String. The MSSQL dialect would return NVARCHAR for the "Unicode" type and "String" would not be involved. all other dialects would return None when given "Unicode".

  4. paj

    Ok, this patch also makes it actually store unicode as unicode, rather than a utf-8 encoded string. If you want this as a separate patch, just ask. MSSQL now passes the Unicode unit tests!

  5. Mike Bayer repo owner

    is there a condition where encoding would be sent to the dialect as None? I see that we are checking for that now. DefaultDialect has encoding as "utf-8" by default.

  6. Mike Bayer repo owner

    OK the current version of the patch, which is completely different from what I proposed, is in changeset:2199.

    • the MissingTypeError is not the way to go; the adapt_type() method does need to return the type that was given in the case that no adapted type is located. this is in the case of a completely custom user-defined type which has no dialect-specific version (unittest is testtypes/OverrideTest)
    • the TypeDecorator instead just checks if the returned type object is the same object as itself (meaning it was not overridden). otherwise, it then runs the adapt step on the returned type, and returns that type.
    • mssql has a separate MSNVarchar and MSUnicode type. MSNVarchar is analgous to using MSNVarchar without unicode conversion, MSUnicode analgous to using MSNVarchar with unicode conversion.
    • reflecting a mssql table with NVARCHAR will give you MSNVarchar as the return type, i.e. it wont do any unicode conversion unless the dialect has "convert_unicode=True" turned on. this is the same behavior as all the other dialects (i.e. unicode is never converted unless you either specified Unicode types specifically for those columns you want, or you have convert_unicode turned on).
    • since MS-SQL isnt going to do unicode conversion by default, i removed the "encoding=None" default setting, particularly because i dont like the latency introduced into the Unicode type having to check if the encoding is not None. if theres a real reason to have encoding=None on the dialect, ill reconsider (but i would think convert_unicode=False, and dont use Unicode, would be enough, if unicode conversion isnt desired).

    im going to mark this fixed, but i havent tested on an actual MS-SQL database so reopen if i screwed something up.

  7. paj
    • changed status to open
    • removed status

    It is creating the column as NVARCHAR, but the data inserted is utf-8 encoded, as opposed to being actual unicode. When you select the data, you get a unicode object that contains utf-8 encoded data. The reason for the option of "encoding = None" was to avoid these problem.

    Here's an example of it happening: users_table.insert().execute(user_id=u'hello\u1234') print users_table.select().execute().fetchone() (u'hello\xe1\u02c6\xb4',)

  8. Mike Bayer repo owner

    by "actual unicode", i assume you mean a python "unicode" object. that would imply that the DBAPI youre working with handles python "unicode" objects directly, and does not want an encoded string (and also returns python "unicode" objects). therefore SA should not be doing the encode() step.

    if that is the case, the MS-SQL code would be changed to this:

    class MSNVarchar(MSString):
        def get_col_spec(self):
            return "NVARCHAR(%(length)s)" % {'length' : self.length}
        def convert_bind_param(self, value, dialect):
            return value
        def convert_result_value(self, value, dialect):
            return value
    
    class MSUnicode(sqltypes.Unicode):
        impl = MSNVarchar
        def convert_bind_param(self, value, dialect):
            return value
        def convert_result_value(self, value, dialect):
            return value
    

    if thats so, I also need to know if both DBAPIs in use do it this way or not.

    if you can show me completely straight DBAPI code that does the right thing, that would be informative.

  9. paj

    Absolutely, what works for me is passing adodbapi a Python Unicode object as a bind parameter. And it does return a Python Unicode object from queries when appropriate. The code you suggested works perfectly.

    Offhand, I don't know how pymssql works. I will investigate this and get back to you.

    Many thanks for looking into this.

  10. paj

    Ok, it appears pymssql does need the encoding - at least, passing it unicode bind parameters fails. But the values end up in the database encoded, rather than as proper unicode. Ooooh, getting reliable MSSQL unicode support is going to be fun!

    I will keep you posted.

  11. Mike Bayer repo owner

    so....does the current checked in code work with pymssql ? in that case we need to have these types just check which dialect is in use (or maybe better is to provide different versions based on the dialect in use).

  12. paj

    The current code is IMHO broken for pymssql, but I think it's fine for most people's needs, and probably as good as we can easily get, bearing in mind the comment at the top of mssql.py "pymssql has problems with binary and unicode data that this module does NOT work around". In any case, this patch doesn't make pymssql any worse :-)

  13. paj

    I've just realised one of these changes introduced an unintended change - making the default encoding None. Here is a patch to revert that part.

  14. Log in to comment