Query builder does not quote mixed case table and column names in Postgresql

Issue #155 resolved
Former user created an issue

(original reporter: creiht) In Postgresql if your table has mixed case name, you have to quote them in the query.

An example might be:

select * from "Ticket" where "Ticket"."SpecialValue" = 12;

SQL Alchemy currently does not quote the names, so the query generated would look like:

select * from Ticket where Ticket.SpecialValue = 12;

and Postgres will turn that into:

select * from ticket where ticket.specialvalue = 12;

I would expect that a similar problem could happen with other engines, but I don't know the mixed case naming conventions of the other databases.

Comments (57)

  1. Former user Account Deleted

    (original author: creiht) I have attached a diff that is my first stab/hack at fixing this that is mostly a proof of concept. Hopefully this will give you an idea of what I am trying to accomplish. A better solution would be to create a more generic solution, perhaps having a function in ansisql.ANSICompiler that would quote strings that by default does nothing but could be overwritten in each engine. Let me know what direction you would like to go with this, and I will implement it.

  2. Mike Bayer repo owner
    • assigned issue to

    the patch is not bad. I would make the check for "islower" more inline, i.e. if the column/table is already lowercase, dont go through all the effort re-joining the string together, etc.

    its ok as a postgres thing for now and can later move up to more of a "template method" within ANSISQLCompiler.

    also, you have to insure all the unittests run, like this (assuming you have a local PG database called "test" with "scott"/"tiger":

       python test/alltests.py --db postgres
    

    or

       python test/alltests.py --dburl <some postgres url>
    

    there should also be unittests added for this specific thing, probably in test/select.py.

  3. Former user Account Deleted

    (original author: creiht) I have attached my second attempt at the solution to this problem. I ran into another case where a string needed to be quoted, and found that it was going to be easier to make a more generalized solution to the problem.

    This solution defines a function called {{{_quote_string()}}}, and if an engine needs to do quotes, it overrides that function. The {{{_quote_string()}}} is then called at each point a string would need to be quoted.

    There are probably also other areas that need to be quoted, and I plan on catching them once I can set up a local instance of Postgres, and I will write a set of tests for each case.

    If this proves to be a performance problem I see the following possibilities for optimization:

    • Add an option at the table level that will turn string quoting on or off and then we can bypass it completely if it is turned off.
    • When a Table class is created, generate and store a quoted version of the string for each column which would then be used in place of the {{{_quote_string()}}} function.

    Let me know what you think.

  4. Mike Bayer repo owner
    • changed milestone to 0.3.0

    this is not bad, i might adjust _quote_string to be something more like process_identifier() or something like that.

  5. Mike Bayer repo owner

    im going to look into adding "quote=True" to Table so that users can manually identify if they want quoting on a table identifier. id rather not have anything with quoting in by default.

  6. Former user Account Deleted

    Is this patch suited to handling quoted identifiers other than those that are a legal unquoted identifier when coerced to a single case? I believe all of the identifiers in the statement below are legal SQL.

    CREATE TABLE "WorstCase" ( id SERIAL PRIMARY KEY, lowercase varchar(25), UPPERCASE varchar(25), MixedCase varchar(25), "ASC" varchar(25), "desc" varchar(25), "Union" varchar(25), "MixedCase" varchar(25), "With Space" varchar(25), "1st" varchar(25), "Similar..?" varchar(25), "Similar..#" varchar(25), ":symbol" varchar(25), "inner""quote" varchar(25) );

  7. Former user Account Deleted

    May be beneficial to upgrade a table.schema to an object with its own quotable name. Table.fullname can not be quoted directly for tables that belong to a schema other than the default.

    Some things to keep in mind concerning delimiters. Both the beginning and ending delimiter should be configurable. The double quote character is the delimiter defined in the SQL specification, but some databases use an alternative character, ie MySQL uses the backtick "`". Some databases, eg MySQL, will accept different delimiters depending on the mode that the backend is configured to run in. In some instances database will operate in a mode that transposes the standard quoting characters for literals and identifiers; single quote identifiers, double quote literals. MSQL will accept the paired delimiters [].

  8. Former user Account Deleted

    ansisql.py.diff doesn't apply cleanly. attaching an updated patch but visit_column() needs some attention.

  9. Former user Account Deleted

    Seems that there is need of "_prepare_identifier" also in ANSISchemaGenerator.visit_primary_key_constraint() (lib/sqlalchemy/ansisql.py)

    @@ -662,16 +673,16 @@
             if len(constraint) == 0:
                 return
             self.append(", \n")
    -        self.append("\tPRIMARY KEY (%s)" % string.join([for c in constraint](c.name),', '))
    +        self.append("\tPRIMARY KEY (%s)" % string.join([for c in constraint](self._prepare_identifier(c)),'
    , '))
    
  10. Former user Account Deleted

    One of the things that causes some failed tests is the distinction between name and fullname on tables. With _prepare_identifier() we've lost this distinction. Should we separate it back out and create _prepare_name() and _prepare_fullname() instead? Could we unify the concept further and create the ability to get a properly quoted tablename.columnname when needed, instead of having to ask for a quoted table name and then a quoted column name?

  11. Former user Account Deleted

    I'm now pretty happy with the above patch. Please take a look and comment. Still need to learn how to write a few tests for it. And it needs more testing/inspection to find places where quoting may be needed but is not yet being done.

  12. Former user Account Deleted
    • removed status
    • changed status to open

    It would be nice if reflected columns were quoted. The obvious way is something like

    self.quote = not(self.name == self.name.lower())
    

    in the Column constructor.

    I've implemented this here:

    http://go2.ie/sqlalchemy-quote-column-names.diff

    The patch is a bit big due to having to fix some test cases that used a non-quoted mixed case column name. Should these sorts of columns be supported? They don't really work that well with the ORM since Python attribute names are case-sensitive.

  13. Mike Bayer repo owner

    so let me get this straight....

    you want to define a table like this:

    t = Table('sometable', meta, 
       Column('SomeColumn', Integer, quote=True)
    )
    

    and then the column would be accessed like this:

    t.c.somecolumn
    

    and calling

    t.c.SomeColumn
    

    which is the actual name of the column, would raise an error ? i think thats pretty counterintuitive.

    There is a feature which i think satisfies what you really want, which is this:

    t = Table('sometable', meta,
        Column('SomeColumn', Integer, quote=True, key='somecolumn')
    )
    

    then you can explicitly say, "id like to reference column X using the name Y". and you know explicit is better than implicit ;)

    i dont push the "key" parameter so much since there have been a lot of mapper bugs that come up and a lack of unit tests using "key". but ultimately that should all be fixed anyway. i bet theres not much buggy behavior left anyway.

  14. Former user Account Deleted

    No not at all :)

    What I would like is to have consistent case sensitivity at table and model level, in conjunction with autoload. At the moment it doesn't work unless you explicity set quote = True on each column after reflecting the table.

    My use case is this:

    ############# foo.sql
    create table "tblFoo"
    (
        "iFooID" serial primary key,
        "sTitle" text not null
    );
    
    ==============================================
    
    ############# foo.py
    import sqlalchemy
    
    metaData = sqlalchemy.BoundMetaData('postgres:///something')
    tblFoo = sqlalchemy.Table('tblFoo', metaData, quote = True)
    
    class Foo:
        def __init__(self, sTitle):
            self.sTitle = sTitle
    
    sqlalchemy.Mapper(Foo, tblFoo)
    session = sqlalchemy.create_session()
    oFoo = Foo('Hello World')
    session.save(oFoo)
    session.flush()
    
    ==============================================
    

    If I do this at the moment, I have to do the following:

    for column in tblFoo.columns:
        if column.name != column.name.lower():
            column.quote = True
    

    Then the above example will work.

  15. Former user Account Deleted

    Some further information:

    You won't see this issue unless you use autoload.

    When you create a table in PostgreSQL without quoting, table and column names get converted to lower case. When information_schema.py reads the column names, it creates Column objects with the lowercase column names. Hence the model in the above example gets attributes ifooid and stitle rather than iFooID and sTitle.

    This hurts my eyes :)

    Yes, I know Hungarian notation is evil etc. I'm stuck with it :(

  16. Former user Account Deleted

    Arg, the above should be

    tblFoo = sqlalchemy.Table('tblFoo', metaData, autoload = True, quote = True)
    
  17. Mike Bayer repo owner

    oh, i read your patch incorrectly, it confused me that you changed all the column identifiers in one of the unittests to be all lowercase.

    we might go with your idea except that the "quote" flag is already revealing its limitation, that its setting an attribute on an entity that has no allegiance to any particular database. the same Table could be used on a sqlite, oracle, mysql database, etc. by sticking a "quote this column if theres mixed case" is a band-aid; what if the column name is a reserved word on just MySQL ? then someones going to want quoting upon reflection for that too. then the same table on Postgres has quoting set on some random column for no apparent reason. similarly for mixed case, only postgres even has the quote-mixed-case requirement in the first place.

    we started with "quote=True" because it was the quickest way to allow any quoting at all to work, so that such a table could even be used. but the better solution is that each database dialect would know exactly what identifiers require quoting, and would use those rules at query compilation time instead of checking a quote=True flag. The Table itself would again have no idea about quoting flags (well, probably best to leave the flag available in case its needed).

    i guess what im saying is that the quote flag is an interim hack, pending some contributors doing the work of identifying all (or at least some) of the quotable conditions for each database, and im a little antsy building too much on top of it. id rather modify the recent svn checkin to check "quote=True" plus some additional dialect-specific method, then put your "check the case convention" rule in the postgres dialect (one of the first patches on this ticket did it that way). as a bonus id also like postgres to cache the result of this check somehow so that it isnt doing thousands of string comparisons on every compilation.

  18. Former user Account Deleted

    Yeah, my first attempt at fixing it was in information_schema.py....

    I will look into the reserved keywords thing and see if I can come up with a better patch.

    FWIW the key option you mentioned would actually solve the problem, but then I lose the DRY-ness of autoload :/ autoload rocks :)

  19. Former user Account Deleted

    zzzeek, I think a few use cases would help me better understand where to take this.

    The coupling between Table/Column and quoting shouldn't be a problem for the autoload=True use case. So perhaps a sufficient measure would be to add quote=True during reflection. This would satisfy the caching desire.

    For cases where a user is CREATEing the tables they of course have the freedom to choose what gets quoted. So again the coupling shouldn't cause problems. And requiring the user to specify the extra flag rather than guessing for them is not forcing too much extra work on them.

    I fully agree that the quoting is not really a property of the identifier or the identified object alone. The database figures in there too. If you can give a couple examples of where the coupling causes problems for people who need quoting, that should help me find a solution.

  20. Former user Account Deleted

    I can't quite formulate a convincing example, but there may be some problem if a user wants to use a table defined for one dialect where an identifier is legal to be used unquoted attempts to use that in a dialect where the identifier is reserved. Maybe in migrating between databases? That might be solved by introducing a requote() method that could find the minimal quoting needed when recasting the table against another dialect.

  21. Mike Bayer repo owner

    all we need here is a function inside of each dialect, give it a string, and it determines the question, "does this identifier need to be quoted" ? thats the work here that im looking for.

    i just added the method in changeset:1802, to the base class in ansisql.py. just add a _requires_quotes method to all the subclasses of ANSIIdentifierPreparer in each dialect. if we get that working, and get reflection going (with a little unit test too), I can drill in to try to optimize the performance of this.

    the quote=True flag is definitely not a big deal. however, there is a tradition in this code, which adhering to has always ultimately proved to be far more necessary than it first seems, which is that the Dialect is solely responsible for all generation of SQL strings that get sent to a database; as few string-generation concepts as possible should be stored in schema/SQL expression objects. by having reflection turn on "quote=True" on the Table/Column, youre just breaking that paradigm a tiny bit, adding a very small inconsistency to the code (and yes, maybe that mythical case where one DB wants quoting and the other doesnt).

  22. Former user Account Deleted

    I'll start researching and working on these methods tomorrow. I think I saw a list of reserved words per database in the code somewhere already.

    I know you've made it clear to me multiple times that you don't think the case folding stuff is necessary. But again the natural case of identifiers is part of the quoting rules. in postgres the natural identifier case is lower (which breaks the sql spec) so it works well with the custom of lowercasing identifiers in sqlalchemy in places. But other dbs that follow sql use upper as the natural case. When you pass in a query the lexer looks for quotes to determine if the case needs to be folded before matching identifiers. When you reflect you will retrieve the case stored by the db which you can test against legal characters to determine if it needs quoting. But if you fold the reflected names to lower for the python identifiers (as I think you are doing now to avoid typing uppercase all over) you lose important information. You are asking to get that back without the extra information given by quotes. I think you will either have to store the proper reflected name with case, or the quote= parameter. In postgresql IDENTIFIER is not equal to "IDENTIFIER". And in any db that follows the sql standard identifier is not equal to "identifier". Having said all that, the simple cases functioning properly will be enough to enable my work. :-)

  23. Former user Account Deleted

    Here is a little test I'm using to understand various db's quoting behavior:

    1) create a table. for mysql one of the MixedCase columns needs to be omitted and the quoting character changed.
    CREATE TABLE "WorstCase2" (
        id INTEGER PRIMARY KEY,
        lowercase INTEGER,
        UPPERCASE INTEGER,
        MixedCase INTEGER,
        "lowered" INTEGER,
        "UPPERED" INTEGER,
        "ASC" INTEGER,
        "desc" INTEGER,
        "Union" INTEGER,
        "MixedCase" INTEGER,
        "MixedCaseD" INTEGER,
        "With Space" INTEGER,
        "1st" INTEGER,
        "Similar..?" INTEGER,
        "Similar..#" INTEGER,
        ":symbol" INTEGER,
        "inner""quote" INTEGER
    );
    
    2) describe the table in the db's syntax
    postgresql: \d "WorstCase2"
    firebird: SHOW TABLE "WorstCase2";
    
    3) do any of the following selects cause errors.
    SELECT LOWERED FROM "WorstCase2";
    SELECT uppered FROM "WorstCase2";
    SELECT mixedcased FROM "WorstCase2";
    

    MySQL reports unquoted column names in the case in which they were originally created. Postgresql folds to lower. Firebird folds to upper.

    MySQL executes all three selects. Postgresql should fail on the second and third select. Firebird fails on the first and third select.

  24. Former user Account Deleted

    So far quoting rules boil down to: * a list of illegal identifiers that must be quoted in a specific version of a specific database * characters that are allowed as the initial character of an unquoted identifier * characters that are allowed in unquoted identifiers

    I think after taking a good shot at getting it right it still might end up being an iterative process. Fixing things as problems are reported. Because the docs aren't always very clear.

    Since missing _require_quotes() won't negatively affect other databases, should I focus on postgresql and wait for users of other databases to fill in code for their databases? Or is it worth making an attempt at all the supported databases for them to use as a starting point?

    Do you want to handle differences in reserved words between versions of the same database? How?

  25. Mike Bayer repo owner

    yup, that is exactly the way to go about this. you work out the framework that allows the problem to be solved, and address individual cases as they come up....trying to do it all at once is overwhelming.

    so. just do postgresql for now. im going to go in and probably have the base dialect cache everything it finds as well, so it isnt hitting _require_quotes too much after a table is first used. then when people report issues with other DBs (like mysql, which has issues already), we can address those too and theres already a place to put it.

    as far as reserved words, we should try to support the union of all DB versions at once (i.e., just a list of all the words). probably a module level util.Set() that contains a listing of words so we can quickly check.

  26. Former user Account Deleted

    I still need a little feedback on how you see databases that return unquoted delimiters in uppercase affecting this. Firebird is such a database. I've written up the _require_quotes() function for Firebird in the above patch to help you see where I think the problem will come in.

  27. Mike Bayer repo owner

    ok, i ask firebird to reflect a column, and it says FOO. so, whats the problem exactly ? that youre not sure if its "FOO" or just FOO ?

    also, dont all these _require_quotes methods need to tell if the identifier name is using MixedCase ?

  28. Former user Account Deleted

    The problem is this: you ask firebird for the column name. firebird gives it to you in natural case which is upper. you set table.name = value.lower(). now you ask if this identifier needs to be quoted. lowercase characters are not legal in an unqouted identifier in firebird so the method says yes. you ask firebird for "foo" which doesn't match "FOO" or FOO.

    reserved_words = ['analyse', 'analyze', 'and', 'any', 'array', 'as', 'asc', 'asymmetric', 'authorization', 'between', 'binary', 'both', 'case', 'cast', 'check', 'collate', 'column', 'constraint', 'create', 'cross', 'current_date', 'current_role', 'current_time', 'current_timestamp', 'current_user', 'default', 'deferrable', 'desc', 'distinct', 'do', 'else', 'end', 'except', 'false', 'for', 'foreign', 'freeze', 'from', 'full', 'grant', 'group', 'having', 'ilike', 'in', 'initially', 'inner', 'intersect', 'into', 'is', 'isnull', 'join', 'leading', 'left', 'like', 'limit', 'localtime', 'localtimestamp', 'natural', 'new', 'not', 'notnull', 'null', 'off', 'offset', 'old', 'on', 'only', 'or', 'order', 'outer', 'overlaps', 'placing', 'primary', 'references', 'right', 'select', 'session_user', 'similar', 'some', 'symmetric', 'table', 'then', 'to', 'trailing', 'true', 'union', 'unique', 'user', 'using', 'verbose', 'when', 'where']('all',)
    legal_characters = string.ascii_lowercase + string.digits + '_$'
    
    class PGIdentifierPreparer(ansisql.ANSIIdentifierPreparer):
        def _fold_identifier_case(self, value):
            return value.lower()
        def _requires_quotes(self, value):
            retval = bool(len([for x in str(value) if x not in legal_characters](x)))
            if not retval and (value[0](0).isdigit() or value in reserved_words):
                retval = True
            return retval
    

    Notice that legal_characters only contains lowercase because uppercase characters are not legal in an unquoted identifier in postgresql. It is unneccesary to look for mixed case; checking for illegal characters is enough.

    (Can't give you an updated diff right now because I'm working through the patch for #71 too and I have both in the code file. I need a better way to work on multiple issues and keep the changesets separate.)

  29. Former user Account Deleted

    if the quote true flag causes to much coupling between the Table and the generated SQL, would a Table/Column.natural_case=True flag be separate enough? this would signal that the identifier originally came in the natural case of the database and need not be quoted. It could be converted to lower for use in python and converted back to natural case before sending to the db (unnecesary of course) or before it is checked as a reserved word. We would not be storing a directive for the SQL, just storing (in a database agnostic manner) extra information that would otherwise be lost about the identifier.

  30. Mike Bayer repo owner

    yes, i like natural_case. because it says something about the identifier itself and its verbiage, but doesnt get into how the database should treat it. its also a nice flag to signal the "case folding" stuff. i agree the firebird problem sucks but this should fix it.

  31. Mike Bayer repo owner

    OK, if you check out changeset:1810 youll see how i did this (added some caching layer to the ansisql.py so the calculations are only done once per table/column). going to leave this ticket open since we got to get all the other DB's working, see if reflection actually works, etc. Also, i think id like to see if this newer idea with the "natural_case" flag can eliminate the need for "quote=True" in most if not all cases (need to play with this some more in general).

    nice job and i think we're getting close.

  32. Former user Account Deleted

    Yes natural_case seems to be much more useful. I think we should be able to remove quote=True all together now if it is desired.

    The only thing that could make in necessary is hinted at here:

    http://www.firebirdsql.org/manual/qsg10-firebird-sql.html

    If there exists a mythical database in which neither

    identifier == "identifier"

    nor

    identifier == "IDENTIFIER"

    are true. I haven't found a database like this yet. maybe oracle.

  33. Former user Account Deleted

    The fact that we prefer lowercase names in python and yet firebird requires you to ask for upper case names during reflection is screwing with my head. Above is my first try with that. I think quoted firebird reserved words will not be lowercased into python with this patch but I'm not sure. We should determin if that behavior is desirable.

  34. Log in to comment