Query builder does not quote mixed case table and column names in Postgresql

Former user Account Deleted

attached postgres.py.diff

(original author: creiht) Initial try at fixing this

2006-04-10T23:10:00+00:00

Former user Account Deleted

(original author: creiht) I have attached a diff that is my first stab/hack at fixing this that is mostly a proof of concept. Hopefully this will give you an idea of what I am trying to accomplish. A better solution would be to create a more generic solution, perhaps having a function in ansisql.ANSICompiler that would quote strings that by default does nothing but could be overwritten in each engine. Let me know what direction you would like to go with this, and I will implement it.

2006-04-10T23:13:16+00:00

Mike Bayer repo owner

assigned issue to

the patch is not bad. I would make the check for "islower" more inline, i.e. if the column/table is already lowercase, dont go through all the effort re-joining the string together, etc.

its ok as a postgres thing for now and can later move up to more of a "template method" within ANSISQLCompiler.

also, you have to insure all the unittests run, like this (assuming you have a local PG database called "test" with "scott"/"tiger":

   python test/alltests.py --db postgres

or

   python test/alltests.py --dburl <some postgres url>

there should also be unittests added for this specific thing, probably in test/select.py.

2006-04-11T12:49:37+00:00

Former user Account Deleted

attached ansisql.py.diff

(original author: creiht) 2nd try

2006-04-11T22:08:58+00:00

Former user Account Deleted

attached postgres.py.2.diff

(original author: creiht) Try 2

2006-04-11T22:09:28+00:00

Former user Account Deleted

(original author: creiht) I have attached my second attempt at the solution to this problem. I ran into another case where a string needed to be quoted, and found that it was going to be easier to make a more generalized solution to the problem.

This solution defines a function called {{{_quote_string()}}}, and if an engine needs to do quotes, it overrides that function. The {{{_quote_string()}}} is then called at each point a string would need to be quoted.

There are probably also other areas that need to be quoted, and I plan on catching them once I can set up a local instance of Postgres, and I will write a set of tests for each case.

If this proves to be a performance problem I see the following possibilities for optimization:

Add an option at the table level that will turn string quoting on or off and then we can bypass it completely if it is turned off.
When a Table class is created, generate and store a quoted version of the string for each column which would then be used in place of the {{{_quote_string()}}} function.

Let me know what you think.

2006-04-11T22:20:08+00:00

Mike Bayer repo owner

changed milestone to 0.3.0

this is not bad, i might adjust _quote_string to be something more like process_identifier() or something like that.

2006-07-08T21:10:27+00:00

Mike Bayer repo owner

im going to look into adding "quote=True" to Table so that users can manually identify if they want quoting on a table identifier. id rather not have anything with quoting in by default.

2006-07-17T20:35:38+00:00

Mike Bayer repo owner

marked as critical

also note we are blocking ~~#240~~

2006-07-17T20:36:53+00:00

Former user Account Deleted

Is this patch suited to handling quoted identifiers other than those that are a legal unquoted identifier when coerced to a single case? I believe all of the identifiers in the statement below are legal SQL.

CREATE TABLE "WorstCase" ( id SERIAL PRIMARY KEY, lowercase varchar(25), UPPERCASE varchar(25), MixedCase varchar(25), "ASC" varchar(25), "desc" varchar(25), "Union" varchar(25), "MixedCase" varchar(25), "With Space" varchar(25), "1st" varchar(25), "Similar..?" varchar(25), "Similar..#" varchar(25), ":symbol" varchar(25), "inner""quote" varchar(25) );

2006-07-31T11:40:06+00:00

Former user Account Deleted

attached delimited_identifiers.diff

alternative implementation requires quoting of identifiers when describing schema

2006-07-31T15:40:30+00:00

Former user Account Deleted

May be beneficial to upgrade a table.schema to an object with its own quotable name. Table.fullname can not be quoted directly for tables that belong to a schema other than the default.

Some things to keep in mind concerning delimiters. Both the beginning and ending delimiter should be configurable. The double quote character is the delimiter defined in the SQL specification, but some databases use an alternative character, ie MySQL uses the backtick "`". Some databases, eg MySQL, will accept different delimiters depending on the mode that the backend is configured to run in. In some instances database will operate in a mode that transposes the standard quoting characters for literals and identifiers; single quote identifiers, double quote literals. MSQL will accept the paired delimiters [].

2006-08-01T15:16:33+00:00

Former user Account Deleted

ansisql.py.diff doesn't apply cleanly. attaching an updated patch but visit_column() needs some attention.

2006-08-05T15:01:29+00:00

Former user Account Deleted

attached updated.diff

original patch updated to apply to current svn

2006-08-05T15:02:23+00:00

Former user Account Deleted

attached delimited_identifiers2.diff

a patch that makes use of quoted=True

2006-08-06T17:17:27+00:00

Former user Account Deleted

attached delimited_identifiers2.2.diff

updated patch. makes use of quote=True. lightly tested with sqlite, postgresql and mysql.

2006-08-07T00:34:10+00:00

Former user Account Deleted

attached delim_id_firebird.diff

same patch in delimited_identifiers2.2.diff for firebird

2006-08-07T10:19:18+00:00

Former user Account Deleted

uploaded patch for firebird, with some changes to make it work for reflection merged from ~~#240~~.

2006-08-07T10:21:28+00:00

Former user Account Deleted

Seems that there is need of "_prepare_identifier" also in ANSISchemaGenerator.visit_primary_key_constraint() (lib/sqlalchemy/ansisql.py)

@@ -662,16 +673,16 @@
         if len(constraint) == 0:
             return
         self.append(", \n")
-        self.append("\tPRIMARY KEY (%s)" % string.join([for c in constraint](c.name),', '))
+        self.append("\tPRIMARY KEY (%s)" % string.join([for c in constraint](self._prepare_identifier(c)),'
, '))

2006-08-07T10:38:06+00:00

Former user Account Deleted

One of the things that causes some failed tests is the distinction between name and fullname on tables. With _prepare_identifier() we've lost this distinction. Should we separate it back out and create _prepare_name() and _prepare_fullname() instead? Could we unify the concept further and create the ability to get a properly quoted tablename.columnname when needed, instead of having to ask for a quoted table name and then a quoted column name?

2006-08-07T14:03:18+00:00

Former user Account Deleted

attached delimited_identifiers3.diff

better DRYer patch

2006-08-08T15:24:05+00:00

Former user Account Deleted

I'm now pretty happy with the above patch. Please take a look and comment. Still need to learn how to write a few tests for it. And it needs more testing/inspection to find places where quoting may be needed but is not yet being done.

2006-08-08T15:26:22+00:00

Mike Bayer repo owner

changed status to resolved

after some more iterations of the attached patch i think we have a decent solution working, resulting from changeset:1785 changeset:1787 changeset:1788. rudimentary docs for quote=True added as well.

2006-08-12T15:04:09+00:00

Former user Account Deleted

removed status
changed status to open

It would be nice if reflected columns were quoted. The obvious way is something like

self.quote = not(self.name == self.name.lower())

in the Column constructor.

I've implemented this here:

http://go2.ie/sqlalchemy-quote-column-names.diff

The patch is a bit big due to having to fix some test cases that used a non-quoted mixed case column name. Should these sorts of columns be supported? They don't really work that well with the ORM since Python attribute names are case-sensitive.

2006-08-14T08:23:22+00:00

Mike Bayer repo owner

changed component to sql
changed milestone to blue sky
marked as minor

so let me get this straight....

you want to define a table like this:

t = Table('sometable', meta, 
   Column('SomeColumn', Integer, quote=True)
)

and then the column would be accessed like this:

t.c.somecolumn

and calling

t.c.SomeColumn

which is the actual name of the column, would raise an error ? i think thats pretty counterintuitive.

There is a feature which i think satisfies what you really want, which is this:

t = Table('sometable', meta,
    Column('SomeColumn', Integer, quote=True, key='somecolumn')
)

then you can explicitly say, "id like to reference column X using the name Y". and you know explicit is better than implicit ;)

i dont push the "key" parameter so much since there have been a lot of mapper bugs that come up and a lack of unit tests using "key". but ultimately that should all be fixed anyway. i bet theres not much buggy behavior left anyway.

2006-08-14T11:19:13+00:00

Former user Account Deleted

No not at all :)

What I would like is to have consistent case sensitivity at table and model level, in conjunction with autoload. At the moment it doesn't work unless you explicity set quote = True on each column after reflecting the table.

My use case is this:

############# foo.sql

create table "tblFoo"
(
    "iFooID" serial primary key,
    "sTitle" text not null
);

==============================================

############# foo.py

import sqlalchemy

metaData = sqlalchemy.BoundMetaData('postgres:///something')
tblFoo = sqlalchemy.Table('tblFoo', metaData, quote = True)

class Foo:
    def __init__(self, sTitle):
        self.sTitle = sTitle

sqlalchemy.Mapper(Foo, tblFoo)
session = sqlalchemy.create_session()
oFoo = Foo('Hello World')
session.save(oFoo)
session.flush()

==============================================

If I do this at the moment, I have to do the following:

for column in tblFoo.columns:
    if column.name != column.name.lower():
        column.quote = True

Then the above example will work.

2006-08-14T13:25:18+00:00

Former user Account Deleted

Some further information:

You won't see this issue unless you use autoload.

When you create a table in PostgreSQL without quoting, table and column names get converted to lower case. When information_schema.py reads the column names, it creates Column objects with the lowercase column names. Hence the model in the above example gets attributes ifooid and stitle rather than iFooID and sTitle.

This hurts my eyes :)

Yes, I know Hungarian notation is evil etc. I'm stuck with it :(

2006-08-14T13:30:30+00:00

Former user Account Deleted

Arg, the above should be

tblFoo = sqlalchemy.Table('tblFoo', metaData, autoload = True, quote = True)

2006-08-14T13:35:43+00:00

Mike Bayer repo owner

oh, i read your patch incorrectly, it confused me that you changed all the column identifiers in one of the unittests to be all lowercase.

we might go with your idea except that the "quote" flag is already revealing its limitation, that its setting an attribute on an entity that has no allegiance to any particular database. the same Table could be used on a sqlite, oracle, mysql database, etc. by sticking a "quote this column if theres mixed case" is a band-aid; what if the column name is a reserved word on just MySQL ? then someones going to want quoting upon reflection for that too. then the same table on Postgres has quoting set on some random column for no apparent reason. similarly for mixed case, only postgres even has the quote-mixed-case requirement in the first place.

we started with "quote=True" because it was the quickest way to allow any quoting at all to work, so that such a table could even be used. but the better solution is that each database dialect would know exactly what identifiers require quoting, and would use those rules at query compilation time instead of checking a quote=True flag. The Table itself would again have no idea about quoting flags (well, probably best to leave the flag available in case its needed).

i guess what im saying is that the quote flag is an interim hack, pending some contributors doing the work of identifying all (or at least some) of the quotable conditions for each database, and im a little antsy building too much on top of it. id rather modify the recent svn checkin to check "quote=True" plus some additional dialect-specific method, then put your "check the case convention" rule in the postgres dialect (one of the first patches on this ticket did it that way). as a bonus id also like postgres to cache the result of this check somehow so that it isnt doing thousands of string comparisons on every compilation.

2006-08-14T13:43:49+00:00

Former user Account Deleted

Yeah, my first attempt at fixing it was in information_schema.py....

I will look into the reserved keywords thing and see if I can come up with a better patch.

FWIW the key option you mentioned would actually solve the problem, but then I lose the DRY-ness of autoload :/ autoload rocks :)

2006-08-14T13:52:30+00:00

Mike Bayer repo owner

marked as major

2006-08-15T16:18:46+00:00

Mike Bayer repo owner

changed milestone to 0.3.0

2006-08-15T16:18:46+00:00

Former user Account Deleted

zzzeek, I think a few use cases would help me better understand where to take this.

The coupling between Table/Column and quoting shouldn't be a problem for the autoload=True use case. So perhaps a sufficient measure would be to add quote=True during reflection. This would satisfy the caching desire.

For cases where a user is CREATEing the tables they of course have the freedom to choose what gets quoted. So again the coupling shouldn't cause problems. And requiring the user to specify the extra flag rather than guessing for them is not forcing too much extra work on them.

I fully agree that the quoting is not really a property of the identifier or the identified object alone. The database figures in there too. If you can give a couple examples of where the coupling causes problems for people who need quoting, that should help me find a solution.

2006-08-15T18:21:16+00:00

Former user Account Deleted

I can't quite formulate a convincing example, but there may be some problem if a user wants to use a table defined for one dialect where an identifier is legal to be used unquoted attempts to use that in a dialect where the identifier is reserved. Maybe in migrating between databases? That might be solved by introducing a requote() method that could find the minimal quoting needed when recasting the table against another dialect.

2006-08-15T19:50:48+00:00

Mike Bayer repo owner

all we need here is a function inside of each dialect, give it a string, and it determines the question, "does this identifier need to be quoted" ? thats the work here that im looking for.

i just added the method in changeset:1802, to the base class in ansisql.py. just add a _requires_quotes method to all the subclasses of ANSIIdentifierPreparer in each dialect. if we get that working, and get reflection going (with a little unit test too), I can drill in to try to optimize the performance of this.

the quote=True flag is definitely not a big deal. however, there is a tradition in this code, which adhering to has always ultimately proved to be far more necessary than it first seems, which is that the Dialect is solely responsible for all generation of SQL strings that get sent to a database; as few string-generation concepts as possible should be stored in schema/SQL expression objects. by having reflection turn on "quote=True" on the Table/Column, youre just breaking that paradigm a tiny bit, adding a very small inconsistency to the code (and yes, maybe that mythical case where one DB wants quoting and the other doesnt).

2006-08-15T20:44:25+00:00

Former user Account Deleted

I'll start researching and working on these methods tomorrow. I think I saw a list of reserved words per database in the code somewhere already.

I know you've made it clear to me multiple times that you don't think the case folding stuff is necessary. But again the natural case of identifiers is part of the quoting rules. in postgres the natural identifier case is lower (which breaks the sql spec) so it works well with the custom of lowercasing identifiers in sqlalchemy in places. But other dbs that follow sql use upper as the natural case. When you pass in a query the lexer looks for quotes to determine if the case needs to be folded before matching identifiers. When you reflect you will retrieve the case stored by the db which you can test against legal characters to determine if it needs quoting. But if you fold the reflected names to lower for the python identifiers (as I think you are doing now to avoid typing uppercase all over) you lose important information. You are asking to get that back without the extra information given by quotes. I think you will either have to store the proper reflected name with case, or the quote= parameter. In postgresql IDENTIFIER is not equal to "IDENTIFIER". And in any db that follows the sql standard identifier is not equal to "identifier". Having said all that, the simple cases functioning properly will be enough to enable my work. :-)

2006-08-16T00:52:36+00:00

Former user Account Deleted

Here is a little test I'm using to understand various db's quoting behavior:

1) create a table. for mysql one of the MixedCase columns needs to be omitted and the quoting character changed.
CREATE TABLE "WorstCase2" (
    id INTEGER PRIMARY KEY,
    lowercase INTEGER,
    UPPERCASE INTEGER,
    MixedCase INTEGER,
    "lowered" INTEGER,
    "UPPERED" INTEGER,
    "ASC" INTEGER,
    "desc" INTEGER,
    "Union" INTEGER,
    "MixedCase" INTEGER,
    "MixedCaseD" INTEGER,
    "With Space" INTEGER,
    "1st" INTEGER,
    "Similar..?" INTEGER,
    "Similar..#" INTEGER,
    ":symbol" INTEGER,
    "inner""quote" INTEGER
);

2) describe the table in the db's syntax
postgresql: \d "WorstCase2"
firebird: SHOW TABLE "WorstCase2";

3) do any of the following selects cause errors.
SELECT LOWERED FROM "WorstCase2";
SELECT uppered FROM "WorstCase2";
SELECT mixedcased FROM "WorstCase2";

MySQL reports unquoted column names in the case in which they were originally created. Postgresql folds to lower. Firebird folds to upper.

MySQL executes all three selects. Postgresql should fail on the second and third select. Firebird fails on the first and third select.

2006-08-16T12:51:35+00:00

Former user Account Deleted

attached require_quotes1.diff

a first look at quoting rules for firebird, mysql and postgresql

2006-08-16T18:47:30+00:00

Former user Account Deleted

So far quoting rules boil down to: * a list of illegal identifiers that must be quoted in a specific version of a specific database * characters that are allowed as the initial character of an unquoted identifier * characters that are allowed in unquoted identifiers

I think after taking a good shot at getting it right it still might end up being an iterative process. Fixing things as problems are reported. Because the docs aren't always very clear.

Since missing _require_quotes() won't negatively affect other databases, should I focus on postgresql and wait for users of other databases to fill in code for their databases? Or is it worth making an attempt at all the supported databases for them to use as a starting point?

Do you want to handle differences in reserved words between versions of the same database? How?

2006-08-16T19:00:46+00:00

Mike Bayer repo owner

yup, that is exactly the way to go about this. you work out the framework that allows the problem to be solved, and address individual cases as they come up....trying to do it all at once is overwhelming.

so. just do postgresql for now. im going to go in and probably have the base dialect cache everything it finds as well, so it isnt hitting _require_quotes too much after a table is first used. then when people report issues with other DBs (like mysql, which has issues already), we can address those too and theres already a place to put it.

as far as reserved words, we should try to support the union of all DB versions at once (i.e., just a list of all the words). probably a module level util.Set() that contains a listing of words so we can quickly check.

2006-08-16T20:20:21+00:00

Former user Account Deleted

I still need a little feedback on how you see databases that return unquoted delimiters in uppercase affecting this. Firebird is such a database. I've written up the _require_quotes() function for Firebird in the above patch to help you see where I think the problem will come in.

2006-08-17T00:44:54+00:00

Mike Bayer repo owner

ok, i ask firebird to reflect a column, and it says FOO. so, whats the problem exactly ? that youre not sure if its "FOO" or just FOO ?

also, dont all these _require_quotes methods need to tell if the identifier name is using MixedCase ?

2006-08-17T11:48:03+00:00

Former user Account Deleted

The problem is this: you ask firebird for the column name. firebird gives it to you in natural case which is upper. you set table.name = value.lower(). now you ask if this identifier needs to be quoted. lowercase characters are not legal in an unqouted identifier in firebird so the method says yes. you ask firebird for "foo" which doesn't match "FOO" or FOO.

reserved_words = ['analyse', 'analyze', 'and', 'any', 'array', 'as', 'asc', 'asymmetric', 'authorization', 'between', 'binary', 'both', 'case', 'cast', 'check', 'collate', 'column', 'constraint', 'create', 'cross', 'current_date', 'current_role', 'current_time', 'current_timestamp', 'current_user', 'default', 'deferrable', 'desc', 'distinct', 'do', 'else', 'end', 'except', 'false', 'for', 'foreign', 'freeze', 'from', 'full', 'grant', 'group', 'having', 'ilike', 'in', 'initially', 'inner', 'intersect', 'into', 'is', 'isnull', 'join', 'leading', 'left', 'like', 'limit', 'localtime', 'localtimestamp', 'natural', 'new', 'not', 'notnull', 'null', 'off', 'offset', 'old', 'on', 'only', 'or', 'order', 'outer', 'overlaps', 'placing', 'primary', 'references', 'right', 'select', 'session_user', 'similar', 'some', 'symmetric', 'table', 'then', 'to', 'trailing', 'true', 'union', 'unique', 'user', 'using', 'verbose', 'when', 'where']('all',)
legal_characters = string.ascii_lowercase + string.digits + '_$'

class PGIdentifierPreparer(ansisql.ANSIIdentifierPreparer):
    def _fold_identifier_case(self, value):
        return value.lower()
    def _requires_quotes(self, value):
        retval = bool(len([for x in str(value) if x not in legal_characters](x)))
        if not retval and (value[0](0).isdigit() or value in reserved_words):
            retval = True
        return retval

Notice that legal_characters only contains lowercase because uppercase characters are not legal in an unquoted identifier in postgresql. It is unneccesary to look for mixed case; checking for illegal characters is enough.

(Can't give you an updated diff right now because I'm working through the patch for ~~#71~~ too and I have both in the code file. I need a better way to work on multiple issues and keep the changesets separate.)

2006-08-17T12:04:03+00:00

Former user Account Deleted

if the quote true flag causes to much coupling between the Table and the generated SQL, would a Table/Column.natural_case=True flag be separate enough? this would signal that the identifier originally came in the natural case of the database and need not be quoted. It could be converted to lower for use in python and converted back to natural case before sending to the db (unnecesary of course) or before it is checked as a reserved word. We would not be storing a directive for the SQL, just storing (in a database agnostic manner) extra information that would otherwise be lost about the identifier.

2006-08-17T12:09:36+00:00

Mike Bayer repo owner

yes, i like natural_case. because it says something about the identifier itself and its verbiage, but doesnt get into how the database should treat it. its also a nice flag to signal the "case folding" stuff. i agree the firebird problem sucks but this should fix it.

2006-08-17T13:09:44+00:00

Former user Account Deleted

attached case_and_reflection1.diff

address case and reflection for postgresql

2006-08-17T18:32:31+00:00

Former user Account Deleted

the above patch addresses this issue as well as the issue presented in ~~#71~~ (and ~~#60~~ ?).

please comment

2006-08-17T18:36:23+00:00

Former user Account Deleted

attached case_and_reflection2.diff

corrected patch. makes reflecting quoted identifiers work in postgresql

2006-08-20T15:01:57+00:00

Mike Bayer repo owner

OK, if you check out changeset:1810 youll see how i did this (added some caching layer to the ansisql.py so the calculations are only done once per table/column). going to leave this ticket open since we got to get all the other DB's working, see if reflection actually works, etc. Also, i think id like to see if this newer idea with the "natural_case" flag can eliminate the need for "quote=True" in most if not all cases (need to play with this some more in general).

nice job and i think we're getting close.

2006-08-20T20:42:02+00:00

Former user Account Deleted

Yes natural_case seems to be much more useful. I think we should be able to remove quote=True all together now if it is desired.

The only thing that could make in necessary is hinted at here:

http://www.firebirdsql.org/manual/qsg10-firebird-sql.html

If there exists a mythical database in which neither

identifier == "identifier"

nor

identifier == "IDENTIFIER"

are true. I haven't found a database like this yet. maybe oracle.

2006-08-20T23:51:47+00:00

Former user Account Deleted

attached firebird_quoted_reflection1.diff

first stab at properly reflecting quoted identifiers from firebird

2006-08-22T01:22:32+00:00

Former user Account Deleted

The fact that we prefer lowercase names in python and yet firebird requires you to ask for upper case names during reflection is screwing with my head. Above is my first try with that. I think quoted firebird reserved words will not be lowercased into python with this patch but I'm not sure. We should determin if that behavior is desirable.