ColumnElement with label() is incorrectly referenced in group_by/order_by producing invalid query

Issue #3385 resolved
Andrey Semenov created an issue

https://gist.github.com/SantjagoCorkez/db207a7b533d1d6f05ae

When providing a ColumnElement itself into .group_by()/.order_by() the query compiler does not automatically quote the reference to that producing a query that in some circumstances becomes invalid (for example in case there are columns with the same name as the ColumnElement's label within the query).

This became an issue at 1.0.0 (and not resolved in 1.0.1)

Comments (19)

  1. Mike Bayer repo owner

    Wow....can you please provide the full SQL being rendered? at least?

    The change here is http://docs.sqlalchemy.org/en/rel_1_0/changelog/migration_10.html#order-by-and-group-by-are-special-cases. It is by design, so the "state" column being expressed as a single identifier is expected, I also see something about a column called "Order.state" in this fragment of code, but I'm hoping you can appreciate those of us who fix bugs need to have full context, thanks.

  2. Mike Bayer repo owner

    a simpler form of this query works on PG without fail, using a table that has a column called "state":

    SELECT count(t.foob) AS count_1, max(t.state) AS x, CASE WHEN (t.state = %(state_1)s) THEN %(param_1)s WHEN (t.state = %(state_2)s) THEN %(param_2)s END AS state 
    FROM t 
    WHERE t.state > %(state_3)s GROUP BY state
    

    What postgresql version is this?

  3. Andrey Semenov reporter

    The request rendered is in a docstring in the gist provided. Or you need a fully compiled request with params substituted?

  4. Mike Bayer repo owner

    I've taken the time to unwrap your example into a proper test case where this now reproduces. Here is a test case:

    from sqlalchemy import *
    
    e = create_engine("postgresql://scott:tiger@localhost/test", echo=True)
    
    m = MetaData()
    order_logs = Table(
        'order_logs', m,
        Column('id', Integer, primary_key=True),
        Column('state', Integer),
        Column('order_id', Integer)
    )
    orders = Table(
        'orders', m,
        Column('id', Integer, primary_key=True),
        Column('state', Integer),
    )
    
    
    expr = case(
        [
            (order_logs.c.state.in_([1, 2]), 'x'),
            (order_logs.c.state.in_([3, 4]), 'y'),
        ],
        else_='z'
    ).label('state')
    
    
    expr2 = func.sum(
        case(
            [
                (orders.c.state.in_([1, 2]), 1),
                (orders.c.state.in_([3, 4]), 2),
            ],
            else_=3
        )
    ).label("key")
    
    
    m.create_all(e)
    
    
    stmt = select(
        [
            func.count(order_logs.c.id).label("count"),
            expr,
            expr2
        ]
    ).where(order_logs.c.order_id == orders.c.id)\
        .group_by(expr)
    
    e.execute(stmt)
    
  5. Mike Bayer repo owner

    Postgresql is definitely being weird on this one. for this one (notice I spelled "state" as "statee"):

    SELECT count(order_logs.id) AS count, CASE WHEN (order_logs.statee IN (%(state_1)s, %(state_2)s)) THEN %(param_1)s WHEN (order_logs.statee IN (%(state_3)s, %(state_4)s)) THEN %(param_2)s ELSE %(param_3)s END AS state 
    FROM order_logs, orders 
    WHERE order_logs.order_id = orders.id GROUP BY state
    

    the error is: (psycopg2.ProgrammingError) column "order_logs.statee" must appear in the GROUP BY clause or be used in an aggregate function.

    But! If you take away the "orders" table:

    SELECT count(order_logs.id) AS count, CASE WHEN (order_logs.statee IN (%(state_1)s, %(state_2)s)) THEN %(param_1)s WHEN (order_logs.statee IN (%(state_3)s, %(state_4)s)) THEN %(param_2)s ELSE %(param_3)s END AS state 
    FROM order_logs GROUP BY state
    

    now it's fine! I'm not understanding that at the moment.

  6. Andrey Semenov reporter

    This was rendered with 0.9.x

    SELECT count(order_logs.id) AS count, CASE WHEN (order_logs.state IN (1003, 1202, 1203, 1200, 1201)) THEN 'refuses' WHEN (order_logs.state IN (1100)) THEN 'errors' WHEN (order_logs.state IN (1001)) THEN 'no_answer' WHEN (order_logs.state IN (1002)) THEN 'call_later' WHEN (order_logs.state IN (1999)) THEN 'success' ELSE 'unknown' END AS state, sum(CASE WHEN (orders.state IN (1003, 1202, 1203, 1200, 1201, 1100)) THEN 1 ELSE 0 END) AS refused, order_logs.ts_spawn - (order_logs.ts_spawn - -10800) %% 3600 AS raw_key, order_logs.ts_spawn - (order_logs.ts_spawn - -10800) %% 3600 AS key 
    FROM order_logs, orders 
    WHERE order_logs.order_id = orders.id AND order_logs.state IN (1999, 1001, 1100, 1002, 1201, 1200, 1203, 1202, 1003) AND order_logs.ts_spawn >= 1429822800 AND order_logs.ts_spawn < 1429909200 AND ((order_logs.flags & 1) > 0 OR order_logs.operator_id IN (42, 46, 120, 133)) GROUP BY CASE WHEN (order_logs.state IN (1003, 1202, 1203, 1200, 1201)) THEN 'refuses' WHEN (order_logs.state IN (1100)) THEN 'errors' WHEN (order_logs.state IN (1001)) THEN 'no_answer' WHEN (order_logs.state IN (1002)) THEN 'call_later' WHEN (order_logs.state IN (1999)) THEN 'success' ELSE 'unknown' END, "raw_key", "key"
    

    And this is rendered with 1.0.x

    SELECT count(order_logs.id) AS count, CASE WHEN (order_logs.state IN (1003, 1202, 1203, 1200, 1201)) THEN 'refuses' WHEN (order_logs.state IN (1100)) THEN 'errors' WHEN (order_logs.state IN (1001)) THEN 'no_answer' WHEN (order_logs.state IN (1002)) THEN 'call_later' WHEN
    (order_logs.state IN (1999)) THEN 'success' ELSE 'unknown' END AS state, sum(CASE WHEN (orders.state IN (1003, 1202, 1203, 1200, 1201, 1100)) THEN 1 ELSE 0 END) AS refused, order_logs.ts_spawn - (order_logs.ts_spawn - -10800) %% 3600 AS raw_key, order_logs.ts_spawn - (ord
    er_logs.ts_spawn - -10800) %% 3600 AS key
    FROM order_logs, orders
    WHERE order_logs.order_id = orders.id AND order_logs.state IN (1999, 1001, 1100, 1002, 1201, 1200, 1203, 1202, 1003) AND order_logs.ts_spawn >= 1429736400 AND order_logs.ts_spawn < 1429822800 AND ((order_logs.flags & 1) > 0 OR order_logs.operator_id IN (42, 46, 120, 133))
     GROUP BY state, "raw_key", "key"
    
  7. Mike Bayer repo owner

    However. In this case, while this behavior is definitely different, I think the current behavior is better. Keeping in mind that your issue goes away if you just group_by() the original element, and not the label, e.g. group_by(state_column.element) for now, but I will add a new method unlabeled() to all column elements to make this easier, let's review the feature.

    The change is simply that if you say group_by() and pass an object that is actually a Label object, it is intuitive that this is referring to that label as expressed in the columns clause:

    from sqlalchemy import table, column, select, func
    
    t1 = table('t', column('a'), column('b'))
    
    expr = func.foo(t.c.a)
    
    labeled_expr = expr.label('foobar')
    
    # SELECT b, foo(a) AS foobar FROM t GROUP BY foo(a)
    stmt1 = select([t1.c.b, labeled_expr]).group_by(expr)
    
    # SELECT b, foo(a) AS foobar FROM t GROUP BY foobar
    stmt2 = select([t1.c.b, labeled_expr]).group_by(labeled_expr)
    

    So workaround for 1.0.1 is to group_by(state_column.element), and I will add a convenience method .unlabeled() to force a labeled expression to render without any label. Let me know if that works for you.

  8. Mike Bayer repo owner

    you know what nevermind, I think I'm just going to disable this feature again for now. The migration notes don't make it clear that this applies to label() objects also.

  9. Andrey Semenov reporter

    Wow. Maybe it's better to update docs? I've just updated the code at around the mentioned at the gist to make the label do not confuse PG's query compiler (renamed label to 'state_mark') and that provided a more clear and, I guess more lightweight for Query Compiler and Planner, query:

    SELECT count(order_logs.id) AS count, CASE WHEN (order_logs.state IN (1003, 1202, 1203, 1200, 1201)) THEN 'refuses' WHEN (order_logs.state IN (1100)) THEN 'errors' WHEN (order_logs.state IN (1001)) THEN 'no_answer' WHEN (order_logs.state IN (1002)) THEN 'call_later' WHEN (order_logs.state IN (1999)) THEN 'success' ELSE 'unknown' END AS state_mark, sum(CASE WHEN (orders.state IN (1003, 1202, 1203, 1200, 1201, 1100)) THEN 1 ELSE 0 END) AS refused, order_logs.ts_spawn - (order_logs.ts_spawn - -10800) %% 3600 AS raw_key, order_logs.ts_spawn - (order_logs.ts_spawn - -10800) %% 3600 AS key
    FROM order_logs, orders
    WHERE order_logs.order_id = orders.id AND order_logs.state IN (1999, 1001, 1100, 1002, 1201, 1200, 1203, 1202, 1003) AND order_logs.ts_spawn >= 1429736400 AND order_logs.ts_spawn < 1429822800 AND ((order_logs.flags & 1) > 0 OR order_logs.operator_id IN (42, 46, 120, 133)) GROUP BY state_mark, raw_key, key
    
  10. Mike Bayer repo owner

    I'm really torn. I'm looking into undoing the feature, and I put a lot of work into getting these labels to render. But I already had to adjust for SQL server, and now PG. I dunno :). just this one spot is impacting you right?

  11. Mike Bayer repo owner

    also, its more GROUP BY that is problematic, rather than ORDER BY, but making it one way in ORDER BY and another in GROUP BY seems like its getting even more confusing.

  12. Mike Bayer repo owner

    OK. So I think, first off it's easier to keep this working by default the way it is in 0.9, because then ill get less of these regressions. Also, what I don't like about it right now is that it takes place for PG, MySQL, not for MSSQL, etc., so there's implicit changing stuff going on here.

    So now my thinking is, make it an option, for when people want to order by / group by the label, but don't want to type out text("some label"):

    This would be new features on Label:

    expr = expression.label("foo")
    
    stmt = select([expr]).group_by(expr.asref())
    
    expr = expression.label("foo", asref=True)
    stmt = select([expr]).group_by(expr)
    
  13. Mike Bayer repo owner

    wow. OK. 0.9 already does the label thing for ORDER BY fully, just not GROUP BY. So, we already do it differently for ORDER BY, even though the implementation in 1.0 is totally different, which is why I thought this was new. OK. Well, until my next realization, seems like I just will switch off this behavior in GROUP BY and we're done.

  14. Mike Bayer repo owner
    • Fixed a regression that was incorrectly fixed in 1.0.0b4 (hence becoming two regressions); reports that SELECT statements would GROUP BY a label name and fail was misconstrued that certain backends such as SQL Server should not be emitting ORDER BY or GROUP BY on a simple label name at all; when in fact, we had forgotten that 0.9 was already emitting ORDER BY on a simple label name for all backends, as described in :ref:migration_1068, as 1.0 had rewritten this logic as part of 🎫2992.

    In 1.0.2, the bug is fixed both that SQL Server, Firebird and others will again emit ORDER BY on a simple label name when passed a :class:.Label construct that is expressed in the columns clause, and no backend will emit GROUP BY on a simple label name in this case, as even Postgresql can't reliably do GROUP BY on a simple name in every case. fixes #3338, fixes #3385

    → <<cset f9275198c304>>

  15. Log in to comment