refine distinct + order_by "add primary cols" logic for non wrapped selectable

Issue #3641 resolved
Mehdi Gmira created an issue

I encountred a bug with the postgres specific "DISTINCT ON" in SQLAlchemy The real SA query I'm using is more complicated than this, but it sums up to :

query = db.session.query(c1, c2, c3, c4).distinct(c1, c2).order_by(c1, c2, c3)

when i compile the query by doing:

query.statement.compile(dialect=postgres.dialect())

The actual query is

SELECT DISTINCT ON (c1, c2) c1, c2, c3, c4, c1, c2, c3 FROM .... ORDER BY c1, c2, c3

As you can see, the columns c1, c2, c3 are repeated. After some investigation, it seems like this is caused by the order_by() part of the query (removing order by removes the repeated columns). The weird thing is that when actually running the query (query.all()), I only get the column that I requested, and not the repeated ones as in the compiled statement. This would have been fine with me. But the problem is it's preventing me from using a union(), because SA thinks I'm requesting more columns on one side of union().

I think It's a bug, but maybe I'm just doing something wrong.

Comments (19)

  1. Mike Bayer repo owner

    hi there -

    thanks for reporting. Unfortuntately I cannot produce this result (see below). Please provide an mvce case. Thanks!

    (hint: I'm guessing that the "c1" "c2" "c3" are not the same objects sent to order_by() as those sent to query()).

    from sqlalchemy import Integer, Column
    from sqlalchemy.orm import Session
    from sqlalchemy.ext.declarative import declarative_base
    
    Base = declarative_base()
    
    
    class A(Base):
        __tablename__ = 'a'
        id = Column(Integer, primary_key=True)
        x = Column(Integer)
        y = Column(Integer)
        z = Column(Integer)
    
    
    s = Session()
    
    q = s.query(A.x, A.y, A.z).distinct(A.x, A.y).order_by(A.x, A.y, A.z)
    
    from sqlalchemy.dialects import postgresql
    print q.statement.compile(dialect=postgresql.dialect())
    

    output:

    SELECT DISTINCT ON (a.x, a.y) a.x, a.y, a.z 
    FROM a ORDER BY a.x, a.y, a.z
    
  2. Mehdi Gmira reporter

    Sorry, You're right. I just found out that SA has nothing to do with this. I was lost if a bunch of subqueries :), and chose the wrong one in the order_by clause.

    Have a good day, and thank you for the great work you're doing on SQLAlchemy :)

  3. Mehdi Gmira reporter

    Hey,

    I realized there is actually a problem with the distinct on implementation.

    Here's a code sample

    engine = create_engine('postgresql://**')
    Session = sessionmaker()
    Session.configure(bind=engine)
    session = Session()
    
    class Activity(Base):
        __tablename__ = 'activity'
    
        id = Column(Integer, primary_key=True)
        name = Column(String)
        last_name = Column(String)
        day = Column(Integer)
        action = Column(String)
    
    sub = (session.query(Activity.name, Activity.last_name, Activity.day, Activity.action)
           .filter(Activity.name == 'test').subquery())
    

    Here's a dumb query:

    q = (session.query(sub.c.name, sub.c.last_name,
                       sub.c.day, sub.c.action).distinct(sub.c.name, sub.c.last_name)
         .order_by(sub.c.name, sub.c.last_name, sub.c.day.desc()))
    

    As expected, this produces:

    SELECT DISTINCT ON (anon_1.name, anon_1.last_name) anon_1.name, anon_1.last_name, anon_1.day, anon_1.action 
    FROM (SELECT activity.name AS name, activity.last_name AS last_name, activity.day AS day, activity.action AS action 
    FROM activity 
    WHERE activity.name = %(name_1)s) AS anon_1 ORDER BY anon_1.name, anon_1.last_name, anon_1.day DESC
    

    But this doesn't work:

    q = (session.query(sub.c.name.label('n'), sub.c.last_name.label('l'),
                       sub.c.day.label('d'), sub.c.action.label('a')).distinct(sub.c.name, sub.c.last_name)
         .order_by(sub.c.name, sub.c.last_name, sub.c.day.desc()))
    

    It produces

    SELECT DISTINCT ON (anon_1.name, anon_1.last_name) anon_1.name AS n, anon_1.last_name AS l, anon_1.day AS d, anon_1.action AS a, anon_1.name, anon_1.last_name, anon_1.day 
    FROM (SELECT activity.name AS name, activity.last_name AS last_name, activity.day AS day, activity.action AS action 
    FROM activity 
    WHERE activity.name = %(name_1)s) AS anon_1 ORDER BY anon_1.name, anon_1.last_name, anon_1.day DESC
    
  4. Mike Bayer repo owner

    this should be unnecessary, no ORM tests fail except one assert SQL that can be changed:

    diff --git a/lib/sqlalchemy/orm/query.py b/lib/sqlalchemy/orm/query.py
    index 08600c3..4266e68 100644
    --- a/lib/sqlalchemy/orm/query.py
    +++ b/lib/sqlalchemy/orm/query.py
    @@ -3319,7 +3319,7 @@ class Query(object):
             if not context.order_by:
                 context.order_by = None
    
    -        if self._distinct and context.order_by:
    +        if False and self._distinct and context.order_by:
                 order_by_col_expr = list(
                     chain(*[
                         sql_util.unwrap_order_by(o)
    

    it might be worth looking back into history to see if when query._compile_context() was split into "compound" and "simple", the add cols logic here should never have been applied to the "simple" case as this is specifically used to allow eager subquery wrapping to work.

    since a behavior change, this needs to be at a major version boundary so hopefully can get this into 1.1.

  5. Mike Bayer repo owner

    certainly at the very least, the assumption of this logic is back when we didn't return individual columns so it at least shouldn't take effect when we arent returning full entities.

  6. Mike Bayer repo owner

    here's the test:

    #!
    
    py.test test/orm/test_query.py -k DistinctTest --db postgresql
    
    
    ProgrammingError: (psycopg2.ProgrammingError) for SELECT DISTINCT, ORDER BY expressions must appear in select list
    LINE 2: ...ddresses ON users.id = addresses.user_id ORDER BY addresses....
                                                                 ^
     [SQL: 'SELECT DISTINCT users.id AS users_id, users.name AS users_name \nFROM users JOIN addresses ON users.id = addresses.user_id ORDER BY addresses.email_address DESC']
    
  7. Mike Bayer repo owner

    alrighty we'll just make it smarter:

    diff --git a/lib/sqlalchemy/orm/query.py b/lib/sqlalchemy/orm/query.py
    index 08600c3..3c0f67a 100644
    --- a/lib/sqlalchemy/orm/query.py
    +++ b/lib/sqlalchemy/orm/query.py
    @@ -3254,12 +3254,11 @@ class Query(object):
             # then append eager joins onto that
    
             if context.order_by:
    -            order_by_col_expr = list(
    -                chain(*[
    -                    sql_util.unwrap_order_by(o)
    -                    for o in context.order_by
    -                ])
    -            )
    +            order_by_col_expr = \
    +                sql_util.expand_column_list_from_order_by(
    +                    context.primary_columns,
    +                    context.order_by
    +                )
             else:
                 context.order_by = None
                 order_by_col_expr = []
    @@ -3320,14 +3319,11 @@ class Query(object):
                 context.order_by = None
    
             if self._distinct and context.order_by:
    -            order_by_col_expr = list(
    -                chain(*[
    -                    sql_util.unwrap_order_by(o)
    -                    for o in context.order_by
    -                ])
    -            )
    -            context.primary_columns += order_by_col_expr
    -
    +            context.primary_columns += \
    +                sql_util.expand_column_list_from_order_by(
    +                    context.primary_columns,
    +                    context.order_by
    +                )
             context.froms += tuple(context.eager_joins.values())
    
             statement = sql.select(
    diff --git a/lib/sqlalchemy/sql/util.py b/lib/sqlalchemy/sql/util.py
    index 1dcf0ee..0e8c77c 100644
    --- a/lib/sqlalchemy/sql/util.py
    +++ b/lib/sqlalchemy/sql/util.py
    @@ -176,6 +176,27 @@ def unwrap_order_by(clause):
         return result
    
    
    +def expand_column_list_from_order_by(collist, order_by):
    +    """Given the columns clause and ORDER BY of a selectable,
    +    return a list of column expressions that can be added to the collist
    +    corresponding to the ORDER BY, without repeating those already
    +    in the collist.
    +
    +    """
    +    cols_already_present = set([
    +        col.element if col._order_by_label_element is not None
    +        else col for col in collist
    +    ])
    +    return [
    +        col for col in
    +        chain(*[
    +            unwrap_order_by(o)
    +            for o in order_by
    +        ])
    +        if col not in cols_already_present
    +    ]
    +
    +
     def clause_is_present(clause, search):
         """Given a target clause and a second to search within, return True
         if the target is plainly present in the search without any
    
  8. Mike Bayer repo owner

    you know actually there is no problem with the behavior as is, since those columns aren't returned by the Query. It's only ugly in the string, and if you use query.statement in a Core sense.

  9. Mike Bayer repo owner
    • A refinement to the logic which adds columns to the resulting SQL when :meth:.Query.distinct is combined with :meth:.Query.order_by such that columns which are already present will not be added a second time, even if they are labeled with a different name. Regardless of this change, the extra columns added to the SQL have never been returned in the final result, so this change only impacts the string form of the statement as well as its behavior when used in a Core execution context. Additionally, columns are no longer added when the DISTINCT ON format is used, provided the query is not wrapped inside a subquery due to joined eager loading. fixes #3641

    → <<cset ff3be95620b6>>

  10. Mehdi Gmira reporter

    I'm not sure I understood what you mean by "the extra columns added to the SQL have never been returned in the final result" in the results. If you run

    q = (session.query(sub.c.name.label('n'), sub.c.last_name.label('l'),
                       sub.c.day.label('d'), sub.c.action.label('a')).distinct(sub.c.name, sub.c.last_name)
         .order_by(sub.c.name, sub.c.last_name, sub.c.day.desc())).all()
    

    You do get more columns than expected. Plus if you wanted to use this query in a union sql statment, you'd get an error.

  11. Mike Bayer repo owner

    You do get more columns than expected.

    not at all, because these columns are not added as entities to the Query. Here's the full test, which I will run against 0.9 where we add all the extra columns:

    from sqlalchemy import *
    from sqlalchemy.orm import *
    from sqlalchemy.ext.declarative import declarative_base
    
    Base = declarative_base()
    
    
    class Activity(Base):
        __tablename__ = 'activity'
    
        id = Column(Integer, primary_key=True)
        name = Column(String)
        last_name = Column(String)
        day = Column(Integer)
        action = Column(String)
    
    engine = create_engine('postgresql://scott:tiger@localhost/test', echo=True)
    
    Base.metadata.drop_all(engine)
    Base.metadata.create_all(engine)
    
    session = Session(engine)
    
    session.add(Activity(name='test'))
    
    sub = (session.query(Activity.name, Activity.last_name, Activity.day, Activity.action)
           .filter(Activity.name == 'test').subquery())
    
    
    q = (session.query(sub.c.name.label('n'), sub.c.last_name.label('l'),
                       sub.c.day.label('d'),
                       sub.c.action.label('a')).distinct(sub.c.name, sub.c.last_name)
         .order_by(sub.c.name, sub.c.last_name, sub.c.day.desc()))
    
    for row in q.all():
        print row
        print len(row)
    

    relevant output:

    SELECT DISTINCT ON (anon_1.name, anon_1.last_name) anon_1.name AS n, anon_1.last_name AS l, anon_1.day AS d, anon_1.action AS a, anon_1.name AS anon_1_name, anon_1.last_name AS anon_1_last_name, anon_1.day AS anon_1_day 
    FROM (SELECT activity.name AS name, activity.last_name AS last_name, activity.day AS day, activity.action AS action 
    FROM activity 
    WHERE activity.name = %(name_1)s) AS anon_1 ORDER BY anon_1.name, anon_1.last_name, anon_1.day DESC
    2016-02-10 09:58:05,081 INFO sqlalchemy.engine.base.Engine {'name_1': 'test'}
    (u'test', None, None, None)
    4
    

    see? eight columns in the SQL, only four in the result.

    Plus if you wanted to use this query in a union sql statment, you'd get an error.

    two defenses:

    1. as stated, when using PG's DISTINCT ON, the whole behavior is now disabled for a simple outer query

    2. It is illegal from a SQL perspective to use DISTINCT and ORDER BY when all the columns in the ORDER BY are not present in the DISTINCT; PG again will reject such a query explicitly. So your UNION with an inner query that has DISTINCT + ORDER BY plus columns missing in the columns clause is not possible anyway, unless you're using a DB like MySQL or SQLite which seems to return random results.

  12. Mehdi Gmira reporter

    see? eight columns in the SQL, only four in the result.

    Yes, you're right. However, this doesn't work (7 columns returned)

    from sqlalchemy import *
    from sqlalchemy.orm import *
    from sqlalchemy.ext.declarative import declarative_base
    
    Base = declarative_base()
    
    
    class Activity(Base):
        __tablename__ = 'activity'
    
        id = Column(Integer, primary_key=True)
        name = Column(String)
        last_name = Column(String)
        day = Column(Integer)
        action = Column(String)
    
    engine = create_engine('postgresql://scott:tiger@localhost/test', echo=True)
    
    Base.metadata.drop_all(engine)
    Base.metadata.create_all(engine)
    
    session = Session(engine)
    
    session.add(Activity(name='test'))
    
    sub = (session.query(Activity.name, Activity.last_name, Activity.day, Activity.action)
           .filter(Activity.name == 'test').subquery())
    
    
    q = (session.query(sub.c.name.label('n'), sub.c.last_name.label('l'),
                       sub.c.day.label('d'),
                       sub.c.action.label('a')).distinct(sub.c.name, sub.c.last_name)
         .order_by(sub.c.name, sub.c.last_name, sub.c.day.desc()))
    
    for row in session.query(q.subquery()).all():
        print row
        print len(row)
    

    But maybe that's what you had in mind when you wrote

    provided the query is not wrapped inside a subquery due to joined eager loading.

    Also

    It is illegal from a SQL perspective to use DISTINCT and ORDER BY when all the columns in the ORDER BY are not present in the DISTINCT; PG again will reject such a query explicitly. So your UNION with an inner query that has DISTINCT + ORDER BY plus columns missing in the columns clause is not possible anyway, unless you're using a DB like MySQL or SQLite which seems to return random results.

    I assume you're talking about "distinct" rather than "distinct on". Because something like

    select distinct on (a,b) a,b,c order by a,b,d
    

    is allowed by postgresql.

  13. Mike Bayer repo owner

    yup, "distinct". Short answer, if you give Query() legal SQL in the first place none of this behavior takes effect as of 1.1.

  14. Log in to comment