consider rolling back / refining automatic columns clause logic with orm query.distinct().order_by()

Issue #3518 duplicate
Thijs Damsma created an issue

When a distinct query is ordered, the columns on which are sorted are added to the SELECT statement, see example code:

from sqlalchemy.orm.session import sessionmaker
from sqlalchemy.dialects import postgresql
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import Column, Integer

Base = declarative_base()


class User(Base):
    __tablename__ = 'user'
    id = Column(Integer, primary_key=True)
    att1 = Column(Integer)
    att2 = Column(Integer)


Session = sessionmaker()
sess = Session()

q = sess.query(User.att2).distinct(User.att1)

print(str(q.statement.compile(dialect=postgresql.dialect())))

as expected, only user.att2 will be returned:

SELECT DISTINCT ON ("user".att1) "user".att2 
FROM "user"

Now with an order_by statement:

print(str(q.order_by(User.att1, User.id).statement.compile(
    dialect=postgresql.dialect())))

not as expected, both user.att2 and user.id will be returned

SELECT DISTINCT ON ("user".att1) "user".att2, "user".att1, "user".id 
FROM "user" ORDER BY "user".att1, "user".id

The intended result is:

SELECT DISTINCT ON ("user".att1) "user".att2
FROM "user" ORDER BY "user".att1, "user".id

Comments (9)

  1. Mike Bayer repo owner

    so the logic making that happen is here:

    diff --git a/lib/sqlalchemy/orm/query.py b/lib/sqlalchemy/orm/query.py
    index 5d08cbc..675d7a1 100644
    --- a/lib/sqlalchemy/orm/query.py
    +++ b/lib/sqlalchemy/orm/query.py
    @@ -3084,7 +3084,7 @@ class Query(object):
             if not context.order_by:
                 context.order_by = None
    
    -        if self._distinct and context.order_by:
    +        if False: #self._distinct and context.order_by:
                 order_by_col_expr = list(
                     chain(*[
                         sql_util.unwrap_order_by(o)
    

    If I comment it out as above, here is the test that fails (only on PG):

    #!
    
    DistinctTest.test_joined _____________________________________________________________________
    Traceback (most recent call last):
      File "/Users/classic/dev/sqlalchemy/test/orm/test_query.py", line 2659, in test_joined
        assert [User(id=7), User(id=9), User(id=8)] == q.all()
    
    ...
    
    ProgrammingError: (psycopg2.ProgrammingError) for SELECT DISTINCT, ORDER BY expressions must appear in select list
    LINE 2: ...ddresses ON users.id = addresses.user_id ORDER BY addresses....
                                                                 ^
     [SQL: 'SELECT DISTINCT users.id AS users_id, users.name AS users_name \nFROM users JOIN addresses ON users.id = addresses.user_id ORDER BY addresses.email_address DESC']
    

    If I run your actual query against Postgresql, with or without the above logic, I get this error:

    With the behavior intact:

    #!
    
    sqlalchemy.exc.ProgrammingError: (psycopg2.ProgrammingError) SELECT DISTINCT ON expressions must match initial ORDER BY expressions
    LINE 1: SELECT DISTINCT ON ("user".att1) "user".att2 AS user_att2, "...
                                ^
     [SQL: 'SELECT DISTINCT ON ("user".att1) "user".att2 AS user_att2, "user".id AS user_id \nFROM "user" ORDER BY "user".att2, "user".id']
    

    with the behavior commented out:

    #!
    
    
    sqlalchemy.exc.ProgrammingError: (psycopg2.ProgrammingError) SELECT DISTINCT ON expressions must match initial ORDER BY expressions
    LINE 1: SELECT DISTINCT ON ("user".att1) "user".att2 AS user_att2 
                                ^
     [SQL: 'SELECT DISTINCT ON ("user".att1) "user".att2 AS user_att2 \nFROM "user" ORDER BY "user".att2, "user".id']
    

    So, the query you're looking for is not valid SQL in any case.

    What we need to know is - what is the valid SQL that you are looking for? Let's work on making that happen before we decide what's broken here.

  2. Thijs Damsma reporter
    • edited description
    • Edited mistake, I meant to order by "user".att1 instead of "user".att2.
    • Added intended result
  3. Mike Bayer repo owner

    So this logic is a "safety" so that when someone does add things to the ORDER BY that arent accounted for with the DISTINCT, they are added to the SELECT list. You'll note that the column that's being added is not actually returned in the result:

    from sqlalchemy.orm.session import Session
    from sqlalchemy.ext.declarative import declarative_base
    from sqlalchemy import Column, Integer, create_engine
    
    Base = declarative_base()
    
    
    class User(Base):
        __tablename__ = 'user'
        id = Column(Integer, primary_key=True)
        att1 = Column(Integer)
        att2 = Column(Integer)
    
    e = create_engine("postgresql://scott:tiger@localhost/test", echo=True)
    sess = Session(e)
    sess.add(User(id=1, att1=2, att2=3))
    
    q = sess.query(User.att2).distinct(User.att1).order_by(User.att1, User.id)
    
    Base.metadata.drop_all(e)
    Base.metadata.create_all(e)
    
    row = q.all()[0]
    assert row.keys() == ['att2']
    

    that would be why this was never noticed before. We can turn down the rules for ordering here to try to mimic PG some more, or take the logic out entirely for non-entity queries, but if you are looking for very specific SQL of an exact string form, Core is more appropriate for that.

  4. Mike Bayer repo owner

    I'm not a fan of this behavior in any case. The use case we test for is where there's a JOIN to a related entity. it's this:

            q = sess.query(User).join('addresses').distinct(). \
                order_by(desc(Address.email_address))
    
            assert [User(id=7), User(id=9), User(id=8)] == q.all()
    

    But that query above doesn't even make any sense. We want the users to be distinct from the join, OK, but then we want to ORDER BY the email address in the join, so of course we have to put that in the SELECT list, which means, we're no longer DISTINCT on the User. We only get back unique User objects because of the de-duping at the entity level. This whole behavior has legacy 0.3 written all over it :).

    Consider deprecating it but this would be a hard jump for any app relying upon it. 1.2 for now but this might keep getting pushed.

  5. Mike Bayer repo owner

    this is resolved w #3641 which is for exactly this case, where PG's "DISTINCT ON" is involved. original test in the description works as expected now.

  6. Log in to comment