consider rolling back / refining automatic columns clause logic with orm query.distinct().order_by()

Issue #3518 duplicate

Thijs Damsma created an issue 2015-08-28

When a distinct query is ordered, the columns on which are sorted are added to the SELECT statement, see example code:

from sqlalchemy.orm.session import sessionmaker
from sqlalchemy.dialects import postgresql
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import Column, Integer

Base = declarative_base()


class User(Base):
    __tablename__ = 'user'
    id = Column(Integer, primary_key=True)
    att1 = Column(Integer)
    att2 = Column(Integer)


Session = sessionmaker()
sess = Session()

q = sess.query(User.att2).distinct(User.att1)

print(str(q.statement.compile(dialect=postgresql.dialect())))

as expected, only user.att2 will be returned:

SELECT DISTINCT ON ("user".att1) "user".att2 
FROM "user"

Now with an order_by statement:

print(str(q.order_by(User.att1, User.id).statement.compile(
    dialect=postgresql.dialect())))

not as expected, both user.att2 and user.id will be returned

SELECT DISTINCT ON ("user".att1) "user".att2, "user".att1, "user".id 
FROM "user" ORDER BY "user".att1, "user".id

The intended result is:

SELECT DISTINCT ON ("user".att1) "user".att2
FROM "user" ORDER BY "user".att1, "user".id

Comments (9)

Mike Bayer repo owner

so the logic making that happen is here:

diff --git a/lib/sqlalchemy/orm/query.py b/lib/sqlalchemy/orm/query.py
index 5d08cbc..675d7a1 100644
--- a/lib/sqlalchemy/orm/query.py
+++ b/lib/sqlalchemy/orm/query.py
@@ -3084,7 +3084,7 @@ class Query(object):
         if not context.order_by:
             context.order_by = None

-        if self._distinct and context.order_by:
+        if False: #self._distinct and context.order_by:
             order_by_col_expr = list(
                 chain(*[
                     sql_util.unwrap_order_by(o)

If I comment it out as above, here is the test that fails (only on PG):

#!

DistinctTest.test_joined _____________________________________________________________________
Traceback (most recent call last):
  File "/Users/classic/dev/sqlalchemy/test/orm/test_query.py", line 2659, in test_joined
    assert [User(id=7), User(id=9), User(id=8)] == q.all()

...

ProgrammingError: (psycopg2.ProgrammingError) for SELECT DISTINCT, ORDER BY expressions must appear in select list
LINE 2: ...ddresses ON users.id = addresses.user_id ORDER BY addresses....
                                                             ^
 [SQL: 'SELECT DISTINCT users.id AS users_id, users.name AS users_name \nFROM users JOIN addresses ON users.id = addresses.user_id ORDER BY addresses.email_address DESC']

If I run your actual query against Postgresql, with or without the above logic, I get this error:

With the behavior intact:

#!

sqlalchemy.exc.ProgrammingError: (psycopg2.ProgrammingError) SELECT DISTINCT ON expressions must match initial ORDER BY expressions
LINE 1: SELECT DISTINCT ON ("user".att1) "user".att2 AS user_att2, "...
                            ^
 [SQL: 'SELECT DISTINCT ON ("user".att1) "user".att2 AS user_att2, "user".id AS user_id \nFROM "user" ORDER BY "user".att2, "user".id']

with the behavior commented out:

#!


sqlalchemy.exc.ProgrammingError: (psycopg2.ProgrammingError) SELECT DISTINCT ON expressions must match initial ORDER BY expressions
LINE 1: SELECT DISTINCT ON ("user".att1) "user".att2 AS user_att2 
                            ^
 [SQL: 'SELECT DISTINCT ON ("user".att1) "user".att2 AS user_att2 \nFROM "user" ORDER BY "user".att2, "user".id']

So, the query you're looking for is not valid SQL in any case.

What we need to know is - what is the valid SQL that you are looking for? Let's work on making that happen before we decide what's broken here.

2015-08-28T16:03:02+00:00

Thijs Damsma reporter
- edited description
- Edited mistake, I meant to order by "user".att1 instead of "user".att2.
- Added intended result
- 2015-08-28T18:34:14+00:00

Mike Bayer repo owner

So this logic is a "safety" so that when someone does add things to the ORDER BY that arent accounted for with the DISTINCT, they are added to the SELECT list. You'll note that the column that's being added is not actually returned in the result:

from sqlalchemy.orm.session import Session
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import Column, Integer, create_engine

Base = declarative_base()


class User(Base):
    __tablename__ = 'user'
    id = Column(Integer, primary_key=True)
    att1 = Column(Integer)
    att2 = Column(Integer)

e = create_engine("postgresql://scott:tiger@localhost/test", echo=True)
sess = Session(e)
sess.add(User(id=1, att1=2, att2=3))

q = sess.query(User.att2).distinct(User.att1).order_by(User.att1, User.id)

Base.metadata.drop_all(e)
Base.metadata.create_all(e)

row = q.all()[0]
assert row.keys() == ['att2']

that would be why this was never noticed before. We can turn down the rules for ordering here to try to mimic PG some more, or take the logic out entirely for non-entity queries, but if you are looking for very specific SQL of an exact string form, Core is more appropriate for that.

2015-08-28T22:00:37+00:00

Mike Bayer repo owner
- add a note about DISTINCT in query w/ ORDER BY, references ~~#3518~~
→ <<cset 6d0c0994e9a0>>
- 2015-08-31T17:13:43+00:00
Mike Bayer repo owner
- add a note about DISTINCT in query w/ ORDER BY, references ~~#3518~~
(cherry picked from commit 6d0c0994e9a08cf5d149af0314970d5f6e25b159)

→ <<cset 2d48b9de5ad5>>
- 2015-08-31T17:13:43+00:00
Mike Bayer repo owner
- add a note about DISTINCT in query w/ ORDER BY, references ~~#3518~~
(cherry picked from commit 6d0c0994e9a08cf5d149af0314970d5f6e25b159)

→ <<cset 5c8d1f33cd10>>
- 2015-08-31T17:13:43+00:00
Mike Bayer repo owner
- marked as major
- changed title to consider rolling back / refining automatic columns clause logic with orm query.distinct().order_by()
- changed milestone to 1.2
- changed component to orm
- marked as proposal
I'm not a fan of this behavior in any case. The use case we test for is where there's a JOIN to a related entity. it's this:
```
        q = sess.query(User).join('addresses').distinct(). \
            order_by(desc(Address.email_address))

        assert [User(id=7), User(id=9), User(id=8)] == q.all()
```
But that query above doesn't even make any sense. We want the users to be distinct from the join, OK, but then we want to ORDER BY the email address in the join, so of course we have to put that in the SELECT list, which means, we're no longer DISTINCT on the User. We only get back unique User objects because of the de-duping at the entity level. This whole behavior has legacy 0.3 written all over it :).

Consider deprecating it but this would be a hard jump for any app relying upon it. 1.2 for now but this might keep getting pushed.
- 2015-08-31T17:23:55+00:00
Mike Bayer repo owner
- changed status to duplicate
Duplicate of ~~#3641~~.
- 2016-03-08T15:28:25+00:00
Mike Bayer repo owner
this is resolved w ~~#3641~~ which is for exactly this case, where PG's "DISTINCT ON" is involved. original test in the description works as expected now.
- 2016-03-08T15:29:01+00:00
Log in to comment

Assignee: –

Type: proposal

Priority: major

Status: duplicate

Component: orm

Milestone: 1.2

Votes: 0

Watchers: 2