refine distinct + order_by "add primary cols" logic for non wrapped selectable

Issue #3641 resolved

Mehdi Gmira created an issue 2016-02-01

I encountred a bug with the postgres specific "DISTINCT ON" in SQLAlchemy The real SA query I'm using is more complicated than this, but it sums up to :

query = db.session.query(c1, c2, c3, c4).distinct(c1, c2).order_by(c1, c2, c3)

when i compile the query by doing:

query.statement.compile(dialect=postgres.dialect())

The actual query is

SELECT DISTINCT ON (c1, c2) c1, c2, c3, c4, c1, c2, c3 FROM .... ORDER BY c1, c2, c3

As you can see, the columns c1, c2, c3 are repeated. After some investigation, it seems like this is caused by the order_by() part of the query (removing order by removes the repeated columns). The weird thing is that when actually running the query (query.all()), I only get the column that I requested, and not the repeated ones as in the compiled statement. This would have been fine with me. But the problem is it's preventing me from using a union(), because SA thinks I'm requesting more columns on one side of union().

I think It's a bug, but maybe I'm just doing something wrong.

Comments (19)

Mike Bayer repo owner

hi there -

thanks for reporting. Unfortuntately I cannot produce this result (see below). Please provide an mvce case. Thanks!

(hint: I'm guessing that the "c1" "c2" "c3" are not the same objects sent to order_by() as those sent to query()).

from sqlalchemy import Integer, Column
from sqlalchemy.orm import Session
from sqlalchemy.ext.declarative import declarative_base

Base = declarative_base()


class A(Base):
    __tablename__ = 'a'
    id = Column(Integer, primary_key=True)
    x = Column(Integer)
    y = Column(Integer)
    z = Column(Integer)


s = Session()

q = s.query(A.x, A.y, A.z).distinct(A.x, A.y).order_by(A.x, A.y, A.z)

from sqlalchemy.dialects import postgresql
print q.statement.compile(dialect=postgresql.dialect())

output:

SELECT DISTINCT ON (a.x, a.y) a.x, a.y, a.z 
FROM a ORDER BY a.x, a.y, a.z

2016-02-01T15:35:01+00:00

Mike Bayer repo owner
- removed milestone
- 2016-02-01T15:35:35+00:00
Mehdi Gmira reporter
Sorry, You're right. I just found out that SA has nothing to do with this. I was lost if a bunch of subqueries :), and chose the wrong one in the order_by clause.

Have a good day, and thank you for the great work you're doing on SQLAlchemy :)
- 2016-02-01T16:21:19+00:00
Mehdi Gmira reporter
- changed status to closed
Not a bug
- 2016-02-01T16:22:03+00:00
Mehdi Gmira reporter
- changed status to open
Reproduced bug
- 2016-02-05T13:03:07+00:00

Mehdi Gmira reporter

Hey,

I realized there is actually a problem with the distinct on implementation.

Here's a code sample

engine = create_engine('postgresql://**')
Session = sessionmaker()
Session.configure(bind=engine)
session = Session()

class Activity(Base):
    __tablename__ = 'activity'

    id = Column(Integer, primary_key=True)
    name = Column(String)
    last_name = Column(String)
    day = Column(Integer)
    action = Column(String)

sub = (session.query(Activity.name, Activity.last_name, Activity.day, Activity.action)
       .filter(Activity.name == 'test').subquery())

Here's a dumb query:

q = (session.query(sub.c.name, sub.c.last_name,
                   sub.c.day, sub.c.action).distinct(sub.c.name, sub.c.last_name)
     .order_by(sub.c.name, sub.c.last_name, sub.c.day.desc()))

As expected, this produces:

SELECT DISTINCT ON (anon_1.name, anon_1.last_name) anon_1.name, anon_1.last_name, anon_1.day, anon_1.action 
FROM (SELECT activity.name AS name, activity.last_name AS last_name, activity.day AS day, activity.action AS action 
FROM activity 
WHERE activity.name = %(name_1)s) AS anon_1 ORDER BY anon_1.name, anon_1.last_name, anon_1.day DESC

But this doesn't work:

q = (session.query(sub.c.name.label('n'), sub.c.last_name.label('l'),
                   sub.c.day.label('d'), sub.c.action.label('a')).distinct(sub.c.name, sub.c.last_name)
     .order_by(sub.c.name, sub.c.last_name, sub.c.day.desc()))

It produces

SELECT DISTINCT ON (anon_1.name, anon_1.last_name) anon_1.name AS n, anon_1.last_name AS l, anon_1.day AS d, anon_1.action AS a, anon_1.name, anon_1.last_name, anon_1.day 
FROM (SELECT activity.name AS name, activity.last_name AS last_name, activity.day AS day, activity.action AS action 
FROM activity 
WHERE activity.name = %(name_1)s) AS anon_1 ORDER BY anon_1.name, anon_1.last_name, anon_1.day DESC

2016-02-05T13:08:16+00:00

Mike Bayer repo owner
- changed milestone to 1.1
- changed component to orm
- changed title to remove distinct + order_by "add primary cols" logic for non wrapped selectable
this should be unnecessary, no ORM tests fail except one assert SQL that can be changed:
```
diff --git a/lib/sqlalchemy/orm/query.py b/lib/sqlalchemy/orm/query.py
index 08600c3..4266e68 100644
--- a/lib/sqlalchemy/orm/query.py
+++ b/lib/sqlalchemy/orm/query.py
@@ -3319,7 +3319,7 @@ class Query(object):
         if not context.order_by:
             context.order_by = None

-        if self._distinct and context.order_by:
+        if False and self._distinct and context.order_by:
             order_by_col_expr = list(
                 chain(*[
                     sql_util.unwrap_order_by(o)
```
it might be worth looking back into history to see if when query._compile_context() was split into "compound" and "simple", the add cols logic here should never have been applied to the "simple" case as this is specifically used to allow eager subquery wrapping to work.

since a behavior change, this needs to be at a major version boundary so hopefully can get this into 1.1.
- 2016-02-05T22:45:25+00:00
Mike Bayer repo owner
certainly at the very least, the assumption of this logic is back when we didn't return individual columns so it at least shouldn't take effect when we arent returning full entities.
- 2016-02-05T22:46:46+00:00

Mike Bayer repo owner

here's the test:

#!

py.test test/orm/test_query.py -k DistinctTest --db postgresql


ProgrammingError: (psycopg2.ProgrammingError) for SELECT DISTINCT, ORDER BY expressions must appear in select list
LINE 2: ...ddresses ON users.id = addresses.user_id ORDER BY addresses....
                                                             ^
 [SQL: 'SELECT DISTINCT users.id AS users_id, users.name AS users_name \nFROM users JOIN addresses ON users.id = addresses.user_id ORDER BY addresses.email_address DESC']

2016-02-09T21:19:31+00:00

Mike Bayer repo owner

alrighty we'll just make it smarter:

diff --git a/lib/sqlalchemy/orm/query.py b/lib/sqlalchemy/orm/query.py
index 08600c3..3c0f67a 100644
--- a/lib/sqlalchemy/orm/query.py
+++ b/lib/sqlalchemy/orm/query.py
@@ -3254,12 +3254,11 @@ class Query(object):
         # then append eager joins onto that

         if context.order_by:
-            order_by_col_expr = list(
-                chain(*[
-                    sql_util.unwrap_order_by(o)
-                    for o in context.order_by
-                ])
-            )
+            order_by_col_expr = \
+                sql_util.expand_column_list_from_order_by(
+                    context.primary_columns,
+                    context.order_by
+                )
         else:
             context.order_by = None
             order_by_col_expr = []
@@ -3320,14 +3319,11 @@ class Query(object):
             context.order_by = None

         if self._distinct and context.order_by:
-            order_by_col_expr = list(
-                chain(*[
-                    sql_util.unwrap_order_by(o)
-                    for o in context.order_by
-                ])
-            )
-            context.primary_columns += order_by_col_expr
-
+            context.primary_columns += \
+                sql_util.expand_column_list_from_order_by(
+                    context.primary_columns,
+                    context.order_by
+                )
         context.froms += tuple(context.eager_joins.values())

         statement = sql.select(
diff --git a/lib/sqlalchemy/sql/util.py b/lib/sqlalchemy/sql/util.py
index 1dcf0ee..0e8c77c 100644
--- a/lib/sqlalchemy/sql/util.py
+++ b/lib/sqlalchemy/sql/util.py
@@ -176,6 +176,27 @@ def unwrap_order_by(clause):
     return result


+def expand_column_list_from_order_by(collist, order_by):
+    """Given the columns clause and ORDER BY of a selectable,
+    return a list of column expressions that can be added to the collist
+    corresponding to the ORDER BY, without repeating those already
+    in the collist.
+
+    """
+    cols_already_present = set([
+        col.element if col._order_by_label_element is not None
+        else col for col in collist
+    ])
+    return [
+        col for col in
+        chain(*[
+            unwrap_order_by(o)
+            for o in order_by
+        ])
+        if col not in cols_already_present
+    ]
+
+
 def clause_is_present(clause, search):
     """Given a target clause and a second to search within, return True
     if the target is plainly present in the search without any

2016-02-09T21:46:03+00:00

Mike Bayer repo owner
you know actually there is no problem with the behavior as is, since those columns aren't returned by the Query. It's only ugly in the string, and if you use query.statement in a Core sense.
- 2016-02-09T22:11:36+00:00
Mike Bayer repo owner
- changed status to resolved
- A refinement to the logic which adds columns to the resulting SQL when :meth:.Query.distinct is combined with :meth:.Query.order_by such that columns which are already present will not be added a second time, even if they are labeled with a different name. Regardless of this change, the extra columns added to the SQL have never been returned in the final result, so this change only impacts the string form of the statement as well as its behavior when used in a Core execution context. Additionally, columns are no longer added when the DISTINCT ON format is used, provided the query is not wrapped inside a subquery due to joined eager loading. fixes ~~#3641~~
→ <<cset ff3be95620b6>>
- 2016-02-09T22:50:31+00:00
Mike Bayer repo owner
- changed title to refine distinct + order_by "add primary cols" logic for non wrapped selectable
- 2016-02-09T22:51:11+00:00
Mehdi Gmira reporter
I'm not sure I understood what you mean by "the extra columns added to the SQL have never been returned in the final result" in the results. If you run
```
q = (session.query(sub.c.name.label('n'), sub.c.last_name.label('l'),
                   sub.c.day.label('d'), sub.c.action.label('a')).distinct(sub.c.name, sub.c.last_name)
     .order_by(sub.c.name, sub.c.last_name, sub.c.day.desc())).all()
```
You do get more columns than expected. Plus if you wanted to use this query in a union sql statment, you'd get an error.
- 2016-02-10T11:11:25+00:00

Mike Bayer repo owner

You do get more columns than expected.

not at all, because these columns are not added as entities to the Query. Here's the full test, which I will run against 0.9 where we add all the extra columns:

from sqlalchemy import *
from sqlalchemy.orm import *
from sqlalchemy.ext.declarative import declarative_base

Base = declarative_base()


class Activity(Base):
    __tablename__ = 'activity'

    id = Column(Integer, primary_key=True)
    name = Column(String)
    last_name = Column(String)
    day = Column(Integer)
    action = Column(String)

engine = create_engine('postgresql://scott:tiger@localhost/test', echo=True)

Base.metadata.drop_all(engine)
Base.metadata.create_all(engine)

session = Session(engine)

session.add(Activity(name='test'))

sub = (session.query(Activity.name, Activity.last_name, Activity.day, Activity.action)
       .filter(Activity.name == 'test').subquery())


q = (session.query(sub.c.name.label('n'), sub.c.last_name.label('l'),
                   sub.c.day.label('d'),
                   sub.c.action.label('a')).distinct(sub.c.name, sub.c.last_name)
     .order_by(sub.c.name, sub.c.last_name, sub.c.day.desc()))

for row in q.all():
    print row
    print len(row)

relevant output:

SELECT DISTINCT ON (anon_1.name, anon_1.last_name) anon_1.name AS n, anon_1.last_name AS l, anon_1.day AS d, anon_1.action AS a, anon_1.name AS anon_1_name, anon_1.last_name AS anon_1_last_name, anon_1.day AS anon_1_day 
FROM (SELECT activity.name AS name, activity.last_name AS last_name, activity.day AS day, activity.action AS action 
FROM activity 
WHERE activity.name = %(name_1)s) AS anon_1 ORDER BY anon_1.name, anon_1.last_name, anon_1.day DESC
2016-02-10 09:58:05,081 INFO sqlalchemy.engine.base.Engine {'name_1': 'test'}
(u'test', None, None, None)
4

see? eight columns in the SQL, only four in the result.

Plus if you wanted to use this query in a union sql statment, you'd get an error.

two defenses:

as stated, when using PG's DISTINCT ON, the whole behavior is now disabled for a simple outer query
It is illegal from a SQL perspective to use DISTINCT and ORDER BY when all the columns in the ORDER BY are not present in the DISTINCT; PG again will reject such a query explicitly. So your UNION with an inner query that has DISTINCT + ORDER BY plus columns missing in the columns clause is not possible anyway, unless you're using a DB like MySQL or SQLite which seems to return random results.

2016-02-10T15:07:34+00:00

Mehdi Gmira reporter

see? eight columns in the SQL, only four in the result.

Yes, you're right. However, this doesn't work (7 columns returned)

from sqlalchemy import *
from sqlalchemy.orm import *
from sqlalchemy.ext.declarative import declarative_base

Base = declarative_base()


class Activity(Base):
    __tablename__ = 'activity'

    id = Column(Integer, primary_key=True)
    name = Column(String)
    last_name = Column(String)
    day = Column(Integer)
    action = Column(String)

engine = create_engine('postgresql://scott:tiger@localhost/test', echo=True)

Base.metadata.drop_all(engine)
Base.metadata.create_all(engine)

session = Session(engine)

session.add(Activity(name='test'))

sub = (session.query(Activity.name, Activity.last_name, Activity.day, Activity.action)
       .filter(Activity.name == 'test').subquery())


q = (session.query(sub.c.name.label('n'), sub.c.last_name.label('l'),
                   sub.c.day.label('d'),
                   sub.c.action.label('a')).distinct(sub.c.name, sub.c.last_name)
     .order_by(sub.c.name, sub.c.last_name, sub.c.day.desc()))

for row in session.query(q.subquery()).all():
    print row
    print len(row)

But maybe that's what you had in mind when you wrote

provided the query is not wrapped inside a subquery due to joined eager loading.

Also

It is illegal from a SQL perspective to use DISTINCT and ORDER BY when all the columns in the ORDER BY are not present in the DISTINCT; PG again will reject such a query explicitly. So your UNION with an inner query that has DISTINCT + ORDER BY plus columns missing in the columns clause is not possible anyway, unless you're using a DB like MySQL or SQLite which seems to return random results.

I assume you're talking about "distinct" rather than "distinct on". Because something like

select distinct on (a,b) a,b,c order by a,b,d

is allowed by postgresql.

2016-02-10T15:55:40+00:00

Mike Bayer repo owner
yup, "distinct". Short answer, if you give Query() legal SQL in the first place none of this behavior takes effect as of 1.1.
- 2016-02-10T16:11:42+00:00
Mehdi Gmira reporter
Ok got it! :)
- 2016-02-10T16:17:17+00:00
Mike Bayer repo owner
Issue ~~#3518~~ was marked as a duplicate of this issue.
- 2016-03-08T15:28:25+00:00
Log in to comment

Assignee: –

Type: bug

Priority: major

Status: resolved

Component: orm

Milestone: 1.1

Votes: 0

Watchers: 3