warn when LIMIT/OFFSET used w/o ORDER BY but w/ subquery eager loading?

Issue #3064 wontfix
Jack Zhou created an issue

Consider the following configuration:

class Node(Base):
    __tablename__ = "node"
    id = Column(Integer, primary_key=True)
    path = Column(String(500), nullable=False)
    depth = Column(Integer, nullable=False)
    children = relationship("Node", viewonly=True,
                            primaryjoin=and_(
                                remote(foreign(path)).like(path.concat(".%")),
                                remote(depth) == depth + 1))

And the query:

db.add(Node(path="foo", depth=0))
db.add(Node(path="foo.baz", depth=1))
db.flush()
db.expunge_all()
print(db.query(Node).first().children)
db.expunge_all()
print(db.query(Node).options(subqueryload(Node.children)).first().children)

The output of this is not deterministic. Depending on external factors (such as whether an unused import is present), the second print statement may or may not be an empty list. I've tracked it down to incorrectly rendered SQL. On "correct" runs, the rendered SQL looks like this:

SELECT node.id AS node_id,
       node.path AS node_path,
       node.depth AS node_depth,
       anon_1.node_path AS anon_1_node_path,
       anon_1.node_depth AS anon_1_node_depth
FROM
  (SELECT DISTINCT node.path AS node_path,
                   node.depth AS node_depth
   FROM node LIMIT 1) AS anon_1
JOIN node ON node.path LIKE (anon_1.node_path || '.%')
AND node.depth = anon_1.node_depth + 1
ORDER BY anon_1.node_path,
         anon_1.node_depth;

On incorrect runs, the rendered SQL looks like this:

SELECT node.id AS node_id,
       node.path AS node_path,
       node.depth AS node_depth,
       anon_1.node_depth AS anon_1_node_depth,
       anon_1.node_path AS anon_1_node_path
FROM
  (SELECT DISTINCT node.depth AS node_depth,
                   node.path AS node_path
   FROM node LIMIT 1) AS anon_1
JOIN node ON node.path LIKE (anon_1.node_path || '.%')
AND node.depth = anon_1.node_depth + 1
ORDER BY anon_1.node_depth,
         anon_1.node_path;

Note the changed order of path and depth. The fundamental problem lies in anon_1. I may be wrong, but I've inferred from this that SQLAlchemy is incorrectly considering depth to be a foreign key when it is not marked as such.

Comments (19)

  1. Mike Bayer repo owner

    wow, OK the whole idea that SQLA can even do, in any case, "table.somecol = table.somecol" and keep the column straight on both sides, even though its the same column, was enormously complicated to work out - that capability only appeared in 0.8. Such a join condition seemed crazy for most of the time I've been developing this thing, but then we had folks presenting us with composite-key joins where such a thing had to occur.

    So here with the materialized path thing, which is also something I'm amazed relationship() can do as it's way outside of its comfort zone, we're now throwing in this "depth = depth" thing - wow! I'd almost be more concerned if it did actually work.

    that's an intro to how outside the usual boundaries that join condition is. Now let me actually look at it and try to understand even what's going on...

  2. Jack Zhou reporter

    Heh, sorry for springing it onto you like this. Rest assured that I completely understand how absurdly complicated this is -- I wouldn't even know where to begin implementing something like this myself! Perhaps a user-friendly error message or a warning in the docs saying this configuration is not supported would suffice, but if you evaluate that query with .all() it does actually appear to work; it only breaks when evaluated with .first() (or anything involving LIMIT, I believe), so I thought it wouldn't be too far-fetched to think it could work for .first() as well.

    Thank you for all the hard work you've put into this amazing library, by the way. IMO SQLAlchemy is one of the best libraries out there, and not just among ORMs and not just in Python, but across all libraries ever written for all languages. Really appreciate it.

  3. Mike Bayer repo owner

    actually this is working just about completely. The problem with first() is that you aren't giving it any order by. It's not really valid from a SQL perspective to ever use LIMIT or similar without ORDER BY - tables and select statements don't have any deterministic ordering. They of course have natural ordering, but without ORDER BY you really can't rely upon getting that ordering back. On a database like Oracle, you really never know when natural ordering is going to just vanish, such as if you emit a SELECT statement, then place that same SELECT statement as a nested subquery as we do here, it might choose different rows when it's nested. The more "typical" databases like PG, MySQL and SQLite we tend to see natural ordering a lot more. But in fact this "non-deterministic" issue is exactly what we're seeing here. If using subquery eager loading, you really should never count on the SELECT returning the same natural ordering by itself as it does when placed as a subquery.

    If I run this on PG, which is my "go-to" DB for queries like this, the inner subquery's ordering is being affected by the ORDER BY we have on the outside query. This is not actually per any SQL specific behavior, it's just how PG's query engine is working it out. If I run the same thing on SQLite (ha and mysql), and I get the exact same query with "depth" first and "path" second, I do get the row back, the inner subquery is still producing natural ordering:

    PG:

    SELECT node.id AS node_id, node.path AS node_path, node.depth AS node_depth, anon_1.node_depth AS anon_1_node_depth, anon_1.node_path AS anon_1_node_path 
    FROM (SELECT DISTINCT node.depth AS node_depth, node.path AS node_path 
    FROM node 
     LIMIT %(param_1)s) AS anon_1 JOIN node ON node.path LIKE (anon_1.node_path || %(param_2)s) AND node.depth = anon_1.node_depth + %(param_3)s ORDER BY anon_1.node_depth, anon_1.node_path
    {'param_1': 1, 'param_3': 1, 'param_2': '.%'}
    Col ('node_id', 'node_path', 'node_depth', 'anon_1_node_depth', 'anon_1_node_path')
    

    SQLite:

    SELECT node.id AS node_id, node.path AS node_path, node.depth AS node_depth, anon_1.node_depth AS anon_1_node_depth, anon_1.node_path AS anon_1_node_path 
    FROM (SELECT DISTINCT node.depth AS node_depth, node.path AS node_path 
    FROM node
     LIMIT ? OFFSET ?) AS anon_1 JOIN node ON node.path LIKE (anon_1.node_path || ?) AND node.depth = anon_1.node_depth + ? ORDER BY anon_1.node_depth, anon_1.node_path
    (1, 0, '.%', 1)
    Col ('node_id', 'node_path', 'node_depth', 'anon_1_node_depth', 'anon_1_node_path')
    Row (2, u'foo.baz', 1, 0, u'foo')
    

    MySQL:

    SELECT node.id AS node_id, node.path AS node_path, node.depth AS node_depth, anon_1.node_depth AS anon_1_node_depth, anon_1.node_path AS anon_1_node_path 
    FROM (SELECT DISTINCT node.depth AS node_depth, node.path AS node_path 
    FROM node 
     LIMIT %s) AS anon_1 INNER JOIN node ON node.path LIKE (concat(anon_1.node_path, %s)) AND node.depth = anon_1.node_depth + %s ORDER BY anon_1.node_depth, anon_1.node_path
    (1, '.%', 1)
    Col ('node_id', 'node_path', 'node_depth', 'anon_1_node_depth', 'anon_1_node_path')
    Row (2L, 'foo.baz', 1L, 0L, 'foo')
    

    All three you can see are doing "depth, path", that ordering is based on Python dictionary/set ordering which ultimately is based on the Python set Node.children.property.local_columns - subqueryloader doesn't care about the ordering of these columns, it's only used to ensure the rows are grouped together to the parent, which in itself is just so that we can use itertools.groupby to form the collections.

    if you just put order_by(Node.id) on your query I get the same result back every time.

    now, where would the "warning" be here - any SELECT where you requested LIMIT / OFFSET and didn't do an ORDER BY, that's pretty generic and we could warn for that in the regular old core compiler. But that seems a little heavyhanded.

  4. Mike Bayer repo owner

    also, I would Iove to get a really good workup of materialized path here and add it to examples/. Would you be interested in providing content for that?

  5. Jack Zhou reporter

    By "workup" do you mean a complete example module that demonstrates the usage of materialized paths? My particular use case is not much more complicated than what I've shown here (I have some polymorphism thrown in), but if that's what you're looking for I'd be glad to contribute.

  6. Mike Bayer repo owner

    Ah well something simple, I think here the "depth" you have you can get that from the constructor and splitting the "path" on the dot and counting, right? Examples look like this: http://docs.sqlalchemy.org/en/rel_0_9/_modules/examples/join_conditions/cast.html or this: http://docs.sqlalchemy.org/en/rel_0_9/_modules/examples/association/basic_association.html and just illustrate some novel kind of mapping and some usage examples. I sort of assume a lot of people figure out how relationship() is used and such by looking at these, maybe not.

  7. Jack Zhou reporter

    Yea, depth is purely stored for convenience's sake. I'll come up with something and submit a pull request.

  8. Mike Bayer repo owner

    OK I just went through the whole process of adding this warning and I realized it's going to usually do more harm than good. The majority of cases where we'll see this is where someone is saying query(Entity).filter_by(id=X).first(). That is, the query only returns one row - so first() is safe to call. The case of emitting query(Entity).<insufficient criteria>.first() is just wrong in general, we can't really save users from doing a wrong query; for us to try and guess which of these are "wrong" is probably further than is appropriate for us to go.

  9. Jack Zhou reporter

    Understandable. Perhaps a note in the docs might be more appropriate? It's not obvious that one always needs add a deterministic ORDER BY in order to ensure correct behavior.

  10. Mike Bayer repo owner

    if you can send me a PR for the docs where subquery eager loading is discussed I can pull that in. (probably in loading.rst).

  11. Jack Zhou reporter

    I submitted pull request #26. I also wonder if it wouldn't be appropriate to issue the warning when subqueryload gets rows back it didn't expect. For example, if the main query has objects with ids 1, 2, and 3, but the query issued by subqueryload returns a row with id 4, you know something is wrong. Of course, this would only catch the case where the subqueryloaded relationship for object 4 is not empty, but it's better than nothing. What do you think?

  12. Log in to comment