limit/offset used with a subqueryload creates unpredictable results

Issue #3650 closed
jvanasco created an issue

This affects subqueryload, not joinedload.

It is related to https://bitbucket.org/zzzeek/sqlalchemy/issues/3064/warn-when-limit-offset-used-w-o-order-by

Assume a simple relationship of Foo to Bar:

class Bar(Base):
    __tablename__ = 'bar'
    id = Column(Integer, primary_key=True)

class Foo(Base):
    __tablename__ = 'foo'
    id = Column(Integer, primary_key=True)
    id_bar = Column(Integer, ForeignKey("bar.id"), nullable=False)

    bar = sqlalchemy.orm.relationship("Bar")

The following query will yield unpredictable results for the 'bar' relationship, even when an id_bar is present:

foos = s.query(Foo)\
        .options(subqueryload('bar'))\
        .limit(20)\
        .offset(10)\
        .all()

The issue is that the subqueryload logic generates a selection like such (reformatted for emphasis):

     SELECT bar.* FROM
 (SELECT DISTINCT foo.id_bar AS foo_id_bar FROM foo LIMIT :limit OFFSET :offset )
      as anon_1 JOIN bar on bar.id = anon_1.foo_id_bar ORDER BY anon_1.foo_id_bar;

Because that inner select for the subquery is performing a DISTINCT, the LIMIT and OFFSET apply to the 'distinct' -- not overall -- list.

one example scenario that will fail: the offset is greater than the total number of possible values. if every foo is assigned an id_bar of only 2 values (1, 2), paginating an offset of 2 of greater will always return NULL and no bar relation will be set.

# 1. make a bunch of Bars.  Let's do 20!
for i in range(0, 20):
    b = Bar(id=i)
    s.add(b)
    s.flush()
    s.commit()

# 2, okay, now make 100 foos... but only use 2 `bars`.
for i in range(0, 100):
    if i < 50:
        id_bar = 1
    else:
        id_bar = 2
    f = Foo(id=i, id_bar=id_bar)
    s.add(f)
    f.id
    s.flush()
    s.commit()

# run the above select , and see the results:
for f in foos:
    print "--"
    print "examing foo.%s" % f.id
    print f.id_bar
    print f.bar

# you'll notice that `bar` will be missing even when id_bar is set

another example scenario is when you have a lot of repetition within the dataset. e.g. if there are fewer unique BARs than FOOs, you'll encounter errors when you paginate closer to the end. i haven't been able to recreate a reliable dataset for this, but it happens in my production dataset when I paginate 80% through.

Then to complicate further, this sort of becomes the other issue -- there is no implicit order_by on the inner query unless an order_by is explicitly applied to the outer query.

order_by(Foo.id.asc()) will create this inner query:

(SELECT DISTINCT foo.id_bar AS foo_id_bar, foo.id AS foo_id 
 FROM foo ORDER BY foo.id ASC
 LIMIT ? OFFSET ?)

however, no order_by will result in the inner query

(SELECT DISTINCT foo.id_bar AS foo_id_bar 
 FROM foo
 LIMIT ? OFFSET ?)

there is no reason for the db to guarantee these results will be for the same ids returned in the previous query (the non-subqueryload 'master' query that sqlalchemy just issued).

The other ticket talked about a more specific use-case with depth, but I'll address the larger and less detailed concept - in the absence of an explicit order_by, sqlalchemy appears to issue an implicit order_by on the primary key -- however does not extend that implicit order_by to the inner query.

the fixes I suggest are:

  1. migrate this into 2 queries -- an inner that paginates and an outer that applies distinct (or group_by, which can be much faster for this in certain scenarios)

    • SELECT DISTINCT foo.id_bar AS foo_id_bar FROM foo LIMIT :limit OFFSET :offset
    • SELECT DISTINCT foo_id_bar FROM (SELECT foo.id_bar AS foo_id_bar FROM foo LIMIT :limit OFFSET :offset) AS foo
  2. in the absence of no order_by issued with a subquery, just apply the same implicit primary-key ordering to the inner query that was used on the outer query.

Comments (15)

  1. Mike Bayer repo owner

    in the absence of an explicit order_by, sqlalchemy appears to issue an implicit order_by on the primary key

    from the subquery. we have to maintain the ordering delivered by it so that we can populate each Foo's collection using itertools.group_by.

    The other ticket talked about a more specific use-case with depth

    I'm seeing no difference whatsoever between this issue and #3064

    migrate this into 2 queries -- an inner that paginates and an outer that applies distinct (or group_by, which can be much faster for this in certain scenarios)

    I'm not understanding this. Now there's two queries that have LIMIT/OFFSET and no ORDER BY. You simply cannot get the same values back each time if you use LIMIT/OFFSET without ORDER BY, but also I don't understand how this new query is used or what purpose it would serve. Especially considering we just emitted that same query as the primary query anyway?

    in the absence of no order_by issued with a subquery, just apply the same implicit primary-key ordering to the inner query that was used on the outer query.

    What would that accomplish, except working by accident more often on certain backends? What's wrong with the documentation fix in #3064 ? Why are you needing to use LIMIT/OFFSET without ORDER BY?

  2. Mike Bayer repo owner

    note also there's another style of loading that we've never implemented, which i think SQLAlchemy utils has tried, which is subquery-IN loading. That would actually perform loads better than subquery eager loading. Basically, you emit the subquery eager load, but the inner query is just an IN containing all the primary keys of the objects we've loaded. It just would have to break it up into several subqueries to keep the size of the IN clauses down to about 500. from a SQL perspective though it's a lot faster especially for smaller result sets.

  3. jvanasco reporter

    The issue with DISTINCT is independent from the order_by. (incidentally, I never paginate without an order_by, I just noticed this while putting together a rough test case). The problem with DISTINCT happening on the same level as the LIMIT OFFSET occurs irrelevant to the ordering.

    I'll make a better test-script tonight and share tomorrow,

    The subquery IN loading would be a lot nicer. i think mysql's query caching probably helps the current implementation, but postgres can be rough. i actually avoid a lot of relationships stuff and do direct 'in' loading by ids when I can.

  4. Mike Bayer repo owner

    OK I just scrolled up to see what you might be referring to and so far I just see queries with limit/offset and no ORDER BY which are documented as not working, so let me know what DISTINCT thing specifically e.g., a failure even when ORDER BY is present.

  5. jvanasco reporter

    OK. I redid the model and selection to more closely mimic my production dataset.

    The issue would be better described as "ordering fails when sorted by a non-unique column". The fix, however, is simple - it just needs to migrate the DISTINCT selection into a query that wraps the paginated selection.

    Apologies that this can take a while to run, I had to create an arbitrarily large dataset to create this condition. I added a toggle to create the dataset in Postgres, so it can run faster on select-only. With a fresh set of eyes, you may be able to think of a better test-case.

    In any case: * Many more Foos than Bars * Each Foo will have 3 Bars AND the bars are not necessarily unique. i.e. id_foo+id_bar is not a unique constraint.
    * Random distribution of Foo 2 Bars

    I identified the problem ids in this dataset, so you can see them easier.

    The model describes an association table, foo_2_bar. (the payload columns are irrelevant, just there as placeholders).

    The example query is searching for Foo2Bar that correspond with an id_foo.

    Although I supply an order_by, we run into issues because of the pagination as I am ordering by id_bar.

    Here is the current sql for the problematic id "9885"

    SELECT bar.id AS bar_id, bar.payload AS bar_payload, anon_1.foo_2_bar_id_bar AS anon_1_foo_2_bar_id_bar 
    FROM (SELECT DISTINCT foo_2_bar.id_bar AS foo_2_bar_id_bar 
    FROM foo_2_bar 
    WHERE foo_2_bar.id_foo = 9885 ORDER BY foo_2_bar.id_bar ASC
     LIMIT 10 OFFSET 2) AS anon_1 JOIN bar ON bar.id = anon_1.foo_2_bar_id_bar ORDER BY anon_1.foo_2_bar_id_bar;
    

    It returns no rows because of the inner query:

    SELECT DISTINCT foo_2_bar.id_bar AS foo_2_bar_id_bar 
    FROM foo_2_bar 
    WHERE foo_2_bar.id_foo = 9885 ORDER BY foo_2_bar.id_bar ASC
    LIMIT 10 OFFSET 2;
    
     foo_2_bar_id_bar 
    ------------------
    (0 rows)
    

    If we remove the limit/offset:

    SELECT DISTINCT foo_2_bar.id_bar AS foo_2_bar_id_bar 
    FROM foo_2_bar 
    WHERE foo_2_bar.id_foo = 9885 ORDER BY foo_2_bar.id_bar ASC;
    
     foo_2_bar_id_bar 
    ------------------
                  268
                  494
    

    And if we remove the distinct:

    SELECT foo_2_bar.id_bar AS foo_2_bar_id_bar 
    FROM foo_2_bar 
    WHERE foo_2_bar.id_foo = 9885 ORDER BY foo_2_bar.id_bar ASC;
    
     foo_2_bar_id_bar 
    ------------------
                  268
                  494
                  494
    
    SELECT foo_2_bar.id_bar AS foo_2_bar_id_bar 
    FROM foo_2_bar 
    WHERE foo_2_bar.id_foo = 9885 ORDER BY foo_2_bar.id_bar ASC
    LIMIT 10 OFFSET 2;
    
     foo_2_bar_id_bar 
    ------------------
                  494
    

    Since this table's field is non-unique, the DISTINCT wipes out the result. (This behavior is what will happen if the clause is ordered or not).

    Looking at all the problematic ids, this is caused by the non-distinct nature:

    SELECT id_foo, id_bar, count(id_bar)
    FROM foo_2_bar WHERE id_foo in (85,369,898,1022,1036,1041,1301,1399,1407,1535,1550,1605,1655,1662,1729,1785,1855,1864,2032,2502,2585,2650,2992,3590,3716,3842,4080,4421,4495,4505,4721,4749,5074,5197,5255,5285,5345,5605,5925,5947,6049,6210,6228,6298,6389,6708,6733,6750,6789,6985,6998,8273,8840,8924,9254,9310,9414,9461,9624,9653,9668,9873,9885,9972)
    GROUP BY id_foo, id_bar ORDER BY id_foo ASC;
    

    In order to get the correct 'distinct' list, we need to apply distinct onto the paginated non-distinct resultsP:

    SELECT DISTINCT foo_2_bar_id_bar FROM (SELECT foo_2_bar.id_bar AS foo_2_bar_id_bar 
                                           FROM foo_2_bar 
                                           WHERE foo_2_bar.id_foo = 9885 ORDER BY foo_2_bar.id_bar ASC
                                           LIMIT 10 OFFSET 2
                                           ) AS subq;
     foo_2_bar_id_bar 
    ------------------
                  494
    
  6. Mike Bayer repo owner

    you aren't ORDER BY'ing enough. Your ORDER BY is still not deterministic and the results that are returned by just the query(Foo2Bar) will not be the same under all conditions.

  7. Mike Bayer repo owner

    http://docs.sqlalchemy.org/en/rel_1_0/orm/loading_relationships.html#the-importance-of-ordering

    emphasis added:

    The Importance of Ordering

    A query which makes use of subqueryload() in conjunction with a limiting modifier such as Query.first(), Query.limit(), or Query.offset() should always include Query.order_by() against unique column(s) such as the primary key, so that the additional queries emitted by subqueryload() include the same ordering as used by the parent query. Without it, there is a chance that the inner query could return the wrong rows:

  8. jvanasco reporter

    How did I not see the sub-select was not deterministic?!?

    Thanks for setting me straight.

    I found this case in some old code that pre-dated those docs (and this behavior).

    It seems to be a bit of a regression pegged to this merge in 0.8/0.9 :

    https://bitbucket.org/zzzeek/sqlalchemy/issues/2836/supply-distinct-within-subq-eager-load
    https://github.com/zzzeek/sqlalchemy/pull/33
    

    Everything worked fine in 0.7 when this was written, but it looks like everything "just so happened to work fine" because PG is very reliable about non-deterministic ordering in a transaction. Unfortunately, the changelog didn't tip me off to this consequence.

    My idea of trying to wrap the 'distinct' introduced to strategies.py in #2836 would only minimize, but not solve, this issue.

    Would it make sense at all to (slightly) revisit #3064, and augment the logic introduced in #2836 to ensure the primary keys are extended to the order_by ? e.g. if there is a primary key on the table , append it's columns to the _order_by. that would effectively fix the deterministic problem, be a nice convenience, and not affect code which properly declares the ordering. (I do prefer the rejected notion from #30654 of raising an error on the improper subqueryload though. )

  9. Mike Bayer repo owner

    the tricky part is determining if the order_by() is indeed against a unique set of columns, but assuming we can detect that I'd rather emit a warning. We don't win any friends by modifying the main query that Query() produces.

  10. Mike Bayer repo owner

    so....I don't really want to get into trying to guess if the order by is unique, because even emitting a warning is going to annoy people who are order_by'ing columns that are unique but have no formal indicators of such (e.g. have a unique constraint but they aren't mapping that in python, or an unconstrained, yet unique, column).

  11. jvanasco reporter

    Just to counter that -- the docs didn't note this requirement until about 20months ago; and the behavior was largely created by the DISTINCT improvement a little over 2 years old. There should really be a warning emitted or at least an advisory notice in the changelog, simply because subqueries that worked in .7 may completely break in .8/.9/1 and looping through the changelog would not suggest why .

    The .9 changelog only mentions this bit about the DISTINCT being applied

    http://docs.sqlalchemy.org/en/rel_1_1/changelog/migration_09.html#change-2836

    That bit does not note that existing code will break if there is a non-unique column(s) used for the order_by. I know that the existing code would have been somewhat faulty, but it would have run and passed tests unless ordering were examined. in the current behavior, non-unique values will cause no relationships will load (instead of a potentially incorrect relationship) -- so accessing 'foo2bar.bar' will raise an exception instead of a potentially mis-ordered data. unlike the variant of the issue with first() noted in the docs [ http://docs.sqlalchemy.org/en/latest/faq/ormconfiguration.html#why-is-order-by-required-with-limit-especially-with-subqueryload ] the relationship will be None and not lazy-loaded on access.

  12. jvanasco reporter

    Perhaps just update the changelog doc?

    http://docs.sqlalchemy.org/en/rel_1_1/changelog/migration_09.html#change-2836

    mend the above paragraph to add:

    A side effect of this new feature, is that calls to subqueryload must now include deterministic column(s) in an order_by if LIMIT or OFFSET are used. This can often be accomplished by adding in the primary key or another unique column(s).

    The deterministic sort is required because the DISTINCT function will eliminate duplicate rows, affecting the result-set if there are non-unique values, and can cause objects to load null relationships where existing ones exist.

    While previous versions of SqlAlchemy often correctly populated these relationships, this was due to several database backends implementing a repeatable natural sort, however this behavior was not consistent across all backends.

    Queries written for the SqlAlchemy 0.8 series (or earlier) which invoke subqueryload should be audited and updated if needed.

    see also: http://docs.sqlalchemy.org/en/rel_1_0/orm/loading_relationships.html#the-importance-of-ordering see also: http://docs.sqlalchemy.org/en/rel_1_0/faq/ormconfiguration.html#faq-subqueryload-limit-sort

    and also reference it in the 0.8 area: http://docs.sqlalchemy.org/en/rel_1_1/changelog/changelog_08.html#change-0.8.3

  13. Mike Bayer repo owner

    sure. though id want to word it that it is much more important to have a deterministic order by than it was before; I dont want it to seem like subqueryload created this problem. send off a PR!

  14. Log in to comment