`contains_eager` for subqueryload

Issue #3684 wontfix
jmagnusson
created an issue

As we all know contains_eager and joinedload queries can be very expensive to run as the amount of columns and rows in the result set start to grow. subqueryload helps a lot in these cases, but unfortunately there's no contains_eager version for it which could allow filtering of the related collections. There is kind of a way to get around this by using the DisjointEagerLoading recipe. But could this perhaps be achieved in a more generic way which could make it into the library?

One idea I have is to pass in a new option sa.orm.enable_eagerloads_filtering() to .options(), like this:

import sqlalchemy as sa
import sqlalchemy.orm  # noqa
from sqlalchemy.ext import declarative

Base = declarative.declarative_base()
Session = sa.orm.sessionmaker()


class Product(Base):
    __tablename__ = 'product'
    id = sa.Column(sa.Integer, primary_key=True)
    artno = sa.Column(sa.Text)
    name = sa.Column(sa.Text)


class Variant(Base):
    __tablename__ = 'variant'
    id = sa.Column(sa.Integer, primary_key=True)
    color = sa.Column(sa.Text)
    product_id = sa.Column(sa.Integer, sa.ForeignKey(Product.id))
    product = sa.orm.relationship(Product, backref='variants')


class SKU(Base):
    __tablename__ = 'sku'
    id = sa.Column(sa.Integer, primary_key=True)
    size = sa.Column(sa.Text)
    variant_id = sa.Column(sa.Integer, sa.ForeignKey(Variant.id))
    variant = sa.orm.relationship(Variant, backref='skus')


product = Product(name='T-shirt', artno='12345')
variant = Variant(product=product, color='Black')
sku2 = SKU(variant=variant, size='S')
sku = SKU(variant=variant, size='M')
sku = SKU(variant=variant, size='L')


engine = sa.create_engine('sqlite://', echo=True)
Session.configure(bind=engine)
Base.metadata.create_all(bind=engine)
session = Session()
session.add_all([sku, sku2])
session.flush()

query = (
    session.query(Product)
    .outerjoin(Product.variants)
    .outerjoin(Variant.skus)
    .filter(Product.name == 'T-shirt')
    .filter(Variant.color == 'Black')
    .filter(SKU.size == 'M')
    .options(
        sa.orm.enable_eagerloads_filtering(),  # <-- This triggers filtering
        sa.orm.subqueryload('variants')
        .subqueryload('skus')
    )
)

Imagine that the commented line would change the behavior of how subqueryloads generates it's queries, and perhaps this could even be a replacement to contains_eager if both subqueryload and joinedload can be changed in this way. I imagine the generated queries to look something like what follows. Note how the inner queries look almost exactly the same.

Main query

SELECT
    product.id,
    product.artno,
    product.name
FROM product
WHERE product.id IN (
    SELECT DISTINCT product.id
    FROM product
    LEFT OUTER JOIN variant ON
        variant.product_id = product.id
    LEFT OUTER JOIN sku ON
        sku.variant_id = variant.id
    WHERE
        product.name = 'T-shirt' AND
        variant.color = 'Black' AND
        sku.size = 'M'
)

Variant subquery

SELECT
    variant.id,
    variant.product_id,
    variant.color
FROM variant
WHERE variant.id IN (
    SELECT DISTINCT variant.id
    FROM product
    LEFT OUTER JOIN variant ON
        variant.product_id = product.id
    LEFT OUTER JOIN sku ON
        sku.variant_id = variant.id
    WHERE
        product.name = 'T-shirt' AND
        variant.color = 'Black' AND
        sku.size = 'M'
)

SKU subquery

SELECT
    sku.id,
    sku.variant_id,
    sku.size
FROM sku
WHERE sku.id IN (
    SELECT DISTINCT sku.id
    FROM product
    LEFT OUTER JOIN variant ON
        variant.product_id = product.id
    LEFT OUTER JOIN sku ON
        sku.variant_id = variant.id
    WHERE
        product.name = 'T-shirt' AND
        variant.color = 'Black' AND
        sku.size = 'M'
)

SQLAlchemy then bakes in the SKU and Variant results into the main query results like it normally would.

I'm not very familiar at all with the internals of SQLAlchemy so I can't say anything about the feasibility of this method. Could it be achieved? And is it a good idea?

Comments (7)

  1. Michael Bayer repo owner

    That proposal can't work because the "main query" emitted in SQL looks nothing like what's being asked for by Query. The contract of Query is that you always get exactly the join() / outerjoin() / filter() you ask for, regardless of eager loading. It looks like you're having the eager loading directives actually remove meaning from the outermost query and that doesn't make any sense. There's no IN or DISTINCT or anything else like that being requested in the main query.

    if the goal is, "do the thing that contains_eager() accomplishes except with subqueries", you should use a new query for that. We can make the recipe in disjointeagerloading easier to use, but definitely, the simplest way for any user to understand the feature of, "I'd like to emit query A, then query B, and have full control over query B", is that they write query B. You can use query A as the foundation for query B.

  2. Michael Bayer repo owner

    I'd imagine this would best be:

    query = (
        session.query(Product)
        .outerjoin(Product.variants)
        .outerjoin(Variant.skus)
        .filter(Product.name == 'T-shirt')
        .filter(Variant.color == 'Black')
        .filter(SKU.size == 'M')
        .options(
            sa.orm.contains_eager('variants', separate=True)
            .contains_eager('skus', separate=True)
        )
    )
    

    you then get essentially the same query three times with a different set of columns. The spirit of "contains_eager()" is, this is exactly the query we want.

  3. jmagnusson reporter

    Perhaps a distinct keyword could be passed in to contains_eager as well? I'd imagine that would be cheaper than having python iterate through all the duplicate rows one might end up with (like the "variants" query would in this case if there'd be thousands of SKUs).

  4. Michael Bayer repo owner

    it's not a good idea to put distinct() on an open-ended set of columns. the existing subqueryload has distinct() on the inner query where we only have primary key values. so repurposing contains_eager() to work for second/third/etc. loads while maintaining WYSIWYG is not ideal.

  5. Michael Bayer repo owner

    really, this goes back to the fundamental point that the use case of, "I want to load X's filtered by Y, and also have their collections filtered to Y", is strange. I've been writing production applications with SQLAlchemy for eight years and have never wanted to do that. I usually encourage people who have the "load some Xs and some related Ys" do do it the normal way, which here would be "sess.query(Product, Variant, Sku)" ...

  6. Michael Bayer repo owner

    I don't have plans to work on / support a feature like this, the use case of "filtered collections" is not a first-class use case and to automate it further will lead to burdensome, complicated features on top of the ORM.

  7. Log in to comment