Subquery loading off of an "already seen" row

Issue #3717 duplicate
Former user created an issue

Given the following relations:

  • Company has many employees and a parent company
  • Employee has an employer

And the following instances:

  • Apple w/ employees Tim Cook & Jony Ive
  • Beats w/ employees Dr. Dre & Jimmy Iovine

The following query does not load everything:

employees = session.query(Employee).options(
    joinedload('employer').joinedload('parent').subqueryload('employees')
).filter().all()

It appears that it finds all employees and parents properly, but since all the companies get loaded into the session via the joined load, it seems to stop processing at this point and doesn't actually perform the subqueryload of employees. This example may seem contrived since it's circling back around through the relationships, but it's a simplified version of something that w/ filters applied is a very real query in our system.

Below is some code to run to show the problem. Just change the connection string to be able to connect to your database, the code will populate a few records for you.

import logging
from contextlib import contextmanager
from sqlalchemy import (
    Column,
    ForeignKey,
    Integer,
    MetaData,
    String,
    Table,
    create_engine,
    inspect,
)
from sqlalchemy.orm import (
    mapper,
    relationship,
    sessionmaker,
    subqueryload,
    joinedload,
)

class Model(object):
    def __init__(self, **kwargs):
        for key, value in kwargs.items():
            setattr(self, key, value)
class Company(Model): pass
class Employee(Model): pass

engine = create_engine(
    'mysql+pymysql://root:password@localhost/test?charset=utf8mb4',
    convert_unicode=True,
)
metadata = MetaData(bind=engine)
Session = sessionmaker(bind=engine)

company_table = Table(
    'companies', metadata,
    Column('id', Integer, primary_key=True, autoincrement=True),
    Column('name', String(255)),
    Column('parent_id', Integer, ForeignKey('companies.id')),
)
employee_table = Table(
    'employees', metadata,
    Column('id', Integer, primary_key=True, autoincrement=True),
    Column('name', String(255)),
    Column('employer_id', Integer, ForeignKey('companies.id')),
)
company_mapper = mapper(Company, company_table, properties={
    'parent': relationship(
        Company,
        uselist=False,
        remote_side=[company_table.columns.id],
    ),
    'employees': relationship(
        Employee,
        uselist=True,
        back_populates='employer',
    ),
})

employee_mapper = mapper(Employee, employee_table, properties={
    'employer': relationship(
        Company,
        uselist=False,
        back_populates='employees',
    )
})

connection = engine.connect()
metadata.create_all(connection)
connection.close()

@contextmanager
def session_manager():
    session = Session()
    try: yield session; session.commit()
    except: session.rollback(); raise
    finally: session.close()

def _ensure_loaded(obj, key_path):
    components = key_path.split('.')
    for i, key in enumerate(components):
        if key in inspect(obj).unloaded:
            raise AttributeError('Attribute {} not loaded during query'.format(
                '.'.join(components[:i+1])
            ))
        obj = getattr(obj, key)

        if obj is None:
            break

models = [
    Company(id=1, name='Apple'),
    Company(id=2, name='Beats', parent_id=1),
    Employee(id=1, name='Tim Cook', employer_id=1),
    Employee(id=2, name='Jony Ive', employer_id=1),
    Employee(id=3, name='Dr. Dre', employer_id=2),
    Employee(id=4, name='Jimmy Iovine', employer_id=2),
]

for model in models:
    with session_manager() as session:
        try:
            session.add(model)
            session.commit()
        except Exception as e: session.close()

logging.basicConfig()
logging.getLogger('sqlalchemy.engine').setLevel(logging.INFO)

with session_manager() as session:
    employees = session.query(Employee).options(
        joinedload('employer').joinedload('parent').subqueryload('employees')
    ).filter(Employee.id == 3).all()

    for employee in employees:
        _ensure_loaded(employee, 'employer.parent.employees')

# this block of code fails presumably because the parent employer is loaded into the session during
# the first joinedload of the employer which doesn't trigger the subqueryload for parent.employees
# for the company that actually does have a parent company.
with session_manager() as session:
    employees = session.query(Employee).options(
        joinedload('employer').joinedload('parent').subqueryload('employees')
    ).filter().all()

    for employee in employees:
        _ensure_loaded(employee, 'employer.parent.employees')

Comments (10)

  1. Mike Bayer repo owner

    I think you meant to map "company_id" and not "employer_id" (or vice versa), as it stands the program has None for all .employer attributes because the setup refers to a non-mapped name "company_id".

    otherwise this is kind of a similar thing to #3431, though more generally since it occurs for collections also.

  2. Mike Bayer repo owner

    it's not totally clear this should be changed (though probably), but in any case your analysis is correct, so it would follow that to get the load you want you'd need to do this:

    with session_manager() as session:
        employees = session.query(Employee).options(
            joinedload('employer').joinedload('parent').subqueryload('employees'),
            defaultload('employer').subqueryload('employees')
        ).order_by(Employee.id).all()
    

    otherwise, it's ambiguous when we come across a "Company" object, if it is to be loaded from two different contexts in the same result set, only one wins out. If one loader path set up joinedload for employer.employees and another subqueryload, it depends on the ORDER BY how that would work out.

  3. Former user Account Deleted reporter

    @zzzeek yes, sorry for the mixup with the employer_id, I've fixed the example for clarity.

    And your suggestion for the load is actually how I've worked around the problem, though repeated the joinedload, so this, which is effectively the same:

    with session_manager() as session:
        employees = session.query(Employee).options(
            joinedload('employer').joinedload('parent').subqueryload('employees'),
            joinedload('employer').subqueryload('employees')
        ).order_by(Employee.id).all()
    

    The issue with doing this is that you're loading data unnecessarily just to ensure the subqueryload is satisfied.

    The sequence joinedload('employer').joinedload('parent').subqueryload('employees') says to me, load all employers and and their parent company, and I need the employees of all parent companies. In our example, that's the employees of Apple when looking up employees Dr. Dre or Jimmy Iovine (since their employer has a parent company of Apple).

    Adding the sequence joinedload('employer').subqueryload('employees') says to load the employer's employees as well. That means loading all of Beats employees even though that relation may never be used.

    Again, I know this example feels a little contrived, but with our schema & the models/relations that we have set up, it's very real.

    --

    You mentioned ORDER BY as something that affects this query, but I'm not certain that seems relevant here (though in our code base, we have applied that properly). The joinedload will always load the employer and the parent as part of the initial query, so that will always load those values into the result set first. So it's just a matter of how & what data's being returned in the result set whether that value will "seen" before reaching the subqueryload. But now re-reading your comment, I think you may have been suggesting something different. I'm going to leave this little comment in case it furthers the conversation in any way, though.

  4. Log in to comment