load_only on joined parent row causes issuing queries for already fetched columns

Issue #3822 resolved
Marcin Barczyński created an issue

Consider the following example:

from sqlalchemy.engine import create_engine
from sqlalchemy import Column, Integer, String, ForeignKey
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import aliased, contains_eager, relationship
from sqlalchemy.orm import sessionmaker


Base = declarative_base()
class Node(Base):
    __tablename__ = 'node'

    id = Column(Integer, primary_key=True)
    parent_id = Column(ForeignKey('node.id'))
    parent = relationship('Node', remote_side=[id])
    name = Column(String)


engine = create_engine("postgresql://test:test@localhost/test", echo=True)
Base.metadata.drop_all(engine)
Base.metadata.create_all(engine)

session_class = sessionmaker(bind=engine)
session1 = session_class()

node1 = Node(id=1, parent=None, name='str1')
node2 = Node(id=2, parent=None, name='str2')
node3 = Node(id=3, parent=node2, name='str3')
node1.parent = node2

session1.add(node1)
session1.add(node2)
session1.add(node3)
session1.commit()

session2 = session_class()
ParentNode = aliased(Node)
query = session2.query(Node).\
    outerjoin(ParentNode, Node.parent).\
    options(contains_eager(Node.parent, alias=ParentNode).load_only(ParentNode.id, ParentNode.parent_id))

for row in query.order_by(Node.id):
    print row.id, row.name

Here are the queries emitted by SQLAlchemy:

2016-10-13 12:14:38,778 INFO sqlalchemy.engine.base.Engine SELECT node_1.id AS node_1_id, node_1.parent_id AS node_1_parent_id, node.id AS node_id, node.parent_id AS node_parent_id, node.name AS node_name 
FROM node LEFT OUTER JOIN node AS node_1 ON node_1.id = node.parent_id ORDER BY node.id
2016-10-13 12:14:38,778 INFO sqlalchemy.engine.base.Engine {}
2016-10-13 12:14:38,780 INFO sqlalchemy.engine.base.Engine SELECT node.name AS node_name 
FROM node 
WHERE node.id = %(param_1)s
2016-10-13 12:14:38,780 INFO sqlalchemy.engine.base.Engine {'param_1': 2}

Despite the fact that the first query fetches all necessary columns, an additional query is issued for name of the second node. Note that order_by is crucial here - without it everything works as expected.

Comments (8)

  1. Mike Bayer repo owner

    self-referential joinedload + special options + order by is a variant of the "which loader works on the object first, second" issue, which we see in #3431 #3811 #3681. But this has to do with load_only so I'm going to guess DeferredLoader is trying to set the attribute as "lazyload" but without given the chance to see that it's already populated. But this one a little trickier than #3431 because this works with the column-level population system which doesn't use conditionals at the row level as it is very performance critical.

  2. Mike Bayer repo owner

    the completely naive approach works, which is:

    diff --git a/lib/sqlalchemy/orm/loading.py b/lib/sqlalchemy/orm/loading.py
    index d457f3c..122897c 100644
    --- a/lib/sqlalchemy/orm/loading.py
    +++ b/lib/sqlalchemy/orm/loading.py
    @@ -508,6 +508,8 @@ def _populate_full(
             for key, populator in populators["delayed"]:
                 populator(state, dict_, row)
         else:
    +        for key, getter in populators["quick"]:
    +            dict_[key] = getter(row)
             # have already seen rows with this identity.
             for key, populator in populators["existing"]:
                 populator(state, dict_, row)
    

    and I begin to start considering that the concept of "isnew" in terms of this kind of issue is what's really broken. the row might not be new, but the set of columns we are looking within is. this would alter the approach for #3431 as well.

  3. Mike Bayer repo owner

    here is a very scary way to make 'isnew" act that way, in reality I think a second state variable is more appropriate:

    diff --git a/lib/sqlalchemy/orm/loading.py b/lib/sqlalchemy/orm/loading.py
    index d457f3c..a220329 100644
    --- a/lib/sqlalchemy/orm/loading.py
    +++ b/lib/sqlalchemy/orm/loading.py
    @@ -330,9 +330,8 @@ def _instance_processor(
                     context, path, mapper, result, adapter, populators)
    
         propagate_options = context.propagate_options
    -    if propagate_options:
    -        load_path = context.query._current_path + path \
    -            if context.query._current_path.path else path
    +    load_path = context.query._current_path + path \
    +        if context.query._current_path.path else path
    
         session_identity_map = context.session.identity_map
    
    @@ -426,9 +425,12 @@ def _instance_processor(
                 # full population routines.  Objects here are either
                 # just created, or we are doing a populate_existing
    
    -            if isnew and propagate_options:
    +            if isnew:
                     state.load_options = propagate_options
                     state.load_path = load_path
    +            else:
    +                if state.load_path != load_path:
    +                    isnew = True
    
                 _populate_full(
                     context, row, state, dict_, isnew,
    
  4. Mike Bayer repo owner

    Memoize load_path in all cases, run quick populators for path change

    Adds a new variant to the "isnew" state within entity loading for isnew=False, but the load path is new. This is to address the use case of an entity appearing in multiple places in the row in a more generalized way than the fixes in [ticket:3431], [ticket:3811] in that loading.py will be able to tell the populator that this row is not "isnew" but is a "new" path for the entity. For the moment, the new information is only being applied to the use of "quick" populators so that simple column loads can take place on top of a deferred loader from elsewhere in the row.

    As part of this change, state.load_path() will now always be populated with the "path" that was in effect when this state was originally loaded, which for multi-path loads of the same entity is still non-deterministic. Ideally there'd be some kind of "here's all the paths that loaded this state and how" type of data structure though it's not clear if that could be done while maintaining performance.

    Fixes: #3822 Change-Id: Ib915365353dfcca09e15c24001a8581113b97d5e

    → <<cset c02675b407b8>>

  5. Log in to comment