Eagerload + many-to-many relationship leads to problems due to outer joins

Issue #1643 resolved
Former user created an issue

Eagerload uses LEFT OUTER JOIN to join the additional fields that are required to eagerly load objects, which makes perfect sense. However, in many-to-many relationships the needed fields are an additional join away (through the association table), ie:

table_A JOIN table_B ON <join condition A-B> JOIN table_C ON <join condition B-C>

When eagerload for B is specified on A, SQL Alchemy converts both joins to outer joins, ie. this meta-SQL would become:

table_A LEFT OUTER JOIN table_B ON <join condition A-B> LEFT OUTER JOIN table_C ON <join condition B-C>

Which returns wrong results, since the second outer join adds additional rows (those that don't match the second join condition).

It seems to me that the correct behavior would be to have just the first join an outer join, ie:

table_A LEFT OUTER JOIN table_B ON <join condition A-B> JOIN table_C ON <join condition B-C>

This is both logical (as ''eagerloadness'' shouldn't affect the relation between B and C) and gives expected results when typed in SQL editor.

I know there's a new "innerjoin" parameter in 0.6, but it is my understanding that it would convert both joins to inner joins, which would not help here (the join field can be NULL and this would lead to wrong results).

Notice also that this behavior is only unmasked when using uselist=False, since the null rows inserted by the second outer join seem to be removed by SQLA from the resulting list when not using this option.

The attached code should very clearly demonstrate the problem - specifying eagerload leads to incorrect result. This is with SQL 0.5.6.

Comments (8)

  1. Mike Bayer repo owner

    The issue is unfortunately not as simple as you describe. A join from A->B->C, where A->B is outer and B->C is inner, will return no A rows for which a B->C does not exist - i.e. the single "inner" join disqualifies the join chain from being "outer".

    Here is a demonstration:

    from sqlalchemy import *
    
    engine = create_engine('sqlite:///:memory:', echo=True)
    
    m = MetaData()
    a = Table('a', m, Column('id', Integer, primary_key=True))
    b = Table('b', m, 
                Column('aid', Integer, ForeignKey('a.id'), primary_key=True), 
                Column('cid', Integer, ForeignKey('c.id'), primary_key=True)
            )
    c = Table('c', m, Column('id', Integer, primary_key=True))
    
    m.create_all(engine)
    
    engine.execute(a.insert(), {'id':1}, {'id':2}, {'id':3})
    
    # passes
    assert engine.execute(
        select([a](a)).select_from(a.outerjoin(b).outerjoin(c))
    ).fetchall() == [(2,), (3,)]((1,),)
    
    # fails
    assert engine.execute(
        select([a](a)).select_from(a.outerjoin(b).join(c))
    ).fetchall() == [(2,), (3,)]((1,),)
    

    so the fact is, whether you use the "innerjoin=True" flag, or the fix you prescribe is implemented:

    Index: lib/sqlalchemy/orm/util.py
    ===================================================================
    --- lib/sqlalchemy/orm/util.py  (revision 6545)
    +++ lib/sqlalchemy/orm/util.py  (working copy)
    @@ -409,6 +409,7 @@
    
                     if sj is not None:
                         left = sql.join(left, secondary, pj, isouter)
    +                    isouter = False
                         onclause = sj
                     else:
                         onclause = pj
    

    the test fails to work properly for a Person object with no child rows:

            john = Person(u'John Smith')
            s.add(john)
            s.commit()
    
            s.query(Person).one()
            s.expunge_all()
    
            s.query(Person).options(eagerload('important_address')).one()
    
    
    
    2009-12-17 11:33:47,971 INFO sqlalchemy.engine.base.Engine.0x...2090 SELECT 
    anon_1.persons_id AS anon_1_persons_id, anon_1.persons_name AS anon_1_persons_name, 
    contact_addresses_1.id AS contact_addresses_1_id, contact_addresses_1.address AS 
    contact_addresses_1_address, contact_addresses_1.important AS 
    contact_addresses_1_important 
    FROM (SELECT persons.id AS persons_id, persons.name AS persons_name 
    FROM persons 
     LIMIT 2 OFFSET 0) AS anon_1 LEFT OUTER JOIN person_contact_addresses AS 
    person_contact_addresses_1 ON anon_1.persons_id = person_contact_addresses_1.person_id 
    JOIN contact_addresses AS contact_addresses_1 ON 
    person_contact_addresses_1.contact_address_id = contact_addresses_1.id AND 
    contact_addresses_1.important = ?
    2009-12-17 11:33:47,972 INFO sqlalchemy.engine.base.Engine.0x...2090 [True](True)
    2009-12-17 11:33:47,972 DEBUG sqlalchemy.engine.base.Engine.0x...2090 Col 
    ('anon_1_persons_id', 'anon_1_persons_name', 'contact_addresses_1_id', 
    'contact_addresses_1_address', 'contact_addresses_1_important')
    Traceback (most recent call last):
      File "eager_outer_join.py", line 59, in <module>
        s.query(Person).options(eagerload('important_address')).one()
      File "/Users/classic/dev/sqlalchemy/lib/sqlalchemy/orm/query.py", line 1276, in one
        raise orm_exc.NoResultFound("No row was found for one()")
    sqlalchemy.orm.exc.NoResultFound: No row was found for one()
    

    in this case you'll need to use some other method to get your "important" ContactAddress - possibly a secondary mapper to a subquery of person_contact_address + contact_address which is then related to Person via a single join condition.

  2. Former user Account Deleted

    You're absolutely right, the "fix" as written fails to work in such a case. The problem was that I didn't write down correctly what was in my head; what I really meant was along the line of this:

    table_A LEFT OUTER JOIN (table_B JOIN table_C ON <join condition B-C>) ON <join condition A-C>
    

    Such a join produces correct results (I've tested it), but I guess it might be rather trickier to implement due to aliasing issues. However, I'm sure you'll agree that the current behaviour is definetely a bug - specifying eagerload changes the values of returned results, which is definetely not apropriate.

    An easier fix might be to go along the lines of code that is executed when uselist=False is not specified. I've noticed that in this case SQLA removes the null row(s) (probably ''after'' the results are fetched) so the effective result is correct. Eg. the test case without uselist=False will pass OK (of course, with a slighlt modification as the returned values are now lists).

  3. Mike Bayer repo owner
    • changed milestone to 0.6.0

    yeah the fix you have there is basically to map to a subquery. That syntax is not accepted on all platforms including MySQL so its not platform-independent, hence not really worth invoking automatically.

    as far as uselist=False, it is not in any way intended to be used with a join condition that returns multiple rows. having it pick an arbitrary row is just as much a "bug" as picking the first one in all cases. What it really should do is raise an error if a second row comes in. unfortunately I don't know that I'll have time to play with this for 0.6.0 which means it might have to be a warning introduced later on.

  4. Former user Account Deleted

    Unfortunately, I'm on a deadline for the application I'm writing, otherwise I would try to change SQLA to implement this.

    Can you give me some pointers on how to implement "secondary mapper to a subquery of person_contact_address + contact_address which is then related to Person via a single join condition"? I can't figure out how to do this by going through the documentation...

  5. Former user Account Deleted

    as far as uselist=False, it is not in any way intended to be used with a join condition that returns multiple rows. having it pick an arbitrary row is just as much a "bug" as picking the first one in all cases. What it really should do is raise an error if a second row comes in. unfortunately I don't know that I'll have time to play with this for 0.6.0 which means it might have to be a warning introduced later on.

    That's true, but I'm not talking about picking 1 row from 2 valid ones; just filtering the NULL ones which then leaves the correct one. Again, when there's no uselist=False this kind of filtering seems to take place...

  6. Mike Bayer repo owner

    Replying to guest:

    Unfortunately, I'm on a deadline for the application I'm writing, otherwise I would try to change SQLA to implement this.

    Can you give me some pointers on how to implement "secondary mapper to a subquery of person_contact_address + contact_address which is then related to Person via a single join condition"? I can't figure out how to do this by going through the documentation...

    Here you are. I think you'll see it makes more sense relationally and also produces the subquery that most databases other than PG and SQLite are expecting to see.

    from sqlalchemy import *
    from sqlalchemy.orm import *
    
    db_engine=create_engine('sqlite://', echo='debug')
    metadata = MetaData()
    
    
    class Person(object):
        def __init__(self, name):
            self.name = name
    
    persons = Table('persons',
                    metadata,
                    Column('id', Integer, primary_key=True),
                    Column('name', Unicode(50)))
    
    
    class ContactAddress(object):
        def __init__(self, address, important):
            self.address = address
            self.important = important
    
    contact_addresses = Table('contact_addresses',
                              metadata,
                              Column('id', Integer, primary_key=True),
                              Column('address', Unicode(50)),
                              Column('important', Boolean))
    
    person_contact_addresses = Table('person_contact_addresses',
                         metadata,
                         Column('person_id', Integer, 
                            ForeignKey('persons.id'), primary_key=True),
                         Column('contact_address_id', Integer,
                          ForeignKey('contact_addresses.id'), primary_key=True))
    
    mapper(Person, persons, properties={
        'addresses': relation(ContactAddress, 
                            secondary=person_contact_addresses,
                             backref='parties')})
    
    mapper(ContactAddress, contact_addresses)
    
    contact_joined_to_assoc = contact_addresses.join(
                            person_contact_addresses, 
                            and_(
                                person_contact_addresses.c.contact_address_id==
                                    contact_addresses.c.id, 
                                    contact_addresses.c.important==True
                            )
                        )
    
    contact_joined_to_assoc_mapper = mapper(ContactAddress,
                                            contact_joined_to_assoc,
                                            non_primary=True
                                            )
    
    class_mapper(Person).add_property(
                            'important_address', 
                            relation(contact_joined_to_assoc_mapper,
                                viewonly=True, uselist=False)
                        )
    
    if __name__ == '__main__':
    
        metadata.create_all(db_engine)
        s=sessionmaker(bind=db_engine)()
    
        ed = Person(u'Ed')
        jane = Person(u'Jane Smith')
        john = Person(u'John Smith')
        addr1 = ContactAddress(u'Main Street #1', False)
        addr2 = ContactAddress(u'Main Street #2', True)
        john.addresses=[addr2](addr1,)
        jane.addresses=[addr1](addr1)
        s.add_all([jane, ed](john,))
        s.commit()
    
        s.expunge_all()
    
        people = s.query(Person).\
                    order_by(Person.name).\
                    options(eagerload('important_address')).\
                    all()
    
        ed, jane, john = people
    
        assert ed.important_address is None
        assert jane.important_address is None
        assert john.important_address.address == u'Main Street #2'
    
  7. Mike Bayer repo owner

    the "filtering" takes place when uselist=True because collections never have NULL rows appended to them, so the rows are skipped. But its still relationally "wrong" for the eager condition to be receiving lots of NULL rows due to a less-than-ideal LEFT OUTER JOIN chain. Applying the second join as a subquery is still the way to go. This is not something SQLA is going to do automatically - its far too great of a guess for it to be making. As far as "eager doesn't return the same as lazy automatically", that is always going to be a condition of more elaborate join conditions on relations.

    The abovementioned warning for redundant rows sent in conjunction with uselist=False is added in 0683cc486d07fa1d1a381ffe647842be15c455f4.

  8. Log in to comment