Eagerload + many-to-many relationship leads to problems due to outer joins

Issue #1643 resolved

Former user created an issue 2009-12-17

Eagerload uses LEFT OUTER JOIN to join the additional fields that are required to eagerly load objects, which makes perfect sense. However, in many-to-many relationships the needed fields are an additional join away (through the association table), ie:

table_A JOIN table_B ON <join condition A-B> JOIN table_C ON <join condition B-C>

When eagerload for B is specified on A, SQL Alchemy converts both joins to outer joins, ie. this meta-SQL would become:

table_A LEFT OUTER JOIN table_B ON <join condition A-B> LEFT OUTER JOIN table_C ON <join condition B-C>

Which returns wrong results, since the second outer join adds additional rows (those that don't match the second join condition).

It seems to me that the correct behavior would be to have just the first join an outer join, ie:

table_A LEFT OUTER JOIN table_B ON <join condition A-B> JOIN table_C ON <join condition B-C>

This is both logical (as ''eagerloadness'' shouldn't affect the relation between B and C) and gives expected results when typed in SQL editor.

I know there's a new "innerjoin" parameter in 0.6, but it is my understanding that it would convert both joins to inner joins, which would not help here (the join field can be NULL and this would lead to wrong results).

Notice also that this behavior is only unmasked when using uselist=False, since the null rows inserted by the second outer join seem to be removed by SQLA from the resulting list when not using this option.

The attached code should very clearly demonstrate the problem - specifying eagerload leads to incorrect result. This is with SQL 0.5.6.

eager_outer_join.py

Comments (8)

Mike Bayer repo owner

The issue is unfortunately not as simple as you describe. A join from A->B->C, where A->B is outer and B->C is inner, will return no A rows for which a B->C does not exist - i.e. the single "inner" join disqualifies the join chain from being "outer".

Here is a demonstration:

from sqlalchemy import *

engine = create_engine('sqlite:///:memory:', echo=True)

m = MetaData()
a = Table('a', m, Column('id', Integer, primary_key=True))
b = Table('b', m, 
            Column('aid', Integer, ForeignKey('a.id'), primary_key=True), 
            Column('cid', Integer, ForeignKey('c.id'), primary_key=True)
        )
c = Table('c', m, Column('id', Integer, primary_key=True))

m.create_all(engine)

engine.execute(a.insert(), {'id':1}, {'id':2}, {'id':3})

# passes
assert engine.execute(
    select([a](a)).select_from(a.outerjoin(b).outerjoin(c))
).fetchall() == [(2,), (3,)]((1,),)

# fails
assert engine.execute(
    select([a](a)).select_from(a.outerjoin(b).join(c))
).fetchall() == [(2,), (3,)]((1,),)

so the fact is, whether you use the "innerjoin=True" flag, or the fix you prescribe is implemented:

Index: lib/sqlalchemy/orm/util.py
===================================================================
--- lib/sqlalchemy/orm/util.py  (revision 6545)
+++ lib/sqlalchemy/orm/util.py  (working copy)
@@ -409,6 +409,7 @@

                 if sj is not None:
                     left = sql.join(left, secondary, pj, isouter)
+                    isouter = False
                     onclause = sj
                 else:
                     onclause = pj

the test fails to work properly for a Person object with no child rows:

        john = Person(u'John Smith')
        s.add(john)
        s.commit()

        s.query(Person).one()
        s.expunge_all()

        s.query(Person).options(eagerload('important_address')).one()



2009-12-17 11:33:47,971 INFO sqlalchemy.engine.base.Engine.0x...2090 SELECT 
anon_1.persons_id AS anon_1_persons_id, anon_1.persons_name AS anon_1_persons_name, 
contact_addresses_1.id AS contact_addresses_1_id, contact_addresses_1.address AS 
contact_addresses_1_address, contact_addresses_1.important AS 
contact_addresses_1_important 
FROM (SELECT persons.id AS persons_id, persons.name AS persons_name 
FROM persons 
 LIMIT 2 OFFSET 0) AS anon_1 LEFT OUTER JOIN person_contact_addresses AS 
person_contact_addresses_1 ON anon_1.persons_id = person_contact_addresses_1.person_id 
JOIN contact_addresses AS contact_addresses_1 ON 
person_contact_addresses_1.contact_address_id = contact_addresses_1.id AND 
contact_addresses_1.important = ?
2009-12-17 11:33:47,972 INFO sqlalchemy.engine.base.Engine.0x...2090 [True](True)
2009-12-17 11:33:47,972 DEBUG sqlalchemy.engine.base.Engine.0x...2090 Col 
('anon_1_persons_id', 'anon_1_persons_name', 'contact_addresses_1_id', 
'contact_addresses_1_address', 'contact_addresses_1_important')
Traceback (most recent call last):
  File "eager_outer_join.py", line 59, in <module>
    s.query(Person).options(eagerload('important_address')).one()
  File "/Users/classic/dev/sqlalchemy/lib/sqlalchemy/orm/query.py", line 1276, in one
    raise orm_exc.NoResultFound("No row was found for one()")
sqlalchemy.orm.exc.NoResultFound: No row was found for one()

in this case you'll need to use some other method to get your "important" ContactAddress - possibly a secondary mapper to a subquery of person_contact_address + contact_address which is then related to Person via a single join condition.

2009-12-17T11:42:11+00:00

Former user Account Deleted
You're absolutely right, the "fix" as written fails to work in such a case. The problem was that I didn't write down correctly what was in my head; what I really meant was along the line of this:
```
table_A LEFT OUTER JOIN (table_B JOIN table_C ON <join condition B-C>) ON <join condition A-C>
```
Such a join produces correct results (I've tested it), but I guess it might be rather trickier to implement due to aliasing issues. However, I'm sure you'll agree that the current behaviour is definetely a bug - specifying eagerload changes the values of returned results, which is definetely not apropriate.

An easier fix might be to go along the lines of code that is executed when uselist=False is not specified. I've noticed that in this case SQLA removes the null row(s) (probably ''after'' the results are fetched) so the effective result is correct. Eg. the test case without uselist=False will pass OK (of course, with a slighlt modification as the returned values are now lists).
- 2009-12-17T15:25:41+00:00
Mike Bayer repo owner
- changed milestone to 0.6.0
yeah the fix you have there is basically to map to a subquery. That syntax is not accepted on all platforms including MySQL so its not platform-independent, hence not really worth invoking automatically.

as far as uselist=False, it is not in any way intended to be used with a join condition that returns multiple rows. having it pick an arbitrary row is just as much a "bug" as picking the first one in all cases. What it really should do is raise an error if a second row comes in. unfortunately I don't know that I'll have time to play with this for 0.6.0 which means it might have to be a warning introduced later on.
- 2009-12-17T20:28:57+00:00
Former user Account Deleted
Unfortunately, I'm on a deadline for the application I'm writing, otherwise I would try to change SQLA to implement this.

Can you give me some pointers on how to implement "secondary mapper to a subquery of person_contact_address + contact_address which is then related to Person via a single join condition"? I can't figure out how to do this by going through the documentation...
- 2009-12-18T03:53:58+00:00
Former user Account Deleted
as far as uselist=False, it is not in any way intended to be used with a join condition that returns multiple rows. having it pick an arbitrary row is just as much a "bug" as picking the first one in all cases. What it really should do is raise an error if a second row comes in. unfortunately I don't know that I'll have time to play with this for 0.6.0 which means it might have to be a warning introduced later on.

That's true, but I'm not talking about picking 1 row from 2 valid ones; just filtering the NULL ones which then leaves the correct one. Again, when there's no uselist=False this kind of filtering seems to take place...
- 2009-12-18T03:57:40+00:00

Mike Bayer repo owner

Replying to guest:

Unfortunately, I'm on a deadline for the application I'm writing, otherwise I would try to change SQLA to implement this.

Can you give me some pointers on how to implement "secondary mapper to a subquery of person_contact_address + contact_address which is then related to Person via a single join condition"? I can't figure out how to do this by going through the documentation...

Here you are. I think you'll see it makes more sense relationally and also produces the subquery that most databases other than PG and SQLite are expecting to see.

from sqlalchemy import *
from sqlalchemy.orm import *

db_engine=create_engine('sqlite://', echo='debug')
metadata = MetaData()


class Person(object):
    def __init__(self, name):
        self.name = name

persons = Table('persons',
                metadata,
                Column('id', Integer, primary_key=True),
                Column('name', Unicode(50)))


class ContactAddress(object):
    def __init__(self, address, important):
        self.address = address
        self.important = important

contact_addresses = Table('contact_addresses',
                          metadata,
                          Column('id', Integer, primary_key=True),
                          Column('address', Unicode(50)),
                          Column('important', Boolean))

person_contact_addresses = Table('person_contact_addresses',
                     metadata,
                     Column('person_id', Integer, 
                        ForeignKey('persons.id'), primary_key=True),
                     Column('contact_address_id', Integer,
                      ForeignKey('contact_addresses.id'), primary_key=True))

mapper(Person, persons, properties={
    'addresses': relation(ContactAddress, 
                        secondary=person_contact_addresses,
                         backref='parties')})

mapper(ContactAddress, contact_addresses)

contact_joined_to_assoc = contact_addresses.join(
                        person_contact_addresses, 
                        and_(
                            person_contact_addresses.c.contact_address_id==
                                contact_addresses.c.id, 
                                contact_addresses.c.important==True
                        )
                    )

contact_joined_to_assoc_mapper = mapper(ContactAddress,
                                        contact_joined_to_assoc,
                                        non_primary=True
                                        )

class_mapper(Person).add_property(
                        'important_address', 
                        relation(contact_joined_to_assoc_mapper,
                            viewonly=True, uselist=False)
                    )

if __name__ == '__main__':

    metadata.create_all(db_engine)
    s=sessionmaker(bind=db_engine)()

    ed = Person(u'Ed')
    jane = Person(u'Jane Smith')
    john = Person(u'John Smith')
    addr1 = ContactAddress(u'Main Street #1', False)
    addr2 = ContactAddress(u'Main Street #2', True)
    john.addresses=[addr2](addr1,)
    jane.addresses=[addr1](addr1)
    s.add_all([jane, ed](john,))
    s.commit()

    s.expunge_all()

    people = s.query(Person).\
                order_by(Person.name).\
                options(eagerload('important_address')).\
                all()

    ed, jane, john = people

    assert ed.important_address is None
    assert jane.important_address is None
    assert john.important_address.address == u'Main Street #2'

2009-12-18T11:19:09+00:00

Mike Bayer repo owner
- changed status to resolved
the "filtering" takes place when uselist=True because collections never have NULL rows appended to them, so the rows are skipped. But its still relationally "wrong" for the eager condition to be receiving lots of NULL rows due to a less-than-ideal LEFT OUTER JOIN chain. Applying the second join as a subquery is still the way to go. This is not something SQLA is going to do automatically - its far too great of a guess for it to be making. As far as "eager doesn't return the same as lazy automatically", that is always going to be a condition of more elaborate join conditions on relations.

The abovementioned warning for redundant rows sent in conjunction with uselist=False is added in 0683cc486d07fa1d1a381ffe647842be15c455f4.
- 2009-12-18T12:29:33+00:00
Mike Bayer repo owner
- removed milestone
Removing milestone: 0.6.0 (automated comment)
- 2014-02-23T18:14:38+00:00
Log in to comment

Assignee: Mike Bayer

Type: bug

Priority: major

Status: resolved

Component: orm

Milestone: –

Votes: 0

Watchers: 0