Inefficient query when using association_tables

Issue #3752 closed
Mehdi Gmira created an issue

Hey,

When lazyloading a relationship that is based on an association table, the query generated by sqlalchemy is poorly constructed and really inefficient. Example:

from sqlalchemy import Column, DateTime, String, Integer, ForeignKey, func, Table, create_engine
from sqlalchemy.orm import relationship, backref, sessionmaker, joinedload
from sqlalchemy.ext.declarative import declarative_base

Base = declarative_base()

ods_legs_association_table = Table('students_sports', Base.metadata,
                                   Column('sport_id', Integer, ForeignKey('sport.id', ondelete='CASCADE')),
                                   Column('student_id', Integer, ForeignKey('student.id', ondelete='CASCADE'))
                                   )


class School(Base):
    __tablename__ = 'school'
    id = Column(Integer, primary_key=True)
    name = Column(String)


class Sport(Base):
    __tablename__ = 'sport'
    id = Column(Integer, primary_key=True)
    name = Column(String)
    school_id = Column(Integer, ForeignKey('school.id'))

    school = relationship('School', backref='sports')
    students = relationship("Student", secondary=ods_legs_association_table, backref="sports")


class Student(Base):
    __tablename__ = 'student'
    id = Column(Integer, primary_key=True)
    name = Column(String)
    school_id = Column(Integer, ForeignKey('school.id'))

    school = relationship('School', backref='students')


engine = create_engine('postgresql://xxxx@localhost/test', echo=True)

Base.metadata.drop_all(engine)
Base.metadata.create_all(engine)
Session = sessionmaker(engine)
session = Session()

school = School(name="Foo School")
students = [Student(name=letter) for letter in 'ABCD']
sports = [Sport(name='football'), Sport(name='tennis')]
for student in students:
    student.sports = sports
school.students = students
school.sports = sports
session.add(school)
session.commit()

school = session.query(School).options(joinedload('sports').joinedload('students'), joinedload('students')).one()

school.students[0].sports  # emits SQL statement

The sql statement is

SELECT sport.id AS sport_id, sport.name AS sport_name, sport.school_id AS sport_school_id 
FROM sport, students_sports 
WHERE %(param_1)s = students_sports.student_id AND sport.id = students_sports.sport_id

while it should be

SELECT sport.id AS sport_id, sport.name AS sport_name, sport.school_id AS sport_school_id 
FROM sport
JOIN students_sports ON sport.id = students_sport.sport_id
WHERE %(param_1)s = students_sports.student_id

This might work fine with a simple example like this, but i've had queries like this stuck for hours because postgres really doesn't like stuff like "FROM sport, students_sports"

Comments (13)

  1. Mike Bayer repo owner

    This might work fine with a simple example like this, but i've had queries like this stuck for hours because postgres really doesn't like stuff like "FROM sport, students_sports"

    Hi, and thanks again for nice easy test cases! Please provide evidence for "doesn't like stuff". Per EXPLAIN, Postgresql sees no difference at all between these two statements:

    test=# EXPLAIN SELECT sport.id AS sport_id, sport.name AS sport_name, sport.school_id AS sport_school_id 
    test-# FROM sport, students_sports 
    test-# WHERE 1 = students_sports.student_id AND sport.id = students_sports.sport_id;
                                     QUERY PLAN                                  
    -----------------------------------------------------------------------------
     Hash Join  (cost=38.39..65.00 rows=11 width=40)
       Hash Cond: (sport.id = students_sports.sport_id)
       ->  Seq Scan on sport  (cost=0.00..22.00 rows=1200 width=40)
       ->  Hash  (cost=38.25..38.25 rows=11 width=4)
             ->  Seq Scan on students_sports  (cost=0.00..38.25 rows=11 width=4)
                   Filter: (1 = student_id)
    (6 rows)
    
    test=# EXPLAIN SELECT sport.id AS sport_id, sport.name AS sport_name, sport.school_id AS sport_school_id 
    test-# FROM sport
    test-# JOIN students_sports ON sport.id = students_sports.sport_id
    test-# WHERE 1 = students_sports.student_id;
                                     QUERY PLAN                                  
    -----------------------------------------------------------------------------
     Hash Join  (cost=38.39..65.00 rows=11 width=40)
       Hash Cond: (sport.id = students_sports.sport_id)
       ->  Seq Scan on sport  (cost=0.00..22.00 rows=1200 width=40)
       ->  Hash  (cost=38.25..38.25 rows=11 width=4)
             ->  Seq Scan on students_sports  (cost=0.00..38.25 rows=11 width=4)
                   Filter: (1 = student_id)
    (6 rows)
    
  2. Mehdi Gmira reporter

    I'll try to come up with a test case. But the query that is sent to postgresql is one that tells to do a cartesian product between two tables (i.e all combinations of data), which is really bad. The fact that it does work sometimes is thanks to postgresql's smart query planner that rewrites the query. But the query planner behaves diffrently depending on the settings, stats, etc. For me, the query should be written correctly by SqlAlchemy (i.e Inner join), and should not rely on the query planner.

  3. Mike Bayer repo owner

    But the query that is sent to postgresql is one that tells to do a cartesian product between two tables (i.e all combinations of data), which is really bad.

    that's not what's illustrated here. The query has a WHERE clause which limits the rows to unique combinations between the two tables. It is identical to a JOIN and as evidence the Postgresql query profiler explicitly says this. Just because one uses JOIN doesn't mean a cartesian product is not possible, just "JOIN .. ON true", you'll get the same thing.

    For me, the query should be written correctly by SqlAlchemy (i.e Inner join), and should not rely on the query planner.

    All queries sent to Postgresql or any other system rely upon the query planner, all queries have a plan.

    it would be vastly complicated to get lazy loads to use JOIN in all cases and would destabilize the library for many months or years.

    A very long stackoverflow thread about the issue at http://stackoverflow.com/questions/1018822/inner-join-on-vs-where-clause has lots of debate about which is more "readable", but virtually nothing at all about any peformance differences across many databases. Here is another: http://dba.stackexchange.com/questions/3480/what-is-more-efficient-a-where-clause-or-a-join-with-million-plus-row-tables "For modern RDBMS there is no difference between "explicit JOIN" and "JOIN-in-the-WHERE" (if all JOINS are INNER) regards performance and query plan."

    You're asking for an incredibly huge and risky change and not providing evidence.

  4. Mehdi Gmira reporter

    Sorry, I didn't know that FROM X, Y WHERE ... was a common way to join. I thought It was a bug because it seems to me very unnatural. I do have a case where the inner join form is 10x faster that the other one. But that requires lots of data(~100k rows per table) and multiple joins (~8). So it is really hard to replicate. So you can close this as it is probably more related to postgres than sqlalchemy, since the "FROM X, Y WHERE ..." is fairly common.

  5. Drachenfels

    I will add that when I first time saw this query (in my own code, not here) I was confused as well and thought it's a bug as well. ;-)

  6. Mike Bayer repo owner

    I do have a case where the inner join form is 10x faster that the other one. But that requires lots of data(~100k rows per table) and multiple joins (~8).

    show me the SQL for those and also run EXPLAIN ANALYZE on each - PG should not be optimizing any differently if the two queries are truly equivalent. Also, is this query specifically from SQLA's lazy-load function ?

  7. Mike Bayer repo owner

    the query here is apparently a little confusing to some people but it's unlikely to have any performance implications vs. a JOIN.

  8. Mehdi Gmira reporter

    SQLA query from lazy-load :

    SELECT *
    FROM os_ls, l LEFT OUTER JOIN t AS t_1 ON t_1.id = l.t_id
      LEFT OUTER JOIN t_st AS t_st_1 ON t_st_1.id = t_1.st_or_id
      LEFT OUTER JOIN s AS s_1 ON s_1.id = t_st_1.s_id
      LEFT OUTER JOIN t_st AS t_st_2 ON t_st_2.id = t_1.st_d_id
      LEFT OUTER JOIN s AS s_2 ON s_2.id = t_st_2.s_id
      LEFT OUTER JOIN t_st AS t_st_3 ON t_st_3.id = l.st_or_id
      LEFT OUTER JOIN s AS s_3 ON s_3.id = t_st_3.s_id LEFT OUTER JOIN t_st AS t_st_4 ON t_st_4.id
     = l.st_d_id LEFT OUTER JOIN s AS s_4 ON s_4.id = t_st_4.s_id
     WHERE 146356 = os_ls.o_id AND l.id = os_ls.l_id ORDER BY l."order" ASC;
    

    Explain:

    QUERY PLAN
    Sort  (cost=51251.37..51251.38 rows=4 width=276)
      Sort Key: l."order"
      ->  Nested Loop Left Join  (cost=29850.04..51251.33 rows=4 width=276)
            ->  Nested Loop Left Join  (cost=29849.90..51250.65 rows=4 width=252)
                  ->  Hash Join  (cost=29849.48..51248.63 rows=4 width=224)
                        Hash Cond: (l.id = os_ls.l_id)
                        ->  Hash Left Join  (cost=23280.25..43499.25 rows=157348 width=224)
                              Hash Cond: (t_st_3.s_id = s_3.id)
                              ->  Hash Left Join  (cost=23277.34..41332.81 rows=157348 width=200)
                                    Hash Cond: (l.t_id = t_1.id)
                                    ->  Hash Right Join  (cost=6125.33..16612.78 rows=157348 width=52)
                                          Hash Cond: (t_st_3.id = l.st_or_id)
                                          ->  Seq Scan on t_st t_st_3  (cost=0.00..3568.25 rows=197225 width=28)
                                          ->  Hash  (cost=3236.48..3236.48 rows=157348 width=24)
                                                ->  Seq Scan on l  (cost=0.00..3236.48 rows=157348 width=24)
                                    ->  Hash  (cost=15955.01..15955.01 rows=35200 width=148)
                                          ->  Hash Left Join  (cost=8354.92..15955.01 rows=35200 width=148)
                                                Hash Cond: (t_st_2.s_id = s_2.id)
                                                ->  Hash Left Join  (cost=8352.01..15468.10 rows=35200 width=124)
                                                      Hash Cond: (t_st_1.s_id = s_1.id)
                                                      ->  Hash Right Join  (cost=8349.09..14981.19 rows=35200 width=100)
                                                            Hash Cond: (t_st_2.id = t_1.st_d_id)
                                                            ->  Seq Scan on t_st t_st_2  (cost=0.00..3568.25 rows=197225 width=28)
                                                            ->  Hash  (cost=7909.09..7909.09 rows=35200 width=72)
                                                                  ->  Hash Right Join  (cost=1277.00..7909.09 rows=35200 width=72)
                                                                        Hash Cond: (t_st_1.id = t_1.st_or_id)
                                                                        ->  Seq Scan on t_st t_st_1  (cost=0.00..3568.25 rows=197225 width=28)
                                                                        ->  Hash  (cost=837.00..837.00 rows=35200 width=44)
                                                                              ->  Seq Scan on t t_1  (cost=0.00..837.00 rows=35200 width=44)
                                                      ->  Hash  (cost=1.85..1.85 rows=85 width=24)
                                                            ->  Seq Scan on s s_1  (cost=0.00..1.85 rows=85 width=24)
                                                ->  Hash  (cost=1.85..1.85 rows=85 width=24)
                                                      ->  Seq Scan on s s_2  (cost=0.00..1.85 rows=85 width=24)
                              ->  Hash  (cost=1.85..1.85 rows=85 width=24)
                                    ->  Seq Scan on s s_3  (cost=0.00..1.85 rows=85 width=24)
                        ->  Hash  (cost=6569.18..6569.18 rows=4 width=4)
                              ->  Seq Scan on os_ls  (cost=0.00..6569.18 rows=4 width=4)
                                    Filter: (146356 = o_id)
                  ->  Index Scan using t_st_pkey on t_st t_st_4  (cost=0.42..0.49 rows=1 width=28)
                        Index Cond: (id = l.st_d_id)
            ->  Index Scan using s_pkey on s s_4  (cost=0.14..0.16 rows=1 width=24)
                  Index Cond: (id = t_st_4.s_id)
    

    The version with JOIN:

    SELECT *
    FROM l 
      JOIN os_ls ON l.id = os_ls.l_id
      LEFT OUTER JOIN t AS t_1 ON t_1.id = l.t_id
      LEFT OUTER JOIN t_st AS t_st_1 ON t_st_1.id = t_1.st_or_id
      LEFT OUTER JOIN s AS s_1 ON s_1.id = t_st_1.s_id
      LEFT OUTER JOIN t_st AS t_st_2 ON t_st_2.id = t_1.st_d_id
      LEFT OUTER JOIN s AS s_2 ON s_2.id = t_st_2.s_id
      LEFT OUTER JOIN t_st AS t_st_3 ON t_st_3.id = l.st_or_id
      LEFT OUTER JOIN s AS s_3 ON s_3.id = t_st_3.s_id LEFT OUTER JOIN t_st AS t_st_4 ON t_st_4.id
     = l.st_d_id LEFT OUTER JOIN s AS s_4 ON s_4.id = t_st_4.s_id
     WHERE 146356 = os_ls.o_id ORDER BY l."order" ASC;
    

    explain:

    QUERY PLAN
    Sort  (cost=6642.69..6642.70 rows=4 width=276)
      Sort Key: l."order"
      ->  Nested Loop Left Join  (cost=2.39..6642.65 rows=4 width=276)
            Join Filter: (s_4.id = t_st_4.s_id)
            ->  Nested Loop Left Join  (cost=2.39..6635.49 rows=4 width=252)
                  ->  Nested Loop Left Join  (cost=1.97..6633.47 rows=4 width=224)
                        Join Filter: (s_3.id = t_st_3.s_id)
                        ->  Nested Loop Left Join  (cost=1.97..6626.31 rows=4 width=200)
                              ->  Nested Loop Left Join  (cost=1.55..6624.29 rows=4 width=172)
                                    Join Filter: (s_2.id = t_st_2.s_id)
                                    ->  Nested Loop Left Join  (cost=1.55..6617.12 rows=4 width=148)
                                          ->  Nested Loop Left Join  (cost=1.13..6614.31 rows=4 width=120)
                                                Join Filter: (s_1.id = t_st_1.s_id)
                                                ->  Nested Loop Left Join  (cost=1.13..6607.15 rows=4 width=96)
                                                      ->  Nested Loop Left Join  (cost=0.71..6604.33 rows=4 width=68)
                                                            ->  Nested Loop  (cost=0.42..6602.98 rows=4 width=24)
                                                                  ->  Seq Scan on os_ls  (cost=0.00..6569.18 rows=4 width=4)
                                                                        Filter: (146356 = o_id)
                                                                  ->  Index Scan using l_pkey on l  (cost=0.42..8.44 rows=1 width=24)
                                                                        Index Cond: (id = os_ls.l_id)
                                                            ->  Index Scan using t_pkey on t t_1  (cost=0.29..0.33 rows=1 width=44)
                                                                  Index Cond: (id = l.t_id)
                                                      ->  Index Scan using t_st_pkey on t_st t_st_1  (cost=0.42..0.69 rows=1 width=28)
                                                            Index Cond: (id = t_1.st_or_id)
                                                ->  Materialize  (cost=0.00..2.27 rows=85 width=24)
                                                      ->  Seq Scan on s s_1  (cost=0.00..1.85 rows=85 width=24)
                                          ->  Index Scan using t_st_pkey on t_st t_st_2  (cost=0.42..0.69 rows=1 width=28)
                                                Index Cond: (id = t_1.st_d_id)
                                    ->  Materialize  (cost=0.00..2.27 rows=85 width=24)
                                          ->  Seq Scan on s s_2  (cost=0.00..1.85 rows=85 width=24)
                              ->  Index Scan using t_st_pkey on t_st t_st_3  (cost=0.42..0.49 rows=1 width=28)
                                    Index Cond: (id = l.st_or_id)
                        ->  Materialize  (cost=0.00..2.27 rows=85 width=24)
                              ->  Seq Scan on s s_3  (cost=0.00..1.85 rows=85 width=24)
                  ->  Index Scan using t_st_pkey on t_st t_st_4  (cost=0.42..0.49 rows=1 width=28)
                        Index Cond: (id = l.st_d_id)
            ->  Materialize  (cost=0.00..2.27 rows=85 width=24)
                  ->  Seq Scan on s s_4  (cost=0.00..1.85 rows=85 width=24)
    
  9. Mehdi Gmira reporter

    Ok I think I know what causes this. When the number of joins is higher than join_collapse_limit, postgres does not rearrange the joins. That why the syntax FROM X, Y does not work well in this case. Maybe this should come as a warning with the lazyloading function ?

  10. Mike Bayer repo owner

    that's an insane query and I can only imagine that the lazyload is against an entity that then points to a string of TEN "lazy='joined'" hardcoded relationships - I'd avoid using fixed lazy="joined" all over the place like that (I use it like, nearly never, and especially not with an OUTER join).

  11. Mehdi Gmira reporter

    I'm not using lazy="joined" all over the place. Just for very specific tables where it makes sense for the application. The entity does not have 10 lazy joins, it has 3 who have their own dependencies. And the left outer joins are generated by sqlalchemy, not me. The query is not insane at all and is very fast (within 50ms) when written with an INNER JOIN. Anyways, I think we said all that could be said on the subject.

  12. Mike Bayer repo owner

    those LEFT OUTER JOINs are almost definitely generated by the lazy="joined" strategy, and looking more closely it appears to be a cycle of self-referential joins with a "secondary" table in the middle. So this may be just one relationship, however it would likely have a setting of join_depth=4 or so, which also is a bad idea for a fixed setting especially when there's an association table involved (because you then get twice the joins).

    And the left outer joins are generated by sqlalchemy, not me.

    SQLAlchemy never generates a LEFT OUTER JOIN without being instructed to in some way, either via lazy="joined' or via with_polymorphic.

  13. Mehdi Gmira reporter

    Yes, as I said lazy="joined" is used (which explains the left outer join). And no there are definitely no cycles of self-referential joins and we do not use the join_depth settings. I'm not saying that SqlAchemy souldn't have generated all those joins. It is the right thing to do, since our models are configured that way. I'm just saying that postgresql has a join_collapse paramater which explain why the query generated by sqlalchemy was stuck.

  14. Log in to comment