Chained joinedload causes duplicate sqla objects, when run with pypy

Issue #4068 resolved
Ashley Chaloner
created an issue

This bug appears when running in pypy but not cpython. It appears when using sqlite or postgres backends (these are the only two I've tested with). The bug does not cause a stack trace, but results in duplicate database rows being returned in non-identical sqlalchemy objects.

Versions

  • pypy: Python 2.7.13 (5.8.0+dfsg-2~ppa2~ubuntu16.04, Jun 17 2017, 18:50:19) [PyPy 5.8.0 with GCC 5.4.0 20160609]
  • cffi==1.10.1
  • greenlet==0.4.12
  • readline==6.2.4.1
  • six==1.10.0
  • SQLAlchemy==1.1.13
  • (if using UUIDType: SQLAlchemy-Utils==0.32.16)

bash script for venv setup

virtualenv -p /usr/bin/pypy ~/test-venv
source ~/test-venv/bin/activate
pip install \
    cffi==1.10.1 \
    greenlet==0.4.12 \
    readline==6.2.4.1 \
    six==1.10.0 \
    SQLAlchemy==1.1.13

# if using UUIDType:
pip install SQLAlchemy-Utils==0.32.16

Python script to reproduce bug

from sqlalchemy import (
    Column,
    ForeignKey,
    Index,
    Table,
    create_engine
)
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import joinedload, relationship, sessionmaker
from sqlalchemy.types import Integer

Base = declarative_base()


class Base2(Base):

    __abstract__ = True
    attrlist = ["id"]

    # for easier debugging
    def __repr__(self):
        attrs = ["%s:%s" % (a, getattr(self, a)) for a in self.attrlist]
        return "<%s.%s object at 0x%016x: %s>" % (
            self.__class__.__module__, self.__class__.__name__,
            id(self), ", ".join(attrs))


class Scene(Base2):

    __tablename__ = "scene"

    id = Column(Integer, primary_key=True)


class ActScenes(Base2):

    __tablename__ = "act_scenes"
    attrlist = ["id", "act_id", "scene_id"]

    id = Column(Integer, primary_key=True)

    act_id = Column(
        Integer(),
        ForeignKey("act.id"),
        nullable=False)

    scene_id = Column(
        Integer(),
        ForeignKey("scene.id"),
        nullable=False)
    scene = relationship(Scene, lazy="joined")


class Act(Base2):

    __tablename__ = "act"

    id = Column(Integer, primary_key=True)

    scenes = relationship(ActScenes)


acts = Table(
    "acts", Base.metadata,
    Column("play_id", Integer(), ForeignKey("play.id")),
    Column("act_id", Integer(), ForeignKey("act.id"))
)
Index("ix_acts", acts.c.play_id, acts.c.act_id, unique=True)


class Play(Base2):

    __tablename__ = "play"

    id = Column(Integer, primary_key=True)

    acts = relationship("Act", secondary=acts)


def run_test(i):
    """Run one test, and return True if something failed."""

    engine = create_engine('sqlite:///:memory:')
    asessionmaker = sessionmaker()
    asessionmaker.configure(bind=engine)
    Base.metadata.create_all(engine)
    session = asessionmaker()

    play1 = Play()
    act = Act()
    act.scenes = [
        ActScenes(
            act_id=act.id,
            scene=Scene()
        )
        for _ in xrange(1000)
    ]
    play1.acts.append(act)

    scenecounts1 = len(play1.acts[0].scenes)

    session.add(play1)
    session.commit()

    # Comment out this block and watch the bug vanish.
    session.query(Play).options(
        joinedload(Play.acts).joinedload(Act.scenes)
    ).filter_by(id=play1.id).first()
    # End block.

    scenecounts1_again = len(play1.acts[0].scenes)
    if scenecounts1 != scenecounts1_again:
        print "Iteration {} failed".format(i)

        seen_scene_ids = set()
        for scene in act.scenes:
            if scene.id in seen_scene_ids:
                # import pdb; pdb.set_trace()
                print "Duplicate scene.id spotted: {}".format(scene.id)
                scenes_with_id = [s for s in act.scenes if s.id == scene.id]
                print "Scenes with this id:\n{}".format(
                    "\n".join(repr(s) for s in scenes_with_id))
            else:
                seen_scene_ids.add(scene.id)

    return scenecounts1 != scenecounts1_again


if __name__ == "__main__":
    N = 25
    results = [run_test(i) for i in xrange(N)]
    print "{} out of {} failed.".format(sum(results), N)

Sample output

(test-venv) user@host:~$ pypy test_chained_joinedload.py
Iteration 0 failed
Duplicate scene.id spotted: 185
Scenes with this id:
<__main__.ActScenes object at 0x00007f77de383948: id:185, act_id:1, scene_id:185>
<__main__.ActScenes object at 0x00007f77de9c4790: id:185, act_id:1, scene_id:185>
1 out of 25 failed.

Comments (17)

  1. Michael Bayer repo owner

    how are you testing against Postgresql? Are you adding new objects to the existing database on each run? I notice that you are querying for session.query(Play).first() without any order_by() which means it may not select the same Play each time. It still should not produce the behavior you are seeing, and on the SQLite backend should also not be a problem as you are creating a new DB each time, however it would be better if on the PG backend as well, the DB started clean on each run.

  2. Ashley Chaloner reporter

    how are you testing against Postgresql?

    Our application with the persistent db is fine, it's just one py.test test (which I've reduced to the above minimal script) that is flaky. For running tests with either sqlite or postgres, py.test runs a custom setup_module fn which picks up an environmental variable indicating which db backend to use. The Flask app gets configured with either an in-memory sqlite or connection details for an in-memory Postgres docker container (faster and definitely empty at the start of every run).

    Are you adding new objects to the existing database on each run?

    Yes, py.test starts with an empty database every time.

    I notice that you are querying for session.query(Play).first() without any order_by() which means it may not select the same Play each time. It still should not produce the behavior you are seeing, and on the SQLite backend should also not be a problem as you are creating a new DB each time, however it would be better if on the PG backend as well, the DB started clean on each run.

    I'm only adding one Play to the database, and also filtering by the primary key (.filter_by(id=play1.id)) so I think it should be ok.

    I'm also not storing/using the result of the call to session.query. Should this call affect the contents of play1?

  3. Michael Bayer repo owner

    all issues specific to pypy have to do with the fact that it does garbage collection differently than Python and exposes odd race conditions. in this case the identity map datastructure is confirmed as failing under pypy however I've yet to locate exactly where.

  4. Michael Bayer repo owner

    it's specific to the fact that there are objects in the session when you call session.commit(), these objects are "expired" and available for garbage collection, however they are still hanging around after the operation; they get in the way during the subsequent load operation. the eager loads expose lots of deduplication work to be done which possibly makes it easier to hit the issue.

  5. Michael Bayer repo owner

    OK the issue is the state objects being garbage collected are calling upon identity_map._fast_discard which is removing the wrong state object. the use of _fast_discard started in 3209a73b92e17dd351a50c41352791baeefcd846, however, the issue existed prior to that as well, it's just that the previous version used the normal discard() method which had a quick assertion in it so at least you'd see in the logs that there were a problem.

    this is a non-subtle bug in fact and that this has never been reported strongly suggests that pypy is not widely used with SQLAlchemy. this kind of issue will happen lots against code that has been working this way for over three years.

  6. Michael Bayer repo owner

    Always check that discarded state is the expected one

    Fixed race condition in ORM identity map which would cause objects to be inappropriately removed during a load operation, causing duplicate object identities to occur, particularly under joined eager loading which involves deduplication of objects. The issue is specific to garbage collection of weak references and is observed only under the Pypy interpreter.

    Change-Id: I9f6ae3fe5b078f26146af82b15d16f3a549a9032 Fixes: #4068

    → <<cset 37c5ed2ed948>>

  7. Michael Bayer repo owner

    Always check that discarded state is the expected one

    Fixed race condition in ORM identity map which would cause objects to be inappropriately removed during a load operation, causing duplicate object identities to occur, particularly under joined eager loading which involves deduplication of objects. The issue is specific to garbage collection of weak references and is observed only under the Pypy interpreter.

    Change-Id: I9f6ae3fe5b078f26146af82b15d16f3a549a9032 Fixes: #4068 (cherry picked from commit 37c5ed2ed948250d7244d1e6e17da37a27c24698)

    → <<cset a3965ea5ff19>>

  8. Log in to comment