Recursive lock attempt in orm/identity.py

Issue #2148 resolved
Former user created an issue

Our system started locking up lately after switch to SQLAlchemy 0.6.6 from 0.5.8. I thought it was the deadlock (#2087) fixed in 0.6.7 but upgrade to 0.6.7 didn't help. What happens appears to be WeakInstanceDict.remove() in identity.py causing _cleanup() weak ref callback to be invoked in orm/state.py, which in turn causes the recursive call to WeakInstanceDict.remove() and recursive mutex acquisition attempt.

File: "/opt/magrathea/frogstar/db.py", line 191, in __call__
  Session.commit()
File: "/usr/lib/python2.5/site-packages/SQLAlchemy-0.6.7-py2.5.egg/sqlalchemy/orm/scoping.py", line 139, in do
  return getattr(self.registry(), name)(*args, **kwargs)
File: "/usr/lib/python2.5/site-packages/SQLAlchemy-0.6.7-py2.5.egg/sqlalchemy/orm/session.py", line 614, in commit
  self.transaction.commit()
File: "/usr/lib/python2.5/site-packages/SQLAlchemy-0.6.7-py2.5.egg/sqlalchemy/orm/session.py", line 395, in commit
  self._remove_snapshot()
File: "/usr/lib/python2.5/site-packages/SQLAlchemy-0.6.7-py2.5.egg/sqlalchemy/orm/session.py", line 316, in _remove_snapshot
  instance_dict=self.session.identity_map)
File: "/usr/lib/python2.5/site-packages/SQLAlchemy-0.6.7-py2.5.egg/sqlalchemy/orm/state.py", line 268, in expire_attributes
  dict_.pop(key, None)
File: "/usr/lib/python2.5/site-packages/SQLAlchemy-0.6.7-py2.5.egg/sqlalchemy/orm/state.py", line 73, in _cleanup
  instance_dict.remove(self)
File: "/usr/lib/python2.5/site-packages/SQLAlchemy-0.6.7-py2.5.egg/sqlalchemy/orm/identity.py", line 142, in remove
  if dict.pop(self, state.key) is not state:
File: "/usr/lib/python2.5/site-packages/SQLAlchemy-0.6.7-py2.5.egg/sqlalchemy/orm/state.py", line 73, in _cleanup
  instance_dict.remove(self)
File: "/usr/lib/python2.5/site-packages/SQLAlchemy-0.6.7-py2.5.egg/sqlalchemy/orm/identity.py", line 140, in remove
  self._remove_mutex.acquire()

Comments (11)

  1. Mike Bayer repo owner
    • changed component to orm
    • changed milestone to 0.6.8
    • assigned issue to

    Well #2087 wasn't "fixed" so spectacularly in 0.6 since we can't reproduce it - 0.7 has a more aggressive fix, which is there because we are trying to hedge between the deadlock and the original issue which is #1891. The more aggressive fix is that the mutex is removed. We wanted people to run on 0.7 to see that they're not getting #1891 again, which is "dictionary changed size during iteration". We've done a bunch of research on the general problem of #1891 and there's not really a canonical "solution".

    The trace you have above is in fact not exactly the same thing as #2087 - its still the deadlock, but this appears like two InstanceState objects at play, first one gcs and results in a gc of a second and then the deadlock.

    anyway, what both of those bugs have in common is the reporters can't reproduce it for us, since it has a lot to do with using a particular amount of memory under particular situations, and then when we say "OK does this work?" they either have already worked around it some other way or they just vanish - hence we keep putting in bandaids that are just guesses.

    So here, i'll attach a patch that copies 0.7's identity module pretty much exactly, and when i pop it into 0.6 all tests pass. Or you could run 0.7. But I'd need to know that you're not getting any "dictionary changed size" issues when the mutex goes away.

    I've run 0.6 and 0.7 in production, 0.6 under pretty heavy load for over a year and never saw this issue. So let me know what you get with this. Not sure if i want to make a change in 0.6 again since 0.7 is where the newer dev is but will think about it if this is truly helpful.

  2. Former user Account Deleted

    Thanks, I've applied the patch. The deadlock was happening once every 3-4 days, so I guess it'll take some time to see whether any side effects will show up too. I think if everything is OK by the end of the week, we may consider the problem fixed. I'll let you know, and in case I forget, here's my contact info: Ivan Shvedunov <ivan4th at gmail dot com>

  3. Former user Account Deleted

    Well, got the following (just once so far), not sure whether it's related to the patch, but anyway:

      File "/opt/magrathea/grading/study.py", line 195, in accept
        assert self.version or isinstance(event, CourseVersionActivationEvent)
      File "/usr/lib/python2.5/site-packages/SQLAlchemy-0.6.7-py2.5.egg/sqlalchemy/orm/attributes.py", line 168, in __get__
        instance_dict(instance))
      File "/usr/lib/python2.5/site-packages/SQLAlchemy-0.6.7-py2.5.egg/sqlalchemy/orm/attributes.py", line 388, in get
        value = callable_(passive=passive)
      File "/usr/lib/python2.5/site-packages/SQLAlchemy-0.6.7-py2.5.egg/sqlalchemy/orm/strategies.py", line 645, in __call__
        return q._get(key, ident, passive=passive)
      File "/usr/lib/python2.5/site-packages/SQLAlchemy-0.6.7-py2.5.egg/sqlalchemy/orm/query.py", line 1896, in _get
        instance = self.session.identity_map.get(key)
      File "/usr/lib/python2.5/site-packages/SQLAlchemy-0.6.7-py2.5.egg/sqlalchemy/orm/identity.py", line 143, in get
        state = dict.__getitem__(self, key)
    KeyError: (<class 'zarquon.docstorage.Version'>, (UUID('a8038201-d2ba-560c-ad48-083436b8bb62'),))
    
  4. Mike Bayer repo owner

    i need to examine if this is because i dropped in "identity" without adjusting the calling logic, or this is a bug that's also in 0.7. you guys aren't using a single Session in multiple threads, right ? everything local to a single thread right ?

  5. Mike Bayer repo owner

    yeah that was a bug in 0.7 too. the patch is updated please let me know how this one goes.

  6. Former user Account Deleted

    Thanks, applied the updated patch. Will see how it works. And no, we aren't using single Session in multiple threads. Our web app follows the session-per-request pattern.

  7. Log in to comment