Recursive lock attempt in orm/identity.py

Issue #2148 resolved

Former user created an issue 2011-04-24

Our system started locking up lately after switch to SQLAlchemy 0.6.6 from 0.5.8. I thought it was the deadlock (~~#2087~~) fixed in 0.6.7 but upgrade to 0.6.7 didn't help. What happens appears to be WeakInstanceDict.remove() in identity.py causing _cleanup() weak ref callback to be invoked in orm/state.py, which in turn causes the recursive call to WeakInstanceDict.remove() and recursive mutex acquisition attempt.

File: "/opt/magrathea/frogstar/db.py", line 191, in __call__
  Session.commit()
File: "/usr/lib/python2.5/site-packages/SQLAlchemy-0.6.7-py2.5.egg/sqlalchemy/orm/scoping.py", line 139, in do
  return getattr(self.registry(), name)(*args, **kwargs)
File: "/usr/lib/python2.5/site-packages/SQLAlchemy-0.6.7-py2.5.egg/sqlalchemy/orm/session.py", line 614, in commit
  self.transaction.commit()
File: "/usr/lib/python2.5/site-packages/SQLAlchemy-0.6.7-py2.5.egg/sqlalchemy/orm/session.py", line 395, in commit
  self._remove_snapshot()
File: "/usr/lib/python2.5/site-packages/SQLAlchemy-0.6.7-py2.5.egg/sqlalchemy/orm/session.py", line 316, in _remove_snapshot
  instance_dict=self.session.identity_map)
File: "/usr/lib/python2.5/site-packages/SQLAlchemy-0.6.7-py2.5.egg/sqlalchemy/orm/state.py", line 268, in expire_attributes
  dict_.pop(key, None)
File: "/usr/lib/python2.5/site-packages/SQLAlchemy-0.6.7-py2.5.egg/sqlalchemy/orm/state.py", line 73, in _cleanup
  instance_dict.remove(self)
File: "/usr/lib/python2.5/site-packages/SQLAlchemy-0.6.7-py2.5.egg/sqlalchemy/orm/identity.py", line 142, in remove
  if dict.pop(self, state.key) is not state:
File: "/usr/lib/python2.5/site-packages/SQLAlchemy-0.6.7-py2.5.egg/sqlalchemy/orm/state.py", line 73, in _cleanup
  instance_dict.remove(self)
File: "/usr/lib/python2.5/site-packages/SQLAlchemy-0.6.7-py2.5.egg/sqlalchemy/orm/identity.py", line 140, in remove
  self._remove_mutex.acquire()

2148.patch

Comments (11)

Mike Bayer repo owner
- changed component to orm
- changed milestone to 0.6.8
- assigned issue to
Well ~~#2087~~ wasn't "fixed" so spectacularly in 0.6 since we can't reproduce it - 0.7 has a more aggressive fix, which is there because we are trying to hedge between the deadlock and the original issue which is ~~#1891~~. The more aggressive fix is that the mutex is removed. We wanted people to run on 0.7 to see that they're not getting ~~#1891~~ again, which is "dictionary changed size during iteration". We've done a bunch of research on the general problem of ~~#1891~~ and there's not really a canonical "solution".

The trace you have above is in fact not exactly the same thing as ~~#2087~~ - its still the deadlock, but this appears like two InstanceState objects at play, first one gcs and results in a gc of a second and then the deadlock.

anyway, what both of those bugs have in common is the reporters can't reproduce it for us, since it has a lot to do with using a particular amount of memory under particular situations, and then when we say "OK does this work?" they either have already worked around it some other way or they just vanish - hence we keep putting in bandaids that are just guesses.

So here, i'll attach a patch that copies 0.7's identity module pretty much exactly, and when i pop it into 0.6 all tests pass. Or you could run 0.7. But I'd need to know that you're not getting any "dictionary changed size" issues when the mutex goes away.

I've run 0.6 and 0.7 in production, 0.6 under pretty heavy load for over a year and never saw this issue. So let me know what you get with this. Not sure if i want to make a change in 0.6 again since 0.7 is where the newer dev is but will think about it if this is truly helpful.
- 2011-04-24T21:12:45+00:00
Former user Account Deleted
Thanks, I've applied the patch. The deadlock was happening once every 3-4 days, so I guess it'll take some time to see whether any side effects will show up too. I think if everything is OK by the end of the week, we may consider the problem fixed. I'll let you know, and in case I forget, here's my contact info: Ivan Shvedunov <ivan4th at gmail dot com>
- 2011-04-25T07:14:48+00:00

Former user Account Deleted

Well, got the following (just once so far), not sure whether it's related to the patch, but anyway:

  File "/opt/magrathea/grading/study.py", line 195, in accept
    assert self.version or isinstance(event, CourseVersionActivationEvent)
  File "/usr/lib/python2.5/site-packages/SQLAlchemy-0.6.7-py2.5.egg/sqlalchemy/orm/attributes.py", line 168, in __get__
    instance_dict(instance))
  File "/usr/lib/python2.5/site-packages/SQLAlchemy-0.6.7-py2.5.egg/sqlalchemy/orm/attributes.py", line 388, in get
    value = callable_(passive=passive)
  File "/usr/lib/python2.5/site-packages/SQLAlchemy-0.6.7-py2.5.egg/sqlalchemy/orm/strategies.py", line 645, in __call__
    return q._get(key, ident, passive=passive)
  File "/usr/lib/python2.5/site-packages/SQLAlchemy-0.6.7-py2.5.egg/sqlalchemy/orm/query.py", line 1896, in _get
    instance = self.session.identity_map.get(key)
  File "/usr/lib/python2.5/site-packages/SQLAlchemy-0.6.7-py2.5.egg/sqlalchemy/orm/identity.py", line 143, in get
    state = dict.__getitem__(self, key)
KeyError: (<class 'zarquon.docstorage.Version'>, (UUID('a8038201-d2ba-560c-ad48-083436b8bb62'),))

2011-04-25T07:32:19+00:00

Mike Bayer repo owner
i need to examine if this is because i dropped in "identity" without adjusting the calling logic, or this is a bug that's also in 0.7. you guys aren't using a single Session in multiple threads, right ? everything local to a single thread right ?
- 2011-04-25T10:10:03+00:00
Mike Bayer repo owner
yeah that was a bug in 0.7 too. the patch is updated please let me know how this one goes.
- 2011-04-25T11:37:24+00:00
Former user Account Deleted
Thanks, applied the updated patch. Will see how it works. And no, we aren't using single Session in multiple threads. Our web app follows the session-per-request pattern.
- 2011-04-25T14:35:45+00:00
Mike Bayer repo owner
OK if this works better for you I'll probably apply this to 0.6.
- 2011-04-25T14:51:14+00:00
Former user Account Deleted
Seems like the fix worked, I don't see any similar lockups anymore. Thanks.
- 2011-05-05T08:00:09+00:00
Mike Bayer repo owner
and no "dictionary changed" errors, great, we can put this in.
- 2011-05-05T09:53:14+00:00
Mike Bayer repo owner
- changed status to resolved
381da62d31235fef603ee070970a0f7decfa9b25
- 2011-05-07T13:19:22+00:00
Mike Bayer repo owner
- removed milestone
Removing milestone: 0.6.8 (automated comment)
- 2014-02-23T18:14:41+00:00
Log in to comment

Assignee: Mike Bayer

Type: bug

Priority: critical

Status: resolved

Component: orm

Milestone: –

Votes: 0

Watchers: 0