Recursive lock attempt in orm/identity.py
Our system started locking up lately after switch to SQLAlchemy 0.6.6 from 0.5.8. I thought it was the deadlock (#2087) fixed in 0.6.7 but upgrade to 0.6.7 didn't help. What happens appears to be WeakInstanceDict.remove() in identity.py causing _cleanup()
weak ref callback to be invoked in orm/state.py, which in turn causes the recursive call to WeakInstanceDict.remove() and
recursive mutex acquisition attempt.
File: "/opt/magrathea/frogstar/db.py", line 191, in __call__
Session.commit()
File: "/usr/lib/python2.5/site-packages/SQLAlchemy-0.6.7-py2.5.egg/sqlalchemy/orm/scoping.py", line 139, in do
return getattr(self.registry(), name)(*args, **kwargs)
File: "/usr/lib/python2.5/site-packages/SQLAlchemy-0.6.7-py2.5.egg/sqlalchemy/orm/session.py", line 614, in commit
self.transaction.commit()
File: "/usr/lib/python2.5/site-packages/SQLAlchemy-0.6.7-py2.5.egg/sqlalchemy/orm/session.py", line 395, in commit
self._remove_snapshot()
File: "/usr/lib/python2.5/site-packages/SQLAlchemy-0.6.7-py2.5.egg/sqlalchemy/orm/session.py", line 316, in _remove_snapshot
instance_dict=self.session.identity_map)
File: "/usr/lib/python2.5/site-packages/SQLAlchemy-0.6.7-py2.5.egg/sqlalchemy/orm/state.py", line 268, in expire_attributes
dict_.pop(key, None)
File: "/usr/lib/python2.5/site-packages/SQLAlchemy-0.6.7-py2.5.egg/sqlalchemy/orm/state.py", line 73, in _cleanup
instance_dict.remove(self)
File: "/usr/lib/python2.5/site-packages/SQLAlchemy-0.6.7-py2.5.egg/sqlalchemy/orm/identity.py", line 142, in remove
if dict.pop(self, state.key) is not state:
File: "/usr/lib/python2.5/site-packages/SQLAlchemy-0.6.7-py2.5.egg/sqlalchemy/orm/state.py", line 73, in _cleanup
instance_dict.remove(self)
File: "/usr/lib/python2.5/site-packages/SQLAlchemy-0.6.7-py2.5.egg/sqlalchemy/orm/identity.py", line 140, in remove
self._remove_mutex.acquire()
Comments (11)
-
repo owner -
Account Deleted Thanks, I've applied the patch. The deadlock was happening once every 3-4 days, so I guess it'll take some time to see whether any side effects will show up too. I think if everything is OK by the end of the week, we may consider the problem fixed. I'll let you know, and in case I forget, here's my contact info: Ivan Shvedunov <ivan4th at gmail dot com>
-
Account Deleted Well, got the following (just once so far), not sure whether it's related to the patch, but anyway:
File "/opt/magrathea/grading/study.py", line 195, in accept assert self.version or isinstance(event, CourseVersionActivationEvent) File "/usr/lib/python2.5/site-packages/SQLAlchemy-0.6.7-py2.5.egg/sqlalchemy/orm/attributes.py", line 168, in __get__ instance_dict(instance)) File "/usr/lib/python2.5/site-packages/SQLAlchemy-0.6.7-py2.5.egg/sqlalchemy/orm/attributes.py", line 388, in get value = callable_(passive=passive) File "/usr/lib/python2.5/site-packages/SQLAlchemy-0.6.7-py2.5.egg/sqlalchemy/orm/strategies.py", line 645, in __call__ return q._get(key, ident, passive=passive) File "/usr/lib/python2.5/site-packages/SQLAlchemy-0.6.7-py2.5.egg/sqlalchemy/orm/query.py", line 1896, in _get instance = self.session.identity_map.get(key) File "/usr/lib/python2.5/site-packages/SQLAlchemy-0.6.7-py2.5.egg/sqlalchemy/orm/identity.py", line 143, in get state = dict.__getitem__(self, key) KeyError: (<class 'zarquon.docstorage.Version'>, (UUID('a8038201-d2ba-560c-ad48-083436b8bb62'),))
-
repo owner i need to examine if this is because i dropped in "identity" without adjusting the calling logic, or this is a bug that's also in 0.7. you guys aren't using a single Session in multiple threads, right ? everything local to a single thread right ?
-
repo owner yeah that was a bug in 0.7 too. the patch is updated please let me know how this one goes.
-
Account Deleted Thanks, applied the updated patch. Will see how it works. And no, we aren't using single Session in multiple threads. Our web app follows the session-per-request pattern.
-
repo owner OK if this works better for you I'll probably apply this to 0.6.
-
Account Deleted Seems like the fix worked, I don't see any similar lockups anymore. Thanks.
-
repo owner and no "dictionary changed" errors, great, we can put this in.
-
repo owner - changed status to resolved
-
repo owner - removed milestone
Removing milestone: 0.6.8 (automated comment)
- Log in to comment
Well
#2087wasn't "fixed" so spectacularly in 0.6 since we can't reproduce it - 0.7 has a more aggressive fix, which is there because we are trying to hedge between the deadlock and the original issue which is#1891. The more aggressive fix is that the mutex is removed. We wanted people to run on 0.7 to see that they're not getting#1891again, which is "dictionary changed size during iteration". We've done a bunch of research on the general problem of#1891and there's not really a canonical "solution".The trace you have above is in fact not exactly the same thing as
#2087- its still the deadlock, but this appears like twoInstanceState
objects at play, first one gcs and results in a gc of a second and then the deadlock.anyway, what both of those bugs have in common is the reporters can't reproduce it for us, since it has a lot to do with using a particular amount of memory under particular situations, and then when we say "OK does this work?" they either have already worked around it some other way or they just vanish - hence we keep putting in bandaids that are just guesses.
So here, i'll attach a patch that copies 0.7's identity module pretty much exactly, and when i pop it into 0.6 all tests pass. Or you could run 0.7. But I'd need to know that you're not getting any "dictionary changed size" issues when the mutex goes away.
I've run 0.6 and 0.7 in production, 0.6 under pretty heavy load for over a year and never saw this issue. So let me know what you get with this. Not sure if i want to make a change in 0.6 again since 0.7 is where the newer dev is but will think about it if this is truly helpful.