handling redis lock timeouts

Issue #77 new
jvanasco
created an issue

I've run into a situation with the distributed_lock in the redis backend (ie, locks that redis manages).

I realized through a few Keys which are backed by create functions that take too long or fail, that a lock_timeout basically must be set – otherwise you can jam your application by setting an eternal lock for a key that is never set.

So i set the lock_timeout, and that solves one problem, but creates another – if the process ends up completing after the lock_timeout, a LockError is raised in the redis package's lock.py Lock.do_release method.

raise LockError("Cannot release a lock that's no longer owned")

For context, this is happening in a webapp and there can be a variable time on the lock_timeout as it can be caused due to a slow-connection, dropped connection, process killed by a long-process monitor or something else. Since this is only being used for caching, I opted to expire the redis lock using the shortest possible time.

A hopeful fix would have been to configure the dogpile redis backend to ignore this particular lock error. However, looking at the traceback and source, this looks to be really tricky.

dogpile/cache/region.py", line 651, in `get_or_create`
dogpile/core/dogpile.py", line 158, in `__enter__`
dogpile/core/dogpile.py", line 98, in `_enter`
dogpile/core/dogpile.py", line 153, in `_enter_create`
redis/lock.py", line 135, in `release` (raised on line 264)

In order to keep the value AND suppress the timeout, the config data in dogpile.cache would need to be used in dogpile.core. that is a lot of updating.

another approach might be catching this particular lock error in region, and then immediately getting the value again (which should not have an error). that too looks tricky though.

does anyone have suggestions for an elegant fix?

Comments (14)

  1. Michael Bayer repo owner

    well to get into the release() you'd wrap the RedisLock with some kind of wrapper. Any chance this should really be an upstream option in the redis package anyway?

  2. jvanasco reporter

    I'm asking them to consider a config option for suppressing LockRelease errors, if they're open then it should be easier to pass in that value via dogpile.cache.

    My thinking is that while it makes more sense to handle this within dogpile.core or other caching libraries... looking at the 'plugin architecture' of most python caching libraries, it will be a pain to implement a wrapper across the board.

  3. jvanasco reporter

    Coming back to this...

    I had filed a ticket upstream in redis.py https://github.com/andymccurdy/redis-py/issues/623

    The more I think on this, the more I think dogpile should be patched, as this error can happen on anything with a distributed lock. While this is only realized in redis now, there's the potential for Riak, Mongo and other distributed stores that people experiment on.

    I can generate a patch where the redis backend uses a wrapped class to handle this.

    Would it be a good idea to document a dogpile.cache.api object on this?

    Another approach (which I'm not in favor of, but just throwing out) would be extending the region to support a list of acceptable lock release error... and then patching core.dogpile and cache.region to reference them. that's messier though.

  4. jvanasco reporter

    Before passing judgement on it, let's call the a "lock release error" in a distributed environment a "situation" for a moment...

    This "situation" can happen from 2 events:

    1. The cloud DB is `flushed` or `reset` and no lock keys exist
    2. The lock `timed out`, because the generation function took too long
    

    The first event is unavoidable, and leads do massive amounts of lock release errors.

    The second event is awkward and hard to avoid. If you set arbitrarily long timeouts or none at all, you can jam workers indefinitely. A workable solution takes finding the right values and regions for timeouts, and balancing out bottlenecks.

    In both of these causing events, the computed data has already been generated and sent to the cloud store -- so the tool that requested the data could service it's request. Keep in mind, dogpile doesn't check if there is a lock before "creating" the data (doing so would be inefficient).

    So a new question: what sort of "situation" is this:

    1. Do we want to know about this situation ? (logging, alert callback?)
    2. Do we want to handle this situation in a specific way?  (raise an exception, alert callback?)
    

    If we raise an exception, we'd have to do quite a bit of work to handle this gracefully. Accessing the data would require a second call to the cache, which means the existing caching @decorators can't be used reliably; dealing with direct region get/get_create/multis is not easily handled either.

    The chances of encountering these release errors on a distributed store is pretty high. In practice, one just wouldn't care aside from seeing it in a log.

    The current design ensures we raise an exception (which is "informational" at best) in this situation, and makes it very awkward to catch that exception. While the creation function did generate usable data (and save it), it basically disappears.

    So the "situation" is either a common error that someone who hasn't encountered it yet should lean about, or an annoying inherency of the datastore that people eventually despise for forcing fatal exceptions that could have resulted in successful responses.

  5. Michael Bayer repo owner

    it sounds like you're just looking to have any number of processes just jump on the creation of the object if either of these events happens, e.g. the dogpile lock would just fail because it vanishes. Which seems a little weird, if a creator is taking too long, you want new processes to keep starting up new creations every N seconds, now it's even slower ?

  6. jvanasco reporter

    Yes and No.

    I'm not concerned with subsequent processes handling additional creations, because that will always happen regardless of the Exception bubbling up.

    I am concerned with the first process not raising an Exception after it has successfully generated (and set) a new value. The most important thing to me -- at least in the context of the @decorators -- is to allow this to "fail gracefully" in a production environment.

    Assume we're in a production environment and a distributed cache becomes invalidated. Every client that was midway through a get_or_create routine will raise an exception when it tries to release a no-longer-existent lock. All of these requests could be handled seamlessly, but the caching routine prioritizes an Exception over Information & User Experience. By raising an Exception, this situation essentially becomes fatal (it's damn near impossible to handle a caching decorator) -- but there is no actual value-add to the situation by bubbling this up over simply logging the event.

    Assuming we have a lock timeout, again there is no value-add to the situation over simply logging this error. Someone might want to raise exceptions in development to discover these situations, but for production and staging use there is no tangible benefit to raising an error. Dogpile has the target data, and we lose it with the raised Exception; properly handling this would require catching the error and requesting the data again from dogpile.

    In the worst-case scenario of constant load over a particular key, the lack of a lock means subsequent processes are instantiated every N seconds -- but only until the first value is created. This would make things potentially slower, but in a production environment this would happen whether or not the lock error was suppressed. If anything, this potentially avoids another error because someone would not have to request the same resource again -- potentially triggering another lock+creation cycle.

    In an ideal situation we would re-acquire a new lock if we're still "creating" and the lock disappeared -- but that would be complex outside of something like twisted or gevent. In a more ideal situation, we wouldn't be losing locks or have backends that are prone to losing locks -- but life isn't perfect.

  7. Michael Bayer repo owner

    Assume we're in a production environment and a distributed cache becomes invalidated. Every client that was midway through a get_or_create routine will raise an exception when it tries to release a no-longer-existent lock.

    IMO as it should, the cache is down totally, if the lock release exception is silently squashed then the "put" would fail too. if the cache is back, if i were engineering this to really handle this (which obviously right now, we don't), I'd want a retry system in place that will take that generated value and try the whole thing again - keeping in mind that another generator may have put a value in there already. because if the lock is gone, you have no idea what happened. This aspect may very well benefit from new logic inside the region code, but I haven't looked into this.

    All of these requests could be handled seamlessly, but the caching routine prioritizes an Exception over Information & User Experience.

    well you'd like it to gracefully handle that the locking system went down totally. If you want to squash the exception, you can use your custom lock thing on your end with a hook on ours, so the discussion here is only for informational purposes. But I'd look into more robust ways of handling that if I had to deal with that problem.

    In the worst-case scenario of constant load over a particular key, the lack of a lock means subsequent processes are instantiated every N seconds -- but only until the first value is created. This would make things potentially slower, but in a production environment this would happen whether or not the lock error was suppressed.

    right, you're willing to tolerate a dogpile situation temporarily.

    The thing is, getting this lock to work was so fucking hard, I've spent years on this process from back when I did it in myghty and beaker and then trying to generalize it on dogpile, also running beaker in a a few production systems for a couple of years and watching all the weird things that happened over there, when presented with the idea of, "what if we just make it that if the lock itself just vanishes, who cares?" , I have no idea what effects that will cause. When i dealt with this on a very loaded production system, basically restarting the cache was a recipe for doom. I hope to have the task of building up a redis cache environment someday so I can come up with a solution to this problem.

  8. jvanasco reporter

    IMO as it should, the cache is down totally, if the lock release exception is silently squashed then the "put" would fail too.

    Bubbling up an exception on the PUT makes sense - the action failed and the system is currently down.

    IIRC, the logic is currently:

    LOCK
    CREATE  # potentially takes a while
    PUT
    RELEASE
    

    On the RELEASE we have 4 candidates:

    • The backend went down sometime after the PUT, and is still down.
    • The backend went down after the LOCK and went up before the PUT.
    • The backend was flushed/expunged/reset after the LOCK.
    • The region has too short of a lock_timeout.

    if i were engineering this to really handle this (which obviously right now, we don't), I'd want a retry system in place that will take that generated value and try the whole thing again - keeping in mind that another generator may have put a value in there already. because if the lock is gone, you have no idea what happened. This aspect may very well benefit from new logic inside the region code, but I haven't looked into this.

    I fundamentally agree with this, but I think it can be messy to implement.

    Assuming a release error, there are 3 possible situations:

    • Another, newer, lock exists for this creation. IIRC, either in the dogpile or redis code the lock's value is a unique hash so we are assured the right process owns the lock.
    • A value exists in the cloud.
    • No lock, no value. yay!

    If there is a newer lock, do we just let it be -- and return the value?

    If there is an existing value, do we reset it? (even if the inner payload is unique, the cached metadata with expiretime could need a reset) The value could have been created via a a direct set instead of get_and_create.

    If you want to squash the exception, you can use your custom lock thing on your end with a hook on ours, so the discussion here is only for informational purposes.

    Well, this is a intermediary solution to a complex problem that will need a lot of thought and discussion.

    The thing is, getting this lock to work was so fucking hard, I've spent years on this process from back when I did it in myghty and beaker and then trying to generalize it on dogpile, also running beaker in a a few production systems for a couple of years and watching all the weird things that happened over there, when presented with the idea of, "what if we just make it that if the lock itself just vanishes, who cares?" , I have no idea what effects that will cause. When i dealt with this on a very loaded production system, basically restarting the cache was a recipe for doom. I hope to have the task of building up a redis cache environment someday so I can come up with a solution to this problem.

    "Restarting the cache" took down Facebook and handful of other big sites a few times. When I was at IAC, I discovered a long-present repeated downtime was from a caching layer failing and the site getting DDOS'd under certain conditions.

    So I fully appreciate and need the utility of the lock. My use of "who cares?" is not a comment on the utility of the lock, but identifying the stakeholders.

    If the locking system goes down (which is highly likely on a distributed backend like redis), what should we do? Depending on the environment and project, the stakeholders and solutions will be different.

    Redis is a great tool, with a lot of peculiarities. I think you'd understand a lot of my logged issues and ideas after a deep-dive into it on production. It's very stable -- but since it's entirely memory based, you become very concerned with size and performance. You have to set redis-based timeouts for keys and locks, otherwise you easily run out of memory or jam your applications.

  9. erowan1

    Hello,

    I am running into this issue using a nested dogpile cache_on_arguments decorator (one cache function calls another). The program is multi-threaded. Would using TLS by exposing the thread_local arg in the redis client lock in the redis backend config perhaps fix it?

    https://github.com/andymccurdy/redis-py/pull/492.

    The thread_local arg defaults to False. https://github.com/dystedium/redis-py/blob/b010f4fc14c0d016b268b45d4950d88d39be9227/redis/client.py#L481

    stacktrace {"error":{"data":{"error":{"traceback":"error=Cannot release a lock that's no longer owned, traceback is: (<class 'redis.exceptions.LockError'>: Cannot release a lock that's no longer owned [/Users/rowan/dev/eclipse/workspace/GX-Feeds/capture/netflow/handlers.py|geo_flights|310] [/Users/rowan/dev/eclipse/workspace/GX-Feeds/env/lib/python3.5/site-packages/dogpile.cache-0.6.1-py3.5.egg/dogpile/cache/region.py|decorate|1053] [/Users/rowan/dev/eclipse/workspace/GX-Feeds/env/lib/python3.5/site-packages/dogpile.cache-0.6.1-py3.5.egg/dogpile/cache/region.py|get_or_create|657] [/Users/rowan/dev/eclipse/workspace/GX-Feeds/env/lib/python3.5/site-packages/dogpile.cache-0.6.1-py3.5.egg/dogpile/lock.py|enter|154] [/Users/rowan/dev/eclipse/workspace/GX-Feeds/env/lib/python3.5/site-packages/dogpile.cache-0.6.1-py3.5.egg/dogpile/lock.py|_enter|94] [/Users/rowan/dev/eclipse/workspace/GX-Feeds/env/lib/python3.5/site-packages/dogpile.cache-0.6.1-py3.5.egg/dogpile/lock.py|_enter_create|149] [/Users/rowan/dev/eclipse/workspace/GX-Feeds/env/lib/python3.5/site-packages/redis-2.10.5-py3.5.egg/redis/lock.py|release|135] [/Users/rowan/dev/eclipse/workspace/GX-Feeds/env/lib/python3.5/site-packages/redis-2.10.5-py3.5.egg/redis/lock.py|do_release|264])"

    Cheers, Rowan

  10. jvanasco reporter

    It's hard to tell.

    I've gotten the same error from lock timeouts on the server:

     process A - gets lock, w/hash 'foo', starts to generate value
     process A - lock times out
     process B - gets lock, w/hash 'bar', starts to generate value
     process A - value generated, tries to release lock, fails because 'foo' != 'bar'
    

    It looks like the issue in that PR is talking about a multi-threaded process that shares locks (which may or may not be the case). You can try extending the length of the lock timeouts to something ridiculously long as a test.

    I ended up writing an alternative backend:

    https://github.com/jvanasco/dogpile_backend_redis_advanced

    It has a two features that might help you:

    1. You can write a custom wrapper for the lock, and have that either silently pass those errors and/or start logging them so you can figure out WTF is going on.

    2. You can change what the lock_prefix is -- ie, you can make the lock start with something custom like "LOCK-" so it is easier to filter on the redis client ( I was getting way too many false positives using the default prefix)

  11. erowan1

    I am going to split out the caching functions so they are not nested to see if that helps but using the redis TLS lock option sounds like a good addition to the dogpile redis backend. I'll take a look at your code if I am still in trouble. Thanks.

  12. Log in to comment