Using Pickle For Serialization Causes Memory Usage To Grow Indefinitely

Issue #103 resolved
Paul Brown
created an issue

I am caching a large list of unique objects using the RedisBackend and I am seeing memory usage increase each time the cache expires.

This issue appears to be the intended behavior of pickle:

The Pickler instance keeps a memory of each of the lists dumped alive, so that if you later pickle a reference to the same list (or other mutable object) again, it can pickle a reference rather than a copy of the value. This is a feature.

By using the same Pickler instance to dump 10,000 unrelated lists, you simply grow the memo data structure beyond reason. So just don't do this!" - Guido van Rossum

So, we saw our memory usage grow because pickle.dumps is storing the new list in its memo each time cache expires.

What would be the best way to solve this?

Since custom serialization is not possible yet (, I was thinking about overriding the RedisBackend and adding pickle.Pickler.clear_memo to the set method.

Comments (6)

  1. Michael Bayer repo owner in dogpile.cache uses pickle.dumps() and pickle.loads() functions, it does not create a Pickler directly nor does it re-use a Pickler object:

    There is no use of Python Pickler anywhere in dogpile.cache:

    $ grep -l "Pickler" `find dogpile -name "*.py"`

    Looking at source of pickle.dumps() / loads() in Python itself at :

    def dumps(obj, protocol=None):
        file = StringIO()
        Pickler(file, protocol).dump(obj)
    return file.getvalue()

    a new Pickler is created every time so this is not indicative of the Python issue you refer to. There is no "global" storage of data that's been pickled.

    WIthout any detail here I can't determine any more about your issue.

  2. Paul Brown reporter

    It looks like the real issue ended up being due to distributed_locking causing a bunch more traffic to our redis server. Our memory usage problem was due to each request taking longer (due to the busy redis server) which eventually led to too many requests happening at the same time.

    Turning off distributed locking led to a lot less waiting on redis (and less traffic), and things are back to normal.

    While I was watching the output from MONITOR, I noticed it issuing a bunch of lock related deletes too. But, there were a massive number of lock related requests in general. Since redis is single threaded and deletes are blocking, that might have been what caused distributed locking to make things seems so much busier.

  3. Paul Brown reporter

    Learned even more about this issue. We were just making way too many calls to cache, and distributed_locking was just compounding the problem. Because overloaded redis + waiting on redis for locks = even more waiting. Once distributed_locking was turned off and the redis logs weren't full of lock related messages, it was easy to see that we were making some unnecessary calls to cache.

  4. Log in to comment