1. Michael Bayer
  2. dogpile.cache
Issue #18 new

get serialization to be configurable with redis backend, others ?

Michael Bayer
repo owner created an issue

No description provided.

Comments (37)

  1. jvanasco

    "serialization, the API is a loads()/dumps() function call, probably passed as any object/module that has "loads"/"dumps" available - even a named tuple. Using ObjectsWithUpperCaseNames is way too zopey for this."

    that works. sorry about #21 -- didn't see this one ( was searching issues with wrong terms )

  2. jvanasco

    coming back to this.

    a few more thoughts on making this configurable/modular:

    • the ability to store/retrieve raw data in the client. ie, don't wrap it in api.CachedValue or pickle it. if i pass in a string that was encoded via msgpack, let it be. don't pickle a number or string -- that actually uses up a lot of space in the cache! these are all things that an advanced user would want to be able to turn off.

    • handling the time relevance/metadata as part of the deserialization. in order to give the best amount of control on the data storage/format, the serializer should be able to encode/omit metadata as it sees fit. however, that would require the serializes to then handle the verification of timeliness (or anything else)

    A usecase example is storing a username-to-id mapping in 2 different backends:

    Given: Jonathan == 10001

    • Storing in a dbm, I would need to store the time-relevance in the CachedValue object. The perfect solution is a pickled object that pairs the data with metadata.

    data raw: {'data': 10001, 'metadata': {'expires': '1444097265'}}
    data pickled :  (dp0\nS'data'\np1\nI10001\nsS'metadata'\np2\n(dp3\nS'expires'\np4\nS'1444097265'\np5\nss."
    

    • Storing in redis, the server already handles expiry and memory is a concern. The perfect solution is storing the raw value

    data raw: 10001
    

    The dbm solution doesn't work well for redis in production though, it takes up 78bytes to achieve the same goal that 5bytes could.

  3. jvanasco

    dogpile.cream is working on my staging site and on production web spiders. no errors.

    Spent an hour integrating the tests into my normal fork (https://bitbucket.org/jvanasco/dogpile.cache/commits/branch/feature-cachecontrol_region)

    I think I can pull this off with a few API changes:

    • dogpile.cache.api.CachedSerializer new implements loads/dumps. configured by region, can overwrite backend defaults. This should probably either only be Pickle or RawData.

    • dogpile.cache.api.CachedValueManager new. this has the actual logic for validating/invalidating data within the CacheRegion. (prototyping in region.py). This will allow end-users to access/enter raw data into a datastore instead of 'cache' objects that contain the timestamp and version..

    • The CacheRegion now proxies all the management of timestamps/etc into the new CachedValueManager. A custom class could implement validation on value that does not use timestamps (ie, time managed by server) or stores the data in another manner

  4. Darin Gordon

    Hey Mike and Jon

    I'm interested in using dogpile cache as the standard caching plugin for yosai (http://www.github.com/yosaiproject). In the project, objects are serialized using marshmallow and then encoded with msgpack/json/userdefined

    Within yosai's serialization module is a SerializationManager. In this SerializationManager, I add one layer of metadata to the message payload and then encode the message using whatever scheme it has been configured to use, such as MSGPack or JSON.

    Through the region configuration process, I'd like to inject a SerializationManager-like object into dogpile cache that would control serialization/deserialization.

    What do you think of me submitting a PR to add a serialization management layer to dogpile.cache?

    reference: https://github.com/YosaiProject/yosai/blob/master/yosai/serialize/serialize.py

    DG

  5. Michael Bayer reporter

    I was thinking of just adding a flag to the Redis backend where you can pass in a pickler / serializer object of your choice, or set it to None. The yosai thing sounds like it would be a dogpile cache ProxyBackend (and could be part of yosai).

  6. Michael Bayer reporter

    that is, I dont think trivial serialization like pickling, which is already intrinsically part of some client libraries and not in our control like that of all the memcached backends, should be a "middleware" component by default. this is a little bit of TMTOWTDI but just like the key mangler, having small injectables for rudimentary key/value mangling is valuable vs. requiring a mediation object in all case.

  7. jvanasco

    Darin Gordon

    So basically, I did that here. Feel free to play with my fork/branch

    https://bitbucket.org/zzzeek/dogpile.cache/pull-requests/39/fine-grained-support-for-cache/diff

    I had a shower-thought on it the other day about consolidating this further, but since forgot :/ It was a good one too.

    The basic premise is that I saw the need for 2 things to be split out from dogpile:

    1. Cache Serialization -- allowing every backend do specify a loads/dumps interface (defaults to cPickle)

    2. Cache Value Management -- dogpile doesn't cache raw values; it wraps a cached values into a new object with some metadata for timestamp/version expiry. if this is extended via a class, then developers can customize the format -- either omitting timestamp data (for example, redis and some other potential backends handle this server-side), or using some sort of nested field within the cached value.

    BUT

    What you're talking about could simply be used with a proxy backend as Mike noted. Mike recently accepted a patch where I illustrated using msgpack as a proxy backend.

    The caveat though, is that dogpile will cache: Pickle.dumps(CachedValue(msgpack.serialize(val)))

    In order to cache msgpack.serialize(value), an approach (like in my fork) is needed.

  8. Darin Gordon

    Pickling a well-serialized object is cringe worthy. Further, over the last few years, I haven't heard a single engineer recommend using pickle for serialization. This is just an assumption based on anecdotal evidence, but it seems that "the industry" has moved on. As Alex Gaynor said, pickles are for delis-- not software. So, why not just remove pickle entirely?

  9. jvanasco

    pickling a well-serialized object is cringe-worthy. it is also unwanted when dealing with raw data (such as caching a raw number vs a pickled python object containing a number and metdata).

    but dogpile is caching library, not a "well serialized object" caching library. it makes sense to support pickle, because often times you may want to cache an unserialized python object. since pickle is the historical default, it needs to be maintained for backwards compatibility.

  10. Darin Gordon

    I'll do as recommended and serialize using a customized ProxyBackend. I just don't want any pickling.

    Turning off pickling involves a straightforward update. A default of pickle=True would ensure others wouldn't be impacted by the change.

    Mike are you good with moving forward on this approach?

  11. Michael Bayer reporter

    Darin Gordon - we should let the loads/dumps functions be directly injectable, so you can stick json.loads/dumps in there if you wanted, or pickle with a lower protocol, whatever. None I guess would be the setting to turn off loads/dumps.

  12. jvanasco

    I don't believe this approach can work as intended.

    cache.region (the core logic file) wraps the "value" (which is serialized by any proxy backends) into an instance of the CachedValue object class, and that instance is passed into the backend (e.g. backends/redis.py)

    reference:

    If you turn off pickling in the backend, you're still stuck with a Python instance (CachedValue) being passed to the backend. That instance contains the data serialized by the custom proxy_backend AND some metadata that the current implementation of region.py requires on deserialization.

    reference:

    Just to be clear about this: backend/version.py (currently) always receives a Python object, so it requires a (de)serialization hook to work for most backends -- redis included. (There are 2 backends that don't require this -- the memory backend inherently doesn't need to pickle, and the memcached driver accepts python objects because it pickles them itself)

    The logic flow is like this:

    [raw value] -> [proxy backend munging] -> [region.py wraps in CachedValue w/metadata] -> [backend/driver.py serializes]

    Without restructuring the logic as I've proposed, the easiest way to achieve Darin Gordon's goals would be to re-implement the redis backend in a custom package that performs msgpack serialization on the backend value (instead of the pre-backend value munging that proxy_backend offers); that could also just fake the entire CachedValue wrapper or store the metadata within the msg_pack format.

  13. Michael Bayer reporter

    If you turn off pickling in the backend, you're still stuck with a Python instance (CachedValue) being passed to the backend. That instance contains the data serialized by the custom proxy_backend AND some metadata that the current implementation of region.py requires on deserialization.

    ah. go you, for reading my source code I never look at ! this is true.

    But still. CachedValueManager? Why do we need another object and why such a J2EE-like architecture? CachedValue is our payload. If i dont want to use pickle, then that indicates I want to do the work of writing some other kind of serialization for CachedValue. There is still a loads()/dumps() function. It's just the approach of allowing "None" is no longer necessary, because you need a serializer no matter what. If you want to serialize CachedValue using json, that's fine.

    This just places some extra burdens on the production of loads()/dumps() functions for those people who have some issue with pickling. They have to handle the CachedValue wrapper.

  14. jvanasco

    I have spent way too much time recently reverse-engineering your code!

    They way I implemented this, it's 100% backwards compatible. The backends use a serialization class, but they default to pickle or none (whatever is current). If you want to configure a region to use JSON instead, then you just pass that in during config.

    The reason why I came up with CachedValueManager is two-fold:

    1. What if I don't want to store a CachedValue object -- I just want to store the raw data? Backends like Redis support daemon-side cache invalidation and timeouts.

    2. What if I want to customize the cache management logic? The current CachedValue implementation is just a timestamp + version creation and check. This is a great "baseline" bit of logic for cache management, but people often have more elaborate needs. The core benefit/utility/brilliance of the dogpile.cache package is in the handling of the Lock mechanism. Using a "CachedValueManager", virtually any caching logic can be bootstrapped onto dogpile's locking system and caching infrastructure.

    The general goal was that this approach allows for the 80% of "basic needs" dogpile users to see no difference in usage, but the 20% of "advanced needs" users would be able to build out much of the custom work and edge-cases that are often requested.

  15. Michael Bayer reporter

    the backend "timeout" value is not compatible with the dogpile approach, because when the backend "times out", the value is gone - we can't return it to clients while the "new" value is generated.

    CacheRegion can be passed a CachedValue subclass as a parameter also.

  16. jvanasco
    • If the backend times out, then it's a cache miss. When using backends like Redis, the secondary check in dogpile is often a considerable waste of cache space. The performance hit of using dogpile instead of a direct redis client is negligible. The hit on data size is significant though -- and means running larger (or more) virtual servers to keep everything in memory.

    CachedValue is directly imported and used by the region (I don't think you got to writing/commiting the feature you're thinking of). All of the validation logic is performed within the CacheRegion as well. Are you possibly thinking about customizing with the should_cache_fn argument?

  17. Michael Bayer reporter

    I dont agree the mechanism of CacheRegion is so trivially swappable and also CachedValueManager is too much up-front interface / OOP for an anticipated use case. I propose a plugin for the CachedValue class, not using it at all is not an option; just use redis directly if you don't want the features of CacheRegion (e.g. dogpile protection). The entire point of CacheRegion is that it does not expose a so-called "cache miss", by returning the previously cached value when expiration is detected.

  18. Michael Bayer reporter

    But also, the use case you're looking for, "I don't want pickle", as always ; the RedisBackend accepts a loads() / dumps() pair, that's it. You need to determine what to do with the given CachedValue object.

  19. jvanasco
    1. My solution is far from fine -- I'll be the first to admit it. It's an early stab at solving this. The intent of my PR was to spark dialog and find a way that leads to the right solution as this is complex.

    2. Michael Bayer Another core benefit of dogpile (perhaps more important) is that it establishes the Lock on value creation (whether there is a miss or expiration) so the create function will only be called once. While the angle I propose would lessen the utility of still-using the old value, the Lock is still incredibly useful and important. The Python-based expiration is also never happening if someone is already using redis or riak as a backend with the server timeouts enabled (i saw a riak plugin, though I forget where).

    The idea of turning off the python-side cache management was also influenced by the fact that this would allow users to hot-swap backends and fine-tune for performance by only changing the dogpile config. redis / riak / memcached could be switched and profiled quickly. Depending on an application's usage, memory (and the ability to hold an entire data-set in working memory) may be much more important that speed and concurrency. It's not pleasant ripping out and re-implementing caching libraries when you need to test this (sadly know from experience ;/)

    1. The RedisBackend doesn't accept a loads()/dumps() pair. Everything is hardcoded into pulling pickle from dogpile.compat and then piping the CachedValue into it.

    So now I remembered by shower-thought--

    If we implement my idea for dogpile.cache.api.CachedSerializer that would be a good first step. CachedSerializer is just an object class that supports loads/dumps; the implementation I prototyped just has an instance of the class passed into the region during configuration (and defaults to classes that handle the current behavior). I chose an object instance, because that means passing in a single argument (instead of two) and I like nicely defined interfaces (pyramid's use of zope interface has grown on me).

    The reason why this would be good first step, is that it allows for undocumented/unsupported/offlabel munging of the data. A developer could repack the CachedValue data into an internal structure and rebuild it as needed... or fake the CachedValue as not having an expiry.

  20. Darin Gordon

    Michael Bayer Makes sense to keep the CacheRegion largely intact. It is the bread and butter of this project.

    Why should backends be concerned about load/dump (serialization)? Instead, a backend should expect cache-ready, serialized objects as input. Do we agree on this point?

    If we agree to above two points, then we're leading towards a CacheSerializer managing serialization / de-serialization, and consequently CacheValue creation. Am I correct?

  21. Michael Bayer reporter

    Why should backends be concerned about load/dump (serialization)?

    because, they already are. Here is the source of the most common client, python-memcached: https://github.com/linsomniac/python-memcached/blob/master/memcache.py#L928

    that particular memcached client, as it was first, set the stage for all the other ones. What is extremely interesting to note here is that they are taking advantage of the memcached protocol itself to store additional "flags" which indicate how the value is serialized; these flags are only available on the wire protocol and are not exposed by the client.

    So the memcached backends provide very elaborate, wire-protocol-bound, serialization functionality already, hardcoded to pickle.

    To be fair, while I don't know how well they document this, they are using those flags to indicate if the value was in fact a Python object to start with and is pickled, and if a plain string was passed, it is not pickled; so there is a way to affect whether or not this occurs.

    But the need for serialization itself as well as the best way to do it is a backend-specific affair. When dogpile was written, the memory, file and memcached backends were it. One backend needed no serialization (memory), one had none (file), and another already provided it (memcached). The job of serialization was clearly something that changes per-backend.

    If we removed calls to "pickle" from the file / redis backends, those backends will now fail to function correctly with the cache region. If we put pickling into the CacheRegion itself, it would need to be turned on by default, which means it is firing unnecessarily for the memory and memcached backends. The backends should be able to work with a region without additional flags added to the region itself, other than the backend itself. Behaviors that need to travel with the backend should be configured with it.

    The overarching theme of this thread is one of vague suggestions of "undocumented/unsupported/offlabel munging of the data" and "virtually any caching logic can be bootstrapped onto dogpile's locking system and caching infrastructure", I can't work with those. Adding small flags, sure. But rearchitecting the object model, that's a huge job, and one that has to be derived from a set of very specific use cases.

    Throughout this thread, I still don't know what either of you want to do. Not serializing isn't an option, and "Alex Gaynor said pickle is dead" is not really sufficient :). jvanasco 's use cases are even more obscure. Just building out a more exploded API and declaring "now it can do anything!" also is not enough for me to go on (and also seems kind of out of scope for, "we want better control over serialization").

  22. jvanasco

    First off - I think it's important to note that technology has changed a lot since the earlier dogpile releases. Memcached used to (in general) be the only option for caching daemons, now it is often considered the worst -- with redis, riak and some other systems being much faster and feature-filled. Processors and memory are also generally much higher, so the chance for blocking tends to be down quite a bit.

    Dogpile may have intended to be the caching system built around the dogpile problem, but it has grown to be much more. Many people are attracted to it because it defines a robust API with hot-swappable backends.

    Unfortunately, my application has grown to the point where dogpile has become a huge bottleneck for me. The manner in which dogpile wraps data into a CachedValue object makes my "mapping" region (name to id) about 18x larger, and most of my other objects 3-4x larger. Dogpile essentially creates a situation where I need a dogpile lock addressed, because my caching backend can only hold 1/6 the data and everything is expiring based on FIFO rules.

    I do not want to rip out every bit of dogpile in my app and replace with a direct redis client (I lose hot-swap options meaning I have to run a configured redis on every environment.). That means I can either maintain a forked dogpile (which I'd like to avoid, but it a better option) or figure out some patch/api change that will give me enough hooks to accomplish what I need. Being able to specify serializers for each backend does that -- as it creates a "hook" at the right spot.

    That being said, let me just clarify my suggestion:

  23. jvanasco

    Darin Gordon

    dogpile.cream was a first stab at an override. it works, but you should ignore it. I have a fork in a feature branch under dogpile.cache called feature-cachecontrol_region -- https://bitbucket.org/jvanasco/dogpile.cache/branch/feature-cachecontrol_region -- that is a much better place to look.

    There are 2 dogpile contexts, and Michael Bayer is primarily talking about one of them. Imagine there is a really expensive function that takes 5 minutes to generate a value. In the span of 15 seconds, 15 unique requests are made for that value, once each second. It could cripple a server to handle all 15.

    The first dogpile context is the "unprimed cache". If there is no value in the cache: the first request creates a "Lock" on creating the value. The lock keeps the next 14 people from generating a value, but makes them wait for a result.

    The second dogpile context is "stale data". If there is a value in the cache but it has expired, the first request creates a new value, while the next 14 requests use the stale data.

    "Stale data" is possible because the CachedValue is actually a custom tuple with 2 properties:

     value = actual_value
     metadata = {'ct': foo,  # cache timestamp
                         'v': 1,   # dogpile API version
                         }
    

    While redis can have a timeout on values, they are not required and a value may not necessarily have a timeout (keys can have no expiry). Memcached, dbm, memory, and other datastores don't have native timeouts either. Redis, Riak, and a few others "newer" daemons support this (increasingly popular) concept.

    In order to support the second context (and generally have timeouts on backends that do not have a native timeout), dogpile needs that expiry timestamp in the cached value.

    My big issue right now is just raw memory. This old Instagram post illustrates a similar issue [they used hashing to minimize the data stored based on prefixes, my concern is just on the wrapper], but goes into the concerns of server resources. http://instagram-engineering.tumblr.com/post/12202313862/storing-hundreds-of-millions-of-simple-key-value

    So my thought is this: if there is a documented loads/dumps hook in the backend, then a developer can deconstruct/rebuild a CachedValue as needed and that solves most problems.

    Custom serializers for redis might do this:

      class MsgpackSerializer(DogpileSerializer):
            """`object_hooks` would be needed on the msgpack serializer to deal with the CachedValue object"""
            def dumps(self, cached_value):
                  return msgpack.dumps(cached_value)
    
            def loads(self, repacked):
                  return msgpack.loads(repacked)
    
      class MsgpackRepackSerializer(DogpileSerializer):
            """recode the metadata into the data struct, then pull it out"""
            def dumps(self, cached_value):
                  repacked = cached_value.value
                  repacked._dogpile = cached_value.metadata
                  return msgpack.dumps(repacked)
    
            def loads(self, repacked):
                  unpacked = msgpack.loads(repacked)
                  metadata = repacked.pop('_dogpile')
                  cached_value = CachedValue(repacked, metadata)
                  return cached_value
    
      class RawDataSerializer(DogpileSerializer):
            """This defats the dogpile lock's Stale Data protections, but lets a very advanced user store lots of data.  Most users would never think of this, so it's generally safe."""
            def dumps(self, cached_value):
                  return cached_value.value
    
            def loads(self, repacked):
                  unpacked = repacked
                  cached_value = CachedValue(repacked, fake_timestamp())
                  return cached_value
    
  24. Michael Bayer reporter

    The manner in which dogpile wraps data into a CachedValue object makes my "mapping" region (name to id) about 18x larger, and most of my other objects 3-4x larger.

    If you have an approach where you'd like to store the information of the CachedValue in a different way, I'm still not understanding why a ProxyBackend is not acceptable. that's what your recipe at http://dogpilecache.readthedocs.org/en/latest/usage.html#encoding-decoding-data-into-another-format does. The "CachedValue makes my data 18x larger" phrase implies that you have very small data values.

    I've read your two latest comments, not taking into account the various proposals / code examples because I really need to see only the problem to be solved, and that's the only problem case I can see described. Does the recipe you posted no longer solve this problem?

    I'm not reading every word of these very long comments very carefully yet because I still don't have knowledge of the problem to be solved. The two problems so far I see are 1. dowwie doesnt want to use pickle because someone said "pickle was dead" and 2. you don't like the way CachedValue serializes because it takes up too much memory.

    I'm not seeing how "allow a pickle function and/or None to be passed to redis" doesn't solve problem #1 and how the recipe at http://dogpilecache.readthedocs.org/en/latest/usage.html#encoding-decoding-data-into-another-format doesn't solve problem #2 (and in fact, if you can turn off the pickle aspect of the backend, lets you do anything you want with the data, full blown). Putting aside questions of purity, where things should be, what cache backends are cool now, etc.

  25. jvanasco

    I'll try to be concise:

    1. proxy_backend does not work because it operates on the raw value - not the "CachedValue". The output of a proxy_backend is wrapped by region.py into a CachedValue via the _value method. That Python object is then passed into the actual backend. There exists no hook within the backends or Region._value to interact with the data.

    2. I am simply proposing that we allow a serialization option to be passed into all backends. It could be dumps/loads functions, but I suggest a class instance (and the code above illustrates an implementation) that implements a loads/dumps interface because that means only 1 argument. My objection was to simply disabling pickle, because the value received by the backend is an object.

    In terms of the other stuff. For a simple mapping of usernames to ids (which we use in URL resolution and is our most used function), a "raw value" of 10001 is a about 1/18 the size of that number wrapped into a CachedValue and pickled.

  26. Michael Bayer reporter

    That Python object is then passed into the actual backend.

    I'm not following - the ProxyBackend gets the CachedValue object. Just added a test for this to verify I'm not crazy. See the above cset. Did you mean something else?

  27. jvanasco

    I'm explaining this wrong, and am also bit confused/going-crazy-on-this after weeks of obsessing over the internals. The sheer number of tests/branches/approaches I have to get around the Pickling issue is intimidating.

    You are correct and proxy_backend does get a CachedValue instance -- not raw data. Operating on value.payload to strip the CachedValue does work. My tests were picking up the Pickling value on the backend.

    As my goal was to remove Pickle, I realized that MsgPack worked better as a "Serializer" than a "Proxy". Implementing msgpack as the storage format using a proxy required still required disabling pickling on the backend. As a custom serializer passed into backends though, it is all accomplished with a single hook.

  28. Log in to comment