1. Michael Bayer
  2. dogpile.cache
  3. Issues
Issue #33 resolved

About get_or_create_multi feature

Lx Yu
created an issue

I need a feature to get/create multi in one call.

Take this code for example

def get_order(order_id):
    return DBSession().query(Order).get(order_id)


def mget_order(order_ids):
    orders = DBSession().query(Order).filter(Order.id.in_(order_ids))
    return {order.id: order for order in orders}

The cache_on_arguments works great for the fist get_order function. But for the later one, if I mget [1, 2, 3] and [1, 2], they'll generate 2 cached value, and if a single element in the ids expired, the whole cache expired.

So I want to let the later one make use of the first function's cache. Then mget [1, 2, 3, 4, 5] will directly make use of cache generated in 'get', and if say [2, 3] expired, we only need to regenerate [2, 3]. It'll be much more fast and efficient when ids list grow larger.

While I go through the source code, the cache_on_arguments use a Lock from dogpile.core with this code, which seems not very suitable to do the multiple way in my situation.

with Lock(
        self._mutex(key),
        gen_value,
        get_value,
        expiration_time,
        async_creator) as value:
    return value

So do you have any suggestions on how to accomplish this goal?

Comments (14)

  1. Lx Yu reporter

    Add it's a little confusing to me as some codes check value version while some not:

    if value is NO_VALUE or \
        value.metadata['v'] != value_version or \
            (self._invalidated and
            value.metadata["ct"] < self._invalidated):
    

    This checks version while get_multi do not.

  2. Michael Bayer repo owner

    it might be possible, and I think as far as the lock, you'd use the async_creation_runner feature. this delivers the mutex held by Lock to the actual creator function. The expiration times and dogpile locks are per-key, so the function here would have a list of locks which it would need to close out once it has retrieved all values.

  3. Michael Bayer repo owner

    oh that. That's a separate issue, that's just a check in case we upgrade how dogpile.cache stores values in the cache, we'd bump "value_version" to 2. So we'd want to add the value_version check everywhere its missing now sure, but that only matters if and when we bump "value_version". that logic has no meaning right now.

  4. Michael Bayer repo owner

    try out that patch. I've only tested it in a non-concurrent case. I'd need to think pretty deeply about it in order to see if it actually maintains "dogpile" behavior, it might not, because it does the get_multi() outside of the mutex. the tricky thing here is that the mutexes have to remain per-key because you want all kinds of combinations of get(x, y, z, ...) to coordinate on keys.

  5. Michael Bayer repo owner

    silly test:

    from dogpile.cache import make_region
    from itertools import count
    
    counter = count()
    
    region = make_region(
        ).configure(
            "dogpile.cache.memory"
        )
    
    def creator(keys):
        print "creator:", keys
        return [k + next(counter) for k in keys]
    
    values = region.get_or_create_multi([2, 5, 6, 7, 8], creator)
    print values
    
    region.delete(5)
    region.delete(8)
    
    values = region.get_or_create_multi([2, 5, 6, 7, 8], creator)
    print values
    
  6. Lx Yu reporter

    Thanks, I'll test the performance tomorrow.

    IMHO, we use get_multi major for its performance(do 1 request instead of n request), so the performance is the first thing to consider, even with slightly less accuracy.

  7. Michael Bayer repo owner

    in fact we might need to sort the keys to prevent this.

    consider one call against (3, 4) and another against (4, 3). If get_multi calls go out against both simultaneously, you can deadlock on A->waiting for 4 B->waiting for 3. I need to look more closely, maybe produce a test, to see if that happens here.

  8. Lx Yu reporter

    Just tested the code modified a litter, you can check it here: https://gist.github.com/lxyu/5712915

    Changes:

    1. The get_multi returns dict, so I think the get_or_create_multi should keep the same behavior.
    2. Add should_cache_fn support
    3. Check keys_to_get before call creator
    4. Add key sorting.

    And another consideration is, the creator function differs from get_or_create too, as it accept the keys as args.

    So it becomes a little complex to add a namespace to keys, see the following code for example.

    def get(order_id):
    
        def _creator():
            return DBSession().query(Order).get(order_id)
    
        _get_key = lambda x: "cache:order:{}".format(x)
        return region.get_or_create(
            key=_get_key(order_id),
            creator=_creator,
            should_cache_fn=_dont_cache_none)
    
    
    def mget(order_ids):
    
        def _creator(order_keys):
            order_ids = map(_extract_key, order_keys)
            orders = session.query(Order).filter(Order.id.in_(order_ids)).all()
            return {_get_key(order.id): order for order in orders}
    
        _get_key = lambda x: "cache:order:{}".format(x)
        _extract_key = lambda x: int(x[x.rfind(':')+1:])
    
        keys = map(_get_key, order_ids)
        orders = region.get_or_create_multi(
            keys=keys,
            creator=_creator,
            should_cache_fn=_dont_cache_none)
    
        return {i: orders[_get_key(i)] for i in order_ids}
    

    Any idea on how to refine this situation? and I'm also wondering if this feature can be written as a multi decorator?"

  9. Lx Yu reporter

    Yes this resolved my problems.

    I actually implement a decorator my own and just thinking for a pull request, then find out that your implementation is better than mine. hah!

    Thank you very much!

  10. Log in to comment