new idea for bake

Issue #3054 resolved
Mike Bayer repo owner created an issue

The issue with "bake" has always been the ability to derive an appropriate cache key for a query. Based on the observation of other SQL products that a Python code object presents a direct way of establishing a unique line number in a unique source file, I'd like to see if we can make use of lambdas in order to establish this as a key. Example:

def do_something(session):
    x1 = 5
    x2 = 10

    q = session.query(Foo).filter(Foo.x == bindparam('x1')).\
                 filter(Foo.y == bindparam('x2'))

    baked = Bake(lambda: q)

    return baked.params(x1=x1, x2=x2).all()

the above:

  1. receives the lambda within Bake. We use the code object present here to retrieve the source file and line number.

  2. we establish some kind of cache, perhaps within the namespace of the source module itself, that will store everything we need about this query keyed to the line number.

  3. Assuming the line number is not in the cache:

    a. we call the lambda and get the Query back. We scan this query for bound parameters and establish those as the arguments we'd use in params(). Note that this includes bound parameters that were created implicitly (see below).

    b. we do the various "bake" things that we do in the baked query recipe with this new Query object. We store what we need in the cache.

    c. we invoke the query with the params and return results.

  4. if the line number is in the cache:

    a. we get the Query that we've "baked" from this lambda.

    b. we apply the current session, invoke the query with the params and return results.

Note that we can pull out the bound parameters without calling bindparam() as well. And, if the whole Query is placed into the lambda, then we dont even need to build the query in order to get the cached version:

def do_something(session):
    x1 = 5
    x2 = 10

    baked = Bake(lambda: \
                       session.query(Foo).
                         filter(Foo.x == 5).filter(Foo.y == 10)
                  )

    return baked.params(x1, x2).all()

to support the usage of the params without any explicit keys, the params() method will also maintain the order in which it encounters these bound parameters and allow the values to be passed positionally in that order.

Comments (25)

  1. Dobes Vandermeer

    It's definitely a worth goal to make it easier to cache the queries, but the line number approach would only work if the query was computed unconditionally, otherwise the key must include result of any conditional tests used to determine what went into the query.

    I think maybe the first step is a safe and easy way to make a query re-usable and to re-use it.

    For example you could have a function/method that takes a query and a desired operation ("all", "first", "count") and returns a callable that takes all the bindparams as keyword arguments and returns the result of that method call.

    Initially people can just create all the variations of the query they need on startup and store them globally (or use a caching scheme of their own invention).

    I think people can easily manage these re-usable queries manually by default, but currently they are a bit of a trick to construct.

    I did update the baked query recipe wiki page to add a precompile() method to Query that returns a query that can be re-used without recompiling the context but it doesn't restrict the final operator and calling filter() or join() again won't throw any exceptions, so it's not as safe as it could be.

    https://bitbucket.org/zzzeek/sqlalchemy/wiki/UsageRecipes/BakedQuery

  2. Mike Bayer reporter

    well that is the idea, the lambda is only at the point that the query itself is ready to go and there are no conditionals. this is taken from SQL systems like Geniusql or pony where they are using python list comprehensions to define SQL, based on the AST. pony's author noted that because they have an AST, they can also cache the whole thing, because AST by definition is fixed with the source (unless you're using eval/exec).

    Basically the challenge I always have with "Bake" is how do you come up with a cache identity for a given query without having to make up a string like query.bake("myquery").

    if you do have a baked query, and you then call an operation like filter() or something, that would basically give you an un-baked query. so it's still "safe" but probably not what one wants. here, in any case, there's some kind of a "bake" object that would only have methods that apply to a pre-baked query.

  3. Dobes Vandermeer

    Well, if the app is computing these on startup then you don't need a cache key, or they can construe their own. I'm just saying that if you have something like:

    qry = sess.query(MyEntity).filter(whatever)
    if order_ascending:
      qry = qry.order_by(some_column.asc)
    if offset is not None:
       qry = qry.offset(offset)
    if limit is not None:
       qry = qry.limit(limit)
    return qry.all()
    

    Then you cannot use the line number of qry.all() as the cache key for a baked version of the query - you have to include "limit is not None" and "offset is not None" and "order_ascending". Or structure the code differently.

    We use this pattern a lot in our code - maybe others do not. But we typically have a few variations of a query based on some dynamic condition so a line number based cache key wouldn't help us. It may help others.

    The "query builder" pattern I posted on the wiki works well for us because we pass in whatever elements cause changes to the query and thus we can safely calculate the cache key based on the arguments and the cache itself is attached to the function implementation. But it does require some care to use, one might mistakenly think they could pass in the offset itself rather than just the presence or absence of it, in which case you'd get a new cache item for each offset value used.

  4. Mike Bayer reporter

    I see what you mean as far as the conditionals there. Building up a "key" based on every call is the really hard problem I've been trying to avoid, because you still hit all kinds of cases where there's no simple way to figure out a "key". I'll have to look at your approach more closely.

  5. Mike Bayer reporter

    I think maybe the first step is a safe and easy way to make a query re-usable and to re-use it.

    can you be specific here if something beyond either your PrecompiledQuery or my BakedQuery is needed in this regard? I would guess that "subquery eager loading" might have issues due to how it works, but otherwise these two approaches work? There's about ten lines of code. The extension here of course will include these ten lines of code in some form that can be used in the way these recipes are using it now.

  6. Mike Bayer reporter

    also as far as memoization,based on conversations with other users I do think this is a big deal, and just asking users to work out the caching scheme is not enough.

    Beyond the lambda idea, there is also something I might consider to be the "nuclear option", which is using stack frame inspection to get at the source code which produces a particular generation. It's still questionable if this will be straightforward to use or not.

  7. Dobes Vandermeer

    can you be specific here if something beyond either your PrecompiledQuery or my BakedQuery is needed in this regard?

    Not much more - I think one of the two, maybe returning a more restricted object after the "bake" or "precompile_context" step so people don't get too confused if they try to use it incorrectly.

    I think having the baked queries as a callable that just returns the desired result(s) could be neat in that you can leave them around in a module or class and call them like a function to fetch the results. The object identity of that callable is the cache key. It does require the user to come up with all the variations of the query in advance. The lambda approach also requires them to ensure a different lambda parameter for each variation of the query, though.

    # Imagining here reuable_query automatically caches the context internally and binds to
    # the current session on use.
    class X(base):
        by_name = reusable_query(X).filter(X.name==bindparam('name')).first
    
    
    X.by_name(...)
    
  8. Mike Bayer reporter

    OK I've made two examples of how this can be done without explicit cache keys. The first one uses frame inspection. While it allows you to build up a Query as we normally do, it is nuts, scary, and doesn't work consistently, so not sure that's really the best way to go.

    The next one is just less than pleasing day to day, you have to use lambdas to build up the query. But it works very simply and with better performance, since remember, Query's methods themselves take up call counts, and if we're really trying to minimize callcounts, we should be minimizing as much as possible, right? This does that very simply:

        def do_query_cached(filter_, join, order_by):
            sess = Session(e)
    
            baked = BakedQuery(lambda: sess.query(A))
    
            if filter_:
                baked.bake(lambda q: q.filter(A.id.in_([2, 3])))
    
            if join:
                baked.bake(lambda q: q.join(A.bs))
    
            if order_by:
                baked.bake(lambda q: q.order_by(A.id.desc()))
    
            return baked
    

    it's a little unpleasing right? I think perhaps this is a good thing as building up a baked query requires special care to start with.

  9. Mike Bayer reporter

    here's the output of the crazy version, counting calls:

    #!
    
    Run test with no cache
    total calls:  905892
    
    Run test with cache, but join() every time
    total calls:  499495
    
    Run test with cache + bake the join() step
    total calls:  383373
    
    Run test with cache + bake the join() and order_by() steps
    total calls:  366898
    

    here's the output of the simple version:

    Run test with no cache
    total calls:  915078
    
    Run test with cache
    total calls:  284420
    
  10. Dobes Vandermeer

    Oh interesting you are doing a traceback for every single time someone changes the query. Interesting. I guess that does catch the conditionals for the most part (maybe single-line conditionals could trick the system but whatever, probably not common).

    Passing a new lambda for each condition does look a bit painful - perhaps you could have one lambda or procedure that accepts a list of booleans and use the whole argument list as part of the cache key?

    So something like

    def make_query(filter_, join, order_by)
       q = sess.query(A)
       if filter_: q = q.filter(...)
       if join: q = q.join(...)
       if order_by: q = q.order_by(...)
       return q
    
    baked = BakedQuery(make_query)
    

    Using BakedQuery as decorator would work here, too. I think using just one lambda and passing in the flags seems less painful to me, especially because lambdas are so restrictive in python compared to named functions.

    But then, once you start down this road you can require it to be a named global function and you don't need to use the function file and line as the cache key any more, just store the cache in the BakedQuery itself because there would be just one instance of this per logical query.

  11. Mike Bayer reporter

    yes, the recipe as is (in the third example) works almost that way, i think it's kind of like the recipe you provided. The initial callable you pass to BakedQuery will also accept "args" which can be made part of the cache key:

    def make_query(filter_, join, order_by):
        q = sess.query(A)
        if filter_: q = q.filter(...)
        if join: q = q.join(...)
        if order_by: q = q.order_by(...)    
        baked = BakedQuery(lambda: q, args=(filter_, join, order_by)
        return baked
    
  12. Dobes Vandermeer

    I see - well that's more or less how we're doing things now and I'm happy enough with it. I think I mildly prefer my current decorator approach because the args are automatically gathered and used as part of the key and I have a separate cache per function rather than a global cache.

    This approach is still a lot of manual labor, though - you have to remember to use the lambdas or pass in the args and stuff like that. So it's not clear we have actually solved the problem we set out to solve - maybe this is no better than some sort of manual cache management. Might have to brew some examples to compare to.

    I wonder if anyone else might want to jump in on the thread, too ...

  13. Mike Bayer reporter

    OK I think I agree, there should be a decorator we can just stick on top of that. Should be addable.

  14. Mike Bayer reporter

    yup, here we go:

    @BakedQuery.baked
    def make_query_cached(filter_, join, order_by):
        q = sess.query(A)
        if filter_:
            q = q.filter(A.id.in_([2, 3]))
        if join:
            q = q.join(A.bs)
        if order_by:
            q = q.order_by(A.id.desc())
        return q
    

    see https://gist.github.com/zzzeek/43d10d34993126f074a5. output:

    #!
    
    $ python test.py
    Run test with no cache
    total calls:  866344
    
    Run test with cache
    total calls:  414820
    
  15. Mike Bayer reporter

    in the interest of expediting 1.0 perhaps we should organize the lazyloader integration such that it is entirely an alternate strategy for the moment, and not turn it on by default.

  16. Mike Bayer reporter
    • Added a new extension suite :mod:sqlalchemy.ext.baked. This simple but unusual system allows for a dramatic savings in Python overhead for the construction and processing of orm :class:.Query objects, from query construction up through rendering of a string SQL statement. fixes #3054

    → <<cset e3b46bd62405>>

  17. Log in to comment