[PATCH] Optimize ORM speed

Issue #1592 resolved
Former user created an issue

(original reporter: ged) Continuing my speed optimization frenzy, here is a (very) experimental patch to optimize the ORM speed by using a custom ResultProxy which only returns dicts with whatever will be used by the ORM.

Once again, the results are encouraging: ~26% improvement on large queries ~11% improvement on very small queries

Remember that I am timing total time, including the time for the actual DBAPI work.

A few comments about the patch: * The approach needs to be able to specify a custom ResultProxy class per query (or at least from within the ORM code path). I think this can also be useful outside the ORM, if the user wants to have its results returned in a specific way (I've had a couple times where I needed that and did it on top of the standard ResultProxy but this was not very efficient, but I didn't think to this solution at the time). Within the patch, I've only made the minimum possible modification so that it would work for my test case (attached). It would of course need to be completed, but I can't do that since I am a bit lost within that part of the code (the different ways to create a connection, and the code path it goes through the dialect etc...). I am hoping for some help here. * I know the custom ResultProxy approach is not compatible with the specific result proxies used by the various dialects, but I am sure it is possible to refactor the code to make that work (I haven't thought about it much yet). * There are two parts of that patch which are ugly to my eyes: * the way I'm guessing what exactly will serve as a key in the dictionary. I am hoping there could be an official "contract" that for example the first element of the result_map is what is used by the ORM. * I have duplicated the functionality of _process_row in fetchall. This is because the overhead of the function call is significant. This is ugly, I know, but worth it IMO for such a critical part.

Comments (29)

  1. Mike Bayer repo owner

    What I don't like about this is that we have here is a big source code surprise, that the ORM doesn't use most of the result object that's so strongly tested in the engine/ and sql/ testsuites, that bugs fixed in orm/ related to result processing don't get expressed in sql/. Its the ORM starting to bypass the expression/engine system that it's built upon. Sure, if the ORM starts to just go down to the metal itself, it'll be faster. But internally its a mess.

    Preferable would be to write an alternate ResultProxy, and all its subclasses, in C. That way it can have full test coverage, and pass the speed enhancements onto everyone. Or to implement some of these speed improvements (for example, not using itemgetter(), which the original RP doesn't use either) in the base ResultProxy.

  2. Former user Account Deleted

    (original author: ged) First, yes, translating to C would be best, but until someone can devote the resources to do so, this is already a good start. And even if you translate to C, having a fat-free algorithm might make a slight difference in C too.

    Secondly, I don't see what kind of surprise you are talking about. This is just an alternate ResultProxy, which does use all the otherwise fine machinery of the sql layer: SQL statement generation, bind processing, result processing, etc... It's just a fat-free one. Yes, this patch lacks unit tests, has many rough edges, etc. but I don't understand why you object the principle.

    Thirdly, as I said, I really think letting the user specify a custom resultProxy, for example, for creating "RawResultProxy" which wouldn't do any processing. That could be useful for testing/timing purpose.

  3. Mike Bayer repo owner

    the ResultProxy is supposed to be the lowest layer of result that SQLAlchemy provides, as long as you intend to stay within the SQLA world of things. Otherwise, you can use your DBAPI cursor, in which case you get entirely platform-dependent behavior and top speed. What passing your own "ResultProxy-like" object through grants you is the illusion of platform-independent behavior. But its not at all - it will break with Oracle binary results, it will break Oracle RETURNING clauses, it will break with PG server side cursors, depending on how much you screw with it you might break MS-SQL which has some tricky things going on, etc.

    Riding on top of all of this is, what problem is trying to be solved here ? SQLAlchemy's ResultProxy has been running just fine on extremely high volume websites like Reddit. I've never had speed problems with resultproxy or the ORM, most speed problems are the queries themselves being too slow. The last time I tested it vs. Storm it was much faster at fetching results, provided you didn't use Storm's C extension in which case Storm was about 2x faster, which points to the futility of trying to compete with raw C in any case.

  4. Mike Bayer repo owner

    So here is my other thought on this. The main area your patch seems to be optimizing is the fact that a straight dict is much faster in Python than any __getitem__() style of object. That seems like something that could probably be arranged to occur in whatever callable is present via the _process_row class attribute. So wouldn't most of the speed improvements be available if we just made it a "public" thing to say:

    result = conn.execute(...)
    result.row_cls = fast_row_dict
    

    where fast_row_dict is a def that does the dict thing and loads the results into a plain dict. If it is to limit to a specific set of columns, it could be like:

    result = conn.execute(...)
    result.row_cls = make_fast_row_dict(table.c.col1, table.c.col2, "foo")
    

    where make_fast_row_dict returns a callable that will pull just those given keys out of each raw row, returning a dict. We can certainly get the Query object to figure out ahead of time what columns to be pulled out, it would require some changes to the loader strategies.

    The other optimization would be to restore the "conditional" calling of the "processor" func, like your patch does here and somewhat like resultproxy was doing before. this is assuming that itemgetter is slower than row[index](index), which I think is the case but I'm not sure. that is for a custom row maker:

    def make_a_row(row, callables, resultproxy):
       return dict(
             (key, callables[key](key)(row[key](key)) if callable else row[key](key))
             for key in (table.c.col1, table.c.col2, "foo")
       )
    

    this change in the straight RowProxy would be along the lines of:

             try: 
    -            return self.__colfuncs[key](key)[0](0)(self.__row) 
    +            colfunc, index, type_ = self.__colfuncs[key](key)
    +            if colfunc:
    +                return colfunc(self.__row[index](index))
    +            else:
    +                return self.__row[index](index)
    

    so this is maintaining most of what you're doing but just not impacting ResultProxy so much, more of the row-making object. We probably would get a lot of speed just by turning RowProxy alone into the C object, too, and it wouldn't require that much code.

  5. Mike Bayer repo owner

    the custom "row maker" would be more like:

    def make_a_row(row, callables, resultproxy):
       return dict(
             (key, colfunc(row[index](index)) if colfunc else row[index](index))
             for colfunc, index, type_ in (
                 callables[k](k) for k in (table.c.col1, table.c.col2, "foo")
             )
       )
    
  6. Former user Account Deleted

    (original author: ged) Replying to zzzeek:

    the ResultProxy is supposed to be the lowest layer of result that SQLAlchemy provides, as long as you intend to stay within the SQLA world of things. Otherwise, you can use your DBAPI cursor, in which case you get entirely platform-dependent behavior and top speed. What passing your own "ResultProxy-like" object through grants you is the illusion of platform-independent behavior. But its not at all - it will break with Oracle binary results, it will break Oracle RETURNING clauses, it will break with PG server side cursors, depending on how much you screw with it you might break MS-SQL which has some tricky things going on, etc.

    Yes, it currently breaks a lot of things, I know that, but I am sure it is possible to reuse the basic ideas here and retain 100% compatibility with all the other stuff.

    Riding on top of all of this is, what problem is trying to be solved here ? SQLAlchemy's ResultProxy has been running just fine on extremely high volume websites like Reddit. I've never had speed problems with resultproxy or the ORM, most speed problems are the queries themselves being too slow.

    Well, in my opinion, more speed without lost functionality and without unmaintainable code, is always worth it. Sure, it might not be your top priority right now, because it is "good enough". Sure, there are other areas which would be possible to speed up. And sure, having a C extension would beat that into the dust. Yet, I see the C extension as something someone might do someday (I might even have a go at it myself if I stay unemployed long enough), and it might not be available on all platforms. What I am proposing is a speedup right now, and which would work for all. If you tell me the C extension is right around the corner, then yes this is probably moot (I'm not even sure).

    The last time I tested it vs. Storm it was much faster at fetching results, provided you didn't use Storm's C extension in which case Storm was about 2x faster, which points to the futility of trying to compete with raw C in any case.

  7. Former user Account Deleted

    (original author: ged) Replying to zzzeek:

    the custom "row maker" would be more like:

    {{{ def make_a_row(row, callables, resultproxy): return dict( (key, colfunc(rowindex) if colfunc else rowindex) for colfunc, index, type_ in ( callablesk for k in (table.c.col1, table.c.col2, "foo") ) ) }}}

    I don't understand that function... Maybe it's the lack of morning coffee (not that I drink any ;-))...

    Note that conditional expressions are a Python 2.5 addition and AFAIK SA still support Python 2.4.

  8. Former user Account Deleted

    (original author: ged) Replying to zzzeek:

    So here is my other thought on this. The main area your patch seems to be optimizing is the fact that a straight dict is much faster in Python than any __getitem__() style of object.

    Hmm, that's one point, indeed. In fact, there are four significant time savers here: - built-in dict vs `getitem' (ie lack of function call per column) - lack of function call per row - faster initialization (most significant on small queries, but has also a small impact on larger queries). Many apps do a lot of small queries, so it is worth looking at this too. - conditional call of the processor (see below).

    That seems like something that could probably be arranged to occur in whatever callable is present via the _process_row class attribute. So wouldn't most of the speed improvements be available if we just made it a "public" thing to say:

    result = conn.execute(...)
    result.row_cls = fast_row_dict
    

    where fast_row_dict is a def that does the dict thing and loads the results into a plain dict. If it is to limit to a specific set of columns, it could be like:

    result = conn.execute(...)
    result.row_cls = make_fast_row_dict(table.c.col1, table.c.col2, "foo")
    

    where make_fast_row_dict returns a callable that will pull just those given keys out of each raw row, returning a dict.

    Yes, this is all in the spirit of what I am trying to achieve. I would personally find it more elegant to pass an argument somewhere (to conn.execute perhaps?), than to monkey_patch the result object, but that's your call here.

    Also, I've been working a bit on a second version of the patch with my code within a _process_rows (notice the plural) method and that makes it suddenly more compatible with all the oddball ResultProxies with only minor changes to some of them.

    The default _process_rows would simply be:

        def _process_rows(self, rows):
            process_row = self._process_row
            return [row) for row in rows](process_row(self,)
    

    We can certainly get the Query object to figure out ahead of time what columns to be pulled out,

    Yes, that's what I was hoping for.

    it would require some changes to the loader strategies.

    Really? Isn't a cleaned-up variation on what I have done possible?

    The other optimization would be to restore the "conditional" calling of the "processor" func, like your patch does here and somewhat like resultproxy was doing before. this is assuming that itemgetter is slower than row[index](index), which I think is the case but I'm not sure.

    AFAIR, itemgetter is indeed slower than rowindex because it does a function call. It is faster than lambda row: rowindex though because it is done in C. Somehow I didn't think of it with my first optimization pass. Good catch. I will make the change to the current "generic" RowProxy and see how it compares.

    that is for a custom row maker:

    def make_a_row(row, callables, resultproxy):
       return dict(
             (key, callables[key](key)(row[key](key)) if callable else row[key](key))
             for key in (table.c.col1, table.c.col2, "foo")
       )
    

    this change in the straight RowProxy would be along the lines of:

             try: 
    -            return self.__colfuncs[key](key)[0](0)(self.__row) 
    +            colfunc, index, type_ = self.__colfuncs[key](key)
    +            if colfunc:
    +                return colfunc(self.__row[index](index))
    +            else:
    +                return self.__row[index](index)
    

    so this is maintaining most of what you're doing but just not impacting ResultProxy so much, more of the row-making object. We probably would get a lot of speed just by turning RowProxy alone into the C object, too, and it wouldn't require that much code.

    I've never done a Python C extension so I am not sure what amount of work it involves. As I said, I will probably have a go at it in a few weeks if I am still unemployed by then.

  9. Former user Account Deleted

    (original author: ged) Here is a totally different approach I tried, which, if accepted, would render the previous patch almost useless.

    This is only a proof-of-concept patch and breaks almost everything except my specific test case BUT in the most used case (ie simple column loaders, it provides a tremendous speedup to the ORM). The principle is that accessing the RowProxy as a tuple is currently way faster (more than 3x) than as a map, so I tried to make the ORM take advantage of that for simple column loaders.

    The approach is probably not usable for more complex loaders but I don't think it matters. We could just call the "simpleloaders" then the more complex ones, unchanged from now.

    A few notes from my experimentation: * the conditional function call in RowProxy that we discussed doesn't make any significant speed difference for the getitem use case, but it does produce a ~6% speedup for the iter case and makes _init_metadata slighly shorter since we can store the processors directly and don't need to remember explicitly whether it's an "itemgetter" or a "colfunc". This patch includes those changes too. * here are the kind of speedups (as usual this is total time) this approach can lead to: * for small queries: ~ 12% (this is about the same as the FastDict approach) * for medium queries: ~ 30% * for large queries: ~ 123% !

  10. Mike Bayer repo owner

    Replying to ged:

    Yes, this is all in the spirit of what I am trying to achieve. I would personally find it more elegant to pass an argument somewhere (to conn.execute perhaps?), than to monkey_patch the result object, but that's your call here.

    conn.execute is holy ground. All the execute methods, with the unfortunate exception of Session.execute, have an argument signature that is strictly about "statement", "args". Ideally Session.execute would be the same. Tacking on a custom RowProxy callable to a result strikes me as a very "transparent" approach that people can understand immediately, and it doesn't clutter up the normal API with weird keyword arguments nobody ever uses or weird method calls. Because to be totally honest this feature has "pet feature" written all over it. Nobody's going to care much about it other than the ORM internals and us.

    the only other approach I'd favor would be something along the lines of:

    connection.options(row_dict=MyRowDict).execute()
    

    This might be necessary only if we see that we really have to send some adjustments into ResultProxy._initialize.

    the theme here though is that I'd really like to avoid more ResultProxy subclasses, because that is where the whole thing becomes not worth it. Haven't read all your other ideas yet.

  11. Mike Bayer repo owner

    Replying to ged:

    Here is a totally different approach I tried, which, if accepted, would render the previous patch almost useless.

    Its interesting. I'd have the contract of create_row_processor be:

    (new_execute, key, col, existing_execute)

    mapper chooses new_execute if present otherwise uses key/col with the dict thing.

    I also think "_process_rows" is an interesting idea.

    Also you can't do this in rowproxy:

    return func and func(self.__row[index](index)) or self.__row[index](index)
    

    func(self.__rowindex) may be False/None and its also an unnecessary evaluation (yes I know foo if bar else bat is only py2.6)

    lastly I really need to keep 0.6 dev moving along and get it out the door. this is all good stuff but I may have to focus on getting 0.6 polished and stable for beta releases during my big blocks of time (i.e. weekends). I spent last weekend on the previous resultproxy speedup (which was great).

  12. Former user Account Deleted

    (original author: ged) Replying to zzzeek:

    Also you can't do this in rowproxy:

    {{{ return func and func(self.__rowindex) or self.__rowindex }}}

    func(self.__rowindex) may be False/None and its also an unnecessary evaluation

    Doh! I fell for the usual gotcha.

    lastly I really need to keep 0.6 dev moving along and get it out the door. this is all good stuff but I may have to focus on getting 0.6 polished and stable for beta releases during my big blocks of time (i.e. weekends).

    Sad news, though I perfectly understand. That will dilute the impressive total speedup for one release that I was trying to achieve ;-). FWIW, with the latest patch, for large Unicode queries it's a 3.45 ratio total speedup (compared to before my first patch), 1.63 for medium ones, and 1.32 for small ones.

    I spent last weekend on the previous resultproxy speedup (which was great).

    FWIW, the speedup figures were a bit misleading in that the most impressive ones are for the __iter__ usecase which is not that useful (I didn't realize it myself until yesterday). The __getitem__ usecase only got a much more modest improvement: ~23% on largish queries, which explains why the ORM improvements weren't in line with those of the SQL layer.

  13. Former user Account Deleted

    (original author: ged) ticket #1323 showed me there is another code path worth optimizing. Here is another proof of concept patch which give more than 100% speedups for that use case (on quite large queries: 100000 records -- untested on smaller ones, but should provide a nice boost too).

  14. Former user Account Deleted

    (original author: ged) And another patch for whenever you'll tackle some ORM optimization. One little change which provide a small but appreciable speedup (a few %) is to not use **kwargs in very hot places like populate_state and in each of the populators.

  15. Mike Bayer repo owner

    Replying to ged:

    ticket #1323 showed me there is another code path worth optimizing. Here is another proof of concept patch which give more than 100% speedups for that use case (on quite large queries: 100000 records -- untested on smaller ones, but should provide a nice boost too).

    can you explain the savings on this patch. Is it just that the do-nothing "proc()" function is taken out of the loop ? do your tests ensure that each column on the NamedTuple is actually pulled via __getattr__ ? __getattr__ in my experience is much slower than pulling straight out of __dict__ and the NamedTuple implementation was based on speed tests. From what I can see here I would think things would be slower, i.e. remove one proc() but then add __getattr__.

  16. Former user Account Deleted

    (original author: ged) You didn't read the patch entirely: I populate the dict within the getattr. getattr is indeed much slower. The trick here is to not pay the price of the dict mapping if not used, and as a trade-off pay the price of one getattr call in case we do use it, which seem to be mostly unnoticeable with 5 fields. But as this patch was very proof-of-concept-ish, I didn't measure the exact overhead of that getattr call where it would be maximal, ie with 2 fields.

    Now as for the savings of this patch: there is: - the saved call for each column - not iterating on the row at all (I give it directly to the namedtuple), ie lots of rowcolumn lookups saved AND the faster RowProxy.iter code path is used within namedtuple instead of the more expensive RowProxy.getitem. - the saved dictionary creation in namedtuple

    My timings were for a straight tuple usage, but the savings should be already important for the attribute usage.

    Btw: I wonder how this whole thing would compare to properties, as done at:

    http://docs.python.org/library/collections.html#namedtuple-factory-function-for-tuples-with-named-fields http://code.activestate.com/recipes/500261/

    If it's fast enough, the approach could also be used in RowProxy.

  17. Mike Bayer repo owner

    Replying to ged:

    You didn't read the patch entirely: I populate the dict within the getattr.

    oh, ok. Here's what I'd like to see here: mostly what you have, but let's not even use proc() at all from _ColumnElement - have it just return None. then we check for "None" before running a proc(), or if all None then do the optimized row populate.

  18. Former user Account Deleted

    (original author: ged) Wow, the property approach is indeed much faster. Here is a little patch which saves 32% on the getattr case, by using the property approach, and by getting rid of the generator in there. Here are the timings on my machine:

    generator suppression: 3.42 to 3.13 property approach for NamedTuple: 3.13 to 2.54

  19. Former user Account Deleted

    (original author: ged) argl... I didn't know that dynamically generated classes weren't picklable (without hacks which are not realistically applicable here)... Well, at least I have learned something today :). I'll commit the generator change anyway.

  20. Former user Account Deleted

    (original author: ged) Replying to zzzeek:

    the tuples need to be pickleable. so can't go with that.

    Is it really unacceptable for the tuples to not be picklable directly? Ie, to require the user to do tuple(result) if he wants to pickle it?

  21. Mike Bayer repo owner

    Replying to ged:

    Replying to zzzeek:

    the tuples need to be pickleable. so can't go with that.

    Is it really unacceptable for the tuples to not be picklable directly? Ie, to require the user to do tuple(result) if he wants to pickle it?

    its not acceptable. plenty of users complained about RowProxy, over here the NamedTuple-un-pickleness in 0.5 drives me crazy and is quite frankly ridiculous that a simple tuple-like object cannot be pickled.

  22. Former user Account Deleted

    (original author: ged) Replying to zzzeek:

    Replying to ged:

    You didn't read the patch entirely: I populate the dict within the getattr.

    oh, ok. Here's what I'd like to see here: mostly what you have, but let's not even use proc() at all from _ColumnElement - have it just return None. then we check for "None" before running a proc(),

    Checking whether the proc is None for each column is actually slower than the current situation because in that case you can't use a list comprehension, so I guess we need to keep the process function.

    or if all None then do the optimized row populate.

    Just for the fun of it, I tried to progressively strip useless processing all the way down to nothing. In fact, in my test case, no processing at all is necessary, so the overhead is annoying. Yes, of course, this is a particular case, but not that exceptional given that Unicode columns do not systematically require processing anymore.

    Test table has 1 integer pk, 10 Unicode fields, 100000 records. Timings are done using:

    [f1) for f0, f1, f2, f3, f4 in session.query(Test.field0, Test.field1, Test.field2, Test.field3, Test.field4)]((f0,)
    
    • 2.3 current speed
    • 1.91 no dict creation in NamedTuple
    • 1.8 by skipping the "proc" call
    • 1.06 by passing row directly to NamedTuple -- this only works if there is the same number of columns and in the same order in the result than in the ORM query, which is the case when the ORM builds the query itself but not necessary the case if the user passes its own result to .instances()
    • 0.93 by using a new "extract" method on RowProxy which return a list instead of an iterator and no "vals=list(vals)" in NamedTuple
    • 0.78 by using tuple(row.extract()) instead of NamedTuple
    • 0.75 by returning row.extract() directly (ie returning a list instead of a tuple)
    • 0.71 by removing the test for debug in RowProxy
    • 0.55 by returning the raw row in RowProxy if all processors are None (precomputed in ResultProxy)
    • 0.41 by returning the raw result from the _fetchall_impl in fetchall if all processors are None

    And this is pretty much the same speed as the test case with the raw dbapi (or so says the profiler).

    A few conclusions: * It might be a good idea to add a fast path in ResultProxy when all processors are None. * It might be a good idea to offer some way to tell what kind of result you want, so that people who are not interested in the getattr behavior do not have to pay a high price for it (if we cannot optimize it to be fast in all cases). * It is most likely a good idea to have a different path when we know that all columns within the result will be used and are in the correct order.

  23. Former user Account Deleted

    (original author: ged) I couldn't disagree more...

    Of all the attached patches, only fastdict_resultproxy.patch and kwargs_are_inefficient.patch are irrelevant now.

    The approach used in "orm_uses_col_indexes.patch" still produces a significant speedup (even with the C extension -- of course, the gain is much smaller in that case).

    Some kind of NamedTuple optimization is also still needed.

    And the approach in querying_columns_only_opt.patch also produces a significant speedup in the query(C.col1, C.col2) case.

    It would be wasteful to lose track of these by closing the ticket. On the other hand, I agree this ticket is a bit of a mess due to the experimental nature of my work (sorry about that), so maybe creating new tickets for each of the individual issues at hand would be preferable than simply reopening this ticket?

  24. Mike Bayer repo owner

    Please create individual tickets with specific features and target them at 0.6.xx. otherwise its too difficult to track status.

  25. Log in to comment