use Cython for C extensions?
see attachments
Comments (37)
-
reporter -
reporter - assigned issue to
-
reporter - changed watchers to klaussfreire@gmail.com
-
Is "no C extension" no extension at all, or just no patch?
If it's no extension at all, it makes difficult to compare. I also don't know about the test's representativeness. It doesn't seem to do much with the objects, just fetch them.
-
reporter "no C extension" means I removed the file "cinstrumented.so" alone, so the other C extensions remained in place.
-
reporter so yes, what would be good here would be a test like orm2010.py that has a heavy emphasis on attribute access. orm2010.py is heavy on the persistence side, though persistence makes a lot of use of getters too.
-
Um... I've been having trouble reproducing the function call excess problem with test_orm, and I noticed it doesn't use declarative, so only relationships use instrumented attributes.
However, when using declarative, even basic columns do.
So, I guess what's needed here is a test using declarative. Is orm_2010 that? I can't find it in test/aaa_profiling...
Edit: Never mind... it's not aaa_profiling but perf...
-
reporter using declarative vs. mapper() has no effect on how columns are ultimately mapped. If the table has a column ".id", it is instrumented in the same way whether or not declarative is used.
-
Then I'm running the tests the wrong way...
I mess or restore ./build/lib.blah/sqlalchemy/cinstrumented.so and run ./sqla_nose.py test.aaa_profiling.test_orm or python ./test/perf/orm2010.py
-
Oh... I know... I have cinstrumented installed in python, if it's not there it loads it from /usr/local...
I'll have to build a virtualenv.
-
Alright, virtualenv in place,
Unpached (no cinstrumented):
(env)claudiofreire@klaumpp:~/src/SQLAlchemy-0.8.1> PYTHONPATH=$(pwd)/build/lib.linux-x86_64-2.7/ python test/perf/orm2010read.py /home/claudiofreire/src/SQLAlchemy-0.8.1/build/lib.linux-x86_64-2.7/sqlalchemy/types.py:307: SAWarning: Dialect sqlite+pysqlite does *not* support Decimal objects natively, and SQLAlchemy must convert from floating point - rounding errors and other issues may occur. Please consider storing Decimal numbers as strings or integers on this platform for lossless storage. d[coltype](coltype) = rp = d['impl']('impl').result_processor(dialect, coltype) SQLA Version: 0.8.1 Total calls 706284 Total cpu seconds: 1.94 Total execute calls: 101 Total executemany calls: 0 sh: runsnake: command not found
Patched (with cinstrumented):
(env)claudiofreire@klaumpp:~/src/SQLAlchemy-0.8.1> PYTHONPATH=$(pwd)/build/lib.linux-x86_64-2.7/ python test/perf/orm2010read.py /home/claudiofreire/src/SQLAlchemy-0.8.1/build/lib.linux-x86_64-2.7/sqlalchemy/types.py:307: SAWarning: Dialect sqlite+pysqlite does *not* support Decimal objects natively, and SQLAlchemy must convert from floating point - rounding errors and other issues may occur. Please consider storing Decimal numbers as strings or integers on this platform for lossless storage. d[coltype](coltype) = rp = d['impl']('impl').result_processor(dialect, coltype) SQLA Version: 0.8.1 Total calls 646370 Total cpu seconds: 1.69 Total execute calls: 101 Total executemany calls: 0 sh: runsnake: command not found
Will attach orm2010read.py, with emphasis on read-only stuff.
-
- attached orm2010read.py
Version of test/perf/orm2010.py with emphasis on read-only querying
-
reporter OK...can you tell me about this:
basic_dict = PyObject_GetAttrString(instance, "__dict__"); 142 if (basic_dict == NULL) { 143 /* No problem, we'll fall back to the generic implementation anyway */ 144 PyErr_Clear(); 145 }
what it seems like we're doing here, is seeing if
__dict__
is there, and if so, then great we can skip usinginstance_dict()
. That's not actually howinstance_dict()
is supposed to work - it's supposed to be consulted unconditionally. The wholeinstance_dict()
thing is really unfortunate in that I think there is exactly one application on the planet that actually needs it to access something other than__dict__
, but the fact is that in the default case it is actually anattrgetter
for__dict__
and should be extremely fast. Am I correct about that step and can that be removed, hopefully simplifying the code ? Overall it's a little weird that we're replacing literally six lines of Python code with 426 lines of C. -
No, it's not that at all. What it's doing, is try to get the {{{dict}}}, for comparison against the instance_dict() return value. Caching (which is what accelerates things) is only enabled when:
and {{{self.impl.get(instance_state(instance), dict_)}}} is successful (ie: no errors). If both happen, then from that point onwards, all attribute accesses of that kind (ie: {{{Entity.id}}} for all instances of Entity) will shortcut directly towards {{{instance.__dict__[self.key](self.key)}}}, as long as {{{globals()['instance_dict']('instance_dict')}}} hasn't changed. PS: The comment there, what it refers to when it says it will fall back, applies when instances have no {{{__dict__}}} (ie: if they have slots). AFAIK, mapped instances cannot omit it easily, but there's no reason to assume it in that code.
-
reporter so the speed savings are:
-
the comparison of
self->cached_instance_dict == instance_dict
and such are straight C references, so this is extremely fast, then -
rv = PyObject_GetItem(basic_dict, key);
allows us to quickly get a value or NULL without the overhead of either "key in dict" or catching an expensiveKeyError
is that right?
also did you run any regular timeit's? I'm assuming we'd see similar percentages.
-
-
Right. Well, {{{key in dict_}}} then {{{return dict_key}}} does the lookup twice, and catching the KeyError has to set up a try/catch block, which is what is expensive about it (kinda like a stack bookmark CPython's VM can roll back to).
But the real savings here come from:
- {{{Entity.id}}} goes from CPython's VM (LOAD_ATTR) directly towards the C code, there's no frame setup or bytecodes involved (that's why the call doesn't show up in profiles). With these ultra-cheap functions, frame setup can be 90% of the cost.
- {{{PyBlah_Bleh}}} calls are all native. While, true, not immensely quicker than equivalent VM bytecodes, the fact that they're just working with simple dict ops and non-property attributes makes the VM itself a hinderance. {{{Entity.id}}} was about a dozen opcodes, that's a dozen loops over a huge switch block that interprets the bytecode. Now it's a pair of native calls. That's way faster in comparison.
While those two wouldn't normally matter much in the overall picture, the fact that properly optimized CPU-bound SQLAlchemy apps tend to be all about reading attributes (in my experience), changes things.
I didn't try the timeit. I will as soon as I get back to my SA devel machine (ie: the office).
-
BTW: Hold on a bit if 426 lines of C look intimidating (I'd think so), since I still have to try Pyrex. I've got to check whether Pyrex can replace the other extensions too, for it would be pointless to just introduce it for this one and not the others.
-
Ok, timeit.
This is the test:
>>> def test(): ... for _ in xrange(10): ... _ = rc.hash ... _ = rc.hash ... _ = rc.hash ... _ = rc.hash ... _ = rc.hash ... _ = rc.hash ... _ = rc.hash ... _ = rc.hash ... _ = rc.hash ... _ = rc.hash ... _ = rc.hash ... _ = rc.hash ... _ = rc.hash ... _ = rc.hash ... _ = rc.hash ... _ = rc.hash ...
Yeah, I did it on the console, and yeah, I made sure both runs contained the same number of rc.hash lines.
With cinstrumented:
>>> timeit.timeit(test) 156.2858281135559
Without cinstrumented:
>>> timeit.timeit(test) 188.22997498512268
Unsurprising.
PS: I did not include a query in the timeit function because I'm working with a remote DB ATM, and it'd overshadow attribute access. I don't have a local db handy.
-
Account Deleted (original author: ged) Replying to claudiofreire:
BTW: Hold on a bit if 426 lines of C look intimidating (I'd think so), since I still have to try Pyrex. I've got to check whether Pyrex can replace the other extensions too, for it would be pointless to just introduce it for this one and not the others.
FWIW, I discovered Cython just after I finished implementing the initial C extensions (but before it got committed), so I did try to generate them using Cython. Back then, I could not get Cython (never tried pyrex) code to be nearly as fast as the pure C extensions, so I dropped the idea since the pure C extensions were already written. It was faster than pure-python but not fast enough (IMO) to be worth it. It is certainly worth it to try that again, as if the speed improvement is close enough to the pure C extensions, Cython code is indeed much more maintainable.
-
Well, it certainly is tricky to get optimal performance, but I've never failed achieving C speeds with Pyrex. I know Cython tries to do type inference to speed up things, in Pyrex you have to annotate types. I've found explicit annotation is more predictable than relying on type inference, that's why I prefer Pyrex. I just need careful cdefs and everything flies.
When there is a need for native C code, you can always include C code from Pyrex. I've in fact done that to include C++ code, when C++'s template power was required. The benefit over pure C extensions is that you avoid all that CPython boilerplate.
So it's a matter of how much needs to be pure C and how much can Pyrex replace, I don't think it will make a difference in speed (ie: it will be just as fast).
-
So, I finally got back to this. Pyrex has no Py3 support, so Cython's the only choice.
cprocessors.int_to_bool becomes a tad slower, going from 0.84s (py2) to 0.94s (py3) for 10M iterations. Is that in line with your previous results?
The same code with cython and py2 stays at 0.84s, so I'm wondering if the slowness is a py3 thing instead of a cython thing.
-
reporter - changed milestone to blue sky
this issue seems to be dead for now, tabling out to blue sky.
-
Ehm... without an answer to the above?
-
reporter Whats the question? how long does 10M iterations of int_to_bool take on my system? how would that be of use without something to compare it to on the identical system ?
-
No, it had been said that at some point someone tried Cython, and the performance hit was unacceptable. The question was whether the performance hit mentioned above was equally unacceptable.
-
reporter its not a big deal. but also re: the attribute access thing, if the speed savings are just that it caches the getting of various attributes, why can't that be done in pure Python?
-
The savings are in the avoidance of creating and initializing python frames, which is the main source of function call overhead in python. Any function call in python will initialize a frame, some will also create one (there's a free list to avoid that), so that's why it cannot be done in python.
Attribute access in python is very very fast, so the overhead of initializing a frame for such access is noticeable.
The caching is there only as a tool to prevent calling python code (which, again, pays the full cost of initializing a frame, no matter how fast that function is inside).
I could find no way to achieve that in pure python. Usually, I use operator.attrgetter to create fast attribute-getting functions, but this descriptor does a little bit more than just accessing the attribute, so that's not viable here.
-
Account Deleted (original author: ged) Replying to claudiofreire:
So, I finally got back to this. Pyrex has no Py3 support, so Cython's the only choice.
cprocessors.int_to_bool becomes a tad slower, going from 0.84s (py2) to 0.94s (py3) for 10M iterations. Is that in line with your previous results?
The same code with cython and py2 stays at 0.84s, so I'm wondering if the slowness is a py3 thing instead of a cython thing.
IIRC, the offender(s) (ie too slow) was not in the type processors, but rather in the rowproxy special methods (getitem and possibly getattr).
-
Yes, and I attached a patch that adds a C extension for it, but there was a (founded) concern that it would be hard to maintain, so I suggested using Cython. And since having just one extension done in Cython seemed backwards, it was also suggested to move them all to Cython, for ease of maintenance.
So I started with cprocessors, and noting there was a (slight) performance decrease, I wasn't sure whether to proceed.
-
reporter Unfortunately I can't give you a guarantee that I'll accept a patch which rewrites all the C extensions in Cython. I'd need to see it, need to see in a bigger sense how the performance differences are (testing one or two functions in isolation is not that informative), need to understand how it works, how to maintain it, what kinds of build/installation issues are going to be introduced, where it might re-introduce mistakes that have been fixed in the existing C exts (relatively unlikely) or where it introduces all new mistakes (very likely). I'm curious to see how they'd look but as I'm not the one playing with Cython right now I don't have too clear of a view how much code we'd introduce and how well it runs.
Overall I'm not too enthused about a rewrite of the C exts at this point because we've got many years of stability with the current ones, they at least don't have any dependencies, and adding many more if any seems like a wasted effort - Pypy is a funded project going full steam ahead on making such C extensions unnecessary, and they are very interested most specifically in optimizing for the Django and SQLAlchemy projects.
-
Alright, close then.
-
reporter - changed title to use Cython for C extensions?
OK, I was curious at least what some of the code would look like, I'm leaving it open as something to look into down the road, depending on future need. The complaints we get about speed right now are limited mostly to folks trying to insert millions of rows with the ORM.
-
- attached proc.pyx
Incomplete Cython-based cprocessors
-
I attached the WIP cprocessors (proc.pyx) so you get an idea of how it looks like
-
reporter it's not terrible, it's like typesafe python. if you want to branch somewhere and update setup.py to use cython for pyx files I can try building it.
-
reporter current pypy 2.1 doubles the speed of the test script, which I've updated in 7f9becf3fbbdb33d2 to have heavy writing and reading, with a 50% improvement, whereas current C extensions only improve it about 6%. see https://gist.github.com/zzzeek/6143700.
-
reporter - changed status to wontfix
I think we're sticking with our hand-written C extensions for now - they're very well tested and certainly the existence of pypy as a viable option makes it hard to justify a rework the whole approach.
- Log in to comment
Running this against 0.8, only one of the aaa_profiling tests falls out of range here. Here's a summary of most of the ORM related profiling tests. The orm_2010.py script has the best savings as it works with a large number of objects, but even then only 2%. So far it doesn't look too dramatic.