use Cython for C extensions?

Mike Bayer reporter

Running this against 0.8, only one of the aaa_profiling tests falls out of range here. Here's a summary of most of the ORM related profiling tests. The orm_2010.py script has the best savings as it works with a large number of objects, but even then only 2%. So far it doesn't look too dramatic.

C extension, test_orm

Pstats calls: 16991 
Pstats calls: 113793 
Pstats calls: 19165 
Pstats calls: 1190 
Pstats calls: 122 
Pstats calls: 18

no C extension, test_orm

Pstats calls: 17987 
Pstats calls: 114819 
Pstats calls: 19165 
Pstats calls: 1190 
Pstats calls: 122 
Pstats calls: 18


C extension, zoomark_orm:

Pstats calls: 6140 
Pstats calls: 390 
Pstats calls: 6389 
Pstats calls: 19003 
Pstats calls: 1049 
Pstats calls: 2671

no C extension: zoomark ORM

Pstats calls: 6104 
Pstats calls: 390 
Pstats calls: 6397 
Pstats calls: 19024 
Pstats calls: 1049 
Pstats calls: 2668

C extension, orm_2010.py

Total calls 4055180
Total cpu seconds: 4.41
Total execute calls: 11201
Total executemany calls: 101

no C extension, orm_2010.py

Total calls 4137179
Total cpu seconds: 4.46
Total execute calls: 11201
Total executemany calls: 101

2013-05-06T16:24:37+00:00

Mike Bayer reporter

assigned issue to

2013-05-06T16:26:39+00:00

Mike Bayer reporter

changed watchers to klaussfreire@gmail.com

2013-05-06T16:27:03+00:00

claudiofreire NA

Is "no C extension" no extension at all, or just no patch?

If it's no extension at all, it makes difficult to compare. I also don't know about the test's representativeness. It doesn't seem to do much with the objects, just fetch them.

2013-05-06T16:38:05+00:00

Mike Bayer reporter

"no C extension" means I removed the file "cinstrumented.so" alone, so the other C extensions remained in place.

2013-05-06T16:42:17+00:00

Mike Bayer reporter

so yes, what would be good here would be a test like orm2010.py that has a heavy emphasis on attribute access. orm2010.py is heavy on the persistence side, though persistence makes a lot of use of getters too.

2013-05-06T16:42:58+00:00

claudiofreire NA

Um... I've been having trouble reproducing the function call excess problem with test_orm, and I noticed it doesn't use declarative, so only relationships use instrumented attributes.

However, when using declarative, even basic columns do.

So, I guess what's needed here is a test using declarative. Is orm_2010 that? I can't find it in test/aaa_profiling...

Edit: Never mind... it's not aaa_profiling but perf...

2013-05-06T17:39:47+00:00

Mike Bayer reporter

using declarative vs. mapper() has no effect on how columns are ultimately mapped. If the table has a column ".id", it is instrumented in the same way whether or not declarative is used.

2013-05-06T18:44:26+00:00

claudiofreire NA

Then I'm running the tests the wrong way...

I mess or restore ./build/lib.blah/sqlalchemy/cinstrumented.so and run ./sqla_nose.py test.aaa_profiling.test_orm or python ./test/perf/orm2010.py

2013-05-06T18:53:50+00:00

claudiofreire NA

Oh... I know... I have cinstrumented installed in python, if it's not there it loads it from /usr/local...

I'll have to build a virtualenv.

2013-05-06T18:55:45+00:00

claudiofreire NA

Alright, virtualenv in place,

Unpached (no cinstrumented):

(env)claudiofreire@klaumpp:~/src/SQLAlchemy-0.8.1> PYTHONPATH=$(pwd)/build/lib.linux-x86_64-2.7/ python test/perf/orm2010read.py
/home/claudiofreire/src/SQLAlchemy-0.8.1/build/lib.linux-x86_64-2.7/sqlalchemy/types.py:307: SAWarning: Dialect sqlite+pysqlite does *not* support Decimal objects natively, and SQLAlchemy must convert from floating point - rounding errors and other issues may occur. Please consider storing Decimal numbers as strings or integers on this platform for lossless storage.
  d[coltype](coltype) = rp = d['impl']('impl').result_processor(dialect, coltype)
SQLA Version: 0.8.1
Total calls 706284
Total cpu seconds: 1.94
Total execute calls: 101
Total executemany calls: 0
sh: runsnake: command not found

Patched (with cinstrumented):

(env)claudiofreire@klaumpp:~/src/SQLAlchemy-0.8.1> PYTHONPATH=$(pwd)/build/lib.linux-x86_64-2.7/ python test/perf/orm2010read.py
/home/claudiofreire/src/SQLAlchemy-0.8.1/build/lib.linux-x86_64-2.7/sqlalchemy/types.py:307: SAWarning: Dialect sqlite+pysqlite does *not* support Decimal objects natively, and SQLAlchemy must convert from floating point - rounding errors and other issues may occur. Please consider storing Decimal numbers as strings or integers on this platform for lossless storage.
  d[coltype](coltype) = rp = d['impl']('impl').result_processor(dialect, coltype)
SQLA Version: 0.8.1
Total calls 646370
Total cpu seconds: 1.69
Total execute calls: 101
Total executemany calls: 0
sh: runsnake: command not found

Will attach orm2010read.py, with emphasis on read-only stuff.

2013-05-06T19:30:08+00:00

claudiofreire NA

attached orm2010read.py

Version of test/perf/orm2010.py with emphasis on read-only querying

2013-05-06T19:30:56+00:00

Mike Bayer reporter

OK...can you tell me about this:

  basic_dict = PyObject_GetAttrString(instance, "__dict__"); 
    142     if (basic_dict == NULL) { 
    143         /* No problem, we'll fall back to the generic implementation anyway */ 
    144         PyErr_Clear(); 
    145     }

what it seems like we're doing here, is seeing if __dict__ is there, and if so, then great we can skip using instance_dict(). That's not actually how instance_dict() is supposed to work - it's supposed to be consulted unconditionally. The whole instance_dict() thing is really unfortunate in that I think there is exactly one application on the planet that actually needs it to access something other than __dict__, but the fact is that in the default case it is actually an attrgetter for __dict__ and should be extremely fast. Am I correct about that step and can that be removed, hopefully simplifying the code ? Overall it's a little weird that we're replacing literally six lines of Python code with 426 lines of C.

2013-05-07T21:09:50+00:00

claudiofreire NA

No, it's not that at all. What it's doing, is try to get the {{{dict}}}, for comparison against the instance_dict() return value. Caching (which is what accelerates things) is only enabled when:

and

{{{self.impl.get(instance_state(instance), dict_)}}} is successful (ie: no errors).

If both happen, then from that point onwards, all attribute accesses of that kind (ie: {{{Entity.id}}} for all instances of Entity) will shortcut directly towards {{{instance.__dict__[self.key](self.key)}}}, as long as {{{globals()['instance_dict']('instance_dict')}}} hasn't changed.

PS: The comment there, what it refers to when it says it will fall back, applies when instances have no {{{__dict__}}} (ie: if they have slots). AFAIK, mapped instances cannot omit it easily, but there's no reason to assume it in that code.

2013-05-07T21:22:57+00:00

Mike Bayer reporter

so the speed savings are:

the comparison of self->cached_instance_dict == instance_dict and such are straight C references, so this is extremely fast, then
rv = PyObject_GetItem(basic_dict, key); allows us to quickly get a value or NULL without the overhead of either "key in dict" or catching an expensive KeyError

is that right?

also did you run any regular timeit's? I'm assuming we'd see similar percentages.

2013-05-07T22:47:46+00:00

claudiofreire NA

Right. Well, {{{key in dict_}}} then {{{return dict_key}}} does the lookup twice, and catching the KeyError has to set up a try/catch block, which is what is expensive about it (kinda like a stack bookmark CPython's VM can roll back to).

But the real savings here come from:

{{{Entity.id}}} goes from CPython's VM (LOAD_ATTR) directly towards the C code, there's no frame setup or bytecodes involved (that's why the call doesn't show up in profiles). With these ultra-cheap functions, frame setup can be 90% of the cost.
{{{PyBlah_Bleh}}} calls are all native. While, true, not immensely quicker than equivalent VM bytecodes, the fact that they're just working with simple dict ops and non-property attributes makes the VM itself a hinderance. {{{Entity.id}}} was about a dozen opcodes, that's a dozen loops over a huge switch block that interprets the bytecode. Now it's a pair of native calls. That's way faster in comparison.

While those two wouldn't normally matter much in the overall picture, the fact that properly optimized CPU-bound SQLAlchemy apps tend to be all about reading attributes (in my experience), changes things.

I didn't try the timeit. I will as soon as I get back to my SA devel machine (ie: the office).

2013-05-07T23:07:06+00:00

claudiofreire NA

BTW: Hold on a bit if 426 lines of C look intimidating (I'd think so), since I still have to try Pyrex. I've got to check whether Pyrex can replace the other extensions too, for it would be pointless to just introduce it for this one and not the others.

2013-05-07T23:11:08+00:00

claudiofreire NA

Ok, timeit.

This is the test:

>>> def test():
...    for _ in xrange(10):
...        _ = rc.hash
...        _ = rc.hash
...        _ = rc.hash
...        _ = rc.hash
...        _ = rc.hash
...        _ = rc.hash
...        _ = rc.hash
...        _ = rc.hash
...        _ = rc.hash
...        _ = rc.hash
...        _ = rc.hash
...        _ = rc.hash
...        _ = rc.hash
...        _ = rc.hash
...        _ = rc.hash
...        _ = rc.hash
...

Yeah, I did it on the console, and yeah, I made sure both runs contained the same number of rc.hash lines.

With cinstrumented:

>>> timeit.timeit(test)
156.2858281135559

Without cinstrumented:

>>> timeit.timeit(test)
188.22997498512268

Unsurprising.

PS: I did not include a query in the timeit function because I'm working with a remote DB ATM, and it'd overshadow attribute access. I don't have a local db handy.

2013-05-08T10:58:47+00:00

Former user Account Deleted

(original author: ged) Replying to claudiofreire:

BTW: Hold on a bit if 426 lines of C look intimidating (I'd think so), since I still have to try Pyrex. I've got to check whether Pyrex can replace the other extensions too, for it would be pointless to just introduce it for this one and not the others.

FWIW, I discovered Cython just after I finished implementing the initial C extensions (but before it got committed), so I did try to generate them using Cython. Back then, I could not get Cython (never tried pyrex) code to be nearly as fast as the pure C extensions, so I dropped the idea since the pure C extensions were already written. It was faster than pure-python but not fast enough (IMO) to be worth it. It is certainly worth it to try that again, as if the speed improvement is close enough to the pure C extensions, Cython code is indeed much more maintainable.

2013-05-08T11:05:33+00:00

claudiofreire NA

Well, it certainly is tricky to get optimal performance, but I've never failed achieving C speeds with Pyrex. I know Cython tries to do type inference to speed up things, in Pyrex you have to annotate types. I've found explicit annotation is more predictable than relying on type inference, that's why I prefer Pyrex. I just need careful cdefs and everything flies.

When there is a need for native C code, you can always include C code from Pyrex. I've in fact done that to include C++ code, when C++'s template power was required. The benefit over pure C extensions is that you avoid all that CPython boilerplate.

So it's a matter of how much needs to be pure C and how much can Pyrex replace, I don't think it will make a difference in speed (ie: it will be just as fast).

2013-05-08T11:21:08+00:00

claudiofreire NA

So, I finally got back to this. Pyrex has no Py3 support, so Cython's the only choice.

cprocessors.int_to_bool becomes a tad slower, going from 0.84s (py2) to 0.94s (py3) for 10M iterations. Is that in line with your previous results?

The same code with cython and py2 stays at 0.84s, so I'm wondering if the slowness is a py3 thing instead of a cython thing.

2013-07-01T16:54:11+00:00

Mike Bayer reporter

changed milestone to blue sky

this issue seems to be dead for now, tabling out to blue sky.

2013-07-31T12:32:31+00:00

claudiofreire NA

Ehm... without an answer to the above?

2013-07-31T12:43:51+00:00

Mike Bayer reporter

Whats the question? how long does 10M iterations of int_to_bool take on my system? how would that be of use without something to compare it to on the identical system ?

2013-07-31T12:46:50+00:00

claudiofreire NA

No, it had been said that at some point someone tried Cython, and the performance hit was unacceptable. The question was whether the performance hit mentioned above was equally unacceptable.

2013-07-31T12:47:54+00:00

Mike Bayer reporter

its not a big deal. but also re: the attribute access thing, if the speed savings are just that it caches the getting of various attributes, why can't that be done in pure Python?

2013-07-31T12:52:20+00:00

claudiofreire NA

The savings are in the avoidance of creating and initializing python frames, which is the main source of function call overhead in python. Any function call in python will initialize a frame, some will also create one (there's a free list to avoid that), so that's why it cannot be done in python.

Attribute access in python is very very fast, so the overhead of initializing a frame for such access is noticeable.

The caching is there only as a tool to prevent calling python code (which, again, pays the full cost of initializing a frame, no matter how fast that function is inside).

I could find no way to achieve that in pure python. Usually, I use operator.attrgetter to create fast attribute-getting functions, but this descriptor does a little bit more than just accessing the attribute, so that's not viable here.

2013-07-31T12:58:40+00:00

Former user Account Deleted

(original author: ged) Replying to claudiofreire:

So, I finally got back to this. Pyrex has no Py3 support, so Cython's the only choice.

cprocessors.int_to_bool becomes a tad slower, going from 0.84s (py2) to 0.94s (py3) for 10M iterations. Is that in line with your previous results?

The same code with cython and py2 stays at 0.84s, so I'm wondering if the slowness is a py3 thing instead of a cython thing.

IIRC, the offender(s) (ie too slow) was not in the type processors, but rather in the rowproxy special methods (getitem and possibly getattr).

2013-07-31T16:01:19+00:00

claudiofreire NA

Yes, and I attached a patch that adds a C extension for it, but there was a (founded) concern that it would be hard to maintain, so I suggested using Cython. And since having just one extension done in Cython seemed backwards, it was also suggested to move them all to Cython, for ease of maintenance.

So I started with cprocessors, and noting there was a (slight) performance decrease, I wasn't sure whether to proceed.

2013-07-31T16:11:26+00:00

Mike Bayer reporter

Unfortunately I can't give you a guarantee that I'll accept a patch which rewrites all the C extensions in Cython. I'd need to see it, need to see in a bigger sense how the performance differences are (testing one or two functions in isolation is not that informative), need to understand how it works, how to maintain it, what kinds of build/installation issues are going to be introduced, where it might re-introduce mistakes that have been fixed in the existing C exts (relatively unlikely) or where it introduces all new mistakes (very likely). I'm curious to see how they'd look but as I'm not the one playing with Cython right now I don't have too clear of a view how much code we'd introduce and how well it runs.

Overall I'm not too enthused about a rewrite of the C exts at this point because we've got many years of stability with the current ones, they at least don't have any dependencies, and adding many more if any seems like a wasted effort - Pypy is a funded project going full steam ahead on making such C extensions unnecessary, and they are very interested most specifically in optimizing for the Django and SQLAlchemy projects.

2013-07-31T16:49:48+00:00

claudiofreire NA

Alright, close then.

2013-07-31T17:03:23+00:00

Mike Bayer reporter

changed title to use Cython for C extensions?

OK, I was curious at least what some of the code would look like, I'm leaving it open as something to look into down the road, depending on future need. The complaints we get about speed right now are limited mostly to folks trying to insert millions of rows with the ORM.

2013-07-31T17:07:50+00:00

claudiofreire NA

attached proc.pyx

Incomplete Cython-based cprocessors

2013-07-31T17:18:10+00:00

claudiofreire NA

I attached the WIP cprocessors (proc.pyx) so you get an idea of how it looks like

2013-07-31T17:18:38+00:00

Mike Bayer reporter

it's not terrible, it's like typesafe python. if you want to branch somewhere and update setup.py to use cython for pyx files I can try building it.

2013-07-31T17:47:35+00:00

Mike Bayer reporter

current pypy 2.1 doubles the speed of the test script, which I've updated in 7f9becf3fbbdb33d2 to have heavy writing and reading, with a 50% improvement, whereas current C extensions only improve it about 6%. see https://gist.github.com/zzzeek/6143700.

2013-08-02T17:48:59+00:00

Mike Bayer reporter

changed status to wontfix

I think we're sticking with our hand-written C extensions for now - they're very well tested and certainly the existence of pypy as a viable option makes it hard to justify a rework the whole approach.

2014-01-21T19:37:14+00:00

Comments (37)