Better integration with Cython

I have been using cython for a little while now, and I really like it for writing quick functions in python with very good performance. Cython lets you annotate your python code with static types, then generates C code which is compiled. Whenever the annotations/inference fail, the generated python simply calls the python interpreter, so the fallback is to run your code as regular python. For me, i've seen 2-3 (yes thousand times) orders of magnitude speedups over writing naive python loops through arrays, and i have even beat numpy/scipy implementations of a few functions by specializing to 3D or to a particular dtype, etc.

My point is that cython plus pycuda are an ideal extension language for pyca, but they need a little help to be fully useful. In particular, cython needs to know in detail about the memory layout of an Image3D, Field3D, Vec3Df, Vec3Di, etc. There is a thing called the buffer interface which is defined in a PEP and which cython and other libraries follow for this sort of low level description of static types.

I don't yet know how to get swig to spit out an interface that can be cimport'ed by cython, but that is the first step.

As a second step, I will probably write some python decorators that would let us quickly write new pyca operations that get dispatched to either a cython for loop or a pycuda kernel launch. For example, I envision something like the following

# some pycuda stuff to load a kernel so that it's callable directly
foo_kernel = barmod.get_function("foo_kernel")

# declaring newtypes to easily annotate functions and local variables
cnewtype catypes.Image3D pyca_im
cnewtype catypes.Field3D pyca_field

# this will set up a little boilerplate to dispatch and run foo_kernel if we detect we're running on the GPU
# This can enforce a convention when passing images and fields to a kernel, wherein all the grid info is passed in a standard way
@pyca_dispatch(foo_kernel)
# pretty standard cython function which can loop over out.asnp().
cdef int foo(pyca_im out, pyca_im in, pyca_field v, int n, float dt):
   # some python code with cython sprinkled in

My hope is to make it super simple to operate in a really fast way with pyca types. It can do a lot of the sort of annoying tasks for us. For instance, we commonly implement something in C++ before doing it in CUDA. Well now we'd do that in foo(), and write @pyca_dispatch(None). The @pyca_dispatch decorator would then know that this means there's no CUDA version available yet, and if a user tried to run foo with GPU types, they'd get a useful error message. All this without us having to put stubs all over the place in PyCA itself.

When we're satisfied with the CPU version, we implement a kernel and change the dispatch line to point to it and bang: we're off to the races.

One final thing this approach will allow is automated CPU/GPU testing. The dispatch decorator can do all sorts of things under the hood, such as keep a global list of known functions that take CPU or GPU pyca types as arguments. Then we can automate the process of comparing cpu and gpu versions in nose or pytest or whatever.

Comments (0)