GPU Maven Plugin
The GPU Maven Plugin compiles Java code with hand-selected Java kernels to CUDA that can run on NVIDIA GPUs of compatibility level 2.0 or higher. It encapsulates the build process so that GPU code is as easy to build with maven as ordinary Java code. The plugin relies on the NVidia CUDA SDK being installed which must be done separately.
The plugin source includes forks of Rootbeer1 and Soot plus bug repairs. Their author was so attached to his command-line tools and idiosyncratic build conventions that I couldn't wait for him any more.
How it works
Users write ordinary Java that designates code for the GPU by enclosing it a class that implements the Rootbeer "Kernel" interface. They use ordinary Java compilers to compile it into a jar of byte codes for the Java virtual machine. This plugin then turns the jar into a new one that contains CUDA kernels to run on the GPU on command from the non-kernel part of the program. Only kernels get converted to CUDA; the rest of the program remains as Java byte codes.
Byte code is a stack-based format that is good for execution but not for code analysis and translation. So Rootbeer uses Soot to find Kernel classes in the jar, to locate their dependencies and to translate them to Jimple. This is a 3-address format that Rootbeer translates into CUDA-compatible C++ source code. Finally, the NVidia tool chain compiles the generated source to CUDA binaries and links them into a binary kernel that the original Java can launch on the GPU. This plugin handles all this automatically so the build process looks like an ordinary Java compile to its users.
How to use it
These sub-modules contain example applications and poms that show how to prepare code to run on the GPU. The gpu-rootbeer/doc folder provides further details.
gpu-mandelbrot: A fast and feature-rich Mandelbrot generator that uses multithreading (CPU-based) to speed computation of amazing Mandelbrot pictures.
gpu-mandelbrot-gpu: A modification to the above that moves CPU threads to the GPU to compare performance of CPU versus GPU threads. However this analysis has not been completed because of the complexity of the code. When complete this test may be useful to quantify data transfer costs. Mandelbrot tasks incur little data transfer overhead but each task is heavily compute-intensive.
gpu-timings: this includes several common algorithms instrumented to compare CPU-only versus GPU performance for a broad variety of task counts and data sizes. Average computes the average of arrays of various sizes, SumSq computes the sum of the squares and IntMatrixMultiply and DoubleMatrixMultiply multiplies integer and double matrices of various sizes.
Is it worth it?
It depends on your application; in particular on the number of GPU tasks and the amount of work they do in parallel. There are also so far unmeasured costs for transferring data to and from the GPU. And of course, it helps only for CPU-bound apps.
For example, the gpu-timings/Average application computes the average of large arrays by subdividing each array into chunks, summing each chunk in parallel with independent GPU threads and combining the sums into an average when the tasks are done. The measurements reveal that conversion of hand-designated Java kernels to GPU/CUDA is beneficial for about a thousand threads (10^3) each summing ten thousand values (10^4) in parallel. The improvement is 2x-4x at those levels and grows to 37x for 10^5 tasks and 10^6 values. The improvement for larger sizes has not yet been measured due to a timeout fault when huge data sets are loaded into the GPU. Some graphs of performance improvements are in the downloads section.
This project has a parent module (gpu-maven-plugin) with two child GPU modules (gpu-mandebrot-gpu and gpu-timings). Such projects trigger a bug in which the second child module (gpu-timings) throws a RuntimeException when mvn install is launched in the parent instead of the child. The workaround is to repeat 'mvn clean install' in the failing child (gpu-timings in this case) as instructed in the exception message.
This happens because Rootbeer and Soot use singleton classes heavily. Running mvn install in the parent module should build all children successfully. But this fails because singletons hang onto state from the first GPU build which pollutes subsequent ones. Such deep structural flaws are nasty to fix so use the workaround until I get to the bottom of this. Other workarounds will be added to the wiki as required.