Use same kernel launcher infrastructure for threading and GPUs

There is a great deal of commonality (and a few differences) between the code used for launching kernels with pthreads, openmp, cusp/cuda and opencl. It would be good if all the kernel launching code could be unified across these instead of having different code for cusp/cuda and opencl and the threading.

At the same time setting up a unified system to manage kernel fusion would be fantastic.

Allowing different parts of the vector to be handled by CPU and GPU kernels would be nice. It is stilly to use only one of the two compute engines to perform operations.

