fused kernel: V^T*A*V without storing A*V
I'm thinking about a memory efficient JacobiDavidson variant, currently we need to store the subspace V (densemat with dozens or hundreds of columns) and Av=AV just because we need to construct V^TAV. Assuming that the communication buffers of V are filled and V doesn't Change we could just do the V^TA*V without MPI communication and save about 50% memory usage (if V is the dominant Memory consumer).
This is particularly interesting for GPUs but also for KNL to increase the MCDRAM utilization
Comments (8)

reporter 
Do you require this as a full matrixmatrixmatrix multiplication, resulting in a small, square matrix, or just columnnwise dot products of V, resulting in a vector of dot products.
The second case can be implemented as a special case of the SpMMV kernel.
The first case is more similar to a TSMTTSM/GEMM with a SpMV in the middle, and should be implemented in a separate kernel. Does this operation occupy a large share of runtime? The parameter range (dozens to hundreds) spans a large range of computational intensities, which makes it more challenging to achieve high efficiencies everywhere.
In both cases, this fused operation is certainly feasable, and I would like to implement it.


assigned issue to

assigned issue to

 changed status to open

reporter First case, yes. I think it's a good kernel to offer because many numerical schemes Project the sparse Matrix onto some subspace (i.e. compute V^TAV) and then do something with that small, dense Matrix. In my particular use case, B is typically very sparse and cheap to apply.

Add new kernel tsmtspmtsm, plain CPU version is functional, re #309
→ <<cset b5fb7ecb2307>>

Merge branch tsmtspmtsm into devel, re #309
→ <<cset 3c0d5d2f6718>>

The fused Kernel's signature is
ghost_error ghost_tsmtspmtsm(ghost_densemat x, ghost_densemat v, ghost_densemat w, ghost_sparsemat A, void* alpha, void* beta, int reduce);
It computes x = alpha * v^TAw + beta*x. Mixed precision is supported. The reduce argument works the same way as for tsmttsm. A similar option can be implemented to avoid unnecessary MPI communication for v.
Currently only a basic CPU and CUDA variant is implemented, without specialization and limited optimization.
 Log in to comment
@te42kyfo , what do you think about this type of kernel, does that make sense on GPUs and/or PHI? It would save up to 50% memory in JacobiDavidson so we could fit more into the HBM/onto the GPU.