fused kernel: V^T*A*V without storing A*V

Issue #309 open
Jonas Thies created an issue

I'm thinking about a memory efficient Jacobi-Davidson variant, currently we need to store the subspace V (densemat with dozens or hundreds of columns) and Av=AV just because we need to construct V^TAV. Assuming that the communication buffers of V are filled and V doesn't Change we could just do the V^TA*V without MPI communication and save about 50% memory usage (if V is the dominant Memory consumer).

This is particularly interesting for GPUs but also for KNL to increase the MCDRAM utilization

Comments (8)

  1. Jonas Thies reporter

    @te42kyfo , what do you think about this type of kernel, does that make sense on GPUs and/or PHI? It would save up to 50% memory in Jacobi-Davidson so we could fit more into the HBM/onto the GPU.

  2. Dominik Ernst

    Do you require this as a full matrix-matrix-matrix multiplication, resulting in a small, square matrix, or just columnnwise dot products of V, resulting in a vector of dot products.

    The second case can be implemented as a special case of the SpMMV kernel.

    The first case is more similar to a TSMTTSM/GEMM with a SpMV in the middle, and should be implemented in a separate kernel. Does this operation occupy a large share of runtime? The parameter range (dozens to hundreds) spans a large range of computational intensities, which makes it more challenging to achieve high efficiencies everywhere.

    In both cases, this fused operation is certainly feasable, and I would like to implement it.

  3. Jonas Thies reporter

    First case, yes. I think it's a good kernel to offer because many numerical schemes Project the sparse Matrix onto some subspace (i.e. compute V^TAV) and then do something with that small, dense Matrix. In my particular use case, B is typically very sparse and cheap to apply.

  4. Dominik Ernst

    The fused Kernel's signature is

        ghost_error ghost_tsmtspmtsm(ghost_densemat x, ghost_densemat v, ghost_densemat w, ghost_sparsemat A, void* alpha, void* beta, int reduce);

    It computes x = alpha * v^TAw + beta*x. Mixed precision is supported. The reduce argument works the same way as for tsmttsm. A similar option can be implemented to avoid unnecessary MPI communication for v.

    Currently only a basic CPU and CUDA variant is implemented, without specialization and limited optimization.

  5. Log in to comment