Hola SpMV: Globally Homogeneous, Locally Adaptive Sparse Matrix-Vector Multiplication on the GPU


Hola SpMV provides an efficient sparse matrix vector multiplication for NVIDIA GPUs. It takes a CSR matrix (M) and a vector (a) as input and computes b=M*a. Hola SpMV performs load balancing between thread blocks. For this load balancing it requires an additional GPU buffer. This buffer must be allocated before running SpMV and will be filled as a first step during SpMV. The code is currently under development and will be updated in terms of useability, readability and performance.

Recent updates

  • 3.11.2018 first integration of Hola SpMVT
  • 3.11.2018 CMake support
  • 3.11.2018 fix for loading symmetric and hermitian matrices
  • 27.6.2017 Initial upload of naive SpMV, naive SpMVT and Hola SpMV

Expected updates

  • Tuned parameters for different GPU generations
  • Performance optimizations for small matrices

Build and usage

The code is split into source code and header files. The important code is found in source/ The other classes and headers are mostly helpers to load matrices, convert data, compute ground thruth results etc.

Hola SpMV is built on CUDA. You need to install the CUDA Toolkit and have an NVIDIA GPU with compute capability 3.0 or higher in your system. It has been developed with CUDA Toolkit 8.0 and optimized for Maxwell and Pascal GPUs.

Hola SpMV uses the reduction found in cub. Before building, make sure to clone cub in the deps/. Under Windows you can use the provided pull.bat.

Currently, build files are included for Visual Studio 2015 under build/vs2015 and CMake files for building and running under Linux/Windows. When using CMake, ensure that you build device code for your GPU generation (see CMake options).

The executeable loads an mtx file (supplied as the first command argument), converts it to CSR, stores the converted matrix as a binary file for reuse. It then computes the ground truth SpMV on the CPU before running naiveSpMV and holaSpMV. Hola SpMV requires an additional buffer for load balancing, which is to be executed on the GPU before launching SpMV. To query the size of the buffer, call the HolaSpMV function with a nullptr.

mtx matrices can be downloaded from SuiteSparse. Hola has been evaluated on all reasonably sized SuiteSparse matrices.


Globally Homogeneous, Locally Adaptive Sparse Matrix-Vector Multiplication on the GPU

When you use the code in a scientific work, please cite our paper

Globally Homogeneous, Locally Adaptive Sparse Matrix-Vector Multiplication on the GPU
Markus Steinberger, Rhaleb Zayer and Hans-Peter Seidel
Proceedings of the International Conference on Supercomputing, 2017

     author = {Steinberger, Markus and Zayer, Rhaleb and Seidel, Hans-Peter},
     title = {Globally Homogeneous, Locally Adaptive Sparse Matrix-vector Multiplication on the GPU},
     booktitle = {Proceedings of the International Conference on Supercomputing},
     series = {ICS '17},
     year = {2017},
     isbn = {978-1-4503-5020-4},
     location = {Chicago, Illinois},
     pages = {13:1--13:11},
     articleno = {13},
     numpages = {11},
     url = {},
     doi = {10.1145/3079079.3079086},
     acmid = {3079086},
     publisher = {ACM},
     address = {New York, NY, USA},
     keywords = {GPU, SpMV, linear algebra, sparse matrix},
How naive is naive SpMV on the GPU?

The source code includes our naive SpMV and transpose SpMV implementation. If you use this code, please cite:

How naive is naive SpMV on the GPU?
Markus Steinberger, Andreas Derler, Rhaleb Zayer and Hans-Peter Seidel
IEEE High Performance Extreme Computing Conference, 2016

    author={Markus Steinberger and Andreas Derler and Rhaleb Zayer and Hans-Peter Seidel},
    booktitle={2016 IEEE High Performance Extreme Computing Conference (HPEC)},
    title={How naive is naive SpMV on the GPU?},
    keywords={cache storage;data handling;graphics processing units;matrix multiplication;parallel processing;sparse matrices;GPU hardware;cache performance;complex data format;data conversion;direct multiplication;fast hardware supported atomic operation;format conversion;graphics hardware;highly tuned parallel implementation;linear algebra computation;multiplication transposition;naive SpMV;sparse matrix vector multiplication;transpose operation;Bandwidth;Graphics processing units;Hardware;Instruction sets;Load management;Memory management;Sparse matrices},


Markus Steinberger and Rhaleb Zayer.

Paper Graphs

The supplemental performance plots can be found here:

Operation Size Format Link
SpMV small float pdf
SpMV large float pdf
SpMV small double pdf
SpMV large double pdf
SpMVT small float pdf
SpMVT large float pdf
SpMVT small double pdf
SpMVT large double pdf