Parallelization slows down the computation of SpMV with CSC and SpMV_T with CSR and vice-versa

Hello, I tried to benchmark the performance of Sparse Matrix - Dense Vector(SpMV) multiplication and Transposed Sparse Matrix - Dense Vector(SpMV_T) multiplication.

The test has been done by using unsymmetric square matrices from `https://sparse.tamu.edu/`

For example - `https://sparse.tamu.edu/Schenk_ISEI/ohne2`

‌

In serial computation (without -fopenmp flag), the expected behaviours were observed.

Blaze CSC (Column Major)
SpMV: 2.64e-07 9.28e-07 9.28e-07 8.55e-07 2.64e-07 CPU TIME: 18
SpMV_T: 9.67e+01 1.97e+02 1.97e+02 3.85e+02 -3.86e+00 CPU TIME: 16
Blaze CSR (Row Major)
SpMV: 2.64e-07 9.28e-07 9.28e-07 8.55e-07 2.64e-07 CPU TIME: 18
SpMV_T: 9.67e+01 1.97e+02 1.97e+02 3.85e+02 -3.86e+00 CPU TIME: 16

‌

But when -fopenmp flag is enabled, the behaviours become weird.

Blaze CSC
SpMV: 2.64e-07 9.28e-07 9.28e-07 8.55e-07 2.64e-07 CPU TIME: 22
SpMV_T: 9.67e+01 1.97e+02 1.97e+02 3.85e+02 -3.86e+00 CPU TIME: 5
Blaze CSR
SpMV: 2.64e-07 9.28e-07 9.28e-07 8.55e-07 2.64e-07 CPU TIME: 4
SpMV_T: 9.67e+01 1.97e+02 1.97e+02 3.85e+02 -3.86e+00 CPU TIME: 24

‌

I expected that the computation speed of SpMV_T with CSC and SpMV with CSR could be fastened as those combinations can do parallel computation without a critical section.

But for SpMV with CSR and SpMV_T with CSR, I expected that the computation speed would remain the same since, as far as I know, the critical section is not avoidable to do parallel computation.

However, the observation showed that it slowed down the computation speed, and it becomes worse when the size of the matrix increases.

The following is the code I used to test it. GSL is used to load MTX(MatrixMarket) file. Please let me know if Blaze has a similar feature.

#include <iostream>
#include <chrono>

#include <blaze/Blaze.h>
#include <blaze/Forward.h>
#include <gsl/gsl_spmatrix.h>
#include <gsl/gsl_spblas.h>

const double m_alpha = 1.0;
const double m_beta = 0.0;
uint32_t print_upto = 5;

int main()
{
    std::string test_case = "ohne2.mtx";
    FILE *load_mtx;
    load_mtx = fopen64(test_case.c_str(), "r");
    gsl_spmatrix *gsl_mat_A = gsl_spmatrix_fscanf(load_mtx);
    fclose(load_mtx);

    uint32_t num_row = gsl_mat_A->size1;
    uint32_t num_col = gsl_mat_A->size2;
    uint32_t num_nz = gsl_mat_A->nz;

    // build Vector x and Vector b for Ax = b
    blaze::DynamicVector<double, blaze::columnVector> m_vec_x(num_col);
    blaze::DynamicVector<double, blaze::columnVector> m_vec_b(num_col);
    for (uint32_t idx_vec = 0; idx_vec < num_col; ++idx_vec)
    {
        m_vec_x[idx_vec] = 1.0;
        m_vec_b[idx_vec] = 0.0;
    }

    // For CSC
    // Build Blaze CSC
    gsl_spmatrix *in_CSC = gsl_spmatrix_ccs(gsl_mat_A);
    blaze::CompressedMatrix<double, blaze::columnMajor> m_mat_A_CSC(num_row, num_col);
    m_mat_A_CSC.reserve(num_nz);
    for (uint32_t idx_ptr = 0; idx_ptr < num_col; ++idx_ptr)
    {
        uint32_t col_start = in_CSC->p[idx_ptr];
        uint32_t col_end = in_CSC->p[idx_ptr + 1];
        for (uint32_t idx_nz = col_start; idx_nz < col_end; ++idx_nz)
        {
            m_mat_A_CSC.append(in_CSC->i[idx_nz], idx_ptr, in_CSC->data[idx_nz]);
        }
        m_mat_A_CSC.finalize(idx_ptr);
    }

    std::cout << "Blaze CSC\n";
    std::cout << "SpMV:    ";
    auto timer_start = std::chrono::steady_clock::now();
    // m_vec_b = m_alpha * m_mat_A_CSC * m_vec_x + m_beta * m_vec_b;
    m_vec_b = m_mat_A_CSC * m_vec_x;
    auto timer_stop = std::chrono::steady_clock::now();
    for (uint32_t iter = 0; iter < print_upto; ++iter)
    {
        std::cout << std::right << std::setw(10) << std::scientific << std::setprecision(2) << m_vec_b[iter] << "\t";
    }
    std::cout << "CPU TIME: " << std::chrono::duration_cast<std::chrono::milliseconds>(timer_stop - timer_start).count();
    std::cout << std::endl;

    std::cout << "SpMV_T:  ";
    timer_start = std::chrono::steady_clock::now();
    m_vec_b = blaze::trans(m_mat_A_CSC) * m_vec_x;
    timer_stop = std::chrono::steady_clock::now();
    for (uint32_t iter = 0; iter < print_upto; ++iter)
    {
        std::cout << std::right << std::setw(10) << std::scientific << std::setprecision(2) << m_vec_b[iter] << "\t";
    }
    std::cout << "CPU TIME: " << std::chrono::duration_cast<std::chrono::milliseconds>(timer_stop - timer_start).count();
    std::cout << std::endl;

    // For CSR
    // Build Blaze CSR
    gsl_spmatrix *in_CSR = gsl_spmatrix_crs(gsl_mat_A);
    blaze::CompressedMatrix<double, blaze::rowMajor> m_mat_A_CSR(num_row, num_col);
    m_mat_A_CSR.reserve(num_nz);
    for (uint32_t idx_ptr = 0; idx_ptr < num_col; idx_ptr++)
    {
        uint32_t row_start = in_CSR->p[idx_ptr];
        uint32_t row_end = in_CSR->p[idx_ptr + 1];
        for (uint32_t idx_nz = row_start; idx_nz < row_end; ++idx_nz)
        {
            m_mat_A_CSR.append(idx_ptr, in_CSR->i[idx_nz], in_CSR->data[idx_nz]);
        }
        m_mat_A_CSR.finalize(idx_ptr);
    }

    std::cout << "Blaze CSR\n";
    std::cout << "SpMV:    ";
    timer_start = std::chrono::steady_clock::now();
    // m_vec_b = m_alpha * m_mat_A_CSR * m_vec_x + m_beta * m_vec_b;
    m_vec_b = m_mat_A_CSR * m_vec_x;
    timer_stop = std::chrono::steady_clock::now();
    for (uint32_t iter = 0; iter < print_upto; ++iter)
    {
        std::cout << std::right << std::setw(10) << std::scientific << std::setprecision(2) << m_vec_b[iter] << "\t";
    }
    std::cout << "CPU TIME: " << std::chrono::duration_cast<std::chrono::milliseconds>(timer_stop - timer_start).count();
    std::cout << std::endl;

    std::cout << "SpMV_T:  ";
    timer_start = std::chrono::steady_clock::now();
    m_vec_b = blaze::trans(m_mat_A_CSR) * m_vec_x;
    timer_stop = std::chrono::steady_clock::now();
    for (uint32_t iter = 0; iter < print_upto; ++iter)
    {
        std::cout << std::right << std::setw(10) << std::scientific << std::setprecision(2) << m_vec_b[iter] << "\t";
    }
    std::cout << "CPU TIME: " << std::chrono::duration_cast<std::chrono::milliseconds>(timer_stop - timer_start).count();
    std::cout << std::endl;

    return 0;
}

The code was compiled with

GCC - g++ (Ubuntu 11.2.0-19ubuntu1) 11.2.0
- For parallel: g++ -m64 -std=c++17 -O2 -I /usr/local/include -I /usr/local/include/blaze validate_para_blz.cpp -o val_par_blaze -lgsl -lgslcblas -fopenmp
- For serial: g++ -m64 -std=c++17 -O2 -I /usr/local/include -I /usr/local/include/blaze validate_para_blz.cpp -o val_par_blaze -lgsl -lgslcblas

Is there some mistake when I use parallelization?

Comments (0)