Wiki

BLAS Level 1
- Dense Vector Addition
- Daxpy
BLAS Level 2
BLAS Level 3
Matrix Transpose
- Dense Matrix Transpose
- Sparse Matrix Transpose

The following selected benchmarks give an impression of the single and multi core performance of the Blaze library. In the single core benchmarks, Blaze 3.0 (released August, 24th, 2016) is compared to the following third party libraries:

Armadillo, version 7.300.1
Blitz++, version 0.10
Boost uBLAS, version 1.61
Eigen3, version 3.3-beta2
GMM++, version 5.0
MTL4, version 4.0.9555
Intel MKL, version 14.0

The benchmark system is an Intel Xeon E5-2650V3 ("Haswell EP") CPU at 2.3 GHz base frequency with 25 MByte of shared L3 cache. Due to the “Turbo Mode” feature the processor can increase the clock speed depending on load and temperature. In order to produce reliable single core results, we turned of the “Turbo Mode” and fixed the clock speed at 2.3 GHz.

The maximum achievable memory bandwidth (as measured by the STREAM benchmark) is about 55.6 GByte/s. Each core has a theoretical peak performance of sixteen flops per cycle in double precision (DP) using AVX (“Advanced Vector Extensions”) vector instructions and FMA. A single core of the Xeon CPU can execute two AVX add and two AVX multiply operation per cycle (assuming that FMA can be used). Full in-cache performance can only be achieved with SIMD-vectorized code. This includes loads and stores, which exist in full-width (AVX) vectorized, half-width (SSE) vectorized, and “scalar” variants. A maximum of one 256-bit wide AVX load and one 128-bit wide store can be sustained per cycle. 256-bit wide AVX stores thus have a two cycle throughput.

The GNU g++ 6.1 compiler was used with the following compiler flags:

g++ -Wall -Wshadow -Woverloaded-virtual -ansi -pedantic -O3 -mavx -mfma -fopenmp -DNDEBUG -DMTL_HAS_BLAS ...

All libraries are benchmarked as given, but configured such that maximum performance can be achieved. We only show double precision results in MFlop/s graphs for each test case. For all in-cache benchmarks we make sure that the data has already been loaded to the cache.

Please note that due to the continued development for all libraries the performance results are subject to change. Also note that the used releases of all libraries may not be the most recent ones. We are currently updating the results with the newest releases of all libraries.

BLAS Level 1

Dense Vector Addition

blaze::DynamicVector<double> a( N ), b( N ), c( N );
// ... Initialization of the vectors
c = a + b;

images/dvecdvecadd.jpg

Daxpy

blaze::DynamicVector<double> a( N ), b( N );
// ... Initialization of the vectors
b += a * 0.001;

images/daxpy.jpg

BLAS Level 2

Row-major Dense Matrix/Vector Multiplication

blaze::DynamicMatrix<double,rowMajor> A( N, N );
blaze::DynamicVector<double> a, b;
// ... Initialization of the matrix and the vector a
b = A * a;

images/dmatdvecmult.jpg

Column-major Dense Matrix/Vector Multiplication

blaze::DynamicMatrix<double,columnMajor> A( N, N );
blaze::DynamicVector<double> a, b;
// ... Initialization of the matrix and the vector a
b = A * a;

images/tdmatdvecmult.jpg

Column-major Dense Matrix/Dense Matrix Addition

blaze::DynamicMatrix<double,columnMajor> A( N, N ), B( N, N ), C( N, N );
// ... Initialization of the matrices
C = A + B;

images/tdmattdmatadd.jpg

BLAS Level 3

Row-major Dense Matrix/Dense Matrix Multiplication

blaze::DynamicMatrix<double,rowMajor> A( N, N ), B( N, N ), C( N, N );
// ... Initialization of the matrices
C = A * B;

Please note that due to the beta state of the Eigen library the OpenMP parallelization of the matrix multiplication did not work as expected!

images/dmatdmatmult.jpg

Column-major Dense Matrix/Dense Matrix Multiplication

blaze::DynamicMatrix<double,columnMajor> A( N, N ), B( N, N ), C( N, N );
// ... Initialization of the matrices
C = A * B;

Please note that due to the beta state of the Eigen library the OpenMP parallelization of the matrix multiplication did not work as expected!

images/tdmattdmatmult.jpg

Row-Major Dense Matrix/Sparse Matrix Multiplication

blaze::DynamicMatrix<double,rowMajor> A( N, N ), C( N, N );
blaze::CompressedMatrix<double,rowMajor> B( N, N );
// ... Initialization of the matrices
C = A * B;

5% of the elements of the sparse matrix are filled with randomly distributed non-zero entries.

images/dmatsmatmult.jpg

Column-Major Dense Matrix/Sparse Matrix Multiplication

blaze::DynamicMatrix<double,columnMajor> A( N, N ), C( N, N );
blaze::CompressedMatrix<double,columnMajor> B( N, N );
// ... Initialization of the matrices
C = A * B;

5% of the elements of the sparse matrix are filled with randomly distributed non-zero entries.

images/tdmattsmatmult.jpg

Row-Major Sparse Matrix/Dense Matrix Multiplication

blaze::CompressedMatrix<double,rowMajor> A( N, N );
blaze::DynamicMatrix<double,rowMajor> B( N, N ), C( N, N );
// ... Initialization of the matrices
C = A * B;

5% of the elements of the sparse matrix are filled with randomly distributed non-zero entries.

images/smatdmatmult.jpg

Column-Major Sparse Matrix/Dense Matrix Multiplication

blaze::CompressedMatrix<double,columnMajor> A( N, N );
blaze::DynamicMatrix<double,columnMajor> B( N, N ), C( N, N );
// ... Initialization of the matrices
C = A * B;

5% of the elements of the sparse matrix are filled with randomly distributed non-zero entries.

images/tsmattdmatmult.jpg

Matrix Transpose

Dense Matrix Transpose

blaze::DynamicMatrix<double,rowMajor> A( N, N ), B( N, N );
// ... Initialization of the matrices
B = trans( A );

images/dmattrans.jpg

Sparse Matrix Transpose

blaze::CompressedMatrix<double,rowMajor> A( N, N ), B( N, N );
// ... Initialization of the matrices
B = trans( A );

images/smattrans.jpg