# Wiki

# blaze / Benchmarks

- BLAS Level 1
- BLAS Level 2
- BLAS Level 3
- Row-major Dense Matrix/Dense Matrix Multiplication
- Column-major Dense Matrix/Dense Matrix Multiplication
- Row-Major Dense Matrix/Sparse Matrix Multiplication
- Column-Major Dense Matrix/Sparse Matrix Multiplication
- Row-Major Sparse Matrix/Dense Matrix Multiplication
- Column-Major Sparse Matrix/Dense Matrix Multiplication

- Matrix Transpose

The following selected benchmarks give an impression of the single and multi core performance of the **Blaze** library. In the single core benchmarks, **Blaze** 3.0 (released August, 24th, 2016) is compared to the following third party libraries:

- Armadillo, version 7.300.1
- Blitz++, version 0.10
- Boost uBLAS, version 1.61
- Eigen3, version 3.3-beta2
- GMM++, version 5.0
- MTL4, version 4.0.9555
- Intel MKL, version 14.0

The benchmark system is an **Intel Xeon E5-2650V3 ("Haswell EP") CPU at 2.3 GHz** base frequency with 25 MByte of shared L3 cache. Due to the “Turbo Mode” feature the processor can increase the clock speed depending on load and temperature. In order to produce reliable single core results, we turned of the “Turbo Mode” and fixed the clock speed at 2.3 GHz.

The maximum achievable memory bandwidth (as measured by the STREAM benchmark) is about 55.6 GByte/s. Each core has a theoretical peak performance of sixteen flops per cycle in double precision (DP) using AVX (“Advanced Vector Extensions”) vector instructions and FMA. A single core of the Xeon CPU can execute two AVX add and two AVX multiply operation per cycle (assuming that FMA can be used). Full in-cache performance can only be achieved with SIMD-vectorized code. This includes loads and stores, which exist in full-width (AVX) vectorized, half-width (SSE) vectorized, and “scalar” variants. A maximum of one 256-bit wide AVX load and one 128-bit wide store can be sustained per cycle. 256-bit wide AVX stores thus have a two cycle throughput.

The **GNU g++ 6.1** compiler was used with the following compiler flags:

g++ -Wall -Wshadow -Woverloaded-virtual -ansi -pedantic -O3 -mavx -mfma -fopenmp -DNDEBUG -DMTL_HAS_BLAS ...

All libraries are benchmarked as given, but configured such that maximum performance can be achieved. We only show double precision results in MFlop/s graphs for each test case. For all in-cache benchmarks we make sure that the data has already been loaded to the cache.

Please note that due to the continued development for all libraries the performance results are subject to change. Also note that the used releases of all libraries may not be the most recent ones. We are currently updating the results with the newest releases of all libraries.

## BLAS Level 1

### Dense Vector Addition

```
blaze::DynamicVector<double> a( N ), b( N ), c( N );
// ... Initialization of the vectors
c = a + b;
```

### Daxpy

```
blaze::DynamicVector<double> a( N ), b( N );
// ... Initialization of the vectors
b += a * 0.001;
```

## BLAS Level 2

### Row-major Dense Matrix/Vector Multiplication

```
blaze::DynamicMatrix<double,rowMajor> A( N, N );
blaze::DynamicVector<double> a, b;
// ... Initialization of the matrix and the vector a
b = A * a;
```

### Column-major Dense Matrix/Vector Multiplication

```
blaze::DynamicMatrix<double,columnMajor> A( N, N );
blaze::DynamicVector<double> a, b;
// ... Initialization of the matrix and the vector a
b = A * a;
```

### Column-major Dense Matrix/Dense Matrix Addition

```
blaze::DynamicMatrix<double,columnMajor> A( N, N ), B( N, N ), C( N, N );
// ... Initialization of the matrices
C = A + B;
```

## BLAS Level 3

### Row-major Dense Matrix/Dense Matrix Multiplication

```
blaze::DynamicMatrix<double,rowMajor> A( N, N ), B( N, N ), C( N, N );
// ... Initialization of the matrices
C = A * B;
```

Please note that due to the beta state of the Eigen library the OpenMP parallelization of the matrix multiplication did not work as expected!

### Column-major Dense Matrix/Dense Matrix Multiplication

```
blaze::DynamicMatrix<double,columnMajor> A( N, N ), B( N, N ), C( N, N );
// ... Initialization of the matrices
C = A * B;
```

Please note that due to the beta state of the Eigen library the OpenMP parallelization of the matrix multiplication did not work as expected!

### Row-Major Dense Matrix/Sparse Matrix Multiplication

```
blaze::DynamicMatrix<double,rowMajor> A( N, N ), C( N, N );
blaze::CompressedMatrix<double,rowMajor> B( N, N );
// ... Initialization of the matrices
C = A * B;
```

5% of the elements of the sparse matrix are filled with randomly distributed non-zero entries.

### Column-Major Dense Matrix/Sparse Matrix Multiplication

```
blaze::DynamicMatrix<double,columnMajor> A( N, N ), C( N, N );
blaze::CompressedMatrix<double,columnMajor> B( N, N );
// ... Initialization of the matrices
C = A * B;
```

5% of the elements of the sparse matrix are filled with randomly distributed non-zero entries.

### Row-Major Sparse Matrix/Dense Matrix Multiplication

```
blaze::CompressedMatrix<double,rowMajor> A( N, N );
blaze::DynamicMatrix<double,rowMajor> B( N, N ), C( N, N );
// ... Initialization of the matrices
C = A * B;
```

5% of the elements of the sparse matrix are filled with randomly distributed non-zero entries.

### Column-Major Sparse Matrix/Dense Matrix Multiplication

```
blaze::CompressedMatrix<double,columnMajor> A( N, N );
blaze::DynamicMatrix<double,columnMajor> B( N, N ), C( N, N );
// ... Initialization of the matrices
C = A * B;
```

5% of the elements of the sparse matrix are filled with randomly distributed non-zero entries.

## Matrix Transpose

### Dense Matrix Transpose

```
blaze::DynamicMatrix<double,rowMajor> A( N, N ), B( N, N );
// ... Initialization of the matrices
B = trans( A );
```

### Sparse Matrix Transpose

```
blaze::CompressedMatrix<double,rowMajor> A( N, N ), B( N, N );
// ... Initialization of the matrices
B = trans( A );
```

Updated