Making Blaze even faster

Issue #181 new

Mikhail Katliar created an issue 2018-07-04

Blaze is fast, but not always the fastest. The chart below shows performance of different implementations of dense matrix-matrix product:

The BLAS implementation from the BLASFEO library (https://github.com/giaf/blasfeo) significantly outperforms Blaze for matrix sizes >10.

It would be great if Blaze performance could be improved such that it is comparable. Perhaps, architecture-optimized code from Blasfeo can be used in Blaze?

blaze-config.zip

Comments (22)

Mikhail Katliar reporter
- edited description
- 2018-07-04T15:28:36+00:00
Klaus Iglberger
Hi!

Thanks a lot for raising this issue and for the performance comparison. Unfortunately this comparison is missing a lot of details that would be necessary to reproduce the results and to give some advice how to adapt Blaze. Which BLAS library do you use as backend for Blaze? Which threshold do you use for switching between the custom kernels to the BLAS kernels? Which data type is used (int, float, double, or even std::complex)? It would also help to see the Blaze source code used for the comparison. If you could share that information we would be grateful.

I should point out that Blaze does not attempt to provide the fastest dense matrix multiplication kernel, but fully builds on existing BLAS libraries for that purpose. Only in case Blasfeo provides better results than existing BLAS libraries (e.g. Intel MKL, OpenBLAS, ATLAS, ...) we would consider to introduce Blasfeo as backend. Thanks again,

Best regards,

Klaus!
- 2018-07-04T15:32:18+00:00
Mikhail Katliar reporter
Hello Klaus,

The performance comparison of Blasfeo vs other libraries (OpenBLAS, MKL, ATLAS and BLIS) can be found in this paper: https://arxiv.org/abs/1704.02457.

The results presented on my chart were obtained using double arithmetic.

I will check which BLAS backend and threshold values were used and will let you know.

Best regards, Mikhail
- 2018-07-06T18:56:00+00:00
Nils Deppe
Hi,

Would being able to switch to say LIBXSMM be viable for certain operations? I've found that it performs better than Blaze for the matrix sizes discussed here.

Best,

Nils
- 2018-08-21T15:24:27+00:00
Klaus Iglberger
Hi!

Integrating different BLAS(-like) libraries such as LIBXSMM or BLASFEO is definitely an option. We will consider this, but not for the next (i.e. Blaze 3.5) release. Thanks for the suggestion,

Best regards,

Klaus!
- 2018-08-22T04:36:05+00:00
Mikhail Katliar reporter
- attached blaze-config.zip
- 2018-08-24T13:12:36+00:00
Mikhail Katliar reporter
Hello Klaus,

in the example presented, no BLAS backend was used:
```
#define BLAZE_BLAS_MODE 0
```
The full set of corresponding Blaze config files you can find in the attached archive.
- 2018-08-24T13:16:46+00:00
Hans Pabst
Hello Mikhail, Klaus, and Nils,

I am preparing an article about libxsmm and the story focuses on (small) matrix multiplications. I was thinking about including example code for Eigen, Blaze, and LIBXSMM. The main point is not about performance comparison but rather about doing the right benchmark or what reproduces the performance behvior of an application.

Let me share the example code, and perhaps you want to comment on:
https://github.com/hfp/libxsmm/tree/master/samples/magazine
- Allocates a series of matrices (with potentially odd shape); work-set is beyond L2/L3 sizes (like STREAM).
- Selects "streaming case" i.e., which matrices are loaded from main memory (vs. nth-level cache).
With respect to Blaze, I am using CustomMatrix to (re-)use existing buffers (like in applications that are BLAS-centric or applications with own non-Blaze based data structures). You can try the example code:
```
git clone https://github.com/hfp/libxsmm.git
cd libxsmm; make
cd samples/magazin; make
```
The latter attempts to detect the BLAZEROOT (at HOME, and somewhere else), but one can also:
```
make BLAZEROOT=/path/to/blaze-3.4
```
It does something similar for EIGENROOT, but you can also just name the make-target and get around of Eigen:
```
make BLAZEROOT=/path/to/blaze-3.4 magazine_blaze
```
Anyhow, please have a look!

P.S.: there seems to be an issue (in Blaze?) if beta=1 (https://github.com/hfp/libxsmm/blob/master/samples/magazine/magazine_blaze.cpp#L93) and it looks like the matrix multilication expression is somehow incorrectly aliasing the c-matrix (it's used on lhs and rhs): https://github.com/hfp/libxsmm/blob/master/samples/magazine/magazine_blaze.cpp#L141
- 2018-08-27T14:02:51+00:00
Hans Pabst
Regarding the issue mentioned, c.assign(a * b + beta * c) (as opposed to c = a * b + beta * c) resolves the problem.
- 2018-08-27T15:17:20+00:00
Klaus Iglberger
Hi Hans!

Thanks a lot for pointing out the problem with beta == 1. We have already taken a look at the issue and apparently discovered a problem in our aliasing detection system that unfortunately has surfaced during our tests before. We will analyse and fix the problem as quickly as possible.

I should point out that Blaze does not provide a special path to efficiently evaluate the expression c = a * b + beta * c; (for instance by mapping this to a BLAS kernel). Please rewrite the expression to either c += a * b; (in case beta == 1) or to c *= beta; c += a * b; (in case beta != 1). Thanks again,

Best regards

Klaus!
- 2018-08-27T19:13:54+00:00
Hans Pabst
Thank you for the recommendations. I just implemented the expression such that it does not rely on mapping it to the best possible implementation (specific expressions with respect to beta). Btw, c.assign(a * b + beta * c) appears to compile with MSVC compiler but not with GCC/Clang.

I settled the code examples for the magazine with the beta=1 case and streaming all three matrix operands. Perhaps not as common as only streaming A and B matrices (and accumulating into C), but the presented case is the most bandwidth-demanding case (not FLOPS bound). I can nevertheless share the performance results if this is of interest.
- 2018-08-28T08:48:20+00:00
Klaus Iglberger
Hi Hans!

c.assign(a * b + beta * c); should not be used directly as this is an internal interface. I'm surprised that MSVC compiles, GCC/Clang handle this case correctly. The official interface is always via operator=() (e.g. c = a * b + beta * c;.

It appears you have picked the only case that is faulty. An alternative formulation that works flawlessly is c = beta * c + a * b;. I'm currently testing the fix for the c = a * b + beta * c; formulation and will push it as soon as I have the guarantee that it works fine. I apologize for the inconvenience,

Best regards,

Klaus!
- 2018-08-28T08:56:24+00:00
Hans Pabst
Here is some comparison of single-core (OMP_NUM_THREADS=1) performance for C += A * B on a Skylake Server system (2666 MHz DIMMs). For the matter of Blaze and Eigen, I have used GNU GCC 8.2 (when using multiple cores it is also a matter of which OpenMP runtime/version is used). Altough "GFLOPS/s" is shown in the diagram, most of the kernels are completely memory-bound. To reproduce these or other results, I have included the benchmark/plot scripts.
```
git clone https://github.com/hfp/libxsmm.git
cd libxsmm; make
cd samples/magazin; make
./benchmark.sh
./benchmark-plot.sh eigen
./benchmark-plot.sh blaze
./benchmark-plot.sh xsmm
```
LIBXSMM can prefetch the operands when processing a batch of small matrix multiplications. For this case I reran the benchmark (see below). For (my) convenience, I also reran [sorry] Eigen/Blaze of course on the same system which at least reminds about variations.
- 2018-08-28T14:21:28+00:00
Klaus Iglberger
Hi Hans!

Thanks for sharing the performance results. I'm glad that Blaze performs so well in comparison to the special purpose LIBXSMM library, but it also becomes obvious that enabling LIBXSMM as a backend would further increase the performance for most matrix sizes. Therefore we will definitely take a deeper look. Thanks again,

Best regards,

Klaus!
- 2018-08-29T06:48:51+00:00
Hans Pabst
I'm glad that Blaze performs so well

Yes, Blaze does definitely well and it is also based on Intrinisc based code (which makes the performance more independent of the compiler's capabilities).

enabling LIBXSMM as a backend would further increase the performance

LIBXSMM integration should be relatively smooth since Blaze consists of algorithms and data/matrix types. The latter are ideal to e.g., add a data member (function pointer) representing LIBXSMM's kernel (which is dispatched by the matrix type's c'tor). With LIBXSMM, any such pointer that represents JIT'ted code is valid through the lifetime of a program (like any normal function pointer that refers to statically generated code). So no "deep copy" or memory management is needed.

Btw, I revisited Eigen and discovered EIGEN_DONT_PARALLELIZE (similar to BLAZE_USE_SHARED_MEMORY_PARALLELIZATION=0) which helps in case of multicore. Though, my sample code parallelizes over small multiplications and it is definitely bad if the library attempts to parallelize intra-muliplication (apparently Eigen and maybe Blaze lack effective single-thread thresholds). In case of Eigen, the library can map the expression in question to GEMM. However, if no BLAS library is incorporated (it was the purpose of the benchmark to not map to BLAS). Eigen is nicely tiling the matrix operands/panels but without a threshold hence it always copies into thread-local store (even for small matrices). I fixed this here, and it helped a bit but not much.
- 2018-08-29T17:03:43+00:00
Hans Pabst
As a reference, an article about the topic is available ("LIBXSMM: An Open Source based Inspiration for Hardware and Software Development at Intel."): https://software.intel.com/en-us/download/parallel-universe-magazine-issue-34-october-2018

The article contains revised benchmark results when compared to what I attached to an earlier comment (see above).
- 2018-09-27T07:46:38+00:00
Mikhail Katliar reporter
Below is a fresh comparison chart of dgemm performance of different LA packages:

‌

The benchmarks were run on an Intel Core i5-6500 CPU @ 3.30GHz with DDR3 memory @ 1866MHz. Blaze (D) and Blaze (S) stand for implementations based on blaze::DynamicMatrix<> and blaze::StaticMatrix<>, respectively. We can see that Blaze (D) performs extremely well on small matrices, outperforming all other implementations for matrix sizes 0<=N<7 and being faster than MKL for N<=20. Blaze (S) is faster than OpenBLAS for N<=15 and about the same performance as OpenBLAS for N>15. MKL becomes considerably faster than Blaze (D) for N>=15.

Interesting is that BLASFEO significantly outperforms all other implementations for N>7. This advantage is due to different representation of matrices in memory used by BLASFEO, so called panel-major format [1]. This format allows BLASFEO to reduce memory bandwidth by performing more useful operations on small submatrices loaded into registers compared to conventional row- or column-major storage formats.

It is interesting to think about combining the idea of panel-major storage with the template-based implementation. This should give the same performance as BLASFEO for large matrices. For small matrices, the compiler can take advantage of the matrix sizes known at compile-time, providing event better performance than BLASFEO.

@Klaus Iglberger what do you think about this idea in general? What is your estimate of effort required to support a new storage format in Blaze, keeping the expression templates mechanism mostly untouched? Or does it look more like developing another matrix library?

‌

References:

[1] BLASFEO: Basic Linear Algebra Subroutines For Embedded Optimization
G. Frison, D. Kouzoupis, T. Sartor, A. Zanelli, M. Diehl
ACM Transactions on Mathematical Software (TOMS) (2018)
- 2019-10-02T17:03:42+00:00
Klaus Iglberger
Hi Misha!

Introducing the panel-major format by means of adding panelMajor in addition to rowMajor and columnMajor would require a significant effort. It would be possible, but would require to touch a major portion of the existing code. Introducing the panel-major format in form of a new matrix class and providing special kernels for the matrix multiplication would require little effort. However, this matrix class would probably cause overhead in all other operations (addition, subtraction, …) since it would have to provide the standard matrix interface, which would have to perform a mapping. But I didn’t have time to think this through so far.

It is definitely very interesting to try this, but it has a lower priority than many other issues (e.g. #53, #49, and #73).

Best regards,

Klaus!
- 2019-10-06T17:06:46+00:00
Hans Pabst
Hello Mikhail,

alternative data formats/layouts are also available with MKL one of which is suitable for very small sizes (and repeated application of such kernels). For LIBXSMM, this format is called “packed” (similar to the instructions operating on multiple consecutive elements ;-). We have implemented GEMM, GETRF, TRMM, and TRSM (samples/packed). We also sketched a “blocked gemm” (bgemm) with copy-in/out which actually uses the panel format. MKL adopted this as well, but fully implemented all cases whereas we stopped with just multiples of the chosen block-size (samples/blocked_gemm).

Using a custom data format somewhat emerged with Machine Learning where the “tensor format” is often customized/reordered for the loop nest the kernel operates in. Klaus is right when mentioning the overhead of format conversion. It can become very challenging when optimizing expression with mixed data layouts (copy-in/out). In fact this is done at runtime for most deep learning frameworks (“graph compiler“).

Best regards,
Hans
- 2019-10-07T06:22:59+00:00
Mikhail Katliar reporter
I managed to reach significantly better performance than the other implementations for static matrices of size < 100 on Haswell architecture even using the conventional column/row-major format. Below are the benchmark results for double precision matrix-matrix multiplication (dgemm), my implementation is labeled BlazeFEO. For matrix sizes which are multiples of SIMD register size, the performance is close to the theoretical peak performance, which is 52Gflops on my system.

‌
- 2020-08-05T15:27:54+00:00
Klaus Iglberger
Hi Misha!

Thanks for these results. This indeed looks very promising. We are very open for a pull request to integrate your implementation into Blaze. In case your implementation is not complete, it might be sufficient to provide a proof of concept to give us an idea what needs to be done. Thanks again,

Best regards,

Klaus!
- 2020-08-07T05:44:08+00:00
Mikhail Katliar reporter
Hello Klaus!

My implementation is separate from Blaze, because I found it difficult to integrate with Blaze expression templates mechanism. But it is surely possible and I will be glad if it is eventually integrated into Blaze. It is rather a proof of concept than a complete implementation, but it is a working one. I will clean-up my code and make it available online such that you can take a look and then we can decide what is the way to go.
- 2020-08-07T08:17:00+00:00
Log in to comment

Assignee: –

Type: task

Priority: major

Status: new

Votes: 1

Watchers: 5