Improve the performance of dense matrix/dense matrix multiplication kernels

Issue #12 resolved

Klaus Iglberger created an issue 2015-07-12

Description

The primary goal of the Blaze library is to provide maximum performance for all operations. In the context of dense matrix/dense matrix multiplications the philosophy of Blaze is to rely on the efficient matrix/matrix multiplication kernels of existing BLAS libraries. However, BLAS only supports kernels for single and double precision floating point and complex numbers. Kernels for integral matrices are not available. Thus the multiplications of integral matrices is not as efficient as it should be.

In order to remove the dependency to BLAS libraries, the performance of the dense matrix/dense matrix multiplication kernels should be improved. The kernels should be generic to provide support for matrices with integral, floating point and complex element types.

The performance of the following kernels should be improved:

row-major dense/row-major dense matrix multiplication (DMatDMatMultExpr)
row-major dense/column-major dense matrix multiplication (DMatTDMatMultExpr)
column-major dense/row-major dense matrix multiplication (TDMatDMatMultExpr)
column-major dense/column-major dense matrix multiplication (TDMatTDMatMultExpr)

Tasks

optimize the performance of the DMatDMatMultExpr kernel
optimize the performance of the DMatTDMatMultExpr kernel
optimize the performance of the TDMatDMatMultExpr kernel
optimize the performance of the TDMatTDMatMultExpr kernel
update symmetric refactoring operations as required
guarantee correctness and robustness for all modified kernels

Comments (8)

Michael Lehn
I think that I could help on that. Based on some papers on the BLIS library and within a lecture I give on high performance computing I created a simple (as it is intended for teaching) implementation of the matrix-matrix product. We called this ulmBLAS. It performs independent of row/col-major storages. Even matrix-views that reference e.g. every second row and third column (and therefore are neither row- or col-major) have no negative impact on performance. If you are interested I can give you more information. Here some pages that I created recently about this topic:
- From an old lecture: http://apfel.mathematik.uni-ulm.de/~lehn/sghpc/gemm/
- More general about how it is used in lectures and other libraries we use in teaching: http://apfel.mathematik.uni-ulm.de/~lehn/ulmBLAS/
- A few weeks ago there was a post on the uBLAS mailing list related to the performance of the matrix-matrix product. So I did some experiments on how to integrate this into uBLAS. Here the current status: http://www.mathematik.uni-ulm.de/~lehn/test_ublas/index.html Please understand that I am no expert on uBLAS so the way I interfaced it is far away from perfect. But some guys on the list are working on a smoother way.
I really think that C++ could (and should) be the programming language of choice for many tasks in high performance computing.
- 2016-02-14T21:01:05+00:00
Klaus Iglberger reporter
Thanks for the interesting links. We would be very much interested to learn more about the implementation. Do you have a reference to a performance comparison between your implementation and any major BLAS implementation (MKL, Goto, ATLAS, ...), including some information on the used CPU to enable us to determine the expected performance? Does the implementation contain kernels for all possible combinations of row-major and column-major matrices? Thanks again for the links, any help is very much appreciated.
- 2016-02-15T18:58:05+00:00
Michael Lehn
For an old Intel Core 2 I have benchmarks here: http://apfel.mathematik.uni-ulm.de/%7Elehn/sghpc/gemm/page14/index.html

For the benchmarks on [Matrix-Matrix Product Experiments with uBLAS(http://www.mathematik.uni-ulm.de/~lehn/test_ublas/index.html) see here for the output of /proc/cpuinfo

I would suggest the following: I setup a small BLAZE-extention for the GEMM operation with some of my micro-kernlels for AVX and FMA. This way you could double-check the benchmarks. As BLAZE can be linked against any BLAS implementation this can be used to compare performance with ATLAS, OpenBLAS, ...

Besides my own micro-kernel any micro-kernel from BLIS can be used. My micro-kernels are somehow more simpler as their primarily are intended for teaching purposes. So for true high performance and support for a wide range of hardware some collaboration and friendship with the BLIS developers would be a good thing :-)
- 2016-02-15T19:48:16+00:00
Michael Lehn
I started with a page on how the C++GEMM implementation can be integrated into BLAZE: Matrix-Matrix Product Experiments with BLAZE
- 2016-02-16T13:31:42+00:00
Klaus Iglberger reporter
Thanks a lot for the enthusiasm to improve the performance of the matrix/matrix multiplication (MMM). However, after looking into your performance results I realized that the performance of the Blaze implementation is on the same level as the performance of the BLIS/ulmBLAS implementation. Given these results it is questionable if the effort to integrate the code is justified.

In release 2.4 we have significantly improved the performance of the MMM. However, due to lack of resources we failed to beat the performance of the MKL (as a representative for the fastest BLAS implementations). This ticket merely serves as a reminder to finish the job and has the main purpose to achieve at least the the performance of the MKL. Even if the BLIS/ulmBLAS implementation is slightly faster than our current implementation and we could gain a little performance, we would not reach the MKL performance level and thus could not get rid of the dependency to the BLAS libraries (the second goal of this ticket).

Still, thanks a lot for the offer to help to improve the performance, it is highly appreciated.
- 2016-02-16T14:10:11+00:00
Klaus Iglberger reporter
- assigned issue to
  
  Klaus Iglberger
- 2016-09-07T06:46:02+00:00
Klaus Iglberger reporter
- changed status to open
- 2016-09-07T06:46:08+00:00
Klaus Iglberger reporter
- changed status to resolved
The performance of the kernels for small dense matrices have been slightly improved by better utilizing the available registers. The new kernels for large matrix multiplications have been optimized for both integral and floating point computations within the bounds of possibility of a pure C++ implementation (i.e. no usage of assembly code).

The feature has been implemented, tested, and documented as required. It is immediately available via cloning the Blaze repository and will be officially released in Blaze 3.1.
- 2016-09-25T16:31:39+00:00
Log in to comment

Assignee: Klaus Iglberger

Type: enhancement

Priority: major

Status: resolved

Votes: 3

Watchers: 4