Improve the performance of dense matrix/dense matrix multiplication kernels

Issue #12 resolved
Klaus Iglberger created an issue

Description

The primary goal of the Blaze library is to provide maximum performance for all operations. In the context of dense matrix/dense matrix multiplications the philosophy of Blaze is to rely on the efficient matrix/matrix multiplication kernels of existing BLAS libraries. However, BLAS only supports kernels for single and double precision floating point and complex numbers. Kernels for integral matrices are not available. Thus the multiplications of integral matrices is not as efficient as it should be.

In order to remove the dependency to BLAS libraries, the performance of the dense matrix/dense matrix multiplication kernels should be improved. The kernels should be generic to provide support for matrices with integral, floating point and complex element types.

The performance of the following kernels should be improved:

  • row-major dense/row-major dense matrix multiplication (DMatDMatMultExpr)
  • row-major dense/column-major dense matrix multiplication (DMatTDMatMultExpr)
  • column-major dense/row-major dense matrix multiplication (TDMatDMatMultExpr)
  • column-major dense/column-major dense matrix multiplication (TDMatTDMatMultExpr)

Tasks

  • optimize the performance of the DMatDMatMultExpr kernel
  • optimize the performance of the DMatTDMatMultExpr kernel
  • optimize the performance of the TDMatDMatMultExpr kernel
  • optimize the performance of the TDMatTDMatMultExpr kernel
  • update symmetric refactoring operations as required
  • guarantee correctness and robustness for all modified kernels

Comments (8)

  1. Michael Lehn

    I think that I could help on that. Based on some papers on the BLIS library and within a lecture I give on high performance computing I created a simple (as it is intended for teaching) implementation of the matrix-matrix product. We called this ulmBLAS. It performs independent of row/col-major storages. Even matrix-views that reference e.g. every second row and third column (and therefore are neither row- or col-major) have no negative impact on performance. If you are interested I can give you more information. Here some pages that I created recently about this topic:

    I really think that C++ could (and should) be the programming language of choice for many tasks in high performance computing.

  2. Klaus Iglberger reporter

    Thanks for the interesting links. We would be very much interested to learn more about the implementation. Do you have a reference to a performance comparison between your implementation and any major BLAS implementation (MKL, Goto, ATLAS, ...), including some information on the used CPU to enable us to determine the expected performance? Does the implementation contain kernels for all possible combinations of row-major and column-major matrices? Thanks again for the links, any help is very much appreciated.

  3. Michael Lehn

    For an old Intel Core 2 I have benchmarks here: http://apfel.mathematik.uni-ulm.de/%7Elehn/sghpc/gemm/page14/index.html

    For the benchmarks on [Matrix-Matrix Product Experiments with uBLAS(http://www.mathematik.uni-ulm.de/~lehn/test_ublas/index.html) see here for the output of /proc/cpuinfo

    I would suggest the following: I setup a small BLAZE-extention for the GEMM operation with some of my micro-kernlels for AVX and FMA. This way you could double-check the benchmarks. As BLAZE can be linked against any BLAS implementation this can be used to compare performance with ATLAS, OpenBLAS, ...

    Besides my own micro-kernel any micro-kernel from BLIS can be used. My micro-kernels are somehow more simpler as their primarily are intended for teaching purposes. So for true high performance and support for a wide range of hardware some collaboration and friendship with the BLIS developers would be a good thing :-)

  4. Klaus Iglberger reporter

    Thanks a lot for the enthusiasm to improve the performance of the matrix/matrix multiplication (MMM). However, after looking into your performance results I realized that the performance of the Blaze implementation is on the same level as the performance of the BLIS/ulmBLAS implementation. Given these results it is questionable if the effort to integrate the code is justified.

    In release 2.4 we have significantly improved the performance of the MMM. However, due to lack of resources we failed to beat the performance of the MKL (as a representative for the fastest BLAS implementations). This ticket merely serves as a reminder to finish the job and has the main purpose to achieve at least the the performance of the MKL. Even if the BLIS/ulmBLAS implementation is slightly faster than our current implementation and we could gain a little performance, we would not reach the MKL performance level and thus could not get rid of the dependency to the BLAS libraries (the second goal of this ticket).

    Still, thanks a lot for the offer to help to improve the performance, it is highly appreciated.

  5. Klaus Iglberger reporter

    The performance of the kernels for small dense matrices have been slightly improved by better utilizing the available registers. The new kernels for large matrix multiplications have been optimized for both integral and floating point computations within the bounds of possibility of a pure C++ implementation (i.e. no usage of assembly code).

    The feature has been implemented, tested, and documented as required. It is immediately available via cloning the Blaze repository and will be officially released in Blaze 3.1.

  6. Log in to comment