Using a third-party BLAS library with blaze

Issue #194 resolved
Антон М created an issue

Documentation states

#!

For maximum performance, Blaze expects you to have a BLAS library installed (Intel MKL,
ACML, Atlas, Goto, ...). If you don't have a BLAS library installed on your system, Blaze will
still work and will not be reduced in functionality, but performance may be limited. Thus it is
strongly recommended to install a BLAS library.

I am struggling to understand how exactly to make use of Intel MKL or the like with Blaze and cannot find more detailed documentation on the subject. The questions are:

1) In what cases can such a library help exactly? Can any BLAS operation be replaced? E.g. can it help if I am only performing arithmetic on vectors?

2) How do I make sure the library is used? Is setting the defines from BLAS.h enough? For some reason, after doing this (I have Intel MKL installed), my code compiles fine without explicitly linking MKL (perhaps this is related to (1)).

I guess my proposal is to document this topic more extensively (if not done already).

Comments (7)

  1. Klaus Iglberger

    Hi Антон!

    We admit that there is no exhaustive documentation on which BLAS functions are currently used within Blaze. The formulation in the wiki is very general, but provides us with some implementation flexibility, i.e. we can change the internals without violating documented behavior.

    Issue #177 already provides some answers on which BLAS and LAPACK functions are currently used within Blaze and how to switch between BLAS and Blaze kernels. However, we will also add a FAQ to the tutorial and wiki, which will cover this particular question. Thanks for creating this issue,

    Best regards,

    Klaus!

  2. Klaus Iglberger

    Summary

    The FAQ, which initially covers four items, has been introduced in both the tutorial and the wiki. The updated tutorial is immediately available via cloning the Blaze repository, the updated wiki will be available with the officially release of Blaze 3.4.

    FAQ

    A StaticVector/StaticMatrix is larger than expected. Is this a bug?

    The size of a StaticVector, StaticMatrix, HybridVector, or HybridMatrix can indeed be larger than expected:

    StaticVector<int,3> a;
    StaticMatrix<int,3,3> A;
    
    sizeof( a );  // Evaluates to 16, 32, or even 64, but not 12
    sizeof( A );  // Evaluates to 48, 96, or even 144, but not 36
    

    In order to achieve the maximum possible performance the Blaze library tries to enable SIMD vectorization even for small vectors. For that reason Blaze by default uses padding elements for all dense vectors and matrices to guarantee that at least a single SIMD vector can be loaded. Depending on the used SIMD technology that can significantly increase the size of a StaticVector, StaticMatrix, HybridVector or HybridMatrix:

    StaticVector<int,3> a;
    StaticMatrix<int,3,3> A;
    
    sizeof( a );  // Evaluates to 16 in case of SSE, 32 in case of AVX, and 64 in case of AVX-512
                  // (under the assumption that an integer occupies 4 bytes)
    sizeof( A );  // Evaluates to 48 in case of SSE, 96 in case of AVX, and 144 in case of AVX-512
                  // (under the assumption that an integer occupies 4 bytes)
    

    The configuration file <blaze/config/Optimizations.h> provides a compile time switch that can be used to (de-)activate padding:

    #define BLAZE_USE_PADDING 1
    

    Alternatively it is possible to (de-)activate padding via command line or by defining this symbol manually before including any Blaze header file:

    #define BLAZE_USE_PADDING 1
    #include <blaze/Blaze.h>
    

    If BLAZE_USE_PADDING is set to 1 padding is enabled for all dense vectors and matrices, if it is set to 0 padding is disabled. Note however that disabling padding can considerably reduce the performance of all dense vector and matrix operations!

    Despite disabling padding, a StaticVector/StaticMatrix is still larger than expected. Is this a bug?

    Despite disabling padding via the BLAZE_USE_PADDING compile time switch, the size of a StaticVector, StaticMatrix, HybridVector, or HybridMatrix can still be larger than expected:

    #define BLAZE_USE_PADDING 1
    #include <blaze/Blaze.h>
    
    StaticVector<int,3> a;
    StaticVector<int,5> b;
    
    sizeof( a );  // Always evaluates to 12
    sizeof( b );  // Evaluates to 32 with SSE (larger than expected) and to 20 with AVX or AVX-512 (expected)
    

    The reason for this behavior is the used SIMD technology. If SSE is used, which provides 128 bit wide registers, a single SIMD pack can usually hold 4 integers (128 bit divided by 32 bit). Since the second vector contains enough elements is possible to benefit from vectorization. However, SSE requires an alignment of 16 bytes, which ultimately results in a total size of 32 bytes for the StaticVector (2 times 16 bytes due to 5 integer elements). If AVX or AVX-512 is used, which provide 256 bit or 512 bit wide registers, a single SIMD vector can hold 8 or 16 integers, respectively. Even the second vector does not hold enough elements to benefit from vectorization, which is why Blaze does not enforce a 32 byte (for AVX) or even 64 byte alignment (for AVX-512).

    It is possible to disable the vectorization entirely by the compile time switch in the <blaze/config/Vectorization.h> configuration file:

    #define BLAZE_USE_VECTORIZATION 1
    

    It is also possible to (de-)activate vectorization via command line or by defining this symbol manually before including any Blaze header file:

    #define BLAZE_USE_VECTORIZATION 1
    #include <blaze/Blaze.h>
    

    In case the switch is set to 1, vectorization is enabled and the Blaze library is allowed to use intrinsics and the necessary alignment to speed up computations. In case the switch is set to 0, vectorization is disabled entirely and the Blaze library chooses default, non-vectorized functionality for the operations. Note that deactivating the vectorization may pose a severe performance limitation for a large number of operations!

    To which extend does Blaze make use of BLAS functions under the hood?

    Currently the only BLAS functions that are utilized by Blaze are the gemm() functions for the multiplication of two dense matrices (i.e. sgemm(), dgemm(), cgemm(), and zgemm()). All other operations are always and unconditionally performed by native Blaze* kernels.

    The BLAZE_BLAS_MODE config switch (see <blaze/config/BLAS.h>) determines whether Blaze is allowed to use BLAS kernels. If BLAZE_BLAS_MODE is set to 0 then Blaze does not utilize the BLAS kernels and unconditionally uses its own custom kernels. If BLAZE_BLAS_MODE is set to 1 then Blaze is allowed to choose between using BLAS kernels or its own custom kernels. In case of the dense matrix multiplication this decision is based on the size of the dense matrices. For large matrices, Blaze uses the BLAS kernels, for small matrices it uses its own custom kernels. The threshold for this decision can be configured via the BLAZE_DMATDMATMULT_THRESHOLD, BLAZE_DMATTDMATMULT_THRESHOLD, BLAZE_TDMATDMATMULT_THRESHOLD and BLAZE_TDMATTDMATMULT_THRESHOLD config switches (see <blaze/config/Thresholds.h>).

    Please note that the extend to which Blaze uses BLAS kernels can change in future releases of Blaze!

    To which extend does Blaze make use of LAPACK functions under the hood?

    Blaze uses LAPACK functions for matrix decomposition, matrix inversion, computing the determinants and eigenvalues, and the SVD. In contrast to the BLAS functionality, you cannot disable LAPACK or switch to custom kernels. In case you try to use any of these functionalities, but do not provide (i.e. link) a LAPACK library you will get link time errors.

    Please note that the extend to which Blaze uses LAPACK kernels can change in future releases of Blaze!

  3. Sandu Ursu

    I also have Intel MKL.

    And the same question too:

    • How do I make sure the library is used? Is setting the defines from BLAS.h enough?

    Ideally there would be an example, a full program, which one could compare how it performs with and without BLAS.

  4. Klaus Iglberger

    Hi Sandu!

    For a row-major matrix multiplication you can make sure that the MKL is called by setting BLAZE_BLAS_MODE to 1 and BLAZE_DMATDMATMULT_THRESHOLD to 0. For instance:

    g++ -DBLAZE_USE_BLAS=1 -DBLAZE_DMATDMATMULT_THRESHOLD=0 -O3 -DNDEBUG ...
    

    You can find more information in the wiki (see e.g. Configuration Files and the FAQ).

    Best regards,

    Klaus!

  5. Log in to comment