Serious performance degradation with a matrix transpose

While benchmarking C++ expression template libraries, I discovered that in several test cases Blaze performs much worse than Eigen, Armadillo or Julia. I'm talking here about 50-70x longer execution time in kernels such: A * B * C * D * E * F, where all letters denote matrices and vectors with specific properties. An evaluation of few similar examples revealed that this problem appears when a matrix transposition is involved and libraries are configured to dispatch matrix operations to an optimized BLAS implementation.

The example below has been evaluated on my workstation, with gcc 5.4 and single-threaded MKL 2017. Blaze has been installed with a CMake Release build, blaze/config/Blas.h is configured to use BLAS for matrix-matrix product and blaze/config/Debugging.h is configured for a release build. Armadillo has been used here to prove that A*B' can be computed as efficiently as A*B. Obviously, I've been measuring timing in a more systematic way and the code has been simplified for the purpose of this demonstration. The difference is large enough to exclude any deviations caused by poor measurement technique. Compiler and linker flags:

-O3 -DNDEBUG -DBLAZE_BLAS_MODE=1 -DBLAZE_BLAS_INCLUDE_FILE="<mkl_cblas.h>" -I MKL_INCLUDE_PATH -I BLAZE_INCLUDE_PATH  -L MKL_LIB_PATH  -lmkl_rt

Armadillo example can be built with these flags:

-DHAVE_ARMADILLO -DARMA_NO_DEBUG -I ARMADILLO_INCLUDE_PATH -L ARMADILLO_LIB_PATH -larmadillo

Example code. Here I'm using the "array constructor" of Blaze matrix to initialize it with random values.

#include <chrono>
#include <random>
#include <iostream>

#include <blaze/Blaze.h>

#if defined(HAVE_ARMADILLO)
#include <armadillo>

    struct naive_armadillo
    {
        template<typename Type_M33, typename Type_M34>
        decltype(auto) operator()(Type_M33 && M33, Type_M34 && M34)
        {
            return (M33*(M34).t()).eval();
        }
    };

    struct naive_armadillo_no_transpose
    {
        template<typename Type_M33, typename Type_M34>
        decltype(auto) operator()(Type_M33 && M33, Type_M34 && M34)
        {
            return (M33*(M34)).eval();
        }
    };
#endif

struct naive_blaze
{
    template<typename Type_M33, typename Type_M34>
    decltype(auto) operator()( Type_M33 && M33, Type_M34 && M34)
    {
        return blaze::evaluate(blaze::trans(M34)*M33);
    }
};

struct naive_blaze_no_transpose
{
    template<typename Type_M33, typename Type_M34>
    decltype(auto) operator()( Type_M33 && M33, Type_M34 && M34)
    {
        return blaze::evaluate(M33*M34);
    }
};

int main()
{

    std::random_device rd;
    std::mt19937 gen(rd());
    std::uniform_real_distribution<> dis(1, 2);

    std::array<size_t, 2> sizes_first{1200, 1200};
    std::array<size_t, 2> sizes_second{1200, 1200};
    std::unique_ptr<double[]>   first(new double[sizes_first[0] * sizes_first[1]] ),
                                second(new double[sizes_second[0] * sizes_second[1]] );
    std::for_each(first.get(), first.get() + sizes_first[0] * sizes_first[1],
                    [&](double & val) {
                        val = dis(gen);
                    }
    );
    std::for_each(second.get(), second.get() + sizes_first[0] * sizes_first[1],
                    [&](double & val) {
                        val = dis(gen);
                    }
    );

    typedef std::chrono::high_resolution_clock clock_t;
    {
        blaze::DynamicMatrix<double> blaze_first(sizes_first[0], sizes_first[1], first.get());
        blaze::DynamicMatrix<double> blaze_second(sizes_second[0], sizes_second[1], second.get());
        auto begin = clock_t::now();
        naive_blaze()(blaze_first, blaze_second);
        auto end = clock_t::now();
        std::cout << "Time Blaze with transpose: " << std::chrono::duration_cast<std::chrono::duration<float>>(end - begin).count() << '\n';
        begin = clock_t::now();
        naive_blaze_no_transpose()(blaze_first, blaze_second);
        end = clock_t::now();
        std::cout << "Time Blaze without transpose: " << std::chrono::duration_cast<std::chrono::duration<float>>(end - begin).count() << '\n';
    }
#if defined(HAVE_ARMADILLO)
    {
        arma::Mat<double> arma_first(first.get(), sizes_first[0], sizes_first[1]);
        arma::Mat<double> arma_second(second.get(), sizes_second[0], sizes_second[1]);
        auto begin = clock_t::now();
        naive_armadillo()(arma_first, arma_second);
        auto end = clock_t::now();
        std::cout << "Time Armadillo with transpose: " << std::chrono::duration_cast<std::chrono::duration<float>>(end - begin).count() << '\n';
        begin = clock_t::now();
        naive_armadillo_no_transpose()(arma_first, arma_second);
        end = clock_t::now();
        std::cout << "Time Armadillo without transpose: " << std::chrono::duration_cast<std::chrono::duration<float>>(end - begin).count() << '\n';
    }
#endif
    return 0;
}

An example of output on my machine:

Time Blaze with transpose: 0.228939
Time Blaze without transpose: 0.0238049
Time Armadillo with transpose: 0.0257762
Time Armadillo without transpose: 0.0276416

Blaze is approximately ten times slower when the transposition operation is used. My initial suspicion was that for some reason, Blaze is unable to execute A*B' as a call to dgemm and performs a rather inefficient multiplication with on--fly transposition. I believe that's the case because:

1) When using the reference BLAS library the execution time of the second kernel increases drastically but the same does not happen for the first kernel.

2) Running the example built with -DBLAZE_USE_FUNCTION_TRACES=1 produces a very short output for the second kernel, but the output for the first kernel is gigantic and includes multiple calls to assign_backend, smpAssign, submatrix and trans.

3) Not including MKL or any BLAS implementation in linker flags leads to a linker error stating that cblas_dgemm could not be found. This is true only when the second kernel is used. The example using only the first Blaze kernel can be built without linking to a BLAS library, which proves that no call to BLAS-3 routine is made in this kernel.

I want to stress out that I have not seen such behaviour in Armadillo, Eigen or Julia. Blaze is an outlier here and that's why I believe it's a bug. An important one, since I have multiple examples of longer matrix chains, where even a single transposition increases the execution time for Blaze from approximately 0.03s to 2.2s. Unfortunately, I don't have time right now to investigate it further and propose a PR fixing this problem.

Comments (5)