Performance issue: DMatScalarMultExpr ctor is slow

Here is the benchmark, which multiplies the lower-triangular part of a StaticMatrix with a scalar in two different ways:

#include <blaze/Math.h>

#include <benchmark/benchmark.h>


namespace tmpc :: benchmark
{
    template <typename Real, size_t N, bool SO>
    static void BM_LowerMatrixScalarMultiplyStatic(::benchmark::State& state)
    {
        blaze::StaticMatrix<Real, N, N, SO> A;        
        randomize(A);

        for (auto _ : state)
        {
            for (size_t k = 0; k < N; ++k)
            {
                size_t const rs = N - k - 1;
                auto A21 = submatrix(A, k + 1, k, rs, 1);

                A21 *= 1.1;
            }

            ::benchmark::DoNotOptimize(A(N - 1, N - 1));
        }
    }


    template <typename Real, size_t N, bool SO>
    static void BM_LowerMatrixScalarMultiplyStaticLoop(::benchmark::State& state)
    {
        blaze::StaticMatrix<Real, N, N, SO> A;
        randomize(A);

        for (auto _ : state)
        {
            for (size_t k = 0; k < N; ++k)
            {
                size_t const rs = N - k - 1;
                auto A21 = submatrix(A, k + 1, k, rs, 1);

                for (size_t i = 0; i < rs; ++i)
                    A21(i, 0) *= 1.1;
            }

            ::benchmark::DoNotOptimize(A(N - 1, N - 1));
        }
    }


    BENCHMARK_TEMPLATE(BM_LowerMatrixScalarMultiplyStatic, double, 5, blaze::columnMajor);
    BENCHMARK_TEMPLATE(BM_LowerMatrixScalarMultiplyStaticLoop, double, 5, blaze::columnMajor);
}

Compiler: ++ (Ubuntu 8.3.0-6ubuntu1) 8.3.0

Compiler options: -O2 -g -DNDEBUG -save-temps -march=native

CPU: Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz

Benchmark command line:

build/bin/tmpc_bench --benchmark_filter="BM_LowerMatrixScalarMultiply*" --benchmark_repetitions=10 --benchmark_report_aggregates_only=true

Benchmark output:

2019-09-10 23:11:10
Run on (4 X 3600 MHz CPU s)
CPU Caches:
  L1 Data 32K (x4)
  L1 Instruction 32K (x4)
  L2 Unified 256K (x4)
  L3 Unified 6144K (x1)
--------------------------------------------------------------------------------------------------------------------
Benchmark                                                                             Time           CPU Iterations
--------------------------------------------------------------------------------------------------------------------
BM_LowerMatrixScalarMultiplyStatic<double, 5, blaze::columnMajor>_mean               61 ns         61 ns   10598612
BM_LowerMatrixScalarMultiplyStatic<double, 5, blaze::columnMajor>_median             61 ns         61 ns   10598612
BM_LowerMatrixScalarMultiplyStatic<double, 5, blaze::columnMajor>_stddev              0 ns          0 ns   10598612
BM_LowerMatrixScalarMultiplyStaticLoop<double, 5, blaze::columnMajor>_mean            5 ns          5 ns  122790225
BM_LowerMatrixScalarMultiplyStaticLoop<double, 5, blaze::columnMajor>_median          5 ns          5 ns  122790225
BM_LowerMatrixScalarMultiplyStaticLoop<double, 5, blaze::columnMajor>_stddev          0 ns          0 ns  122790225

One can see that the version where submatrix*scalar multiplication is written as a for loop is about 12 times faster.

Profiling the BM_LowerMatrixScalarMultiplyStatic benchmark shows the following:

Surprisingly, DMatScalarMultExpr ctor takes 70% of the time and is the bottleneck.

Comments (6)