Performance issue: DMatScalarMultExpr ctor is slow

Issue #288 new
Mikhail Katliar created an issue

Here is the benchmark, which multiplies the lower-triangular part of a StaticMatrix with a scalar in two different ways:

#include <blaze/Math.h>

#include <benchmark/benchmark.h>


namespace tmpc :: benchmark
{
    template <typename Real, size_t N, bool SO>
    static void BM_LowerMatrixScalarMultiplyStatic(::benchmark::State& state)
    {
        blaze::StaticMatrix<Real, N, N, SO> A;        
        randomize(A);

        for (auto _ : state)
        {
            for (size_t k = 0; k < N; ++k)
            {
                size_t const rs = N - k - 1;
                auto A21 = submatrix(A, k + 1, k, rs, 1);

                A21 *= 1.1;
            }

            ::benchmark::DoNotOptimize(A(N - 1, N - 1));
        }
    }


    template <typename Real, size_t N, bool SO>
    static void BM_LowerMatrixScalarMultiplyStaticLoop(::benchmark::State& state)
    {
        blaze::StaticMatrix<Real, N, N, SO> A;
        randomize(A);

        for (auto _ : state)
        {
            for (size_t k = 0; k < N; ++k)
            {
                size_t const rs = N - k - 1;
                auto A21 = submatrix(A, k + 1, k, rs, 1);

                for (size_t i = 0; i < rs; ++i)
                    A21(i, 0) *= 1.1;
            }

            ::benchmark::DoNotOptimize(A(N - 1, N - 1));
        }
    }


    BENCHMARK_TEMPLATE(BM_LowerMatrixScalarMultiplyStatic, double, 5, blaze::columnMajor);
    BENCHMARK_TEMPLATE(BM_LowerMatrixScalarMultiplyStaticLoop, double, 5, blaze::columnMajor);
}

Compiler: ++ (Ubuntu 8.3.0-6ubuntu1) 8.3.0

Compiler options: -O2 -g -DNDEBUG -save-temps -march=native

CPU: Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz

Benchmark command line:

build/bin/tmpc_bench --benchmark_filter="BM_LowerMatrixScalarMultiply*" --benchmark_repetitions=10 --benchmark_report_aggregates_only=true

Benchmark output:

2019-09-10 23:11:10
Run on (4 X 3600 MHz CPU s)
CPU Caches:
  L1 Data 32K (x4)
  L1 Instruction 32K (x4)
  L2 Unified 256K (x4)
  L3 Unified 6144K (x1)
--------------------------------------------------------------------------------------------------------------------
Benchmark                                                                             Time           CPU Iterations
--------------------------------------------------------------------------------------------------------------------
BM_LowerMatrixScalarMultiplyStatic<double, 5, blaze::columnMajor>_mean               61 ns         61 ns   10598612
BM_LowerMatrixScalarMultiplyStatic<double, 5, blaze::columnMajor>_median             61 ns         61 ns   10598612
BM_LowerMatrixScalarMultiplyStatic<double, 5, blaze::columnMajor>_stddev              0 ns          0 ns   10598612
BM_LowerMatrixScalarMultiplyStaticLoop<double, 5, blaze::columnMajor>_mean            5 ns          5 ns  122790225
BM_LowerMatrixScalarMultiplyStaticLoop<double, 5, blaze::columnMajor>_median          5 ns          5 ns  122790225
BM_LowerMatrixScalarMultiplyStaticLoop<double, 5, blaze::columnMajor>_stddev          0 ns          0 ns  122790225

One can see that the version where submatrix*scalar multiplication is written as a for loop is about 12 times faster.

Profiling the BM_LowerMatrixScalarMultiplyStatic benchmark shows the following:

Surprisingly, DMatScalarMultExpr ctor takes 70% of the time and is the bottleneck.

Comments (6)

  1. Klaus Iglberger

    Hi Misha!

    Thanks a lot for reporting this problem and for the detailed analysis. We will try to fix the underlying problem as quickly as possible.

    Best regards,

    Klaus!

  2. Mikhail Katliar reporter

    I played a little with the benchmark, and found that changing A21 *= 1.1 to column(A21) *= 1.1; makes the submatrix benchmark ~7 times faster:

    2019-09-15 22:14:32
    Run on (4 X 3600 MHz CPU s)
    CPU Caches:
      L1 Data 32K (x4)
      L1 Instruction 32K (x4)
      L2 Unified 256K (x4)
      L3 Unified 6144K (x1)
    ---------------------------------------------------------------------------------------------------------------------------------
    Benchmark                                                                                          Time           CPU Iterations
    ---------------------------------------------------------------------------------------------------------------------------------
    BM_LowerMatrixScalarMultiplyStatic_Submatrix<double, 5, blaze::columnMajor>_mean                  61 ns         61 ns   10627919
    BM_LowerMatrixScalarMultiplyStatic_Submatrix<double, 5, blaze::columnMajor>_median                61 ns         61 ns   10627919
    BM_LowerMatrixScalarMultiplyStatic_Submatrix<double, 5, blaze::columnMajor>_stddev                 0 ns          0 ns   10627919
    BM_LowerMatrixScalarMultiplyStatic_SubmatrixColumn<double, 5, blaze::columnMajor>_mean             8 ns          8 ns   84947740
    BM_LowerMatrixScalarMultiplyStatic_SubmatrixColumn<double, 5, blaze::columnMajor>_median           8 ns          8 ns   84947740
    BM_LowerMatrixScalarMultiplyStatic_SubmatrixColumn<double, 5, blaze::columnMajor>_stddev           0 ns          0 ns   84947740
    BM_LowerMatrixScalarMultiplyStatic_Loop<double, 5, blaze::columnMajor>_mean                        6 ns          6 ns  108627940
    BM_LowerMatrixScalarMultiplyStatic_Loop<double, 5, blaze::columnMajor>_median                      6 ns          6 ns  108627940
    BM_LowerMatrixScalarMultiplyStatic_Loop<double, 5, blaze::columnMajor>_stddev                      0 ns          0 ns  108627940
    BM_LowerMatrixScalarMultiplyStatic_Submatrix<double, 50, blaze::columnMajor>_mean                879 ns        879 ns     766742
    BM_LowerMatrixScalarMultiplyStatic_Submatrix<double, 50, blaze::columnMajor>_median              879 ns        879 ns     766742
    BM_LowerMatrixScalarMultiplyStatic_Submatrix<double, 50, blaze::columnMajor>_stddev                3 ns          3 ns     766742
    BM_LowerMatrixScalarMultiplyStatic_SubmatrixColumn<double, 50, blaze::columnMajor>_mean          171 ns        171 ns    4064772
    BM_LowerMatrixScalarMultiplyStatic_SubmatrixColumn<double, 50, blaze::columnMajor>_median        171 ns        171 ns    4064772
    BM_LowerMatrixScalarMultiplyStatic_SubmatrixColumn<double, 50, blaze::columnMajor>_stddev          0 ns          0 ns    4064772
    BM_LowerMatrixScalarMultiplyStatic_Loop<double, 50, blaze::columnMajor>_mean                     587 ns        587 ns    1125954
    BM_LowerMatrixScalarMultiplyStatic_Loop<double, 50, blaze::columnMajor>_median                   588 ns        588 ns    1125954
    BM_LowerMatrixScalarMultiplyStatic_Loop<double, 50, blaze::columnMajor>_stddev                     1 ns          1 ns    1125954
    

    For a 5x5 matrix, the plain loop is the fastest; for a 50x50 matrix the column(submatrix(…)) version is ~3.5 times faster than the plain loop (probably due to vectorization), and the submatrix() version is the slowest for both big and small matrix cases.

  3. Log in to comment