- changed title to Performance issue: DMatScalarMultExpr ctor is slow
Performance issue: DMatScalarMultExpr ctor is slow
Here is the benchmark, which multiplies the lower-triangular part of a StaticMatrix
with a scalar in two different ways:
#include <blaze/Math.h>
#include <benchmark/benchmark.h>
namespace tmpc :: benchmark
{
template <typename Real, size_t N, bool SO>
static void BM_LowerMatrixScalarMultiplyStatic(::benchmark::State& state)
{
blaze::StaticMatrix<Real, N, N, SO> A;
randomize(A);
for (auto _ : state)
{
for (size_t k = 0; k < N; ++k)
{
size_t const rs = N - k - 1;
auto A21 = submatrix(A, k + 1, k, rs, 1);
A21 *= 1.1;
}
::benchmark::DoNotOptimize(A(N - 1, N - 1));
}
}
template <typename Real, size_t N, bool SO>
static void BM_LowerMatrixScalarMultiplyStaticLoop(::benchmark::State& state)
{
blaze::StaticMatrix<Real, N, N, SO> A;
randomize(A);
for (auto _ : state)
{
for (size_t k = 0; k < N; ++k)
{
size_t const rs = N - k - 1;
auto A21 = submatrix(A, k + 1, k, rs, 1);
for (size_t i = 0; i < rs; ++i)
A21(i, 0) *= 1.1;
}
::benchmark::DoNotOptimize(A(N - 1, N - 1));
}
}
BENCHMARK_TEMPLATE(BM_LowerMatrixScalarMultiplyStatic, double, 5, blaze::columnMajor);
BENCHMARK_TEMPLATE(BM_LowerMatrixScalarMultiplyStaticLoop, double, 5, blaze::columnMajor);
}
Compiler: ++ (Ubuntu 8.3.0-6ubuntu1) 8.3.0
Compiler options: -O2 -g -DNDEBUG -save-temps -march=native
CPU: Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz
Benchmark command line:
build/bin/tmpc_bench --benchmark_filter="BM_LowerMatrixScalarMultiply*" --benchmark_repetitions=10 --benchmark_report_aggregates_only=true
Benchmark output:
2019-09-10 23:11:10
Run on (4 X 3600 MHz CPU s)
CPU Caches:
L1 Data 32K (x4)
L1 Instruction 32K (x4)
L2 Unified 256K (x4)
L3 Unified 6144K (x1)
--------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------------------------------------------------------------
BM_LowerMatrixScalarMultiplyStatic<double, 5, blaze::columnMajor>_mean 61 ns 61 ns 10598612
BM_LowerMatrixScalarMultiplyStatic<double, 5, blaze::columnMajor>_median 61 ns 61 ns 10598612
BM_LowerMatrixScalarMultiplyStatic<double, 5, blaze::columnMajor>_stddev 0 ns 0 ns 10598612
BM_LowerMatrixScalarMultiplyStaticLoop<double, 5, blaze::columnMajor>_mean 5 ns 5 ns 122790225
BM_LowerMatrixScalarMultiplyStaticLoop<double, 5, blaze::columnMajor>_median 5 ns 5 ns 122790225
BM_LowerMatrixScalarMultiplyStaticLoop<double, 5, blaze::columnMajor>_stddev 0 ns 0 ns 122790225
One can see that the version where submatrix*scalar multiplication is written as a for loop is about 12 times faster.
Profiling the BM_LowerMatrixScalarMultiplyStatic
benchmark shows the following:
Surprisingly,
DMatScalarMultExpr
ctor takes 70% of the time and is the bottleneck.
Comments (6)
-
reporter -
Hi Misha!
Thanks a lot for reporting this problem and for the detailed analysis. We will try to fix the underlying problem as quickly as possible.
Best regards,
Klaus!
-
reporter - edited description
-
reporter - edited description
-
reporter I played a little with the benchmark, and found that changing
A21 *= 1.1
tocolumn(A21) *= 1.1;
makes the submatrix benchmark ~7 times faster:2019-09-15 22:14:32 Run on (4 X 3600 MHz CPU s) CPU Caches: L1 Data 32K (x4) L1 Instruction 32K (x4) L2 Unified 256K (x4) L3 Unified 6144K (x1) --------------------------------------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations --------------------------------------------------------------------------------------------------------------------------------- BM_LowerMatrixScalarMultiplyStatic_Submatrix<double, 5, blaze::columnMajor>_mean 61 ns 61 ns 10627919 BM_LowerMatrixScalarMultiplyStatic_Submatrix<double, 5, blaze::columnMajor>_median 61 ns 61 ns 10627919 BM_LowerMatrixScalarMultiplyStatic_Submatrix<double, 5, blaze::columnMajor>_stddev 0 ns 0 ns 10627919 BM_LowerMatrixScalarMultiplyStatic_SubmatrixColumn<double, 5, blaze::columnMajor>_mean 8 ns 8 ns 84947740 BM_LowerMatrixScalarMultiplyStatic_SubmatrixColumn<double, 5, blaze::columnMajor>_median 8 ns 8 ns 84947740 BM_LowerMatrixScalarMultiplyStatic_SubmatrixColumn<double, 5, blaze::columnMajor>_stddev 0 ns 0 ns 84947740 BM_LowerMatrixScalarMultiplyStatic_Loop<double, 5, blaze::columnMajor>_mean 6 ns 6 ns 108627940 BM_LowerMatrixScalarMultiplyStatic_Loop<double, 5, blaze::columnMajor>_median 6 ns 6 ns 108627940 BM_LowerMatrixScalarMultiplyStatic_Loop<double, 5, blaze::columnMajor>_stddev 0 ns 0 ns 108627940 BM_LowerMatrixScalarMultiplyStatic_Submatrix<double, 50, blaze::columnMajor>_mean 879 ns 879 ns 766742 BM_LowerMatrixScalarMultiplyStatic_Submatrix<double, 50, blaze::columnMajor>_median 879 ns 879 ns 766742 BM_LowerMatrixScalarMultiplyStatic_Submatrix<double, 50, blaze::columnMajor>_stddev 3 ns 3 ns 766742 BM_LowerMatrixScalarMultiplyStatic_SubmatrixColumn<double, 50, blaze::columnMajor>_mean 171 ns 171 ns 4064772 BM_LowerMatrixScalarMultiplyStatic_SubmatrixColumn<double, 50, blaze::columnMajor>_median 171 ns 171 ns 4064772 BM_LowerMatrixScalarMultiplyStatic_SubmatrixColumn<double, 50, blaze::columnMajor>_stddev 0 ns 0 ns 4064772 BM_LowerMatrixScalarMultiplyStatic_Loop<double, 50, blaze::columnMajor>_mean 587 ns 587 ns 1125954 BM_LowerMatrixScalarMultiplyStatic_Loop<double, 50, blaze::columnMajor>_median 588 ns 588 ns 1125954 BM_LowerMatrixScalarMultiplyStatic_Loop<double, 50, blaze::columnMajor>_stddev 1 ns 1 ns 1125954
For a 5x5 matrix, the plain loop is the fastest; for a 50x50 matrix the
column(submatrix(…))
version is ~3.5 times faster than the plain loop (probably due to vectorization), and thesubmatrix()
version is the slowest for both big and small matrix cases. -
reporter - attached LowerMatrixScalarMultiply.cpp
Benchmark source
- Log in to comment