DynamicMatrix elementwise multiplication performance

Issue #402 resolved

Zdeněk Hrazdíra created an issue 2021-03-18

I am quite new to the Blaze library, and I want to start using it mainly for image processing. I am using a blaze::DynamicMatrix<float> container for my image. The first thing I did was to benchmark the performance of elementwise multiplication (via the % operator) and copared it to the OpenCV library (which is not known to be superfast and efficient, contrary to the Blaze library). In my benchmark, I am getting pretty bad results with Blaze:

OpenCV: 58ms
Blaze:  240ms

I am benchmarking only the multiplication part, and repeating it 1000 times for less noise. I know there might be many Blaze settings that I can fiddle with, to increase performance, and that this benchmark is not super robust, but I still expected Blaze to beat OpenCV in this benchmark, even with default settings (BLAS not enabled). My benchmark code (all matrices are preallocated, and with exactly the same content):

{
  LOG_FUNCTION("OpenCV matmul");
  for (int i = 0; i < iters; ++i)
    imgOpenCVOut = imgOpenCV.mul(imgOpenCV);
}

{
  LOG_FUNCTION("Blaze matmul");
  for (int i = 0; i < iters; ++i)
    imgBlazeOut = imgBlaze % imgBlaze;
}

I guess I am doing something wrong. Any ideas what it might be?

Comments (8)

Zdeněk Hrazdíra reporter
- edited description
- 2021-03-18T16:28:53+00:00
Klaus Iglberger
Hi Zdeněk!

Thanks a lot for taking the time to report this potential defect. In order to give us all information to investigate, could you please give us a minimum example containing the exact matrix dimensions and types of matrices? Also, could you please post the compilation flags that you use? Thanks a lot,

Best regards,

Klaus!

‌
- 2021-03-18T17:15:08+00:00

Zdeněk Hrazdíra reporter

I tried the same benchmark for multiple image sizes (see attached pictures). Furthermore, I can clearly see, that Blaze uzes multithreading (roughly steady 90% overall CPU utilization on 6cores), while OpenCV does not (single core), and yet Blaze is still slower for some reason. The full code I used for the benchmark:

    auto iters = 1000;
    auto path = "Resources/gui.png";
    auto imgOpenCV = loadImage(path);
    auto imgBlaze = LoadImageBlaze(path);
    auto imgOpenCVOut = imgOpenCV.clone();
    auto imgBlazeOut = imgBlaze;

    LOG_INFO("Matrix size: {}", imgOpenCV.size());

    {
      LOG_FUNCTION("OpenCV matmul");
      for (int i = 0; i < iters; ++i)
        imgOpenCVOut = imgOpenCV.mul(imgOpenCV);
    }

    {
      LOG_FUNCTION("Blaze matmul");
      for (int i = 0; i < iters; ++i)
        imgBlazeOut = imgBlaze % imgBlaze;
    }

where the LoadImageBlaze function is just manual reasignment from the OpenCV matrix, as follows:

using BlazeMat = blaze::DynamicMatrix<float>;

inline BlazeMat LoadImageBlaze(const std::string& path)
{
  Mat img = loadImage(path);

  BlazeMat out(img.rows, img.cols);
  for (int r = 0; r < out.rows(); ++r)
    for (int c = 0; c < out.columns(); ++c)
      out(r, c) = img.at<float>(r, c);

  return out;
}

I am using VisualStudio 2017, so the compiler options are auto-generated, but here:

/Yu"stdafx.h" /MP /GS /W1 /Zc:wchar_t /Gm- /O2 /Fd"x64\Release\vc141.pdb" /Zc:inline /fp:precise /D "UNICODE" /D "_UNICODE" /D "WIN32" /D "_ENABLE_EXTENDED_ALIGNED_STORAGE" /D "WIN64" /D "QT_NO_DEBUG" /D "QT_PRINTSUPPORT_LIB" /D "QT_WIDGETS_LIB" /D "QT_GUI_LIB" /D "QT_CORE_LIB" /D "NDEBUG" /errorReport:prompt /WX- /Zc:forScope /Gd /MD /openmp /std:c++17 /FC /Fa"x64\Release\" /EHsc /nologo /Fo"x64\Release\" /Fp"x64\Release\Zdeny_PhD_Shenanigans.pch" /diagnostics:classic

‌

I hope this helps.

‌

Thanks for the quick response,

Zdeněk

2021-03-18T19:02:09+00:00

Klaus Iglberger
Hi Zdeněk!

Since of course I cannot reproduce exactly the same scenario, could you please disable OpenMP (/openmp) and explicitly enable vectorization (SSE or AVX, depending on your machine; eg. /arch:AVX)? Our suspicion is that parallelization for such small matrices indeed incurs much overhead and that currently no vectorization is used. Thanks for the additional effort,

Best regards,

Klaus!

‌
- 2021-03-18T20:26:20+00:00
Zdeněk Hrazdíra reporter
Thanks for the info, I will definitely try that. However, it still seems weird to me, that even for a 4096x4096 matrix the single core approach with opencv is faster than multicore with blaze and openmp. Makes no sense.

I’ll post my results once I try to enable vectorization explicitly and try that with openmp enabled & disabled.

Zdeněk
- 2021-03-18T21:31:45+00:00
Zdeněk Hrazdíra reporter
I realized, that disabling openmp is not an option for me, since I use it in some code myself. I just explicitly enabled /arch:AVX2, and the blaze version is now significantly faster than the OpenCV version. I thought I did not have to do it explicitly, thanks for the suggestion!

‌

Best regards,

Zdeněk

‌
- 2021-03-18T22:14:36+00:00
Klaus Iglberger
Hi Zdeněk!

I’m glad I could help. For your reference: If you ever need to, you can deactivate parallelization in Blaze via the compilation flag BLAZE_USE_SHARED_MEMORY_PARALLELIZATION(see the wiki).

If you come across other problem, please let us know.

Best regards,

Klaus!
- 2021-03-19T06:26:12+00:00
Klaus Iglberger
- changed status to resolved
- 2021-03-19T06:26:19+00:00
Log in to comment

Assignee: –

Type: task

Priority: minor

Status: resolved

Votes: 0

Watchers: 1