DynamicMatrix elementwise multiplication performance

Issue #402 resolved
Zdeněk Hrazdíra created an issue

I am quite new to the Blaze library, and I want to start using it mainly for image processing. I am using a blaze::DynamicMatrix<float> container for my image. The first thing I did was to benchmark the performance of elementwise multiplication (via the % operator) and copared it to the OpenCV library (which is not known to be superfast and efficient, contrary to the Blaze library). In my benchmark, I am getting pretty bad results with Blaze:

OpenCV: 58ms
Blaze:  240ms

I am benchmarking only the multiplication part, and repeating it 1000 times for less noise. I know there might be many Blaze settings that I can fiddle with, to increase performance, and that this benchmark is not super robust, but I still expected Blaze to beat OpenCV in this benchmark, even with default settings (BLAS not enabled). My benchmark code (all matrices are preallocated, and with exactly the same content):

{
  LOG_FUNCTION("OpenCV matmul");
  for (int i = 0; i < iters; ++i)
    imgOpenCVOut = imgOpenCV.mul(imgOpenCV);
}

{
  LOG_FUNCTION("Blaze matmul");
  for (int i = 0; i < iters; ++i)
    imgBlazeOut = imgBlaze % imgBlaze;
}

I guess I am doing something wrong. Any ideas what it might be?

Comments (8)

  1. Klaus Iglberger

    Hi Zdeněk!

    Thanks a lot for taking the time to report this potential defect. In order to give us all information to investigate, could you please give us a minimum example containing the exact matrix dimensions and types of matrices? Also, could you please post the compilation flags that you use? Thanks a lot,

    Best regards,

    Klaus!

  2. Zdeněk Hrazdíra reporter

    I tried the same benchmark for multiple image sizes (see attached pictures). Furthermore, I can clearly see, that Blaze uzes multithreading (roughly steady 90% overall CPU utilization on 6cores), while OpenCV does not (single core), and yet Blaze is still slower for some reason. The full code I used for the benchmark:

        auto iters = 1000;
        auto path = "Resources/gui.png";
        auto imgOpenCV = loadImage(path);
        auto imgBlaze = LoadImageBlaze(path);
        auto imgOpenCVOut = imgOpenCV.clone();
        auto imgBlazeOut = imgBlaze;
    
        LOG_INFO("Matrix size: {}", imgOpenCV.size());
    
        {
          LOG_FUNCTION("OpenCV matmul");
          for (int i = 0; i < iters; ++i)
            imgOpenCVOut = imgOpenCV.mul(imgOpenCV);
        }
    
        {
          LOG_FUNCTION("Blaze matmul");
          for (int i = 0; i < iters; ++i)
            imgBlazeOut = imgBlaze % imgBlaze;
        }
    

    where the LoadImageBlaze function is just manual reasignment from the OpenCV matrix, as follows:

    using BlazeMat = blaze::DynamicMatrix<float>;
    
    inline BlazeMat LoadImageBlaze(const std::string& path)
    {
      Mat img = loadImage(path);
    
      BlazeMat out(img.rows, img.cols);
      for (int r = 0; r < out.rows(); ++r)
        for (int c = 0; c < out.columns(); ++c)
          out(r, c) = img.at<float>(r, c);
    
      return out;
    }
    

    I am using VisualStudio 2017, so the compiler options are auto-generated, but here:

    /Yu"stdafx.h" /MP /GS /W1 /Zc:wchar_t /Gm- /O2 /Fd"x64\Release\vc141.pdb" /Zc:inline /fp:precise /D "UNICODE" /D "_UNICODE" /D "WIN32" /D "_ENABLE_EXTENDED_ALIGNED_STORAGE" /D "WIN64" /D "QT_NO_DEBUG" /D "QT_PRINTSUPPORT_LIB" /D "QT_WIDGETS_LIB" /D "QT_GUI_LIB" /D "QT_CORE_LIB" /D "NDEBUG" /errorReport:prompt /WX- /Zc:forScope /Gd /MD /openmp /std:c++17 /FC /Fa"x64\Release\" /EHsc /nologo /Fo"x64\Release\" /Fp"x64\Release\Zdeny_PhD_Shenanigans.pch" /diagnostics:classic 
    

    I hope this helps.

    Thanks for the quick response,

    Zdeněk

  3. Klaus Iglberger

    Hi Zdeněk!

    Since of course I cannot reproduce exactly the same scenario, could you please disable OpenMP (/openmp) and explicitly enable vectorization (SSE or AVX, depending on your machine; eg. /arch:AVX)? Our suspicion is that parallelization for such small matrices indeed incurs much overhead and that currently no vectorization is used. Thanks for the additional effort,

    Best regards,

    Klaus!

  4. Zdeněk Hrazdíra reporter

    Thanks for the info, I will definitely try that. However, it still seems weird to me, that even for a 4096x4096 matrix the single core approach with opencv is faster than multicore with blaze and openmp. Makes no sense.

    I’ll post my results once I try to enable vectorization explicitly and try that with openmp enabled & disabled.

    Zdeněk

  5. Zdeněk Hrazdíra reporter

    I realized, that disabling openmp is not an option for me, since I use it in some code myself. I just explicitly enabled /arch:AVX2, and the blaze version is now significantly faster than the OpenCV version. I thought I did not have to do it explicitly, thanks for the suggestion!

    Best regards,

    Zdeněk

  6. Klaus Iglberger

    Hi Zdeněk!

    I’m glad I could help. For your reference: If you ever need to, you can deactivate parallelization in Blaze via the compilation flag BLAZE_USE_SHARED_MEMORY_PARALLELIZATION(see the wiki).

    If you come across other problem, please let us know.

    Best regards,

    Klaus!

  7. Log in to comment