- edited description
DynamicMatrix elementwise multiplication performance
I am quite new to the Blaze library, and I want to start using it mainly for image processing. I am using a blaze::DynamicMatrix<float>
container for my image. The first thing I did was to benchmark the performance of elementwise multiplication (via the % operator) and copared it to the OpenCV library (which is not known to be superfast and efficient, contrary to the Blaze library). In my benchmark, I am getting pretty bad results with Blaze:
OpenCV: 58ms
Blaze: 240ms
I am benchmarking only the multiplication part, and repeating it 1000 times for less noise. I know there might be many Blaze settings that I can fiddle with, to increase performance, and that this benchmark is not super robust, but I still expected Blaze to beat OpenCV in this benchmark, even with default settings (BLAS not enabled). My benchmark code (all matrices are preallocated, and with exactly the same content):
{
LOG_FUNCTION("OpenCV matmul");
for (int i = 0; i < iters; ++i)
imgOpenCVOut = imgOpenCV.mul(imgOpenCV);
}
{
LOG_FUNCTION("Blaze matmul");
for (int i = 0; i < iters; ++i)
imgBlazeOut = imgBlaze % imgBlaze;
}
I guess I am doing something wrong. Any ideas what it might be?
Comments (8)
-
reporter -
Hi Zdeněk!
Thanks a lot for taking the time to report this potential defect. In order to give us all information to investigate, could you please give us a minimum example containing the exact matrix dimensions and types of matrices? Also, could you please post the compilation flags that you use? Thanks a lot,
Best regards,
Klaus!
-
reporter I tried the same benchmark for multiple image sizes (see attached pictures). Furthermore, I can clearly see, that Blaze uzes multithreading (roughly steady 90% overall CPU utilization on 6cores), while OpenCV does not (single core), and yet Blaze is still slower for some reason. The full code I used for the benchmark:
auto iters = 1000; auto path = "Resources/gui.png"; auto imgOpenCV = loadImage(path); auto imgBlaze = LoadImageBlaze(path); auto imgOpenCVOut = imgOpenCV.clone(); auto imgBlazeOut = imgBlaze; LOG_INFO("Matrix size: {}", imgOpenCV.size()); { LOG_FUNCTION("OpenCV matmul"); for (int i = 0; i < iters; ++i) imgOpenCVOut = imgOpenCV.mul(imgOpenCV); } { LOG_FUNCTION("Blaze matmul"); for (int i = 0; i < iters; ++i) imgBlazeOut = imgBlaze % imgBlaze; }
where the
LoadImageBlaze
function is just manual reasignment from the OpenCV matrix, as follows:using BlazeMat = blaze::DynamicMatrix<float>; inline BlazeMat LoadImageBlaze(const std::string& path) { Mat img = loadImage(path); BlazeMat out(img.rows, img.cols); for (int r = 0; r < out.rows(); ++r) for (int c = 0; c < out.columns(); ++c) out(r, c) = img.at<float>(r, c); return out; }
I am using VisualStudio 2017, so the compiler options are auto-generated, but here:
/Yu"stdafx.h" /MP /GS /W1 /Zc:wchar_t /Gm- /O2 /Fd"x64\Release\vc141.pdb" /Zc:inline /fp:precise /D "UNICODE" /D "_UNICODE" /D "WIN32" /D "_ENABLE_EXTENDED_ALIGNED_STORAGE" /D "WIN64" /D "QT_NO_DEBUG" /D "QT_PRINTSUPPORT_LIB" /D "QT_WIDGETS_LIB" /D "QT_GUI_LIB" /D "QT_CORE_LIB" /D "NDEBUG" /errorReport:prompt /WX- /Zc:forScope /Gd /MD /openmp /std:c++17 /FC /Fa"x64\Release\" /EHsc /nologo /Fo"x64\Release\" /Fp"x64\Release\Zdeny_PhD_Shenanigans.pch" /diagnostics:classic
I hope this helps.
Thanks for the quick response,
Zdeněk
-
Hi Zdeněk!
Since of course I cannot reproduce exactly the same scenario, could you please disable OpenMP (
/openmp
) and explicitly enable vectorization (SSE or AVX, depending on your machine; eg./arch:AVX
)? Our suspicion is that parallelization for such small matrices indeed incurs much overhead and that currently no vectorization is used. Thanks for the additional effort,Best regards,
Klaus!
-
reporter Thanks for the info, I will definitely try that. However, it still seems weird to me, that even for a 4096x4096 matrix the single core approach with opencv is faster than multicore with blaze and openmp. Makes no sense.
I’ll post my results once I try to enable vectorization explicitly and try that with openmp enabled & disabled.
Zdeněk
-
reporter I realized, that disabling openmp is not an option for me, since I use it in some code myself. I just explicitly enabled /arch:AVX2, and the blaze version is now significantly faster than the OpenCV version. I thought I did not have to do it explicitly, thanks for the suggestion!
Best regards,
Zdeněk
-
Hi Zdeněk!
I’m glad I could help. For your reference: If you ever need to, you can deactivate parallelization in Blaze via the compilation flag
BLAZE_USE_SHARED_MEMORY_PARALLELIZATION
(see the wiki).If you come across other problem, please let us know.
Best regards,
Klaus!
-
- changed status to resolved
- Log in to comment