- changed status to resolved
Curious Benchmark Results (Vectorization)
I’ve been trying to understand (and measure) the impact of vectorization, i.e. BLAZE_USE_VECTORIZATION 0
vs BLAZE_USE_VECTORIZATION 1
. As a very simple test, I tried adding 2 vectors N
times, and measure the duration. Because one would think that Vectorization should yield much better results (i.e. faster).
const uint64_t startTime = getNanos();
for (size_t i = 0; i < N; ++i)
w = u + v;
const uint64_t duration = getNanos() - startTime;
I tested 3 types (short script attached):
std::vector<double>
for which I provided a naiveoperator+
functionblaze::DynamicVector<double>
withBLAZE_USE_VECTORIZATION 0
turned offblaze::DynamicVector<double>
withBLAZE_USE_VECTORIZATION 1
turned on
What is curious is that MSVC’s std::vector
performs much better than blaze::DynamicVector
without vectorization (at least for larger sizes). I would have expected blaze to be at least as fast (if not a little better, due to expression templates and general optimizations). What is as expected though is that blaze::DynamicVector
with vectorization clearly performs best (and it is indeed ~2 faster than its non-vectorized counterpart).
On g++ (cygwin/Ubuntu), it appears that std::vector
does indeed perform the worst (as expected), but there is hardly any difference between blaze::DynamicVector
with or without vectorization. I would have expected the vectorized version to be clearly faster, but it is not. It seems as if vectorization is enabled regardless of whether we set BLAZE_USE_VECTORIZATION
to 0
or 1
, which is surprising.
Questions:
- On MSVC, how come
std::vector
performs better thanblaze::DynamicVector
without vectorization? - On g++, how come vectorization is enabled for
blaze::DynamicVector
regardless of flagBLAZE_USE_VECTORIZATION
? - Is my benchmarking procedure even meaningful? Are there better ways to test this?
Side note: I compiled with MSVC in VS 2019 (Release build with /O2
), and for g++ I used g++ Script1.cpp -O3 -o Script1 -std=c++17 -I .
to compile the scripts.
Comments (5)
-
-
reporter Hi Klaus,
Thanks for the quick reply.
So are you saying that (e.g. on g++) we only see an improvement of
BLAZE_USE_VECTORIZATION 1
overBLAZE_USE_VECTORIZATION 0
for smaller vector sizes (up to ~256), because of memory bandwidth limits?How come this reasoning does not seem to apply to MSVC? Since we see
std::vector
being faster thanblaze::DynamicVector
without vectorization, andBLAZE_USE_VECTORIZATION 1
vsBLAZE_USE_VECTORIZATION 0
yielding (almost exactly) the expected 2x speed improvements (for any vector size even). Whereas on g++ we don’t see the 2x improvement at all (not even for smaller vector sizes). -
Hi Phil!
So are you saying that (e.g. on g++) we only see an improvement of
BLAZE_USE_VECTORIZATION 1
overBLAZE_USE_VECTORIZATION 0
for smaller vector sizes (up to ~256), because of memory bandwidth limits?That is to be expected, yes. Depending on your architecture, vector addition will eventually be memory bandwidth limited, i.e. vectorization will not pay off anymore. Without knowing your hardware I cannot predict when this will happen. However, if you turn off vectorization by means of
BLAZE_USE_VECTORIZATION 0
, the compiler might still perform some vectorization itself (which is relatively simple for vector additions). Therefore it is hard to prediction anything. Also, vectorization depends on your compilation flags. Without knowing which flags you use (e.g.-msse4.2
,-mavx
, …) I cannot give more information on what is to be expected.How come this reasoning does not seem to apply to MSVC? Since we see
std::vector
being faster thanblaze::DynamicVector
without vectorization, andBLAZE_USE_VECTORIZATION 1
vsBLAZE_USE_VECTORIZATION 0
yielding (almost exactly) the expected 2x speed improvements (for any vector size even). Whereas on g++we don’t see the 2x improvement at all (not even for smaller vector sizes).Please consider that the base performance of MSVC is worse than the base performance of GCC. Apparently MSVC does not optimize as aggressively or well as GCC does. The results seem to confirm that, as it has more problems to optimize Blaze code (which is nested more deeply and does more optimisation manually) than the STL code. But again, without knowing what you are doing there is little I can do to explain the performance.
Best regards,
Klaus!
-
reporter Thanks, Klaus. I think I am starting to understand the differences better now.
In any case, here are the compiler details:
- MSVC:
MSVC\14.27.29110\bin\Hostx64\x64\cl.exe /nologo /TP -D_CRT_SECURE_NO_WARNINGS -I..\ /DWIN32 /D_WINDOWS /GR /EHsc /O2 /Ob2 /DNDEBUG -MD /MP /wd4127 /O2 -std:c++17 /showIncludes /FoCMakeFiles\Script0.dir\Script0.cpp.obj /FdCMakeFiles\Script0.dir\ /FS -c ..\Script1.cpp
(copied from the log). Note, I provided-D_CRT_SECURE_NO_WARNINGS
to suppress warnings about the usage oflocaltime
(as opposed tolocaltime_s
) and/wd4127
to suppress warnings aboutif
statements that could beif constexpr
. - GCC:
g++ Script1.cpp -O3 -o Script1 -std=c++17 -I .
- MSVC:
-
Hi Phil!
In both cases you don’t explicitly ask for vectorization (for GCC this would be for example
-msse4.2
or-mavx
). Therefore the compiler chooses the kind of vectorization depending on your architecture. However, the compiler might also choose not to vectorize. I recommend to use Compiler Explorer to get an impression on whether vectorization is performed or not (you can choose Blaze as a library).Our only concern is that the manual vectorization works (i.e. the run with
BLAZE_USE_VECTORIZATION
set to1
). Disabling vectorization puts you at the mercy of the compiler, which is something we cannot influence.Best regards,
Klaus!
- Log in to comment
Hi Phil!
The vector addition is a memory bandwidth limited operation. Therefore it is not to be expected that vectorization gives you any performance benefits for large vectors. Parallelization might help, depending on the size of vectors and the underlying architecture, but doesn't have to either.
In order to get an impression on the benefits of vectorization, you'll have to choose an operation that is less or not limited by memory bandwidth. That is for instance true for operations with small vectors and matrices, that fit into the caches. Also, you might implement your own (naive) matrix multiplication and compare that to the performance of a vectorized implementation.
I hope this answers your questions,
Best regards,
Klaus!