Curious Benchmark Results (Vectorization)

Issue #389 resolved
Phil created an issue

I’ve been trying to understand (and measure) the impact of vectorization, i.e. BLAZE_USE_VECTORIZATION 0 vs BLAZE_USE_VECTORIZATION 1. As a very simple test, I tried adding 2 vectors N times, and measure the duration. Because one would think that Vectorization should yield much better results (i.e. faster).

const uint64_t startTime = getNanos();
for (size_t i = 0; i < N; ++i)
    w = u + v;
const uint64_t duration = getNanos() - startTime;

I tested 3 types (short script attached):

  • std::vector<double> for which I provided a naive operator+ function
  • blaze::DynamicVector<double> with BLAZE_USE_VECTORIZATION 0 turned off
  • blaze::DynamicVector<double> with BLAZE_USE_VECTORIZATION 1 turned on

What is curious is that MSVC’s std::vector performs much better than blaze::DynamicVector without vectorization (at least for larger sizes). I would have expected blaze to be at least as fast (if not a little better, due to expression templates and general optimizations). What is as expected though is that blaze::DynamicVector with vectorization clearly performs best (and it is indeed ~2 faster than its non-vectorized counterpart).

On g++ (cygwin/Ubuntu), it appears that std::vector does indeed perform the worst (as expected), but there is hardly any difference between blaze::DynamicVector with or without vectorization. I would have expected the vectorized version to be clearly faster, but it is not. It seems as if vectorization is enabled regardless of whether we set BLAZE_USE_VECTORIZATION to 0 or 1, which is surprising.

Questions:

  1. On MSVC, how come std::vector performs better than blaze::DynamicVector without vectorization?
  2. On g++, how come vectorization is enabled for blaze::DynamicVector regardless of flag BLAZE_USE_VECTORIZATION?
  3. Is my benchmarking procedure even meaningful? Are there better ways to test this?

Side note: I compiled with MSVC in VS 2019 (Release build with /O2), and for g++ I used g++ Script1.cpp -O3 -o Script1 -std=c++17 -I . to compile the scripts.

Comments (5)

  1. Klaus Iglberger

    Hi Phil!

    The vector addition is a memory bandwidth limited operation. Therefore it is not to be expected that vectorization gives you any performance benefits for large vectors. Parallelization might help, depending on the size of vectors and the underlying architecture, but doesn't have to either.

    In order to get an impression on the benefits of vectorization, you'll have to choose an operation that is less or not limited by memory bandwidth. That is for instance true for operations with small vectors and matrices, that fit into the caches. Also, you might implement your own (naive) matrix multiplication and compare that to the performance of a vectorized implementation.

    I hope this answers your questions,

    Best regards,

    Klaus!

  2. Phil reporter

    Hi Klaus,

    Thanks for the quick reply.

    So are you saying that (e.g. on g++) we only see an improvement of BLAZE_USE_VECTORIZATION 1 over BLAZE_USE_VECTORIZATION 0 for smaller vector sizes (up to ~256), because of memory bandwidth limits?

    How come this reasoning does not seem to apply to MSVC? Since we see std::vector being faster than blaze::DynamicVector without vectorization, and BLAZE_USE_VECTORIZATION 1 vs BLAZE_USE_VECTORIZATION 0 yielding (almost exactly) the expected 2x speed improvements (for any vector size even). Whereas on g++ we don’t see the 2x improvement at all (not even for smaller vector sizes).

  3. Klaus Iglberger

    Hi Phil!

    So are you saying that (e.g. on g++) we only see an improvement of BLAZE_USE_VECTORIZATION 1 over BLAZE_USE_VECTORIZATION 0 for smaller vector sizes (up to ~256), because of memory bandwidth limits?

    That is to be expected, yes. Depending on your architecture, vector addition will eventually be memory bandwidth limited, i.e. vectorization will not pay off anymore. Without knowing your hardware I cannot predict when this will happen. However, if you turn off vectorization by means of BLAZE_USE_VECTORIZATION 0, the compiler might still perform some vectorization itself (which is relatively simple for vector additions). Therefore it is hard to prediction anything. Also, vectorization depends on your compilation flags. Without knowing which flags you use (e.g. -msse4.2, -mavx, …) I cannot give more information on what is to be expected.

    How come this reasoning does not seem to apply to MSVC? Since we see std::vector being faster than blaze::DynamicVector without vectorization, and BLAZE_USE_VECTORIZATION 1 vs BLAZE_USE_VECTORIZATION 0 yielding (almost exactly) the expected 2x speed improvements (for any vector size even). Whereas on g++we don’t see the 2x improvement at all (not even for smaller vector sizes).

    Please consider that the base performance of MSVC is worse than the base performance of GCC. Apparently MSVC does not optimize as aggressively or well as GCC does. The results seem to confirm that, as it has more problems to optimize Blaze code (which is nested more deeply and does more optimisation manually) than the STL code. But again, without knowing what you are doing there is little I can do to explain the performance.

    Best regards,

    Klaus!

  4. Phil reporter

    Thanks, Klaus. I think I am starting to understand the differences better now.

    In any case, here are the compiler details:

    • MSVC: MSVC\14.27.29110\bin\Hostx64\x64\cl.exe /nologo /TP -D_CRT_SECURE_NO_WARNINGS -I..\ /DWIN32 /D_WINDOWS /GR /EHsc /O2 /Ob2 /DNDEBUG -MD /MP /wd4127 /O2 -std:c++17 /showIncludes /FoCMakeFiles\Script0.dir\Script0.cpp.obj /FdCMakeFiles\Script0.dir\ /FS -c ..\Script1.cpp (copied from the log). Note, I provided -D_CRT_SECURE_NO_WARNINGS to suppress warnings about the usage of localtime (as opposed to localtime_s) and /wd4127 to suppress warnings about if statements that could be if constexpr.
    • GCC: g++ Script1.cpp -O3 -o Script1 -std=c++17 -I .

  5. Klaus Iglberger

    Hi Phil!

    In both cases you don’t explicitly ask for vectorization (for GCC this would be for example -msse4.2 or -mavx). Therefore the compiler chooses the kind of vectorization depending on your architecture. However, the compiler might also choose not to vectorize. I recommend to use Compiler Explorer to get an impression on whether vectorization is performed or not (you can choose Blaze as a library).

    Our only concern is that the manual vectorization works (i.e. the run with BLAZE_USE_VECTORIZATION set to 1). Disabling vectorization puts you at the mercy of the compiler, which is something we cannot influence.

    Best regards,

    Klaus!

  6. Log in to comment