Performance of l2Norm drops when embedded in other expressions

Issue #220 wontfix
gergol created an issue

While playing around with blaze (which is a really great library, many many thanks here!!) I observed the following performance issue:

I have two ways of expressing the same calculation:

result_manual =  blaze::sqrt(blaze::sum(x%x) / blaze::size(x)); // OK
result_norm = blaze::l2Norm(x)/blaze::sqrt(blaze::size(x)); // OK

When passing a plain DynamicMatrix<float> , both perform equally well (~ 1500ns per loop). However, when I embed the calculations in more a more complex expression, the performance of the result_norm expression drops significantly (about 4-5 times slower than the result_manual). What could be the reason? Interestingly, the performance does not suffer if I pre-calculate the blaze::sqrt(blaze::size(x)) or if I remove the multiplication by x.

result_manual_complex = x *   blaze::sqrt(blaze::sum(x%x) / blaze::size(x));  // OK

result_norm_complex = x * blaze::l2Norm(x)/blaze::sqrt(blaze::size(x));  // SLOW!

result_norm_complex1 = blaze::l2Norm(x)/blaze::sqrt(blaze::size(x));  // OK

float sz = blaze::sqrt(blaze::size(x));
result_norm_complex2 = x * blaze::l2Norm(x)/sz;  // OK

Comments (7)

  1. Klaus Iglberger

    Hi gergol!

    Thanks for raising this issue and pointing out this potential bug. We will investigate as quickly as possible.

    Best regards,

    Klaus!

  2. Klaus Iglberger

    Hi gergol!

    We tried to reproduce the issue, but unfortunately were not able to see a significant performance difference between the first two complex expressions:

    result_manual_complex = x *   blaze::sqrt(blaze::sum(x%x) / blaze::size(x));  // OK
    
    result_norm_complex = x * blaze::l2Norm(x)/blaze::sqrt(blaze::size(x));  // SLOW!
    

    In fact, for us it looks as both expressions run at the same performance. We tried both GCC-8 and Clang-7.0, SSE and AVX and matrix sizes of N=10 and N=250.

    In order to proceed we will need further input about your settings:

    • Which matrix size do you use?
    • Which compiler do you use?
    • Which SIMD instruction set do you use?
    • Which compiler flags do you use?

    We would also very much appreciate if you could please provide a minimum example that shows the performance defect. Thanks a lot,

    Best regards,

    Klaus!

  3. gergol reporter

    Hi Klaus,

    • The issue exists for matrices of different sizes. Interestingly, when compiling with clang, the norm expression is FASTER, when the non-major dimension is 1 (e.g. for a 1-row column-major-matrix).
    • I've tested clang6 and gcc 8.2 on ubuntu 18.04
    • SIMD instructions: sse, sse2, sse3, sse4.1, sse4.2, avx, avx2
    • Compiler flags: Nothing special: -march=native -O3 -DNDEBUG -std=gnu++17 and additionally with and without -fopenmp I've attached a simple CMake project. The archive also contains two txt files with detailed measurements. I hope it helps...

    Best regards

    Gerald

  4. Klaus Iglberger

    Hi Gerald!

    Thanks for the benchmarking project, this helped to understand the problem. The problem can be easily explained with the following minimum example:

    blaze::DynamicMatrix<double> Ad;  // Double precision matrix
    blaze::DynamicMatrix<float>  Af;  // Single precision matrix
    
    const double d = 1.2;   // Double precision scalar
    const float  f = 1.2F;  // Single precision scalar
    
    auto B = evaluate( Ad * d );  // OK, vectorized
    auto C = evaluate( Ad * f );  // OK, vectorized due to floating point promotion
    auto D = evaluate( Af * d );  // SLOW, not vectorized due to narrowing conversion
    auto E = evaluate( Af * f );  // OK, vectorized
    

    The problem only happens if you try to multiply a single precision floating point matrix with a double precision scalar (matrix D). In that case Blaze does not perform a narrowing conversion, but works with the given double value. Unfortunately, this mixed precision multiplication cannot be vectorized. In all other cases, either the type of the matrix elements and the scalar match (matrices B and E) or it is possible to promote the given scalar (i.e. convert a float to a double; matrix C). Therefore all other operations can be vectorized.

    In your example, the sqrt() function is used on an integral type (more specifically size_t). The resulting value is of type double, which is used as factor for the given matrix x. Thus this operation is not vectorized. All other examples work well since the scalar is always of type float. To resolve the performance problem, please explicitly perform a conversion to float to allow Blaze to perform a vectorized multiplication.

    It is admittedly a little surprising that one of these operations performs worse than the others. However, from our point of view changing the behavior and performing the narrowing conversion could cause more serious problems. Therefore this would arguably introduce a bug. We decided to leave it as it is, but to add one more entry to our FAQ to highlight and explain the potential problem with mixed precision operations.

    Thanks again for raising this issue and pointing out a potential performance bug (we rely on this as we cannot test every possible combination of things).

    Best regards,

    Klaus!

  5. gergol reporter

    Hi Klaus!

    Thank you for your comprehensive answer which totally makes sense. Perhaps it could be useful to issue warnings or even fail compilation (possibly as opt-in option) if a mixed precision operation would happen.

    Best regards

    Gerald

  6. Log in to comment