offset views

Issue #219 wontfix
Christopher Mauney created an issue

I've been doing some light testing, and have run into an issue that is hampering the performance of Blaze, vs. a native loop implementation.

Consider a finite-difference stencil:

da[ i ] = ( a[ i + 1 ] - a[ i - 1 ] ) / dx[ i ];

The straightforward (maybe incorrect?) implementation using Blaze looks like:

subvector(da, 0, N-1) = (subvector(a, 1, N-1) - subvector(a, 0, N-1)) / subvector(dx, 0, N-1); 

However, this suffers in performance; my simple tests show it can be almost thrice as slow as an explicit loop. I'd venture that this is because of i.) loss of alignment, ii.) indirection due to subvector call. Likely other things I'm missing, or just poor implementation on my part.

I've attached my .cpp file I was using to test. If there's a feature I'm missing that would help in this case, let me know.

Thanks

Comments (2)

  1. Klaus Iglberger

    Hi Christopher!

    Thanks for pointing out this potential performance problem. I have analyzed the problem and experimented with several settings/improvements. Here are my findings:

    • For small vectors (N=1000) the overhead of creating subvectors plays some role. For larger vectors (e.g. N=1000000) the overhead is negligible.
    • The used SIMD technology has a significant impact. With AVX I achieve the same performance between the two approaches, with SSE Blaze runs slower. It apparently pays off to explicitly vectorize by means of AVX.
    • It is possible to setup subvectors with compile time offset and size (see the wiki). Since all vectors are only created once this does not result in a big difference, though.
    • Blaze provides a pow2() function that you can use instead of the more general pow() function (although pow() might have a special treatment for an exponent of 2). This is currently not documented in the wiki, we will fix this omission.

    In total, I find that both solutions run at the same speed for large vectors using AVX. However, for this kind of 1D stencil operation I would not expect that Blaze can provide any advantage to the optimizations that a modern compiler can apply. Due to the overhead for the setup of subvectors and due to the unaligned access I would even expect Blaze to run slightly slower. From a readability point of view I would argue that the manual implementation is easier to read and understand. One of the reasons for that may be that subvectors were not build for this application. Thus in this case I would recommend to stick to the for-loop approach.

    I hope this helps,

    Best regards,

    Klaus!

  2. Klaus Iglberger

    Hi Christopher!

    I continued the analysis a little and investigated how much overhead subvectors introduce in comparison to a plain for loop and manual vectorization. I used the operation proposed by you

    da[ i ] = ( a[ i + 1 ] - a[ i - 1 ] ) / dx[ i ];
    

    and consequently used three DynamicVectors:

    const size_t N( 10000UL );
    
    DynamicVector<double> a ( N + 2UL, ... );  // Randomly initialized
    DynamicVector<double> dx( N + 1UL, ... );  // Randomly initialized
    DynamicVector<double> da( N + 1UL, 0.0 );
    

    Approach 1: Plain for loop

    for( size_t i=1UL; i<=N; ++i ) {
       da[i] = ( a[i+1UL] - a[i-1UL] ) / dx[i];
    }
    

    Approach 2: Blaze subvectors

    subvector<1UL,N>( da ) = ( subvector<2UL,N>( a ) - subvector<0UL,N>( a ) ) / subvector<1UL,N>( dx );
    

    Approach 3: Manual vectorization (by means of the Blaze SIMD library)

    double* pa  = a.data();
    double* pdx = dx.data();
    double* pda = da.data();
    
    for( size_t i=1UL; i<=N; i+=4UL ) {
       storeu( pda+i, ( loadu( pa+i+1UL ) - loadu( pa+i-1UL ) ) / loadu( pdx+i ) );
    }
    

    Using GCC 8.2 and the following compilation flags

    -Wall -Werror -Wextra -Wshadow -O3 -DNDEBUG -mavx -std=c++14
    

    I found that all three solutions show the same performance (with the usual measurement inaccuracies). Thus from my point of view subvectors do not incur a measurable overhead. However, again, I would prefer approach 1 as it is more readable and may allow for additional compiler optimizations. Thanks again for pointing out this potential problem,

    Best regards,

    Klaus!

  3. Log in to comment