- changed status to wontfix
offset views
I've been doing some light testing, and have run into an issue that is hampering the performance of Blaze, vs. a native loop implementation.
Consider a finite-difference stencil:
da[ i ] = ( a[ i + 1 ] - a[ i - 1 ] ) / dx[ i ];
The straightforward (maybe incorrect?) implementation using Blaze looks like:
subvector(da, 0, N-1) = (subvector(a, 1, N-1) - subvector(a, 0, N-1)) / subvector(dx, 0, N-1);
However, this suffers in performance; my simple tests show it can be almost thrice as slow as an explicit loop. I'd venture that this is because of i.) loss of alignment, ii.) indirection due to subvector call. Likely other things I'm missing, or just poor implementation on my part.
I've attached my .cpp file I was using to test. If there's a feature I'm missing that would help in this case, let me know.
Thanks
Comments (2)
-
-
Hi Christopher!
I continued the analysis a little and investigated how much overhead subvectors introduce in comparison to a plain for loop and manual vectorization. I used the operation proposed by you
da[ i ] = ( a[ i + 1 ] - a[ i - 1 ] ) / dx[ i ];
and consequently used three
DynamicVector
s:const size_t N( 10000UL ); DynamicVector<double> a ( N + 2UL, ... ); // Randomly initialized DynamicVector<double> dx( N + 1UL, ... ); // Randomly initialized DynamicVector<double> da( N + 1UL, 0.0 );
Approach 1: Plain for loop
for( size_t i=1UL; i<=N; ++i ) { da[i] = ( a[i+1UL] - a[i-1UL] ) / dx[i]; }
Approach 2: Blaze subvectors
subvector<1UL,N>( da ) = ( subvector<2UL,N>( a ) - subvector<0UL,N>( a ) ) / subvector<1UL,N>( dx );
Approach 3: Manual vectorization (by means of the Blaze SIMD library)
double* pa = a.data(); double* pdx = dx.data(); double* pda = da.data(); for( size_t i=1UL; i<=N; i+=4UL ) { storeu( pda+i, ( loadu( pa+i+1UL ) - loadu( pa+i-1UL ) ) / loadu( pdx+i ) ); }
Using GCC 8.2 and the following compilation flags
-Wall -Werror -Wextra -Wshadow -O3 -DNDEBUG -mavx -std=c++14
I found that all three solutions show the same performance (with the usual measurement inaccuracies). Thus from my point of view subvectors do not incur a measurable overhead. However, again, I would prefer approach 1 as it is more readable and may allow for additional compiler optimizations. Thanks again for pointing out this potential problem,
Best regards,
Klaus!
- Log in to comment
Hi Christopher!
Thanks for pointing out this potential performance problem. I have analyzed the problem and experimented with several settings/improvements. Here are my findings:
pow2()
function that you can use instead of the more generalpow()
function (althoughpow()
might have a special treatment for an exponent of2
). This is currently not documented in the wiki, we will fix this omission.In total, I find that both solutions run at the same speed for large vectors using AVX. However, for this kind of 1D stencil operation I would not expect that Blaze can provide any advantage to the optimizations that a modern compiler can apply. Due to the overhead for the setup of subvectors and due to the unaligned access I would even expect Blaze to run slightly slower. From a readability point of view I would argue that the manual implementation is easier to read and understand. One of the reasons for that may be that subvectors were not build for this application. Thus in this case I would recommend to stick to the for-loop approach.
I hope this helps,
Best regards,
Klaus!