offset views

Comments (2)

Klaus Iglberger

changed status to wontfix

Hi Christopher!

Thanks for pointing out this potential performance problem. I have analyzed the problem and experimented with several settings/improvements. Here are my findings:

For small vectors (N=1000) the overhead of creating subvectors plays some role. For larger vectors (e.g. N=1000000) the overhead is negligible.
The used SIMD technology has a significant impact. With AVX I achieve the same performance between the two approaches, with SSE Blaze runs slower. It apparently pays off to explicitly vectorize by means of AVX.
It is possible to setup subvectors with compile time offset and size (see the wiki). Since all vectors are only created once this does not result in a big difference, though.
Blaze provides a pow2() function that you can use instead of the more general pow() function (although pow() might have a special treatment for an exponent of 2). This is currently not documented in the wiki, we will fix this omission.

In total, I find that both solutions run at the same speed for large vectors using AVX. However, for this kind of 1D stencil operation I would not expect that Blaze can provide any advantage to the optimizations that a modern compiler can apply. Due to the overhead for the setup of subvectors and due to the unaligned access I would even expect Blaze to run slightly slower. From a readability point of view I would argue that the manual implementation is easier to read and understand. One of the reasons for that may be that subvectors were not build for this application. Thus in this case I would recommend to stick to the for-loop approach.

I hope this helps,

Best regards,

Klaus!

2018-12-09T08:14:05+00:00

Klaus Iglberger

Hi Christopher!

I continued the analysis a little and investigated how much overhead subvectors introduce in comparison to a plain for loop and manual vectorization. I used the operation proposed by you

da[ i ] = ( a[ i + 1 ] - a[ i - 1 ] ) / dx[ i ];

and consequently used three DynamicVectors:

const size_t N( 10000UL );

DynamicVector<double> a ( N + 2UL, ... );  // Randomly initialized
DynamicVector<double> dx( N + 1UL, ... );  // Randomly initialized
DynamicVector<double> da( N + 1UL, 0.0 );

Approach 1: Plain for loop

for( size_t i=1UL; i<=N; ++i ) {
   da[i] = ( a[i+1UL] - a[i-1UL] ) / dx[i];
}

Approach 2: Blaze subvectors

subvector<1UL,N>( da ) = ( subvector<2UL,N>( a ) - subvector<0UL,N>( a ) ) / subvector<1UL,N>( dx );

Approach 3: Manual vectorization (by means of the Blaze SIMD library)

double* pa  = a.data();
double* pdx = dx.data();
double* pda = da.data();

for( size_t i=1UL; i<=N; i+=4UL ) {
   storeu( pda+i, ( loadu( pa+i+1UL ) - loadu( pa+i-1UL ) ) / loadu( pdx+i ) );
}

Using GCC 8.2 and the following compilation flags

-Wall -Werror -Wextra -Wshadow -O3 -DNDEBUG -mavx -std=c++14

I found that all three solutions show the same performance (with the usual measurement inaccuracies). Thus from my point of view subvectors do not incur a measurable overhead. However, again, I would prefer approach 1 as it is more readable and may allow for additional compiler optimizations. Thanks again for pointing out this potential problem,

Best regards,

Klaus!

2018-12-09T09:39:13+00:00

Klaus Iglberger
- changed status to wontfix
Hi Christopher!

Thanks for pointing out this potential performance problem. I have analyzed the problem and experimented with several settings/improvements. Here are my findings:
- For small vectors (N=1000) the overhead of creating subvectors plays some role. For larger vectors (e.g. N=1000000) the overhead is negligible.
- The used SIMD technology has a significant impact. With AVX I achieve the same performance between the two approaches, with SSE Blaze runs slower. It apparently pays off to explicitly vectorize by means of AVX.
- It is possible to setup subvectors with compile time offset and size (see the wiki). Since all vectors are only created once this does not result in a big difference, though.
- Blaze provides a pow2() function that you can use instead of the more general pow() function (although pow() might have a special treatment for an exponent of 2). This is currently not documented in the wiki, we will fix this omission.
In total, I find that both solutions run at the same speed for large vectors using AVX. However, for this kind of 1D stencil operation I would not expect that Blaze can provide any advantage to the optimizations that a modern compiler can apply. Due to the overhead for the setup of subvectors and due to the unaligned access I would even expect Blaze to run slightly slower. From a readability point of view I would argue that the manual implementation is easier to read and understand. One of the reasons for that may be that subvectors were not build for this application. Thus in this case I would recommend to stick to the for-loop approach.

I hope this helps,

Best regards,

Klaus!
- 2018-12-09T08:14:05+00:00
Klaus Iglberger
Hi Christopher!

I continued the analysis a little and investigated how much overhead subvectors introduce in comparison to a plain for loop and manual vectorization. I used the operation proposed by you
```
da[ i ] = ( a[ i + 1 ] - a[ i - 1 ] ) / dx[ i ];
```
and consequently used three DynamicVectors:
```
const size_t N( 10000UL );

DynamicVector<double> a ( N + 2UL, ... );  // Randomly initialized
DynamicVector<double> dx( N + 1UL, ... );  // Randomly initialized
DynamicVector<double> da( N + 1UL, 0.0 );
```
Approach 1: Plain for loop
```
for( size_t i=1UL; i<=N; ++i ) {
   da[i] = ( a[i+1UL] - a[i-1UL] ) / dx[i];
}
```
Approach 2: Blaze subvectors
```
subvector<1UL,N>( da ) = ( subvector<2UL,N>( a ) - subvector<0UL,N>( a ) ) / subvector<1UL,N>( dx );
```
Approach 3: Manual vectorization (by means of the Blaze SIMD library)
```
double* pa  = a.data();
double* pdx = dx.data();
double* pda = da.data();

for( size_t i=1UL; i<=N; i+=4UL ) {
   storeu( pda+i, ( loadu( pa+i+1UL ) - loadu( pa+i-1UL ) ) / loadu( pdx+i ) );
}
```
Using GCC 8.2 and the following compilation flags
```
-Wall -Werror -Wextra -Wshadow -O3 -DNDEBUG -mavx -std=c++14
```
I found that all three solutions show the same performance (with the usual measurement inaccuracies). Thus from my point of view subvectors do not incur a measurable overhead. However, again, I would prefer approach 1 as it is more readable and may allow for additional compiler optimizations. Thanks again for pointing out this potential problem,

Best regards,

Klaus!
- 2018-12-09T09:39:13+00:00
Log in to comment