All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Macros Groups Pages
OpenMP Parallelization
Previous: Matrix/Matrix Multiplication     Next: Serial Execution


One of the main motivations of the Blaze 1.x releases was to achieve maximum performance on a single CPU core for all possible operations. However, today's CPUs are not single core anymore, but provide several (homogeneous or heterogeneous) compute cores. In order to fully exploit the performance potential of a multicore CPU, computations have to be parallelized across all available cores of a CPU. Therefore, starting with Blaze 2.0, the Blaze library provides shared memory parallelization with OpenMP.


OpenMP Setup


To enable OpenMP-based parallelization, all that needs to be done is to explicitly specify the use of OpenMP on the command line:

-fopenmp // GNU C++ compiler
-openmp // Intel C++ compiler
/openmp // Visual Studio

This simple action will cause the Blaze library to automatically try to run all operations in parallel with the specified number of threads.

As common for OpenMP, the number of threads can be specified either via an environment variable

export OMP_NUM_THREADS=4

or via an explicit call to the omp_set_num_threads() function:

omp_set_num_threads( 4 );

Either way, the best performance can be expected if the specified number of threads matches the available number of cores.


OpenMP Configuration


Note that Blaze is not unconditionally running an operation in parallel. In case Blaze deems the parallel execution as counterproductive for the overall performance, the operation is executed serially. One of the main reasons for not executing an operation in parallel is the size of the operands. For instance, a vector addition is only executed in parallel if the size of both vector operands exceeds a certain threshold. Otherwise, the performance could seriously decrease due to the overhead caused by the thread setup. However, in order to be able to adjust the Blaze library to a specific system, it is possible to configure these thresholds manually. All OpenMP thresholds are contained within the configuration file ./blaze/config/Thresholds.h.


First Touch Policy


So far the Blaze library does not (yet) automatically initialize dynamic memory according to the first touch principle. Consider for instance the following vector triad example:

const size_t N( 1000000UL );
blaze::DynamicVector<double,columnVector> a( N ), b( N ), c( N ), d( N );
// Initialization of the vectors b, c, and d
for( size_t i=0UL; i<N; ++i ) {
b[i] = rand<double>();
c[i] = rand<double>();
d[i] = rand<double>();
}
// Performing a vector triad
a = b + c * d;

If this code, which is prototypical for many OpenMP applications that have not been optimized for ccNUMA architectures, is run across several locality domains (LD), it will not scale beyond the maximum performance achievable on a single LD if the working set does not fit into the cache. This is because the initialization loop is executed by a single thread, writing to b, c, and d for the first time. Hence, all memory pages belonging to those arrays will be mapped into a single LD.

As mentioned above, this problem can be solved by performing vector initialization in parallel:

// ...
// Initialization of the vectors b, c, and d
#pragma omp parallel for
for( size_t i=0UL; i<N; ++i ) {
b[i] = rand<double>();
c[i] = rand<double>();
d[i] = rand<double>();
}
// ...

This simple modification makes a huge difference on ccNUMA in memory-bound situations (as for instance in all BLAS level 1 operations and partially BLAS level 2 operations). Therefore, in order to achieve the maximum possible performance, it is imperative to initialize the memory according to the later use of the data structures.


Previous: Matrix/Matrix Multiplication     Next: Serial Execution