OpenMP Parallelization
Previous: Matrix/Matrix Multiplication     Next: C++11 Thread Parallelization


One of the main motivations of the Blaze 1.x releases was to achieve maximum performance on a single CPU core for all possible operations. However, today's CPUs are not single core anymore, but provide several (homogeneous or heterogeneous) compute cores. In order to fully exploit the performance potential of a multicore CPU, computations have to be parallelized across all available cores of a CPU. Therefore, starting with Blaze 2.0, the Blaze library provides shared memory parallelization with OpenMP.


OpenMP Setup


To enable OpenMP-based parallelization, all that needs to be done is to explicitly specify the use of OpenMP on the command line:

-fopenmp // GNU C++ compiler
-openmp // Intel C++ compiler
/openmp // Visual Studio

This simple action will cause the Blaze library to automatically try to run all operations in parallel with the specified number of threads.

As common for OpenMP, the number of threads can be specified either via an environment variable

export OMP_NUM_THREADS=4 // Unix systems
set OMP_NUM_THREADS=4 // Windows systems

or via an explicit call to the omp_set_num_threads() function:

omp_set_num_threads( 4 );

Alternatively, the number of threads can also be specified via the setNumThreads() function provided by the Blaze library:

Please note that the Blaze library does not limit the available number of threads. Therefore it is in YOUR responsibility to choose an appropriate number of threads. The best performance, though, can be expected if the specified number of threads matches the available number of cores.

In order to query the number of threads used for the parallelization of operations, the getNumThreads() function can be used:

const size_t threads = blaze::getNumThreads();

In the context of OpenMP, the function returns the maximum number of threads OpenMP will use within a parallel region and is therefore equivalent to the omp_get_max_threads() function.


OpenMP Configuration


Note that Blaze is not unconditionally running an operation in parallel. In case Blaze deems the parallel execution as counterproductive for the overall performance, the operation is executed serially. One of the main reasons for not executing an operation in parallel is the size of the operands. For instance, a vector addition is only executed in parallel if the size of both vector operands exceeds a certain threshold. Otherwise, the performance could seriously decrease due to the overhead caused by the thread setup. However, in order to be able to adjust the Blaze library to a specific system, it is possible to configure these thresholds manually. All shared memory thresholds are contained within the configuration file ./blaze/config/Thresholds.h.

Please note that these thresholds are highly sensitiv to the used system architecture and the shared memory parallelization technique (see also C++11 Thread Parallelization and Boost Thread Parallelization). Therefore the default values cannot guarantee maximum performance for all possible situations and configurations. They merely provide a reasonable standard for the current CPU generation.


First Touch Policy


So far the Blaze library does not (yet) automatically initialize dynamic memory according to the first touch principle. Consider for instance the following vector triad example:

const size_t N( 1000000UL );
blaze::DynamicVector<double,columnVector> a( N ), b( N ), c( N ), d( N );
// Initialization of the vectors b, c, and d
for( size_t i=0UL; i<N; ++i ) {
b[i] = rand<double>();
c[i] = rand<double>();
d[i] = rand<double>();
}
// Performing a vector triad
a = b + c * d;

If this code, which is prototypical for many OpenMP applications that have not been optimized for ccNUMA architectures, is run across several locality domains (LD), it will not scale beyond the maximum performance achievable on a single LD if the working set does not fit into the cache. This is because the initialization loop is executed by a single thread, writing to b, c, and d for the first time. Hence, all memory pages belonging to those arrays will be mapped into a single LD.

As mentioned above, this problem can be solved by performing vector initialization in parallel:

// ...
// Initialization of the vectors b, c, and d
#pragma omp parallel for
for( size_t i=0UL; i<N; ++i ) {
b[i] = rand<double>();
c[i] = rand<double>();
d[i] = rand<double>();
}
// ...

This simple modification makes a huge difference on ccNUMA in memory-bound situations (as for instance in all BLAS level 1 operations and partially BLAS level 2 operations). Therefore, in order to achieve the maximum possible performance, it is imperative to initialize the memory according to the later use of the data structures.


Limitations of the OpenMP Parallelization


There are a few important limitations to the current Blaze OpenMP parallelization. The first one involves the explicit use of an OpenMP parallel region (see The Parallel Directive), the other one the OpenMP sections directive (see The Sections Directive).


The Parallel Directive

In OpenMP threads are explicitly spawned via the an OpenMP parallel directive:

// Serial region, executed by a single thread
#pragma omp parallel
{
// Parallel region, executed by the specified number of threads
}
// Serial region, executed by a single thread

Conceptually, the specified number of threads (see OpenMP Setup) is created every time a parallel directive is encountered. Therefore, from a performance point of view, it seems to be beneficial to use a single OpenMP parallel directive for several operations:

#pragma omp parallel
{
y1 = A * x;
y2 = B * x;
}

Unfortunately, this optimization approach is not allowed within the Blaze library. More explicitly, it is not allowed to put an operation into a parallel region. The reason is that the entire code contained within a parallel region is executed by all threads. Although this appears to just comprise the contained computations, a computation (or more specifically the assignment of an expression to a vector or matrix) can contain additional logic that must not be handled by multiple threads (as for instance memory allocations, setup of temporaries, etc.). Therefore it is not possible to manually start a parallel region for several operations, but Blaze will spawn threads automatically, depending on the specifics of the operation at hand and the given operands.


The Sections Directive

OpenMP provides several work-sharing construct to distribute work among threads. One of these constructs is the sections directive:

// ... Resizing and initialization
#pragma omp sections
{
#pragma omp section
y1 = A * x;
#pragma omp section
y2 = B * x;
}

In this example, two threads are used to compute two distinct matrix/vector multiplications concurrently. Thereby each of the sections is executed by exactly one thread.

Unfortunately Blaze does not support concurrent parallel computations and therefore this approach does not work with any of the Blaze parallelization techniques. All techniques (including the C++11 and Boost thread parallelizations; see C++11 Thread Parallelization and Boost Thread Parallelization) are optimized for the parallel computation of an operation within a single thread of execution. This means that Blaze tries to use all available threads to compute the result of a single operation as efficiently as possible. Therefore, for this special case, it is advisable to disable all Blaze parallelizations and to let Blaze compute all operations within a sections directive in serial. This can be done by either completely disabling the Blaze parallelization (see Serial Execution) or by selectively serializing all operations within a sections directive via the serial() function:

// ... Resizing and initialization
#pragma omp sections
{
#pragma omp section
y1 = serial( A * x );
#pragma omp section
y2 = serial( B * x );
}

Please note that the use of the BLAZE_SERIAL_SECTION (see also Serial Execution) does NOT work in this context!


Previous: Matrix/Matrix Multiplication     Next: C++11 Thread Parallelization