![]() |
Blaze
3.6
|
The fourth and final shared memory parallelization provided with Blaze is based on OpenMP.
To enable the OpenMP-based parallelization, all that needs to be done is to explicitly specify the use of OpenMP on the command line:
This simple action will cause the Blaze library to automatically try to run all operations in parallel with the specified number of threads. Note however that the HPX-based, the C++11 thread-based, and the Boost thread-based parallelizations have priority, i.e. are preferred in case either is enabled in combination with the OpenMP thread parallelization.
As common for OpenMP, the number of threads can be specified either via an environment variable
or via an explicit call to the omp_set_num_threads()
function:
Alternatively, the number of threads can also be specified via the setNumThreads()
function provided by the Blaze library:
Please note that the Blaze library does not limit the available number of threads. Therefore it is in YOUR responsibility to choose an appropriate number of threads. The best performance, though, can be expected if the specified number of threads matches the available number of cores.
In order to query the number of threads used for the parallelization of operations, the getNumThreads()
function can be used:
In the context of OpenMP, the function returns the maximum number of threads OpenMP will use within a parallel region and is therefore equivalent to the omp_get_max_threads()
function.
Note that Blaze is not unconditionally running an operation in parallel. In case Blaze deems the parallel execution as counterproductive for the overall performance, the operation is executed serially. One of the main reasons for not executing an operation in parallel is the size of the operands. For instance, a vector addition is only executed in parallel if the size of both vector operands exceeds a certain threshold. Otherwise, the performance could seriously decrease due to the overhead caused by the thread setup. However, in order to be able to adjust the Blaze library to a specific system, it is possible to configure these thresholds manually. All shared memory thresholds are contained within the configuration file <blaze/config/Thresholds.h>
.
Please note that these thresholds are highly sensitiv to the used system architecture and the shared memory parallelization technique (see also C++11 Thread Parallelization and Boost Thread Parallelization). Therefore the default values cannot guarantee maximum performance for all possible situations and configurations. They merely provide a reasonable standard for the current CPU generation.
So far the Blaze library does not (yet) automatically initialize dynamic memory according to the first touch principle. Consider for instance the following vector triad example:
If this code, which is prototypical for many OpenMP applications that have not been optimized for ccNUMA architectures, is run across several locality domains (LD), it will not scale beyond the maximum performance achievable on a single LD if the working set does not fit into the cache. This is because the initialization loop is executed by a single thread, writing to b
, c
, and d
for the first time. Hence, all memory pages belonging to those arrays will be mapped into a single LD.
As mentioned above, this problem can be solved by performing vector initialization in parallel:
This simple modification makes a huge difference on ccNUMA in memory-bound situations (as for instance in all BLAS level 1 operations and partially BLAS level 2 operations). Therefore, in order to achieve the maximum possible performance, it is imperative to initialize the memory according to the later use of the data structures.
There are a few important limitations to the current Blaze OpenMP parallelization. The first one involves the explicit use of an OpenMP parallel region (see The Parallel Directive), the other one the OpenMP sections
directive (see The Sections Directive).
In OpenMP threads are explicitly spawned via the an OpenMP parallel directive:
Conceptually, the specified number of threads (see OpenMP Setup) is created every time a parallel directive is encountered. Therefore, from a performance point of view, it seems to be beneficial to use a single OpenMP parallel directive for several operations:
Unfortunately, this optimization approach is not allowed within the Blaze library. More explicitly, it is not allowed to put an operation into a parallel region. The reason is that the entire code contained within a parallel region is executed by all threads. Although this appears to just comprise the contained computations, a computation (or more specifically the assignment of an expression to a vector or matrix) can contain additional logic that must not be handled by multiple threads (as for instance memory allocations, setup of temporaries, etc.). Therefore it is not possible to manually start a parallel region for several operations, but Blaze will spawn threads automatically, depending on the specifics of the operation at hand and the given operands.
OpenMP provides several work-sharing construct to distribute work among threads. One of these constructs is the sections
directive:
In this example, two threads are used to compute two distinct matrix/vector multiplications concurrently. Thereby each of the sections
is executed by exactly one thread.
Unfortunately Blaze does not support concurrent parallel computations and therefore this approach does not work with any of the Blaze parallelization techniques. All techniques (including the C++11 and Boost thread parallelizations; see C++11 Thread Parallelization and Boost Thread Parallelization) are optimized for the parallel computation of an operation within a single thread of execution. This means that Blaze tries to use all available threads to compute the result of a single operation as efficiently as possible. Therefore, for this special case, it is advisable to disable all Blaze parallelizations and to let Blaze compute all operations within a sections
directive in serial. This can be done by either completely disabling the Blaze parallelization (see Serial Execution) or by selectively serializing all operations within a sections
directive via the serial()
function:
Please note that the use of the BLAZE_SERIAL_SECTION
(see also Serial Execution) does NOT work in this context!
Previous: Boost Thread Parallelization Next: Serial Execution