![]() |
One of the main motivations of the Blaze 1.x releases was to achieve maximum performance on a single CPU core for all possible operations. However, today's CPUs are not single core anymore, but provide several (homogeneous or heterogeneous) compute cores. In order to fully exploit the performance potential of a multicore CPU, computations have to be parallelized across all available cores of a CPU. Therefore, starting with Blaze 2.0, the Blaze library provides shared memory parallelization with OpenMP.
To enable OpenMP-based parallelization, all that needs to be done is to explicitly specify the use of OpenMP on the command line:
This simple action will cause the Blaze library to automatically try to run all operations in parallel with the specified number of threads.
As common for OpenMP, the number of threads can be specified either via an environment variable
or via an explicit call to the omp_set_num_threads()
function:
Either way, the best performance can be expected if the specified number of threads matches the available number of cores.
Note that Blaze is not unconditionally running an operation in parallel. In case Blaze deems the parallel execution as counterproductive for the overall performance, the operation is executed serially. One of the main reasons for not executing an operation in parallel is the size of the operands. For instance, a vector addition is only executed in parallel if the size of both vector operands exceeds a certain threshold. Otherwise, the performance could seriously decrease due to the overhead caused by the thread setup. However, in order to be able to adjust the Blaze library to a specific system, it is possible to configure these thresholds manually. All OpenMP thresholds are contained within the configuration file ./blaze/config/Thresholds.h.
So far the Blaze library does not (yet) automatically initialize dynamic memory according to the first touch principle. Consider for instance the following vector triad example:
If this code, which is prototypical for many OpenMP applications that have not been optimized for ccNUMA architectures, is run across several locality domains (LD), it will not scale beyond the maximum performance achievable on a single LD if the working set does not fit into the cache. This is because the initialization loop is executed by a single thread, writing to b
, c
, and d
for the first time. Hence, all memory pages belonging to those arrays will be mapped into a single LD.
As mentioned above, this problem can be solved by performing vector initialization in parallel:
This simple modification makes a huge difference on ccNUMA in memory-bound situations (as for instance in all BLAS level 1 operations and partially BLAS level 2 operations). Therefore, in order to achieve the maximum possible performance, it is imperative to initialize the memory according to the later use of the data structures.