![]() |
Blaze 3.9
|
The size of a StaticVector, StaticMatrix, HybridVector, or HybridMatrix can indeed be larger than expected:
In order to achieve the maximum possible performance the Blaze library tries to enable SIMD vectorization even for small vectors. For that reason Blaze by default uses padding elements for all dense vectors and matrices to guarantee that at least a single SIMD vector can be loaded. Depending on the used SIMD technology that can significantly increase the size of a StaticVector, StaticMatrix, HybridVector, or HybridMatrix :
The configuration files ./blaze/config/Padding.h
provides a compile time switch that can be used to (de-)activate padding:
Alternatively it is possible to (de-)activate padding via command line or by defining this symbol manually before including any Blaze header file:
If BLAZE_DEFAULT_PADDING_FLAG
is set to blaze::padded
, by default padding is enabled for StaticVector, HybridVector, StaticMatrix, and HybridMatrix. If it is set to blaze::unpadded
, then padding is by default disabled. Note however that disabling padding can considerably reduce the performance of all dense vector and matrix operations!
Despite disabling padding via the BLAZE_DEFAULT_PADDING_FLAG
compile time switch (see A StaticVector/StaticMatrix is larger than expected. Is this a bug?), the size of a StaticVector, StaticMatrix, HybridVector, or HybridMatrix can still be larger than expected:
The reason for this behavior is the used SIMD technology. If SSE is used, which provides 128 bit wide registers, a single SIMD pack can usually hold 4 integers (128 bit divided by 32 bit). Since the second vector contains enough elements is possible to benefit from vectorization. However, SSE requires an alignment of 16 bytes, which ultimately results in a total size of 32 bytes for the StaticVector
(2 times 16 bytes due to 5 integer elements). If AVX or AVX-512 is used, which provide 256 bit or 512 bit wide registers, a single SIMD vector can hold 8 or 16 integers, respectively. Even the second vector does not hold enough elements to benefit from vectorization, which is why Blaze does not enforce a 32 byte (for AVX) or even 64 byte alignment (for AVX-512).
It is possible to disable the SIMD-specific alignment for StaticVector, StaticMatrix, HybridVector, or HybridMatrix via the compile time switch in the ./blaze/config/Alignment.h
configuration file:
Alternatively it is possible set the default alignment flag via command line or by defining this symbol manually before including any Blaze header file:
If BLAZE_DEFAULT_ALIGNMENT_FLAG
is set to blaze::aligned
then StaticVector, HybridVector, StaticMatrix, and HybridMatrix use aligned memory by default. If it is set to blaze::unaligned
they don't enforce aligned memory. Note however that disabling alignment can considerably reduce the performance of all operations with these vector and matrix types!
Alternatively it is possible to disable the vectorization entirely by the compile time switch in the ./blaze/config/Vectorization.h
configuration file:
It is also possible to (de-)activate vectorization via command line or by defining this symbol manually before including any Blaze header file:
In case the switch is set to 1, vectorization is enabled and the Blaze library is allowed to use intrinsics and the necessary alignment to speed up computations. In case the switch is set to 0, vectorization is disabled entirely and the Blaze library chooses default, non-vectorized functionality for the operations. Note that deactivating the vectorization may pose a severe performance limitation for a large number of operations!
With active vectorization the elements of a StaticVector, HybridVector, StaticMatrix, and HybridMatrix are possibly over-aligned to meet the alignment requirements of the available instruction set (SSE, AVX, AVX-512, ...). The alignment for fundamental types (short
, int
, float
, double
, ...) and complex types (complex<float>
, complex<double>
, ...) is 16 bytes for SSE, 32 bytes for AVX, and 64 bytes for AVX-512. All other types are aligned according to their intrinsic alignment:
For this reason StaticVector, HybridVector, StaticMatrix, and HybridMatrix cannot be used in containers using dynamic memory such as std::vector
without additionally providing an allocator that can provide over-aligned memory:
It is possible to disable the vectorization entirely by the compile time switch in the ./blaze/config/Vectorization.h
configuration file:
It is also possible to (de-)activate vectorization via command line or by defining this symbol manually before including any Blaze header file:
In case the switch is set to 1, vectorization is enabled and the Blaze library is allowed to use intrinsics and the necessary alignment to speed up computations. In case the switch is set to 0, vectorization is disabled entirely and the Blaze library chooses default, non-vectorized functionality for the operations. Note that deactivating the vectorization may pose a severe performance limitation for a large number of operations!
Currently the only BLAS functions that are utilized by Blaze are the gemm()
functions for the multiplication of two dense matrices (i.e. sgemm()
, dgemm()
, cgemm()
, and zgemm()
). All other operations are always and unconditionally performed by native Blaze kernels.
The BLAZE_BLAS_MODE
config switch (see ./blaze/config/BLAS.h
) determines whether Blaze is allowed to use BLAS kernels. If BLAZE_BLAS_MODE
is set to 0 then Blaze does not utilize the BLAS kernels and unconditionally uses its own custom kernels. If BLAZE_BLAS_MODE
is set to 1 then Blaze is allowed to choose between using BLAS kernels or its own custom kernels. In case of the dense matrix multiplication this decision is based on the size of the dense matrices. For large matrices, Blaze uses the BLAS kernels, for small matrices it uses its own custom kernels. The threshold for this decision can be configured via the BLAZE_DMATDMATMULT_THRESHOLD
, BLAZE_DMATTDMATMULT_THRESHOLD
, BLAZE_TDMATDMATMULT_THRESHOLD
and BLAZE_TDMATTDMATMULT_THRESHOLD
config switches (see ./blaze/config/Thresholds.h
).
Please note that the extend to which Blaze uses BLAS kernels can change in future releases of Blaze!
Blaze uses LAPACK functions for matrix decomposition, matrix inversion, computing the determinants and eigenvalues, and the SVD. In contrast to the BLAS functionality (see To which extend does Blaze make use of BLAS functions under the hood?), you cannot disable LAPACK or switch to custom kernels. In case you try to use any of these functionalities, but do not provide (i.e. link) a LAPACK library you will get link time errors.
Please note that the extend to which Blaze uses LAPACK kernels can change in future releases of Blaze!
The following examples give an overview of different approaches to setup a sparse, row-major NxN matrix with the following pattern, where all values on the diagonal and the two sub-diagonals are filled:
Special emphasis is given to the runtime until the matrix setup is complete. In all cases the runtime is benchmarked with Clang-9.0 (compilation flags -O2
and -DNDEBUG
) for N=200000
.
Approach 1: Using the function call operator
In this approach the function call operator (i.e. operator()
) is used to insert the according elements into the matrix:
This approach is the most general and convenient, but also the slowest of all (approx. 64 seconds). With every call to operator()
, a new element is inserted at the specified position. This implies shifting all subsequent elements and adapting every subsequent row. Since all non-zero elements are stored in a single array inside a CompressedMatrix
, this approach is similar to inserting elements at the front of a std::vector
; all subsequent elements have to be shifted.
Approach 2: Rowwise reserve and insert
The next approach performs a rowwise reservation of capacity:
The first call to reserve() performs the memory allocation for the entire matrix. The complete matrix now holds the entire capacity, but each single row has a capacity of 0. Therefore the subsequent calls to reserve()
divide the existing capacity to all rows.
Unfortunately, also this approach is rather slow. The runtime is approx. 30 seconds. The downside of this approach is that changing the capacity of a single row causes a change in all following rows. Therefore this approach is similar to the first approach.
Approach 3: reserve/append/finalize
As the wiki explains, the most efficient way to fill a sparse matrix is a combination of reserve()
, append()
and finalize()
:
The initial call to reserve()
allocates enough memory for all non-zero elements of the entire matrix. append()
and finalize()
are then used to insert the elements and to mark the end of each single row. This is a very low-level approach and very similar to writing to an array manually, which results in a mere 0.026 seconds. The append()
function writes the new element to the next memory location, and at the end of each row or column the finalize()
function sets the internal pointers accordingly. It is very important to note that the finalize()
function has to be explicitly called for each row, even for empty ones! Else the internal data structure will be corrupt! Also note that although append()
does not allocate new memory, it still invalidates all iterators returned by the end()
functions!
Approach 4: Reservation via the constructor
In case the number of non-zero elements is known upfront, it is also possible to perform the reservation via the constructor of CompressedMatrix
. For that purpose CompressedMatrix
provides a constructor taking a std::vector<size_t>
:
The runtime for this approach is 0.027 seconds.
The include file <blaze/Blaze.h>
includes the entire functionality of the Blaze library, which by now is several hundred thousand lines of source code. That means that a lot of source code has to be parsed whenever <blaze/Blaze.h>
is encountered. However, it is rare that everything is required within a single compilation unit. Therefore it is easily possible to reduce compile times by including only those Blaze features that are used within the compilation unit. For instance, instead of including <blaze/Blaze.h>
it could be enough to include <blaze/math/DynamicVector.h>
, which would reduce the compilation times by about 20%.
Additionally we are taking care to implement new Blaze functionality such that compile times do not explode and try to reduce the compile times of existing features. Thus newer releases of Blaze can also improve compile times.
In some cases you might be able to implement the required functionality very conveniently by building on the existing map()
functions (see The map() Functions). For instance, the following code demonstrates the addition of a function that merges two vectors of floating point type into a vector of complex numbers:
You will find a summary of the necessary steps to create custom features in Customization.
Sometimes, however, the available customization points might not be sufficient. In this case you are cordially invited to create a pull request that provides the implementation of a feature or to create an issue according to our Issue Creation Guidelines. Please try to explain the feature as descriptive as possible, for instance by providing conceptual code examples.
Previous: Intra-Statement Optimization Next: Issue Creation Guidelines