Wiki

Clone wiki

blaze / FAQ


A StaticVector/StaticMatrix is larger than expected. Is this a bug?

The size of a StaticVector, StaticMatrix, HybridVector, or HybridMatrix can indeed be larger than expected:

StaticVector<int,3> a;
StaticMatrix<int,3,3> A;

sizeof( a );  // Evaluates to 16, 32, or even 64, but not 12
sizeof( A );  // Evaluates to 48, 96, or even 144, but not 36

In order to achieve the maximum possible performance the Blaze library tries to enable SIMD vectorization even for small vectors. For that reason Blaze by default uses padding elements for all dense vectors and matrices to guarantee that at least a single SIMD vector can be loaded. Depending on the used SIMD technology that can significantly increase the size of a StaticVector, StaticMatrix, HybridVector, or HybridMatrix :

StaticVector<int,3> a;
StaticMatrix<int,3,3> A;

sizeof( a );  // Evaluates to 16 in case of SSE, 32 in case of AVX, and 64 in case of AVX-512
              // (under the assumption that an integer occupies 4 bytes)
sizeof( A );  // Evaluates to 48 in case of SSE, 96 in case of AVX, and 144 in case of AVX-512
              // (under the assumption that an integer occupies 4 bytes)

The configuration files ./blaze/config/Padding.h provides a compile time switch that can be used to (de-)activate padding:

#define BLAZE_DEFAULT_PADDING_FLAG blaze::padded

Alternatively it is possible to (de-)activate padding via command line or by defining this symbol manually before including any Blaze header file:

g++ ... -BLAZE_DEFAULT_PADDING_FLAG=blaze::padded ...
#define BLAZE_DEFAULT_PADDING_FLAG blaze::padded
#include <blaze/Blaze.h>

If BLAZE_DEFAULT_PADDING_FLAG is set to blaze::padded, by default padding is enabled for StaticVector, HybridVector, StaticMatrix, and HybridMatrix. If it is set to blaze::unpadded, then padding is by default disabled. Note however that disabling padding can considerably reduce the performance of all dense vector and matrix operations!


Despite disabling padding, a StaticVector/StaticMatrix is still larger than expected. Is this a bug?

Despite disabling padding via the BLAZE_DEFAULT_PADDING_FLAG compile time switch (see A StaticVector/StaticMatrix is larger than expected. Is this a bug?, the size of a StaticVector, StaticMatrix, HybridVector, or HybridMatrix can still be larger than expected:

#define BLAZE_DEFAULT_PADDING_FLAG blaze::unpadded
#include <blaze/Blaze.h>

StaticVector<int,3> a;
StaticVector<int,5> b;

sizeof( a );  // Always evaluates to 12
sizeof( b );  // Evaluates to 32 with SSE (larger than expected) and to 20 with AVX or AVX-512 (expected)

The reason for this behavior is the used SIMD technology. If SSE is used, which provides 128 bit wide registers, a single SIMD pack can usually hold 4 integers (128 bit divided by 32 bit). Since the second vector contains enough elements is possible to benefit from vectorization. However, SSE requires an alignment of 16 bytes, which ultimately results in a total size of 32 bytes for the StaticVector (2 times 16 bytes due to 5 integer elements). If AVX or AVX-512 is used, which provide 256 bit or 512 bit wide registers, a single SIMD vector can hold 8 or 16 integers, respectively. Even the second vector does not hold enough elements to benefit from vectorization, which is why Blaze does not enforce a 32 byte (for AVX) or even 64 byte alignment (for AVX-512).

It is possible to disable the SIMD-specific alignment for StaticVector, StaticMatrix, HybridVector, or HybridMatrix via the compile time switch in the ./blaze/config/Alignment.h configuration file:

#define BLAZE_DEFAULT_ALIGNMENT_FLAG blaze::aligned

Alternatively it is possible set the default alignment flag via command line or by defining this symbol manually before including any Blaze header file:

g++ ... -DBLAZE_DEFAULT_ALIGNMENT_FLAG=blaze::aligned ...
#define BLAZE_DEFAULT_ALIGNMENT_FLAG blaze::aligned
#include <blaze/Blaze.h>

If BLAZE_DEFAULT_ALIGNMENT_FLAG is set to blaze::aligned then StaticVector, HybridVector, StaticMatrix, and HybridMatrix use aligned memory by default. If it is set to blaze::unaligned they don't enforce aligned memory. Note however that disabling alignment can considerably reduce the performance of all operations with these vector and matrix types!

Alternatively it is possible to disable the vectorization entirely by the compile time switch in the ./blaze/config/Vectorization.h configuration file:

#define BLAZE_USE_VECTORIZATION 1

It is also possible to (de-)activate vectorization via command line or by defining this symbol manually before including any Blaze header file:

g++ ... -DBLAZE_USE_VECTORIZATION=1 ...
#define BLAZE_USE_VECTORIZATION 1
#include <blaze/Blaze.h>

In case the switch is set to 1, vectorization is enabled and the Blaze library is allowed to use intrinsics and the necessary alignment to speed up computations. In case the switch is set to 0, vectorization is disabled entirely and the Blaze library chooses default, non-vectorized functionality for the operations. Note that deactivating the vectorization may pose a severe performance limitation for a large number of operations!


I experience crashes when using StaticVector/StaticMatrix in a std::vector. Is this a bug?

With active vectorization the elements of a StaticVector, HybridVector, StaticMatrix, and HybridMatrix are possibly over-aligned to meet the alignment requirements of the available instruction set (SSE, AVX, AVX-512, ...). The alignment for fundamental types (short, int, float, double, ...) and complex types (complex<float>, complex<double>, ...) is 16 bytes for SSE, 32 bytes for AVX, and 64 bytes for AVX-512. All other types are aligned according to their intrinsic alignment:

struct Int { int i; };

using VT1 = blaze::StaticVector<double,3UL>;
using VT2 = blaze::StaticVector<complex<float>,2UL>;
using VT3 = blaze::StaticVector<Int,5UL>;

alignof( VT1 );  // Evaluates to 16 for SSE, 32 for AVX, and 64 for AVX-512
alignof( VT2 );  // Evaluates to 16 for SSE, 32 for AVX, and 64 for AVX-512
alignof( VT3 );  // Evaluates to 'alignof( Int )'

For this reason StaticVector, HybridVector, StaticMatrix, and HybridMatrix cannot be used in containers using dynamic memory such as std::vector without additionally providing an allocator that can provide over-aligned memory:

using Type = blaze::StaticVector<double,3UL>;
using Allocator = blaze::AlignedAllocator<Type>;

std::vector<Type> v1;  // Might be misaligned for AVX or AVX-512
std::vector<Type,Allocator> v2;  // Properly aligned for AVX or AVX-512

It is possible to disable the vectorization entirely by the compile time switch in the <blaze/config/Vectorization.h> configuration file:

#define BLAZE_USE_VECTORIZATION 1

It is also possible to (de-)activate vectorization via command line or by defining this symbol manually before including any Blaze header file:

g++ ... -BLAZE_USE_VECTORIZATION=1 ...
#define BLAZE_USE_VECTORIZATION 1
#include <blaze/Blaze.h>

In case the switch is set to 1, vectorization is enabled and the Blaze library is allowed to use intrinsics and the necessary alignment to speed up computations. In case the switch is set to 0, vectorization is disabled entirely and the Blaze library chooses default, non-vectorized functionality for the operations. Note that deactivating the vectorization may pose a severe performance limitation for a large number of operations!


To which extend does Blaze make use of BLAS functions under the hood?

Currently the only BLAS functions that are utilized by Blaze are the gemm() functions for the multiplication of two dense matrices (i.e. sgemm(), dgemm(), cgemm(), and zgemm()). All other operations are always and unconditionally performed by native Blaze kernels.

The BLAZE_BLAS_MODE config switch (see <blaze/config/BLAS.h>) determines whether Blaze is allowed to use BLAS kernels. If BLAZE_BLAS_MODE is set to 0 then Blaze does not utilize the BLAS kernels and unconditionally uses its own custom kernels. If BLAZE_BLAS_MODE is set to 1 then Blaze is allowed to choose between using BLAS kernels or its own custom kernels. In case of the dense matrix multiplication this decision is based on the size of the dense matrices. For large matrices, Blaze uses the BLAS kernels, for small matrices it uses its own custom kernels. The threshold for this decision can be configured via the BLAZE_DMATDMATMULT_THRESHOLD, BLAZE_DMATTDMATMULT_THRESHOLD, BLAZE_TDMATDMATMULT_THRESHOLD and BLAZE_TDMATTDMATMULT_THRESHOLD config switches (see <blaze/config/Thresholds.h>).

Please note that the extend to which Blaze uses BLAS kernels can change in future releases of Blaze!


To which extend does Blaze make use of LAPACK functions under the hood?

Blaze uses LAPACK functions for matrix decomposition, matrix inversion, computing the determinants and eigenvalues, and the SVD. In contrast to the BLAS functionality (see To which extend does Blaze make use of BLAS functions under the hood?), you cannot disable LAPACK or switch to custom kernels. In case you try to use any of these functionalities, but do not provide (i.e. link) a LAPACK library you will get link time errors.

Please note that the extend to which Blaze uses LAPACK kernels can change in future releases of Blaze!


What is the fastest way to setup a very large sparse matrix?

The following examples give an overview of different approaches to setup a sparse, row-major NxN matrix with the following pattern, where all values on the diagonal and the two sub-diagonals are filled:

(  1    1    0    0    0   ...   0    0    0  )
(  1    1    1    0    0   ...   0    0    0  )
(  0    1    1    1    0   ...   0    0    0  )
(  0    0    1    1    1   ...   0    0    0  )
(  0    0    0    1    1   ...   0    0    0  )
( ...  ...  ...  ...  ...  ...  ...  ...  ... )
(  0    0    0    0    0   ...   1    1    0  )
(  0    0    0    0    0   ...   1    1    1  )
(  0    0    0    0    0   ...   0    1    1  )

Special emphasis is given to the runtime until the matrix setup is complete. In all cases the runtime is benchmarked with Clang-9.0 (compilation flags -O2 and -DNDEBUG) for N=200000.

Approach 1: Using the function call operator

In this approach the function call operator (i.e. operator()) is used to insert the according elements into the matrix:

blaze::CompressedMatrix<int,rowMajor> A( N, N );

A.reserve( N*3UL-2UL );  // Optional: Reserve capacity for all elements upfront

for( size_t i=0; i<N; ++i ) {
   const size_t jbegin( i == 0UL ? 0UL : i-1UL );
   const size_t jend  ( i == N-1UL ? N-1UL : i+1UL );
   for( size_t j=jbegin; j<=jend; ++j ) {
      A(i,j) = 1;
   }
}

This approach is the most general and convenient, but also the slowest of all (approx. 64 seconds). With every call to operator(), a new element is inserted at the specified position. This implies shifting all subsequent elements and adapting every subsequent row. Since all non-zero elements are stored in a single array inside a CompressedMatrix, this approach is similar to inserting elements at the front of a std::vector; all subsequent elements have to be shifted.

Approach 2: Rowwise reserve and insert

The next approach performs a rowwise reservation of capacity:

blaze::CompressedMatrix<int,rowMajor> A( N, N );

A.reserve( N*3UL );                // Allocate the total amount of memory
A.reserve( 0UL, 2UL );             // Reserve a capacity of 2 for row 0
for( size_t i=1; i<N-1UL; ++i ) {
   A.reserve( i, 3UL );            // Reserve a capacity of 3 for row i
}
A.reserve( N-1UL, 2UL );           // Reserve a capacity of 2 for the last row

for( size_t i=0; i<N; ++i ) {
   const size_t jbegin( i == 0UL ? 0UL : i-1UL );
   const size_t jend  ( i == N-1UL ? N-1UL : i+1UL );
   for( size_t j=jbegin; j<=jend; ++j ) {
      A.insert( i, j, 1 );
   }
}

The first call to reserve() performs the memory allocation for the entire matrix. The complete matrix now holds the entire capacity, but each single row has a capacity of 0. Therefore the subsequent calls to reserve() divide the existing capacity to all rows.

Unfortunately, also this approach is rather slow. The runtime is approx. 30 seconds. The downside of this approach is that changing the capacity of a single row causes a change in all following rows. Therefore this approach is similar to the first approach.

Approach 3: reserve/append/finalize

As the wiki explains, the most efficient way to fill a sparse matrix is a combination of reserve(), append() and finalize():

CompressedMatrix<int,rowMajor> A( N, N );

A.reserve( N*3UL );
for( size_t i=0; i<N; ++i ) {
   const size_t jbegin( i == 0UL ? 0UL : i-1UL );
   const size_t jend  ( i == N-1UL ? N-1UL : i+1UL );
   for( size_t j=jbegin; j<=jend; ++j ) {
      A.append( i, j, 1 );
   }
   A.finalize( i );
}

The initial call to reserve() allocates enough memory for all non-zero elements of the entire matrix. append() and finalize() are then used to insert the elements and to mark the end of each single row. This is a very low-level approach and very similar to writing to an array manually, which results in a mere 0.026 seconds. The append() function writes the new element to the next memory location, and at the end of each row or column the finalize() function sets the internal pointers accordingly. It is very important to note that the finalize() function has to be explicitly called for each row, even for empty ones! Else the internal data structure will be corrupt! Also note that although append() does not allocate new memory, it still invalidates all iterators returned by the end() functions!

Approach 4: Reservation via the constructor

In case the number of non-zero elements is known upfront, it is also possible to perform the reservation via the constructor of CompressedMatrix. For that purpose CompressedMatrix provides a constructor taking a std::vector<size_t>:

std::vector<size_t> nonzeros( N, 3UL );  // Create a vector of N elements with value 3
nonzeros[  0] = 2UL;                     // We need only 2 elements in the first row ...
nonzeros[N-1] = 2UL;                     //  ... and last row.

CompressedMatrix<int,rowMajor> A( N, N, nonzeros );

//std::cerr << " Inserting values...\n";
for( size_t i=0; i<N; ++i ) {
   const size_t jbegin( i == 0UL ? 0UL : i-1UL );
   const size_t jend  ( i == N-1UL ? N-1UL : i+1UL );
   for( size_t j=jbegin; j<=jend; ++j ) {
      A.insert( i, j, 1 );
   }
}

The runtime for this approach is 0.027 seconds.


The compile time is too high if I include <blaze/Blaze.h>. Can I reduce it?

The include file <blaze/Blaze.h> includes the entire functionality of the Blaze library, which by now is several hundred thousand lines of source code. That means that a lot of source code has to be parsed whenever <blaze/Blaze.h> is encountered. However, it is rare that everything is required within a single compilation unit. Therefore it is easily possible to reduce compile times by including only those Blaze features that are used within the compilation unit. For instance, instead of including <blaze/Blaze.h> it could be enough to include <blaze/math/DynamicVector.h>, which would reduce the compilation times by about 20%.

Additionally we are taking care to implement new Blaze functionality such that compile times do not explode and try to reduce the compile times of existing features. Thus newer releases of Blaze can also improve compile times.


Blaze does not provide feature XYZ. What can I do?

In some cases you might be able to implement the required functionality very conveniently by building on the existing map() functions. For instance, the following code demonstrates the addition of a function that merges two vectors of floating point type into a vector of complex numbers:

template< typename VT1, typename VT2, bool TF >
decltype(auto) zip( const blaze::DenseVector<VT1,TF>& lhs, const blaze::DenseVector<VT2,TF>& rhs )
{
   return blaze::map( ~lhs, ~rhs, []( const auto& r, const auto& i ) {
      using ET1 = ElementType_t<VT1>;
      using ET2 = ElementType_t<VT2>;
      return std::complex<std::common_type_t<ET1,ET2>>( r, i );
   } );
}

You will find a summary of the necessary steps to create custom features in Customization.

Sometimes, however, the available customization points might not be sufficient. In this case you are cordially invited to create a pull request that provides the implementation of a feature or to create an issue according to our Issue Creation Guidelines. Please try to explain the feature as descriptive as possible, for instance by providing conceptual code examples.


Previous: Intra-Statement Optimization ---- Next: Issue Creation Guidelines

Updated