Is it possible to control padding and alignment of a Static/Dynamic Vector/Matrix?
Is it possible to create a Static/Dynamic Vector/Matrix (not a CustomMatrix
or a CustomVector
) with specified padding and alignment? The template parameters include the storage order, but not the padding and alignment flags.
Comments (8)
-
-
reporter Hello Klaus!
The reason I am asking is the following. When using
CustomVector
orCustomMatrix
, I need to first allocate the memory and then create aCustomVector
orCustomMatrix
which refers to that memory. This means managing 2 objects when 1 object is enough. I can of course derive my own class fromCustomVector
orCustomMatrix
which would do memory (de-)allocation, but I thought maybe there is already an existing solution inBlaze
.Grüß,
Mikhail
-
reporter - edited description
-
Hi Mikhail!
Since there is no fitting vector or matrix type, the only option at this point is to extend
CustomVector
orCustomMatrix
. Below you will find a subclass based onCustomMatrix
, which should work well for most purposes:template< typename Type // Data type of the matrix , bool SO = defaultStorageOrder > // Storage order class MyCustomMatrix : public CustomMatrix< Type, unaligned, unpadded, SO > { public: explicit inline MyCustomMatrix( size_t m, size_t n ) : CustomMatrix<Type,unaligned,unpadded,SO>() , array_( new Type[m*n] ) { this->reset( array_.get(), m, n ); } private: std::unique_ptr<Type[]> array_; };
I hope this helps,
Best regards,
Klaus!
-
reporter Thanks Klaus,
I saw this example in the doc and this is exactly what I did.
Grüß,
Mikhail
-
Custom matrices however do not have sizes known at compile time, and alligned loads/stores aren’t any faster on recent architectures.
For example, the link groups vmovaps/d and vmovups/d together all the way back with Sandy Bridge (page 193) for vector,memory moves.
My particular use case is just a (jupyter) notebook benchmarking a few small matrix libraries:
- Eigen
- Blaze
- gfortran’s default matmul
- Intel MKL JIT
- My PaddedMatrices.jl (which supports padded and unpadded matrices, but pads by default like Blaze)
For column major C = A * B, where A is Mx32 and B is 32xN, I am testing all combinations of (M=3,…,32)x(N=3,…32), both with and without padding (ie, for Eigen, gfortran, and Intel MKL JIT, I pad manually by just reporting a larger the padded number of rows).
In each of these cases, I want the compiler and library to be able to take advantage of size information at compile time, meaning to be fair to Blaze, I have to use StaticMatrices.
My workaround of course is to simply align the memory of all the arrays, so it isn’t a big deal (and I’m running the benchmarks now).
I’m just making a case for why I think it’s a reasonable option in general (ie, [a] option for fixed size matrices in shared libraries and [b] no performance difference on recent architectures).
EDIT:
If you’re interested, here are the benchmarks:https://bayeswatch.org/2019/06/06/small-matrix-multiplication-performance-shootout/
For column major C = A * B, where C is M x N, and A is M x 32, and on a system supporting avx512…
PaddedMatrices.jl was faster than Blaze except when M was 8 (or padded to 8 ) and N <= 3. The advantage was especially large when matrices were not padded.
When they were padded, Blaze was comparable to MKL JIT, edging it out slightly. Without padding, MKL JIT was faster.
Eigen and gfortran were far behind.
-
- changed status to open
-
- changed status to resolved
Summary
Commits 49cb718 and 2e8c7e6 enable the instance-specific alignment and padding configuration for the
StaticVector
andStaticMatrix
class templates. The feature is immediately available via cloning the Blaze repository and will be officially released in Blaze 3.7.StaticVector
The
blaze::StaticVector
class template is the representation of a fixed size vector with statically allocated elements of arbitrary type. It can be included via the header file#include <blaze/math/StaticVector.h>
The type of the elements, the number of elements, the transpose flag, the alignment, and the padding of the vector can be specified via the five template parameters:
template< typename Type, size_t N, bool TF, AlignmentFlag AF, PaddingFlag PF > class StaticVector;
Type
: specifies the type of the vector elements.StaticVector
can be used with any non-cv-qualified element type, including other vector or matrix types.N
: specifies the total number of vector elements. It is expected thatStaticVector
is only used for tiny and small vectors.TF
: specifies whether the vector is a row vector (blaze::rowVector
) or a column vector (blaze::columnVector
). The default value isblaze::columnVector
.AF
: specifies whether the first element of the vector is properly aligned with respect to the available instruction set (SSE, AVX, ...). Possible values areblaze::aligned
andblaze::unaligned
. The default value isblaze::aligned
.PF
: specifies whether the vector should be padded to maximize the efficiency of vectorized operations. Possible values areblaze::padded
andblaze::unpadded
. The default value isblaze::padded
.
The
blaze::StaticVector
is perfectly suited for small to medium vectors whose size is known at compile time:// Definition of a 3-dimensional integral column vector blaze::StaticVector<int,3UL> a; // Definition of a 4-dimensional single precision column vector blaze::StaticVector<float,4UL,blaze::columnVector> b; // Definition of an unaligned, unpadded 6-dimensional double precision row vector blaze::StaticVector<double,6UL,blaze::rowVector,blaze::unaligned,blaze::unpadded> c;
Alignment
In case
AF
is set toblaze::aligned
, the elements of aStaticVector
are possibly over-aligned to meet the alignment requirements of the available instruction set (SSE, AVX, AVX-512, ...). The alignment for fundamental types (short
,int
,float
,double
, ...) and complex types (complex<float>
,complex<double>
, ...) is 16 bytes for SSE, 32 bytes for AVX, and 64 bytes for AVX-512. All other types are aligned according to their intrinsic alignment:struct Int { int i; }; using VT1 = blaze::StaticVector<double,3UL>; using VT2 = blaze::StaticVector<complex<float>,2UL>; using VT3 = blaze::StaticVector<Int,5UL>; alignof( VT1 ); // Evaluates to 16 for SSE, 32 for AVX, and 64 for AVX-512 alignof( VT2 ); // Evaluates to 16 for SSE, 32 for AVX, and 64 for AVX-512 alignof( VT3 ); // Evaluates to 'alignof( Int )'
Note that an aligned
StaticVector
instance may be bigger than the sum of its data elements:sizeof( VT1 ); // Evaluates to 32 for both SSE and AVX sizeof( VT2 ); // Evaluates to 16 for SSE and 32 for AVX sizeof( VT3 ); // Evaluates to 20; no special alignment requirements
Please note that for this reason an aligned
StaticVector
cannot be used in containers using dynamic memory such asstd::vector
without additionally providing an allocator that can provide over-aligned memory:using Type = blaze::StaticVector<double,3UL>; using Allocator = blaze::AlignedAllocator<Type>; std::vector<Type> v1; // Might be misaligned for AVX or AVX-512 std::vector<Type,Allocator> v2; // Properly aligned for AVX or AVX-512
Padding
Adding padding elements to the end of a
StaticVector
can have a significant impact on the performance. For instance, assuming that AVX is available, then two padded 3-dimensional vectors of double precision values can be added via a single SIMD addition operation:using blaze::StaticVector; using blaze::columnVector; using blaze::aligned; using blaze::unaligned; using blaze::padded; using blaze::unpadded; StaticVector<double,3UL,columnVector,aligned,padded> a1, b1, c1; StaticVector<double,3UL,columnVector,unaligned,unpadded> a2, b2, c2; // ... Initialization c1 = a1 + b1; // AVX-based vector addition; maximum performance c2 = a2 + b2; // Scalar vector addition; limited performance sizeof( a1 ); // Evaluates to 32 for SSE and AVX, and 64 for AVX-512 sizeof( a2 ); // Evaluates to 24 for SSE, AVX, and AVX-512 (minimum size)
Due to padding, the first addition will run at maximum performance. On the flip side, the size of each vector instance is increased due to the padding elements. The total size of an instance depends on the number of elements and width of the available instruction set (16 bytes for SSE, 32 bytes for AVX, and 64 bytes for AVX-512).
The second addition will be limited in performance since due to the number of elements some of the elements need to be handled in a scalar operation. However, the size of an
unaligned
,unpadded
StaticVector
instance is guaranteed to be the sum of its elements.Please also note that Blaze will zero initialize the padding elements in order to achieve maximum performance!
StaticMatrix
The
blaze::StaticMatrix
class template is the representation of a fixed size matrix with statically allocated elements of arbitrary type. It can be included via the header file#include <blaze/math/StaticMatrix.h>
The type of the elements, the number of rows and columns, the storage order of the matrix, the alignment and the padding of the matrix can be specified via the six template parameters:
template< typename Type, size_t M, size_t N, bool SO, AlignmentFlag AF, PaddingFlag PF > class StaticMatrix;
Type
: specifies the type of the matrix elements.StaticMatrix
can be used with any non-cv-qualified element type, including vector and other matrix types.M
: specifies the total number of rows of the matrix.N
: specifies the total number of columns of the matrix. Note that it is expected thatStaticMatrix
is only used for tiny and small matrices.SO
: specifies the storage order (blaze::rowMajor
,blaze::columnMajor
) of the matrix. The default value isblaze::rowMajor
.AF
: specifies whether the first element of every row/column is properly aligned with respect to the available instruction set (SSE, AVX, ...). Possible values areblaze::aligned
andblaze::unaligned
. The default value isblaze::aligned
.PF
: specifies whether every row/column of the matrix should be padded to maximize the efficiency of vectorized operations. Possible values areblaze::padded
andblaze::unpadded
. The default value isblaze::padded
.
The
blaze::StaticMatrix
is perfectly suited for small to medium matrices whose dimensions are known at compile time:// Definition of a 3x4 integral row-major matrix blaze::StaticMatrix<int,3UL,4UL> A; // Definition of a 4x6 single precision row-major matrix blaze::StaticMatrix<float,4UL,6UL,blaze::rowMajor> B; // Definition of an unaligned, unpadded 6x4 double precision column-major matrix blaze::StaticMatrix<double,6UL,4UL,blaze::columnMajor,blaze::unaligned,blaze::unpadded> C;
Alignment
In case
AF
is set toblaze::aligned
, the elements of aStaticMatrix
are possibly over-aligned to meet the alignment requirements of the available instruction set (SSE, AVX, AVX-512, ...). The alignment for fundamental types (short
,int
,float
,double
, ...) and complex types (complex<float>
,complex<double>
, ...) is 16 bytes for SSE, 32 bytes for AVX, and 64 bytes for AVX-512. All other types are aligned according to their intrinsic alignment:struct Int { int i; }; using MT1 = blaze::StaticMatrix<double,3UL,5UL>; using MT2 = blaze::StaticMatrix<complex<float>,2UL,3UL>; using MT3 = blaze::StaticMatrix<Int,5UL,4UL>; alignof( MT1 ); // Evaluates to 16 for SSE, 32 for AVX, and 64 for AVX-512 alignof( MT2 ); // Evaluates to 16 for SSE, 32 for AVX, and 64 for AVX-512 alignof( MT3 ); // Evaluates to 'alignof( Int )'
Note that an aligned
StaticMatrix
instance may be bigger than the sum of its data elements:sizeof( MT1 ); // Evaluates to 160 for SSE, and 192 for AVX and AVX-512 sizeof( MT2 ); // Evaluates to 64 for SSE and AVX and 128 for AVX-512 sizeof( MT3 ); // Evaluates to 80; no special alignment requirements
Please note that for this reason a
StaticMatrix
cannot be used in containers using dynamic memory such asstd::vector
without additionally providing an allocator that can provide over-aligned memory:using Type = blaze::StaticMatrix<double,3UL,5UL>; using Allocator = blaze::AlignedAllocator<Type>; std::vector<Type> v1; // Might be misaligned for AVX or AVX-512 std::vector<Type,Allocator> v2; // Properly aligned for AVX or AVX-512
Padding
Adding padding elements to the end of every row or column of a
StaticMatrix
can have a significant impact on the performance. For instance, assuming that AVX is available, then two padded 3x3 matrices of double precision values can be added with three SIMD addition operations:using blaze::StaticMatrix; using blaze::rowMajor; using blaze::aligned; using blaze::unaligned; using blaze::padded; using blaze::unpadded; StaticMatrix<double,3UL,3UL,rowMajor,aligned,padded> A1, B1, C1; StaticMatrix<double,3UL,3UL,rowMajor,unaligned,unpadded> A2, B2, C2; // ... Initialization C1 = A1 + B1; // AVX-based matrix addition; maximum performance C2 = A2 + B2; // Scalar matrix addition; limited performance sizeof( A1 ); // Evaluates to 96 for SSE and AVX, and 192 for AVX-512 sizeof( A2 ); // Evaluates to 72 for SSE, AVX, and AVX-512 (minimum size)
Due to padding, the first addition will run at maximum performance. On the flip side, the size of each matrix instance is increased due to the padding elements. The total size of an instance depends on the number of elements and width of the available instruction set (16 bytes for SSE, 32 bytes for AVX, and 64 bytes for AVX-512).
The second addition will be limited in performance since due to the number of elements some of the elements need to be handled in a scalar operation. However, the size of an
unaligned
,unpadded
StaticMatrix
instance is guaranteed to be the sum of its elements.Please also note that Blaze will zero initialize the padding elements in order to achieve maximum performance!
- Log in to comment
Hi Mikhail!
Thanks for creating this proposal. Except for
CustomVector
andCustomMatrix
it is not possible to create vectors and matrices with individual alignment and padding settings. The rational is that being able to individually specify alignment and padding for every vector and matrix can easily cause a decrease in performance. ThereforeStaticVector
,StaticMatrix
,DynamicVector
, andDynamicMatrix
don't allow tempering with these settings. It is possible, however, to turn off padding altogether via theBLAZE_USE_PADDING
switch (see the wiki for a detailed description).Could you please explain why you are interested in specifying these two settings individually? What is the problem you are trying to solve? Whereas we can understand the need to specify padding for each vector and matrix individually, we don't understand the need to specify the alignment. Are you interested in specifying an alignment flag (i.e.
aligned
,unaligned
) or the actual alignment in bytes? Could you please elaborate on this?Best regards,
Klaus!