Is it possible to control padding and alignment of a Static/Dynamic Vector/Matrix?

Issue #134 resolved
Mikhail Katliar created an issue

Is it possible to create a Static/Dynamic Vector/Matrix (not a CustomMatrix or a CustomVector) with specified padding and alignment? The template parameters include the storage order, but not the padding and alignment flags.

Comments (8)

  1. Klaus Iglberger

    Hi Mikhail!

    Thanks for creating this proposal. Except for CustomVector and CustomMatrix it is not possible to create vectors and matrices with individual alignment and padding settings. The rational is that being able to individually specify alignment and padding for every vector and matrix can easily cause a decrease in performance. Therefore StaticVector, StaticMatrix, DynamicVector, and DynamicMatrix don't allow tempering with these settings. It is possible, however, to turn off padding altogether via the BLAZE_USE_PADDING switch (see the wiki for a detailed description).

    Could you please explain why you are interested in specifying these two settings individually? What is the problem you are trying to solve? Whereas we can understand the need to specify padding for each vector and matrix individually, we don't understand the need to specify the alignment. Are you interested in specifying an alignment flag (i.e. aligned, unaligned) or the actual alignment in bytes? Could you please elaborate on this?

    Best regards,

    Klaus!

  2. Mikhail Katliar reporter

    Hello Klaus!

    The reason I am asking is the following. When using CustomVector or CustomMatrix, I need to first allocate the memory and then create a CustomVector or CustomMatrix which refers to that memory. This means managing 2 objects when 1 object is enough. I can of course derive my own class from CustomVector or CustomMatrix which would do memory (de-)allocation, but I thought maybe there is already an existing solution in Blaze.

    Grüß,

    Mikhail

  3. Klaus Iglberger

    Hi Mikhail!

    Since there is no fitting vector or matrix type, the only option at this point is to extend CustomVector or CustomMatrix. Below you will find a subclass based on CustomMatrix, which should work well for most purposes:

    template< typename Type                    // Data type of the matrix
            , bool SO = defaultStorageOrder >  // Storage order
    class MyCustomMatrix
       : public CustomMatrix< Type, unaligned, unpadded, SO >
    {
     public:
       explicit inline MyCustomMatrix( size_t m, size_t n )
          : CustomMatrix<Type,unaligned,unpadded,SO>()
          , array_( new Type[m*n] )
       {
          this->reset( array_.get(), m, n );
       }
    
     private:
       std::unique_ptr<Type[]> array_;
    };
    

    I hope this helps,

    Best regards,

    Klaus!

  4. Mikhail Katliar reporter

    Thanks Klaus,

    I saw this example in the doc and this is exactly what I did.

    Grüß,

    Mikhail

  5. Chris Elrod

    Custom matrices however do not have sizes known at compile time, and alligned loads/stores aren’t any faster on recent architectures.

    For example, the link groups vmovaps/d and vmovups/d together all the way back with Sandy Bridge (page 193) for vector,memory moves.

    My particular use case is just a (jupyter) notebook benchmarking a few small matrix libraries:

    1. Eigen
    2. Blaze
    3. gfortran’s default matmul
    4. Intel MKL JIT
    5. My PaddedMatrices.jl (which supports padded and unpadded matrices, but pads by default like Blaze)

    For column major C = A * B, where A is Mx32 and B is 32xN, I am testing all combinations of (M=3,…,32)x(N=3,…32), both with and without padding (ie, for Eigen, gfortran, and Intel MKL JIT, I pad manually by just reporting a larger the padded number of rows).

    In each of these cases, I want the compiler and library to be able to take advantage of size information at compile time, meaning to be fair to Blaze, I have to use StaticMatrices.

    My workaround of course is to simply align the memory of all the arrays, so it isn’t a big deal (and I’m running the benchmarks now).

    I’m just making a case for why I think it’s a reasonable option in general (ie, [a] option for fixed size matrices in shared libraries and [b] no performance difference on recent architectures).

    EDIT:
    If you’re interested, here are the benchmarks:

    https://bayeswatch.org/2019/06/06/small-matrix-multiplication-performance-shootout/

    For column major C = A * B, where C is M x N, and A is M x 32, and on a system supporting avx512…

    PaddedMatrices.jl was faster than Blaze except when M was 8 (or padded to 8 ) and N <= 3. The advantage was especially large when matrices were not padded.

    When they were padded, Blaze was comparable to MKL JIT, edging it out slightly. Without padding, MKL JIT was faster.

    Eigen and gfortran were far behind.

  6. Klaus Iglberger

    Summary

    Commits 49cb718 and 2e8c7e6 enable the instance-specific alignment and padding configuration for the StaticVector and StaticMatrix class templates. The feature is immediately available via cloning the Blaze repository and will be officially released in Blaze 3.7.

    StaticVector

    The blaze::StaticVector class template is the representation of a fixed size vector with statically allocated elements of arbitrary type. It can be included via the header file

    #include <blaze/math/StaticVector.h>
    

    The type of the elements, the number of elements, the transpose flag, the alignment, and the padding of the vector can be specified via the five template parameters:

    template< typename Type, size_t N, bool TF, AlignmentFlag AF, PaddingFlag PF >
    class StaticVector;
    
    • Type : specifies the type of the vector elements. StaticVector can be used with any non-cv-qualified element type, including other vector or matrix types.
    • N : specifies the total number of vector elements. It is expected that StaticVector is only used for tiny and small vectors.
    • TF : specifies whether the vector is a row vector (blaze::rowVector) or a column vector (blaze::columnVector). The default value is blaze::columnVector.
    • AF : specifies whether the first element of the vector is properly aligned with respect to the available instruction set (SSE, AVX, ...). Possible values are blaze::aligned and blaze::unaligned. The default value is blaze::aligned.
    • PF : specifies whether the vector should be padded to maximize the efficiency of vectorized operations. Possible values are blaze::padded and blaze::unpadded. The default value is blaze::padded.

    The blaze::StaticVector is perfectly suited for small to medium vectors whose size is known at compile time:

    // Definition of a 3-dimensional integral column vector
    blaze::StaticVector<int,3UL> a;
    
    // Definition of a 4-dimensional single precision column vector
    blaze::StaticVector<float,4UL,blaze::columnVector> b;
    
    // Definition of an unaligned, unpadded 6-dimensional double precision row vector
    blaze::StaticVector<double,6UL,blaze::rowVector,blaze::unaligned,blaze::unpadded> c;
    

    Alignment

    In case AF is set to blaze::aligned, the elements of a StaticVector are possibly over-aligned to meet the alignment requirements of the available instruction set (SSE, AVX, AVX-512, ...). The alignment for fundamental types (short, int, float, double, ...) and complex types (complex<float>, complex<double>, ...) is 16 bytes for SSE, 32 bytes for AVX, and 64 bytes for AVX-512. All other types are aligned according to their intrinsic alignment:

    struct Int { int i; };
    
    using VT1 = blaze::StaticVector<double,3UL>;
    using VT2 = blaze::StaticVector<complex<float>,2UL>;
    using VT3 = blaze::StaticVector<Int,5UL>;
    
    alignof( VT1 );  // Evaluates to 16 for SSE, 32 for AVX, and 64 for AVX-512
    alignof( VT2 );  // Evaluates to 16 for SSE, 32 for AVX, and 64 for AVX-512
    alignof( VT3 );  // Evaluates to 'alignof( Int )'
    

    Note that an aligned StaticVector instance may be bigger than the sum of its data elements:

    sizeof( VT1 );  // Evaluates to 32 for both SSE and AVX
    sizeof( VT2 );  // Evaluates to 16 for SSE and 32 for AVX
    sizeof( VT3 );  // Evaluates to 20; no special alignment requirements
    

    Please note that for this reason an aligned StaticVector cannot be used in containers using dynamic memory such as std::vector without additionally providing an allocator that can provide over-aligned memory:

    using Type = blaze::StaticVector<double,3UL>;
    using Allocator = blaze::AlignedAllocator<Type>;
    
    std::vector<Type> v1;  // Might be misaligned for AVX or AVX-512
    std::vector<Type,Allocator> v2;  // Properly aligned for AVX or AVX-512
    

    Padding

    Adding padding elements to the end of a StaticVector can have a significant impact on the performance. For instance, assuming that AVX is available, then two padded 3-dimensional vectors of double precision values can be added via a single SIMD addition operation:

    using blaze::StaticVector;
    using blaze::columnVector;
    using blaze::aligned;
    using blaze::unaligned;
    using blaze::padded;
    using blaze::unpadded;
    
    StaticVector<double,3UL,columnVector,aligned,padded> a1, b1, c1;
    StaticVector<double,3UL,columnVector,unaligned,unpadded> a2, b2, c2;
    
    // ... Initialization
    
    c1 = a1 + b1;  // AVX-based vector addition; maximum performance
    c2 = a2 + b2;  // Scalar vector addition; limited performance
    
    sizeof( a1 );  // Evaluates to 32 for SSE and AVX, and 64 for AVX-512
    sizeof( a2 );  // Evaluates to 24 for SSE, AVX, and AVX-512 (minimum size)
    

    Due to padding, the first addition will run at maximum performance. On the flip side, the size of each vector instance is increased due to the padding elements. The total size of an instance depends on the number of elements and width of the available instruction set (16 bytes for SSE, 32 bytes for AVX, and 64 bytes for AVX-512).

    The second addition will be limited in performance since due to the number of elements some of the elements need to be handled in a scalar operation. However, the size of an unaligned, unpadded StaticVector instance is guaranteed to be the sum of its elements.

    Please also note that Blaze will zero initialize the padding elements in order to achieve maximum performance!

    StaticMatrix

    The blaze::StaticMatrix class template is the representation of a fixed size matrix with statically allocated elements of arbitrary type. It can be included via the header file

    #include <blaze/math/StaticMatrix.h>
    

    The type of the elements, the number of rows and columns, the storage order of the matrix, the alignment and the padding of the matrix can be specified via the six template parameters:

    template< typename Type, size_t M, size_t N, bool SO, AlignmentFlag AF, PaddingFlag PF >
    class StaticMatrix;
    
    • Type : specifies the type of the matrix elements. StaticMatrix can be used with any non-cv-qualified element type, including vector and other matrix types.
    • M : specifies the total number of rows of the matrix.
    • N : specifies the total number of columns of the matrix. Note that it is expected that StaticMatrix is only used for tiny and small matrices.
    • SO : specifies the storage order (blaze::rowMajor, blaze::columnMajor) of the matrix. The default value is blaze::rowMajor.
    • AF : specifies whether the first element of every row/column is properly aligned with respect to the available instruction set (SSE, AVX, ...). Possible values are blaze::aligned and blaze::unaligned. The default value is blaze::aligned.
    • PF : specifies whether every row/column of the matrix should be padded to maximize the efficiency of vectorized operations. Possible values are blaze::padded and blaze::unpadded. The default value is blaze::padded.

    The blaze::StaticMatrix is perfectly suited for small to medium matrices whose dimensions are known at compile time:

    // Definition of a 3x4 integral row-major matrix
    blaze::StaticMatrix<int,3UL,4UL> A;
    
    // Definition of a 4x6 single precision row-major matrix
    blaze::StaticMatrix<float,4UL,6UL,blaze::rowMajor> B;
    
    // Definition of an unaligned, unpadded 6x4 double precision column-major matrix
    blaze::StaticMatrix<double,6UL,4UL,blaze::columnMajor,blaze::unaligned,blaze::unpadded> C;
    

    Alignment

    In case AF is set to blaze::aligned, the elements of a StaticMatrix are possibly over-aligned to meet the alignment requirements of the available instruction set (SSE, AVX, AVX-512, ...). The alignment for fundamental types (short, int, float, double, ...) and complex types (complex<float>, complex<double>, ...) is 16 bytes for SSE, 32 bytes for AVX, and 64 bytes for AVX-512. All other types are aligned according to their intrinsic alignment:

    struct Int { int i; };
    
    using MT1 = blaze::StaticMatrix<double,3UL,5UL>;
    using MT2 = blaze::StaticMatrix<complex<float>,2UL,3UL>;
    using MT3 = blaze::StaticMatrix<Int,5UL,4UL>;
    
    alignof( MT1 );  // Evaluates to 16 for SSE, 32 for AVX, and 64 for AVX-512
    alignof( MT2 );  // Evaluates to 16 for SSE, 32 for AVX, and 64 for AVX-512
    alignof( MT3 );  // Evaluates to 'alignof( Int )'
    

    Note that an aligned StaticMatrix instance may be bigger than the sum of its data elements:

    sizeof( MT1 );  // Evaluates to 160 for SSE, and 192 for AVX and AVX-512
    sizeof( MT2 );  // Evaluates to 64 for SSE and AVX and 128 for AVX-512
    sizeof( MT3 );  // Evaluates to 80; no special alignment requirements
    

    Please note that for this reason a StaticMatrix cannot be used in containers using dynamic memory such as std::vector without additionally providing an allocator that can provide over-aligned memory:

    using Type = blaze::StaticMatrix<double,3UL,5UL>;
    using Allocator = blaze::AlignedAllocator<Type>;
    
    std::vector<Type> v1;  // Might be misaligned for AVX or AVX-512
    std::vector<Type,Allocator> v2;  // Properly aligned for AVX or AVX-512
    

    Padding

    Adding padding elements to the end of every row or column of a StaticMatrix can have a significant impact on the performance. For instance, assuming that AVX is available, then two padded 3x3 matrices of double precision values can be added with three SIMD addition operations:

    using blaze::StaticMatrix;
    using blaze::rowMajor;
    using blaze::aligned;
    using blaze::unaligned;
    using blaze::padded;
    using blaze::unpadded;
    
    StaticMatrix<double,3UL,3UL,rowMajor,aligned,padded> A1, B1, C1;
    StaticMatrix<double,3UL,3UL,rowMajor,unaligned,unpadded> A2, B2, C2;
    
    // ... Initialization
    
    C1 = A1 + B1;  // AVX-based matrix addition; maximum performance
    C2 = A2 + B2;  // Scalar matrix addition; limited performance
    
    sizeof( A1 );  // Evaluates to 96 for SSE and AVX, and 192 for AVX-512
    sizeof( A2 );  // Evaluates to 72 for SSE, AVX, and AVX-512 (minimum size)
    

    Due to padding, the first addition will run at maximum performance. On the flip side, the size of each matrix instance is increased due to the padding elements. The total size of an instance depends on the number of elements and width of the available instruction set (16 bytes for SSE, 32 bytes for AVX, and 64 bytes for AVX-512).

    The second addition will be limited in performance since due to the number of elements some of the elements need to be handled in a scalar operation. However, the size of an unaligned, unpadded StaticMatrix instance is guaranteed to be the sum of its elements.

    Please also note that Blaze will zero initialize the padding elements in order to achieve maximum performance!

  7. Log in to comment