Blaze is not generating vectorized code on Arm

Issue #153 duplicate
Adel Ahmadyan created an issue

Hi,

I'm using Blaze for small matrix multiplication (Multiplying 6x3 X 3x6 matrices of floats). However, the performance is lower than expected. I suspect the reason is the generated code is not Neon-ized.

Here is an example:

blaze::DynamicMatrix<blaze::StaticMatrix<float, 6, 6>> dst; blaze::DynamicVector<blaze::StaticMatrix<float, 6,3>> a, b;

// initialization, resizing and some logic

for (...) { dst(i,j) += a[i] * trans(b[j]); }

  • Target is 64-bit ARM (iPhone, ...)
  • When I check for alignment, all matrices report to be aligned (dst.isAligned(), a.isAligned(), b.isAligned()).
  • When I look at the assembly code, that line calls addAssign function, and inside it, there are unrolled ARM code, but no neon.
  • blaze is linked against accelerate, which includes blas and lapack.

Any help or suggestion is appreciated. Thanks Adel

Comments (3)

  1. Klaus Iglberger

    Hi Adel!

    Thanks for creating this issue. This is not a bug, but a missing feature: Blaze currently does not support vectorisation on the ARM architecture. Issue #49 is devoted to add support for ARM platforms.

    In order to make use of Accelerate for small matrices, you can manually reduce the BLAS thresholds in the <blaze/config/Thresholds.h> configuration file or define the according symbol before the <blaze/Blaze.h> include. The following code snippet demonstrates this for the threshold for the row-major dense matrix multiplication:

    #define BLAZE_DMATDMATMULT_THRESHOLD 0UL
    #include <blaze/Blaze.h>
    

    Please set all necessary thresholds to 0 in order to unconditionally use BLAS kernels.

    Best regards,

    Klaus!

  2. Log in to comment