- edited description
Blaze is not generating vectorized code on Arm
Hi,
I'm using Blaze for small matrix multiplication (Multiplying 6x3 X 3x6 matrices of floats). However, the performance is lower than expected. I suspect the reason is the generated code is not Neon-ized.
Here is an example:
blaze::DynamicMatrix<blaze::StaticMatrix<float, 6, 6>> dst; blaze::DynamicVector<blaze::StaticMatrix<float, 6,3>> a, b;
// initialization, resizing and some logic
for (...) { dst(i,j) += a[i] * trans(b[j]); }
- Target is 64-bit ARM (iPhone, ...)
- When I check for alignment, all matrices report to be aligned (dst.isAligned(), a.isAligned(), b.isAligned()).
- When I look at the assembly code, that line calls addAssign function, and inside it, there are unrolled ARM code, but no neon.
- blaze is linked against accelerate, which includes blas and lapack.
Any help or suggestion is appreciated. Thanks Adel
Comments (3)
-
reporter -
Hi Adel!
Thanks for creating this issue. This is not a bug, but a missing feature: Blaze currently does not support vectorisation on the ARM architecture. Issue #49 is devoted to add support for ARM platforms.
In order to make use of Accelerate for small matrices, you can manually reduce the BLAS thresholds in the
<blaze/config/Thresholds.h>
configuration file or define the according symbol before the<blaze/Blaze.h>
include. The following code snippet demonstrates this for the threshold for the row-major dense matrix multiplication:#define BLAZE_DMATDMATMULT_THRESHOLD 0UL #include <blaze/Blaze.h>
Please set all necessary thresholds to 0 in order to unconditionally use BLAS kernels.
Best regards,
Klaus!
-
- changed status to duplicate
Duplicate of #49.
- Log in to comment