Blaze is not generating vectorized code on Arm

Hi,

I'm using Blaze for small matrix multiplication (Multiplying 6x3 X 3x6 matrices of floats). However, the performance is lower than expected. I suspect the reason is the generated code is not Neon-ized.

Here is an example:

blaze::DynamicMatrix<blaze::StaticMatrix<float, 6, 6>> dst; blaze::DynamicVector<blaze::StaticMatrix<float, 6,3>> a, b;

// initialization, resizing and some logic

for (...) { dst(i,j) += a[i] * trans(b[j]); }

Target is 64-bit ARM (iPhone, ...)
When I check for alignment, all matrices report to be aligned (dst.isAligned(), a.isAligned(), b.isAligned()).
When I look at the assembly code, that line calls addAssign function, and inside it, there are unrolled ARM code, but no neon.
blaze is linked against accelerate, which includes blas and lapack.

Any help or suggestion is appreciated. Thanks Adel

Comments (3)