OpenMP for Nested Matrix

Issue #113 wontfix
Bryan Flynt created an issue

I've been looking for a matrix library with block capability for a long time so I was happy to find Blaze. My simple test program (shown below) doesn't seem to accelerate using OpenMP. Ideally, this scenario should perform the individual (10x10) block operations on a single thread then use OpenMP over the rows of the outer most matrix. I seem to get no difference with OpenMP on or off. I've also tried setting all the SMP thresholds to 0 and different OMP_NUM_THREADS which report correctly in the output but do not produce any difference in execution speed. Any ideas ??? Thanks

#include <iostream>
#include <blaze/Math.h>

using blaze::DynamicMatrix;
using blaze::StaticMatrix;
using blaze::DynamicVector;
using blaze::rowMajor;
using blaze::columnVector;
int main() {
  std::cout << "Threads = " << blaze::getNumThreads() << std::endl;

  const int NROW = 3000;
  const int NCOL = 3000;

  DynamicMatrix< StaticMatrix<double,10,10,rowMajor>, rowMajor > A;

  DynamicVector< StaticVector<double,10,columnVector >, columnVector > x, y;

  // Resize
  A.resize(NROW,NCOL);
  x.resize(NCOL);
  y.resize(NROW);

  y = A * x;
}

Comments (6)

  1. Klaus Iglberger

    Hi Bryan!

    Thanks for creating the issue. Please give us some time to investigate and to try to reproduce the issue. We will give you an analysis as quickly as possible.

    Best regards,

    Klaus!

  2. Klaus Iglberger

    Hi Bryan!

    As promised we have taken a close look at the issue and tried to reproduce the problem. We have used the code you posted in combination with the current Blaze version from the repository, both clang and gcc, and varying numbers of threads (1 to 4). From our point of view Blaze works flawlessly, i.e. we see an increasing performance with an increasing number of threads.

    However, we have an idea why you might not see a performance increase. A 3000x3000 matrix-vector multiplication with 10x10 block matrices is a memory bound operation (i.e. the total amount of data is too large caches). Some architectures, especially older ones, are able to exploit the maximum memory bandwidth with a single thread. Other architectures can only reach the maximum bandwidth with multiple threads. For instance, on an Intel Xeon E5-2650V3 it is not possible to reach the maximum memory bandwidth with a single thread, but at least 4 threads are necessary. We assume that the architecture you are using may be able to reach the limit with a single thread. Using more threads would therefore not increase the performance.

    We hope that this explanation is helpful and that you still find Blaze to be useful for your purposes.

    Best regards,

    Klaus!

  3. Log in to comment