Best implementation of parallelization in Blaze, and weird results with AVX/AVX2 compile flags

Issue #353 wontfix
Nafis Sadat created an issue

Hello, I have 2 issues to address here: (also as a heads up, I’m a very novice C++ user who’s only using C++ for use with R)

1) I was wondering if you guys have any documentation on what would be the best way to implement parallelization in Blaze?

Backstory: I am trying to rewrite some functions I wrote previously in Armadillo, and the math is basically full of tight loops in tight loops. Currently, I have had an OpenMP pragma preceding the inner most for-loop, and so that’s how I was able to parallelize over multiple threads in Armadillo (and I confirmed that with tracking htop and looking and runtime).

So I went ahead and replaced all of my Armadillo vectors and matrices into Blaze dynamic vectors and matrices respectively, keeping the OMP for-loop pragma as is. Now, if I specify more than one thread in the OMP_NUM_THREADS environment variable, then the parallelization is supposed to kick in, but in the Blaze code, it segfaults my program unfortunately.

Using a single thread and comparing with the Armadillo code gives it a slower runtime (albeit equivalent results). And so I was wondering what would you recommend is the most effective method of implementing parallelization in Blaze where my functions a lot of tight loops?

2) I saw that Blaze has optimizations built in for CPUs which has support for AVX and AVX2 (and my CPU does have those in the instruction set list). And so I compiled my C++ code with the flags “-mavx -mavx2” (alongside other flags that I used like SSE 4.1 etc), and it turns out that if I call the functions with the AVX compiled shared object, then I get weird random results (like, if I’m outputting a dynamic vector, then a random element of the vector will have a value like 10^200 or something).

Initially I was getting some overflows with a base C++ function I had, and I switched from using ‘int’ to ‘long int’, and that fixed things. Here, all of the Blaze matrices I’m using are of type double (and all dynamic matrices). I was wondering if you guys have any tips what I can look out for.

Thanks!

Nafis

Comments (6)

  1. Klaus Iglberger

    Hi Nafis!

    1. Please refer to the Limitations of the OpenMP parallelization section in the wiki. In short, it is not possible to use OpenMP threads outside and inside the Blaze library. You’ll have to decide at which level you want to use the parallelization.

    2. Due to the misuse of OpenMP in combination with Blaze you’ve entered the realm of undefined behavior. Anything could happen, including very large values.

    Best regards,

    Klaus!

  2. Nafis Sadat reporter

    Thanks @Klaus Iglberger ! Do you have any suggestion on how I can leverage BLAZE_NUM_THREADS for parallelization?

  3. Klaus Iglberger

    You’ll have to benchmark to see how many threads give you the best performance for your particular problem. For small problems the parallelization doesn’t pay off, but above a certain threshold the overhead of creating threads is smaller than the gain due to a parallel execution.

    Please also see the <blaze/config/Thresholds.h> header file for all available thresholds (search for “SMP Thresholds”). These can be used to configure above which problem size parallelization takes place. Feel free to experiment to find the perfect combination for your problem.

  4. Nafis Sadat reporter

    Hey @Klaus Iglberger , I’m finally getting around to messing around with Blaze threading. I removed all the OpenMP references in my Blaze code, and so now all my functions are purely based off of iterating over dynamic matrices and vectors. I use the flag -DBLAZE_USE_CPP_THREADS to compile my Blaze C++ file (I can’t use BOOST threads because I’m using the RcppBlaze3 wrapper and I think there’s something inherently weird with R not working with Boost threads properly) and I had to use C++14 instead of 11, because otherwise I get a bunch of compile errors regarding the ‘auto’ type and a bunch of incompatible issues.

    However, htop shows that regardless of the value of BLAZE_NUM_THREADS environment variable, I’m still only seeing a single threaded operation. Are there anything else we have to do to invoke the parallelization?

  5. Klaus Iglberger

    Hi Nafis!

    First, please make sure to explicitly set the maximum number of threads (see the wiki). Second, you’ll need to run an operation that is big enough to benefit from parallelization (see the <blaze/config/Thresholds.h> header file for all SMP thresholds). Parallelization of a small operations would decrease performance due to the thread setup overhead. For instance, the following matrix addition is executed by four threads:

    blaze::setNumThreads( 4 );
    
    blaze::DynamicMatrix<double> A( 200UL, 200UL, 1.0 );
    blaze::DynamicMatrix<double> B( 200UL, 200UL, 2.0 );
    blaze::DynamicMatrix<double> C( A + B );  // Matrix addition of two 200x200 matrices (see BLAZE_SMP_DMATDMATADD_THRESHOLD)
    

    Best regards,

    Klaus!

  6. Nafis Sadat reporter

    Ah got it. I wasn't sure on whether I had to set anything else except the setNumThreads call in the C++ code. I'm going to run that simple snippet you pasted just to make sure that multiple threads are picked up or not.

    As long as I have one of the 2 threading flags in my compilation, then we should be able to pick up the parallelism right?

  7. Log in to comment