polling thread for progress in OpenMP programs

Issue #19 resolved

BrianS created an issue 2017-04-12

was just prototyping what progress might look like in an OpenMP program. in a fork join programming model it is hard to think about attentiveness. while not the prettiest, it would at least work and get us off the ground.

#include <iostream>
#include <cmath>
#include <omp.h>
#include <atomic>

std::ostream flusher(0);
void progress(int& a_counter)
{
    int64_t x = 1250;
    for (int i=0; i<x; i++)
    { flusher<<".";} // pointless busy work in place of upcxx progress 
    a_counter++; // counter for purpose of demo code.
}

int main(int argc, char* argv[])
{
  int counter=0;//counter doesn't need to be atomic
  std::atomic<bool> poll(true);
  double red=0;
#pragma omp parallel
  {
   #pragma omp single nowait
   {
    std::cout<<"thread "<<omp_get_thread_num()<<" polling..."<<std::endl;
    while(poll.load()) progress(counter);
   }
   #pragma omp for reduction(+:red) nowait
   for(int i=0; i<150000; i++) // made up compute job
   {
     double x=sin(2*M_PI*i/150.0)*cos(M_PI*i/150.0)+cos(150./(M_PI*i));
     red+=x*x;
   }
   #pragma omp single
   {
     poll.store(false);
   }
  }
  std::cout<<"Counter triggered "<<counter<<" times"<<std::endl;
  return 0;
}

produces output like this: running the program several times

>./a.out;./a.out;./a.out;./a.out
thread 1 polling...
Counter triggered 38 times
thread 4 polling...
Counter triggered 39 times
thread 4 polling...
Counter triggered 40 times
thread 1 polling...
Counter triggered 39 times

Comments (7)

BrianS reporter

both the nowaits are needed. the last single can be removed and be correct.

I'd have to profile real code to see what is better. the atomic bounces the cache lines without single, but omp runtime has to coordinate the single itself with it's own technology

here is a different version that uses all OpenMP directives (I'm leery of mixing C++11 synchronization or threading and OpenMP methods)

#include <iostream>
#include <cmath>
#include <omp.h>

std::ostream flusher(0);
void progress(int& a_counter)
{
    int64_t x = 1250;
    for (int i=0; i<x; i++)
    { flusher<<".";} // pointless busy work in place of upcxx progress 
    a_counter++; // counter for purpose of demo code.
}

int main(int argc, char* argv[])
{
  int counter=0;//counter doesn't need to be atomic
  bool poll=true;
  double red=0;
#pragma omp parallel
  {
   #pragma omp single nowait
   {
    std::cout<<"thread "<<omp_get_thread_num()<<" polling...\n";
    bool ready;
    #pragma omp atomic read
    ready=poll;
    do{
      progress(counter);
      #pragma omp atomic read
      ready=poll;
    }while(ready);
   }
   #pragma omp for reduction(+:red) nowait
   for(int i=0; i<150000; i++) // made up compute job
   {
     double x=sin(2*M_PI*i/150.0)*cos(M_PI*i/150.0)+cos(150./(M_PI*i));
     red+=x*x;
   }
   #pragma omp atomic write
   poll=false;
  }
  std::cout<<"Counter triggered "<<counter<<" times\n";
  return 0;
}

2017-04-12T19:26:27+00:00

Former user Account Deleted

@bvstraalen's examples runs on my laptop, but I think it has a performance bug. Upon hitting the omp single nowait, one thread of the squad gets stuck in spinning on the flag. The other threads pass through to the parallel-for, but since that's statically scheduled, there's a chunk of work carved out for the spinning thread that doesn't get executed until the spinning thread is released. So one thread has to finish its chunk, reach the bottom omp single, which will free the spinner to enter its chunk of the for-loop. What we want is the par-for to be statically scheduled over a n-1 subsquad of the parallel region. This could be done with nested parallelism:

int thread_n = omp_num_threads();
omp_set_nested(1);

atomic<bool> spinning{true};

#omp parallel num_threads=2
{
  if(omp_thread_num() == 0) {
    while(spinning.load())
      upcxx::progress();
  }
  else {
    #omp parallel num_threads=thread_n-1
    {
      #omp for
      for(...) {...}
    }
    spinning.store(false);
  }
}

Or cuda style, where all you get is your thread id:

int thread_n = omp_num_threads();

atomic<int> spinning = thread_n-1;

#omp parallel num_threads=thread_n
{
  if(omp_thread_num() == 0) {
    while(spinning.load() != 0)
      upcxx::progress();
  }
  else {
    int me = omp_thread_num() - 1; // out of [0...thread_n-1]
    for(int i = LO_INDEX(me); i != HI_INDEX(me); i++) {
      ...
    }
    spinning.fetch_sub(1);
  }
}

2017-04-12T20:51:18+00:00

Former user Account Deleted

And we could provide each flavor with boilerplate factored away.

template<typename Lambda>
void upcxx::omp_nested(Lambda f) {
  int thread_n = omp_num_threads();
  omp_set_nested(1);

  atomic<bool> spinning{true};

  #omp parallel num_threads=2
  {
    if(omp_thread_num() == 0) {
      while(spinning.load())
        upcxx::progress();
    }
    else {
      #omp parallel num_threads=thread_n-1
      { f(); }

      spinning.store(false);
    }
  } 
}

// user code
upcxx::omp_nested([&]() {
  #omp for
  for(...) {...}
});

template<typename Lambda>
void upcxx::omp_like_cuda(Lambda f) {
  int thread_n = omp_num_threads();
  atomic<int> spinning{thread_n-1};

  #omp parallel num_threads=thread_n
  {
    if(omp_thread_num() == 0) {
      while(spinning.load() != 0)
        upcxx::progress();
    }
    else {
      int me = omp_thread_num() - 1; // out of [0...thread_n-1]
      f(me, thread_n-1);
      spinning.fetch_sub(1);
    }
  }
}

// user code
upcxx::omp_like_cuda([&](int thread_me, int thread_n) {
  for(int i=LO(thread_me); i != HI(thread_me); i++) {
    ...
  }
});

2017-04-12T22:00:33+00:00

BrianS reporter
yup, the static schedule has the progress thread show up late and stil have to perform it's loop portion.

I think we should bring this one to the OpenMP ECP project and ask what the preferred programming style should be. I don't think we should intercept the OpenMP outliner and meta program this one for users, in general. The OpenMP outliner has been notoriously finicky about what it will optimize. Intel's OMP runtime goes into strange spin races when none should exist.
- 2017-04-13T01:09:58+00:00
Former user Account Deleted
I'm not experience with openmp support, so I'll take your word for it that some important compilers may crash if we mix generic programming and #omp. But, in theory we should be able to defeat that. Instead of the lambda type being passed through as a template parameter, which opens the door for inlining and aggressive context sensitive optimizations regarding the omp pragmas and lambda contents, we can just pass the lambda as a function pointer (std::function is basically just that). Now upcxx::omp_nested is a concrete function, the omp outliner will just see that it has to make a funptr call from many parallel threads. It's hard to imagine a compiler getting confused about that. The openmp standard says that parallel contexts are dynamically inherited, so any omp directives in the lambda are guaranteed by the standard to work.
```
void upcxx::omp_nested(const std::function<void()> &f) {
  int thread_n = omp_num_threads();
  omp_set_nested(1);

  atomic<bool> spinning{true};

  #omp parallel num_threads(2)
  {
    if(omp_thread_num() == 0) {
      while(spinning.load())
        upcxx::progress();
    }
    else {
      #omp parallel num_threads(thread_n-1)
      { f(); }

      spinning.store(false);
    }
  } 
}
```
- 2017-04-14T15:36:50+00:00
BrianS reporter
We will need to discover how important a peeled off progress thread is for our progress. the nested openmp is another area that will have artifacts.

but it is good to have some examples of how to service rpc calls in a fork-join model.

We can also plop this in Khaled's lap! ;-)
- 2017-04-14T16:13:24+00:00
Dan Bonachea
- changed status to resolved
I don't think there is a spec issue here.

Spec 1.0 Draft 2 provides the mechanisms required to program various flavors of progress threads, and the guide has examples of mixing this feature with OpenMP.
- 2017-09-14T07:54:34+00:00
Log in to comment

Assignee: –

Type: proposal

Priority: minor

Status: resolved

Component: –

Milestone: –

Version: –

Votes: 0

Watchers: 2