polling thread for progress in OpenMP programs

Issue #19 resolved
BrianS created an issue

was just prototyping what progress might look like in an OpenMP program. in a fork join programming model it is hard to think about attentiveness. while not the prettiest, it would at least work and get us off the ground.

#include <iostream>
#include <cmath>
#include <omp.h>
#include <atomic>

std::ostream flusher(0);
void progress(int& a_counter)
{
    int64_t x = 1250;
    for (int i=0; i<x; i++)
    { flusher<<".";} // pointless busy work in place of upcxx progress 
    a_counter++; // counter for purpose of demo code.
}

int main(int argc, char* argv[])
{
  int counter=0;//counter doesn't need to be atomic
  std::atomic<bool> poll(true);
  double red=0;
#pragma omp parallel
  {
   #pragma omp single nowait
   {
    std::cout<<"thread "<<omp_get_thread_num()<<" polling..."<<std::endl;
    while(poll.load()) progress(counter);
   }
   #pragma omp for reduction(+:red) nowait
   for(int i=0; i<150000; i++) // made up compute job
   {
     double x=sin(2*M_PI*i/150.0)*cos(M_PI*i/150.0)+cos(150./(M_PI*i));
     red+=x*x;
   }
   #pragma omp single
   {
     poll.store(false);
   }
  }
  std::cout<<"Counter triggered "<<counter<<" times"<<std::endl;
  return 0;
}

produces output like this: running the program several times

>./a.out;./a.out;./a.out;./a.out
thread 1 polling...
Counter triggered 38 times
thread 4 polling...
Counter triggered 39 times
thread 4 polling...
Counter triggered 40 times
thread 1 polling...
Counter triggered 39 times

Comments (7)

  1. BrianS reporter

    both the nowaits are needed. the last single can be removed and be correct.

    I'd have to profile real code to see what is better. the atomic bounces the cache lines without single, but omp runtime has to coordinate the single itself with it's own technology

    here is a different version that uses all OpenMP directives (I'm leery of mixing C++11 synchronization or threading and OpenMP methods)

    #include <iostream>
    #include <cmath>
    #include <omp.h>
    
    std::ostream flusher(0);
    void progress(int& a_counter)
    {
        int64_t x = 1250;
        for (int i=0; i<x; i++)
        { flusher<<".";} // pointless busy work in place of upcxx progress 
        a_counter++; // counter for purpose of demo code.
    }
    
    int main(int argc, char* argv[])
    {
      int counter=0;//counter doesn't need to be atomic
      bool poll=true;
      double red=0;
    #pragma omp parallel
      {
       #pragma omp single nowait
       {
        std::cout<<"thread "<<omp_get_thread_num()<<" polling...\n";
        bool ready;
        #pragma omp atomic read
        ready=poll;
        do{
          progress(counter);
          #pragma omp atomic read
          ready=poll;
        }while(ready);
       }
       #pragma omp for reduction(+:red) nowait
       for(int i=0; i<150000; i++) // made up compute job
       {
         double x=sin(2*M_PI*i/150.0)*cos(M_PI*i/150.0)+cos(150./(M_PI*i));
         red+=x*x;
       }
       #pragma omp atomic write
       poll=false;
      }
      std::cout<<"Counter triggered "<<counter<<" times\n";
      return 0;
    }
    
  2. Former user Account Deleted

    @bvstraalen's examples runs on my laptop, but I think it has a performance bug. Upon hitting the omp single nowait, one thread of the squad gets stuck in spinning on the flag. The other threads pass through to the parallel-for, but since that's statically scheduled, there's a chunk of work carved out for the spinning thread that doesn't get executed until the spinning thread is released. So one thread has to finish its chunk, reach the bottom omp single, which will free the spinner to enter its chunk of the for-loop. What we want is the par-for to be statically scheduled over a n-1 subsquad of the parallel region. This could be done with nested parallelism:

    int thread_n = omp_num_threads();
    omp_set_nested(1);
    
    atomic<bool> spinning{true};
    
    #omp parallel num_threads=2
    {
      if(omp_thread_num() == 0) {
        while(spinning.load())
          upcxx::progress();
      }
      else {
        #omp parallel num_threads=thread_n-1
        {
          #omp for
          for(...) {...}
        }
        spinning.store(false);
      }
    }
    

    Or cuda style, where all you get is your thread id:

    int thread_n = omp_num_threads();
    
    atomic<int> spinning = thread_n-1;
    
    #omp parallel num_threads=thread_n
    {
      if(omp_thread_num() == 0) {
        while(spinning.load() != 0)
          upcxx::progress();
      }
      else {
        int me = omp_thread_num() - 1; // out of [0...thread_n-1]
        for(int i = LO_INDEX(me); i != HI_INDEX(me); i++) {
          ...
        }
        spinning.fetch_sub(1);
      }
    }
    
  3. Former user Account Deleted

    And we could provide each flavor with boilerplate factored away.

    template<typename Lambda>
    void upcxx::omp_nested(Lambda f) {
      int thread_n = omp_num_threads();
      omp_set_nested(1);
    
      atomic<bool> spinning{true};
    
      #omp parallel num_threads=2
      {
        if(omp_thread_num() == 0) {
          while(spinning.load())
            upcxx::progress();
        }
        else {
          #omp parallel num_threads=thread_n-1
          { f(); }
    
          spinning.store(false);
        }
      } 
    }
    
    // user code
    upcxx::omp_nested([&]() {
      #omp for
      for(...) {...}
    });
    
    template<typename Lambda>
    void upcxx::omp_like_cuda(Lambda f) {
      int thread_n = omp_num_threads();
      atomic<int> spinning{thread_n-1};
    
      #omp parallel num_threads=thread_n
      {
        if(omp_thread_num() == 0) {
          while(spinning.load() != 0)
            upcxx::progress();
        }
        else {
          int me = omp_thread_num() - 1; // out of [0...thread_n-1]
          f(me, thread_n-1);
          spinning.fetch_sub(1);
        }
      }
    }
    
    // user code
    upcxx::omp_like_cuda([&](int thread_me, int thread_n) {
      for(int i=LO(thread_me); i != HI(thread_me); i++) {
        ...
      }
    });
    
  4. BrianS reporter

    yup, the static schedule has the progress thread show up late and stil have to perform it's loop portion.

    I think we should bring this one to the OpenMP ECP project and ask what the preferred programming style should be. I don't think we should intercept the OpenMP outliner and meta program this one for users, in general. The OpenMP outliner has been notoriously finicky about what it will optimize. Intel's OMP runtime goes into strange spin races when none should exist.

  5. Former user Account Deleted

    I'm not experience with openmp support, so I'll take your word for it that some important compilers may crash if we mix generic programming and #omp. But, in theory we should be able to defeat that. Instead of the lambda type being passed through as a template parameter, which opens the door for inlining and aggressive context sensitive optimizations regarding the omp pragmas and lambda contents, we can just pass the lambda as a function pointer (std::function is basically just that). Now upcxx::omp_nested is a concrete function, the omp outliner will just see that it has to make a funptr call from many parallel threads. It's hard to imagine a compiler getting confused about that. The openmp standard says that parallel contexts are dynamically inherited, so any omp directives in the lambda are guaranteed by the standard to work.

    void upcxx::omp_nested(const std::function<void()> &f) {
      int thread_n = omp_num_threads();
      omp_set_nested(1);
    
      atomic<bool> spinning{true};
    
      #omp parallel num_threads(2)
      {
        if(omp_thread_num() == 0) {
          while(spinning.load())
            upcxx::progress();
        }
        else {
          #omp parallel num_threads(thread_n-1)
          { f(); }
    
          spinning.store(false);
        }
      } 
    }
    
  6. BrianS reporter

    We will need to discover how important a peeled off progress thread is for our progress. the nested openmp is another area that will have artifacts.

    but it is good to have some examples of how to service rpc calls in a fork-join model.

    We can also plop this in Khaled's lap! ;-)

  7. Dan Bonachea

    I don't think there is a spec issue here.

    Spec 1.0 Draft 2 provides the mechanisms required to program various flavors of progress threads, and the guide has examples of mixing this feature with OpenMP.

  8. Log in to comment