Proposal: Thread-blocking support for personas

Background

I have a design proposal based on reading the PAW-ATM19 paper: https://doi.org/10.25344/S43G60 from @Alexander Pöppl and @Scott Baden

In the paper, they show various implementation strategies for Actors on UPC++. Actors generally have a high amount of idle time (while awaiting communication latency) - the overlap strategy is to intentionally overcommit up to 8 Actors per core to cover communication latency.

Arguably the most straightforward strategy (dubbed "thread-based execution strategy" and "Pond Thread") assigns one thread/persona to each Actor, and incoming RPCs are forwarded to the correct thread by LPC from the master communication thread. Unfortunately, this strategy also performs very poorly compared to the other, more complicated strategies that run one thread per physical core and manually schedule Actors onto them. As the paper notes, one major reason for this discrepancy is the overheads in the thread-based strategy caused by threads corresponding to currently idle Actors busy-waiting for incoming LPC.

Below is a design proposal for a new functionality that could be used to address such a situation, by allowing "idle" worker threads to truly sleep and relinquish all computation resources, while still remaining responsive to incoming events. This allows computation to be scheduled dynamically (by the OS) across all available execution contexts, rather than being scheduled at user-level (where there is generally less insight into the process-wide scheduling state).

Proposal

void upcxx::sleep_until_progress();

Precondition: ! master_persona().active_with_caller()

Semantics: Issues upcxx::discharge(default_persona_scope()). Next, the calling thread idles (goes to sleep) until any completion notification or LPC arrival is available to any persona in the current thread's active stack. After such time, the thread is awakened and executes upcxx::progress(progress_level::user), then returns from this call.

Spurious wake-ups are permitted, meaning the call is permitted to return when no persona has processed a completion.

Discussion

This proposal would allow an app to "park" worker threads to awaken when work is sent to them. It can similarly be used by non-master threads to sleep during the latency of communication they initiated, as an alternative to busy waiting, eg:

while (!myfuture.ready()) upcxx::sleep_until_progress();

This of course assumes/requires the master persona to always remain awake and occasionally polling any GEX handles associated with RMA initiated on worker threads.

It might also be beneficial to provide a runtime control allowing end-users to disable sleep from the command line for non-overcommit scenario (converting behavior to a regular progress call).

Abstract implementation sketch

Each persona gains one additional state field:

std::atomic<thread_struct *> sleep_thread

Each thread-specific data has a few additional state fields :

mutex + condition variable - used for the actual sleeping and OS signaling
int num_execs - counter of completions in the current progress (already exists?)

The basic idea behind this implementation approach is to minimize overhead added on existing critical paths elsewhere in the system (in particular when enqueing completions), even if it means additional work for the thread performing a sleep (which is declaring it has no useful work available).

sleep_until_progress() pseudocode:

  assert(BACKEND_PAR && !master_persona().active_with_caller());
  num_execs = 0; // thread_local variable tracking completions executed
  upcxx::discharge();
  if num_execs > 0 then return; // immediately made progress
  for (persona p in thread.stack) // mark all my personas as sleeping with this thread
     atomic_store(&p.sleep_thread, mythread, relaxed);
  atomic_thread_fence(release); 
  mutex_lock(mythread.mutex);
  while (1) {
    upcxx::progress();
    if num_execs > 0 then // made some progress
       for (persona p in thread.stack) // clear persona sleep state 
         atomic_store(&p.sleep_thread, NULL, relaxed);
       unlock(mythread.mutex); 
       return; 
    cond_wait(mythread.mutex, mythread.cond); // sleep until signal
  }

enqueue_completions(persona p, ...) pseudocode:

 // <<current enqueing logic ...>>
  #if BACKEND_PAR  // new code
    sth = atomic_load(&p.sleepthread, acquire); // one new load
    if sth then // one new branch - persona thread might be asleep
      mutex_lock(sth.mutex);
      cond_signal(sth.cond); // wake it up
      mutex_unlock(sth.mutex);
  #endif

I know the runtime internals don't work exactly like this at the moment, the idea with the sketch above is to demonstrate this feature can ideally incur only one additional load/branch for existing paths (and only in PAR mode).

Thoughts?

Background

Proposal

Discussion

Abstract implementation sketch

Comments (13)