Proposal: Thread-blocking support for personas

Issue #153 new
Dan Bonachea created an issue

Background

I have a design proposal based on reading the PAW-ATM19 paper: https://doi.org/10.25344/S43G60 from @Alexander Pöppl and @Scott Baden

In the paper, they show various implementation strategies for Actors on UPC++. Actors generally have a high amount of idle time (while awaiting communication latency) - the overlap strategy is to intentionally overcommit up to 8 Actors per core to cover communication latency.

Arguably the most straightforward strategy (dubbed "thread-based execution strategy" and "Pond Thread") assigns one thread/persona to each Actor, and incoming RPCs are forwarded to the correct thread by LPC from the master communication thread. Unfortunately, this strategy also performs very poorly compared to the other, more complicated strategies that run one thread per physical core and manually schedule Actors onto them. As the paper notes, one major reason for this discrepancy is the overheads in the thread-based strategy caused by threads corresponding to currently idle Actors busy-waiting for incoming LPC.

Below is a design proposal for a new functionality that could be used to address such a situation, by allowing "idle" worker threads to truly sleep and relinquish all computation resources, while still remaining responsive to incoming events. This allows computation to be scheduled dynamically (by the OS) across all available execution contexts, rather than being scheduled at user-level (where there is generally less insight into the process-wide scheduling state).

Proposal

void upcxx::sleep_until_progress();

Precondition: ! master_persona().active_with_caller()

Semantics: Issues upcxx::discharge(default_persona_scope()). Next, the calling thread idles (goes to sleep) until any completion notification or LPC arrival is available to any persona in the current thread's active stack. After such time, the thread is awakened and executes upcxx::progress(progress_level::user), then returns from this call.

Spurious wake-ups are permitted, meaning the call is permitted to return when no persona has processed a completion.

Discussion

This proposal would allow an app to "park" worker threads to awaken when work is sent to them. It can similarly be used by non-master threads to sleep during the latency of communication they initiated, as an alternative to busy waiting, eg:

while (!myfuture.ready()) upcxx::sleep_until_progress();

This of course assumes/requires the master persona to always remain awake and occasionally polling any GEX handles associated with RMA initiated on worker threads.

It might also be beneficial to provide a runtime control allowing end-users to disable sleep from the command line for non-overcommit scenario (converting behavior to a regular progress call).

Abstract implementation sketch

Each persona gains one additional state field:

  • std::atomic<thread_struct *> sleep_thread

Each thread-specific data has a few additional state fields :

  • mutex + condition variable - used for the actual sleeping and OS signaling
  • int num_execs - counter of completions in the current progress (already exists?)

The basic idea behind this implementation approach is to minimize overhead added on existing critical paths elsewhere in the system (in particular when enqueing completions), even if it means additional work for the thread performing a sleep (which is declaring it has no useful work available).

sleep_until_progress() pseudocode:

  assert(BACKEND_PAR && !master_persona().active_with_caller());
  num_execs = 0; // thread_local variable tracking completions executed
  upcxx::discharge();
  if num_execs > 0 then return; // immediately made progress
  for (persona p in thread.stack) // mark all my personas as sleeping with this thread
     atomic_store(&p.sleep_thread, mythread, relaxed);
  atomic_thread_fence(release); 
  mutex_lock(mythread.mutex);
  while (1) {
    upcxx::progress();
    if num_execs > 0 then // made some progress
       for (persona p in thread.stack) // clear persona sleep state 
         atomic_store(&p.sleep_thread, NULL, relaxed);
       unlock(mythread.mutex); 
       return; 
    cond_wait(mythread.mutex, mythread.cond); // sleep until signal
  }

enqueue_completions(persona p, ...) pseudocode:

 // <<current enqueing logic ...>>
  #if BACKEND_PAR  // new code
    sth = atomic_load(&p.sleepthread, acquire); // one new load
    if sth then // one new branch - persona thread might be asleep
      mutex_lock(sth.mutex);
      cond_signal(sth.cond); // wake it up
      mutex_unlock(sth.mutex);
  #endif

I know the runtime internals don't work exactly like this at the moment, the idea with the sketch above is to demonstrate this feature can ideally incur only one additional load/branch for existing paths (and only in PAR mode).

Thoughts?

Comments (13)

  1. Paul Hargrove

    I like the idea and hope it is something we can work into the runtime implementation at some point. In particular, this might become part of fulfilling the fuzzy promise we've made regarding exploring strong progress in the later years.

  2. john bachan

    I am not in favor of this proposal:

    1. It adds an overhead to the send side of each LPC (or any other notification destined to a persona) in that it will require at least a pthread_cond_signal. Though in defense of this, pthread_cond_signal is cheap when there are no waiting threads (at least on linux/glibc) which is the case that matters. So maybe this isn't such a big deal. But on other nonlinux systems pthread_cond_signal might involve grabbing a lock unconditionally, or gasp, an unconditional syscall (I'm being pessimistic).

    2. This encourages the non-HPC practice of relying on the OS to do context switching in a performant manner. I believe the right approach is to identify what mechanism UPC++ lacks that has caused the user to turn to the OS for support. The answer is a shared work queue. I believe a shared work queue could integrate with and complement our semantics nicely. For instance:

      • Completions could be targeted to the shared queue: operation_cx::as_lpc(upcxx::shared_lpc_queue&, <lambda>)

      • Given a way to name a remote work queue, RPC's could be delivered to them more efficiently than what is possible now since the AM could push to the queue from whatever thread runs it instead of forcing the user to bounce an intermediate RPC->LPC off the master persona.

    Point 2 is the one I believe most strongly in. If it weren't there, I could get over the overheads addressed in 1 as being necessary, and easily elided by config-time options if the user knows they won't need it. But to me, 2 makes a strong case. Shared work queues are fairly common stuff in our world. Pursuing the stated proposal incurs the cost of slowing down (albeit marginally) our runtime all to give the user convenient access to an OS facility that is regularly seen as significantly inferior to pure user-space solutions.

    I think the best course is to spin this off into a separate proposal for adding something like upcxx::shared_lpc_queue.

  3. Dan Bonachea reporter

    It adds an overhead to the send side of each LPC [...] that it will require at least a pthread_cond_signal [...] pthread_cond_signal is cheap when there are no waiting threads (at least on linux/glibc) which is the case that matters.

    John - If you read the proposed algorithm (in enqueue_completions), you'll see it elides the call to pthread_cond_signal entirely when there are no sleeping threads. So for the case of apps that never use this functionality or only rarely signal a sleeping thread, the only added cost is one additional load and branch (technically an atomic load, but on x86 this is just a regular load instruction).

    I agree with your general point that other extensions may be justified to solve SOME problems. However I don't see how a shared work queue would directly apply to @Alexander Pöppl's Actor library, which is the topic of this issue. In particular, the "work" items in the Actor library are NOT independent work units that can be serviced with arbitrary concurrency - incoming messages for a given Actor have serial dependencies dictating the order in which they can be processed. This is part of what made their OpenMP version messy. I think this means you'd need at least one work queue per Actor (many per process, dynamically created), and idle threads burn CPU racing to poll all the local queues - which starts to looks like poor design for a user-level threading library. This might work, but it definitely requires a re-design and doesn't solve the same problem posed here, which (amongst other things) provides a drop-in replacement for any future::wait() call that relinquishes all processor resources. (It also potentially has secondary side-effects of being more power-friendly and more effectively scheduled on an overcommitted host).

  4. john bachan

    Thanks, I missed that you were conditionally eliding the pthread_cond_signal. But that overhead is exactly what glibc on linux does anyway. So I was at least typing with these performance characteristics in mind.

    I agree that this does allow the OS to better utilize resources on overcommitted hosts. But if MPI doesn't have an analog capability (?), or at least the community isn't screaming for it (?), I don't see why we should feel pressure to solve the issue. It is clearly uncommon practice in HPC.

    I see now that a shared queue isn't enough for actors. But I would still prefer dreaming up general mechanisms that are sufficient without leaving user space.

  5. Dan Bonachea reporter

    I missed that you were conditionally eliding the pthread_cond_signal. But that overhead is exactly what glibc on linux does anyway

    FWIW most safe synchronization patterns using pthread_cond_signal also require grabbing a mutex to enforce a critical section around the signal, overhead which cannot be elided inside the implementation of pthread_cond_signal - but that's not really relevant to this discussion.

    I agree that this does allow the OS to better utilize resources on overcommitted hosts. But if MPI doesn't have an analog capability (?), or at least the community isn't screaming for it (?), I don't see why we should feel pressure to solve the issue. It is clearly uncommon practice in HPC.

    So your argument against the application-motivated proposal is that it's not application-motivated? ;-)

    In all seriousness, MPI_Wait() blocks the calling thread and is permitted to sleep that thread (provided an appropriate mechanism for wake-up and progress on background tasks), although it's implementation-dependent whether it actually would. The same is true for most other MPI blocking calls. Less primitive programming models with real task scheduling features like Chapel provide support to allow tasks to block/suspend for dependencies. Nor is this a new idea, task scheduling layers have been doing this for decades.

    future::wait() could theoretically sleep a thread, but since it's specified to run any user callbacks or completions on the persona stack of the calling thread, it would need to awaken this particular thread to service any of them - this proposal gives a way to explicitly request that behavior. We don't have any resource-efficient way to spell "block for an LPC arrival", which is an important part of what this proposal could provide.

  6. john bachan

    So your argument against the application-motivated proposal is that it's not application-motivated? ;-)

    Exactly! Just one case (actor frameworks) doesn't seem sufficient.

    MPI_Wait() being permitted to sleep the thread says nothing. Every function call (ever? joke.) is permitted to a sleep a thread. What I'm interested in is whether any MPI implementation would sleep during MPI_Wait() for the beneficial resource management reasons you stated above. And Chapel's blocking facilities have the important implementation detail of blocking in userspace for other Chapel related things via lightweight threads (qthreads or something like that I think).

    I think it is clear, though the exact mechanism has not been identified, that actors can be handled most efficiently without the OS's involvement if we are willing to enrich our semantics with some unknown cool thing. For that reason, I am pushing back on putting in effort and complicating our runtime to service a need in a suboptimal way, since that energy would feel misspent on the day somebody else does actors "right" and we lose the actor framework shootout.

  7. Dan Bonachea reporter

    What I'm interested in is whether any MPI implementation would sleep during MPI_Wait()

    I don't know if any of the major implementations include such a feature at the moment.

    However here is a discussion thread from the OpenMPI user list that I found in a few moments of searching with MPI users "clamoring" for such functionality, and some reputable MPI folks acknowledging that it could be useful in the right context.

    I agree that yielding to a user-space scheduler is generally preferable to yielding into kernel space, but that's orthogonal - neither is a requirement for this proposal. There are implementations of POSIX threads in user-space (here is one). Also, nothing in my sketch above is specific to POSIX threads - it's not impossible that someday a UPC++ implementation could support execution over QThreads or some other threading library (that, after all, is one of the motivations for decoupling personas and other things from POSIX/C++ threading mechanisms).

  8. Log in to comment