Oversubscription and spin-waiting

Issue #153 resolved
john bachan created an issue

The implementation currently uses a naive heuristic to detect when the application is spin-waiting for operations to complete. It pessimistically assumes those operations are dependent on attentiveness from another rank which might need access to the current CPU. So, the heuristic invokes sched_yield when it thinks we're spinning.

When oversubscription is occurring, such as is common with SMP runs on our laptops, this can have a striking performance boost. But on HPC machines, oversubscription is less common and these calls to sched_yield might actually be hurting.

We need to discuss a better option. Perhaps an env var liked UPCXX_OVERSUBSCRIBED=0/1?

@bonachea had this to say when he first brought this up in a PR comment:

Re: Oversubscription: This comment alludes to a general problem we've been handling in PGAS runtime layers for years anywhere that blocking occurs (eg future::wait). I'd like to discuss deploying a more robust solution for oversubscription in UPC++, because the consequences of "getting it wrong" can be catastrophic to measured performance - yielding a kernel thread the OS when unnecessary (eg on a batch scheduled node with sufficient dedicated cores) can add significant latency costs and jitter (especially post Meltdown/Spectre), but failing to yield on an oversubscribed system (usually in development environment) can also be very very bad for performance. IMO this is not something that should be completely hidden from the user, who has a more information about the execution context of his system than we'll ever have programmatically.

Berkeley UPC uses gasnett_cpu_count() to detect oversubscription at startup and defaults to enabling polite yield-based blocking with a warning like:

WARNING: Host pcp-d-6 running more threads (64) than there are physical CPU's (16)
         enabling "polite", low-performance synchronization algorithms

Even this is not sufficient for all cases, because sometimes a system can be oversubscribed through the action of several competing and non-cooperating jobs. Consequently, this behavior can also be overridden in the spawner or environment, for example: see upcrun documentation on -polite-sync.

Comments (15)

  1. Dan Bonachea

    For reference, MPI implementations face this exact same issue inside blocking calls. Here is the relevant Open MPI documentation to handle it - they call the two execution modes "Aggressive" and "Degraded", use some smarts to detect overcommits to default enable the right mode, and have a documented manual override knob.

    This topic was discussed at the 7/18 Pagoda meeting.

    The consensus was basically to mimic the BUPC behavior by default, with an optional build-time override that would statically remove the branch and force the never-yield case. Ie:

    1. At startup, UPC++ runtime in each process should query gasnett_cpu_count() compare it to the co-located process count reported by gex_System_QueryHostInfo() (note this may be wider than local_team due to various reasons).
    2. Default behavior: (no other settings)
      1. If the comparison indicates overcommit (more than one process per core), issue a warning to the console and dynamically enable "polite/degraded mode" that always does sched_yield in calls to upcxx::progress() and the spin loops inside any blocking calls like future::wait()
      2. Otherwise, system defaults to assuming no overcommit and branches around the sched_yield calls
    3. If the user has set UPCXX_OVERSUBSCRIBED=0/1 or some equivalent upcxx-run argument in the runtime environment, honor that in the branches protecting the yield calls.
      1. UPCXX_OVERSUBSCRIBED=1 is recommended whenever the app starts up its own threads (eg via OpenMP) such that the total threads competing for cycles on a host exceeds the core count.
    4. If NEVERYIELD=1 (or something similar) set in the build-time environment - compile away any branches in the progress engine statically and never call sched_yield.
      1. Still do the startup detection and warn if a NEVERYIELD binary is run in overcommit mode, because it costs nothing to check this outside the critical path and could prevent catastrophic performance mistakes
    5. Add some documentation explaining this behavior so users understand when and how to use these knobs.

    John agreed to look into implementing this.

  2. Steven Hofmeyr

    Be aware that sched_yield in Linux probably doesn't have any benefit for oversubscription, and calling it is a waste of time. At least, this was the case five years ago. If you're going to use it, I would suggest doing a few simple experiments to check on it's behavior. I suspect that if you're interested in making oversubscription more effective for blocking calls, you'll need to use something other than sched_yield. Here's the email chain in which I discussed the problem with Costin and Paul five years ago:

  3. Steven Hofmeyr

    hmm, I attached the email chain as pdf to my email reply, but it didn't show up here. I'll forward it on separately.

  4. Steven Hofmeyr

    Also, on my Linux system (kernel 4.15), man sched_yield has this to say at the end:

       sched_yield() is intended for use with  read-time  scheduling  policies
       (i.e., SCHED_FIFO or SCHED_RR).  Use of sched_yield() with nondetermin‐
       istic scheduling policies such as SCHED_OTHER is unspecified  and  very
       likely means your application design is broken.
    
  5. Steven Hofmeyr

    Note that sched_compat_yield still exists on Cori, but it is set to 0 by default. And it requires root privilege to change that to 1.

  6. Dan Bonachea

    I'm interested to hear more details about Steve's experimental setup with this, and in particular how he's measuring the effect (or non-effect) of the call.

    There's a good discussion of modern Linux scheduling policies here, and it still recommends sched_yield for our use case. According to the sources I can find sched_yield behaves differently on CFS than prior schedulers, but it definitely doesn't do "nothing". In particular, it should deschedule the running process and move it to or near the end of the run queue for its priority class. We are interested in the case where the competing processes are essentially forked copies of the same process, so they are all in the same priority class and this should effect sharing the CPU between all our over-committed SPMD processes in a runnable state.

    Note we do NOT expect the CPU utilization as reported by top and other tools to drop at all in "polite mode"! The total CPU utilization should still remain near 100% during wait loops - the difference is we allow overcommitted copies of ourselves to share the CPU at a finer granularity, reducing pathological self-interference and effective priority inversion. Ie the idea is we are "yielding" the processor to other directly competing SPMD processes, NOT to the kernel idle state or even to other lower-priority user processes. It also goes without saying that yielding does not affect processor binding masks - it's probably a bad idea to mix core pinning with overcommit anyhow.

    Here's a simple experiment I just performed on a (64-core) cori Haswell node with Berkeley UPC and smp-conduit (so no network in the picture at all) using the very simple barrierperf.upc benchmark. I'm using manual fork launch instead of upcrun to bypass

    $  uname -a
    Linux nid00011 4.4.73-5.1_4.0.141-cray_ari_c #1 SMP Mon Mar 12 17:11:28 UTC 2018 (4.0.141) x86_64 x86_64 x86_64 GNU/Linux
    
    $ module list
    Currently Loaded Modulefiles:
      1) modules/3.2.10.6                                8) pmi/5.0.13                                     15) rca/2.2.16-6.0.5.0_15.34__g5e09e6d.ari         22) altd/2.0
      2) intel/18.0.1.163                                9) dmapp/7.1.1-6.0.5.0_49.8__g1125556.ari         16) atp/2.1.1                                      23) darshan/3.1.4
      3) craype-network-aries                           10) gni-headers/5.0.12-6.0.5.0_2.15__g2ef1ebc.ari  17) PrgEnv-intel/6.0.4                             24) upcxx/nightly
      4) craype/2.5.14                                  11) xpmem/2.2.4-6.0.5.1_8.18__g35d5e73.ari         18) craype-haswell                                 25) gcc/7.3.0
      5) cray-libsci/18.03.1                            12) job/2.2.2-6.0.5.0_8.47__g3c644b5.ari           19) cray-mpich/7.7.0                               26) Base-opts/2.4.123-6.0.5.0_11.2__g6460790.ari
      6) udreg/2.3.2-6.0.5.0_13.12__ga14955a.ari        13) dvs/2.7_2.2.65-6.0.5.2_16.2__gbec2cb0          20) bupc/2.26.0
      7) ugni/6.0.14-6.0.5.0_16.9__g19583bb.ari         14) alps/6.5.28-6.0.5.0_18.6__g13a91b6.ari         21) git/2.15.1
    
    $ which upcc
    /global/common/cori/ftg/upc/2.26.0/hsw/intel/PrgEnv-intel-6.0.4-18.0.1.163/runtime/inst/bin/upcc
    
    $ upcc -network smp -nopthreads barrierperf.upc 
    
    $ time env UPC_QUIET=1 GASNET_PSHM_NODES='32' ./barrierperf  
    running barrier perf test with 10000 iterations, 32 threads
    Total time:    0.013 sec  Avg Named Barrier latency:    1.275 us
    Total time:    0.013 sec  Avg Anon. Barrier latency:    1.278 us
    done.
    34.940u 13.568s 0:01.75 2771.4% 0+0k 0+0io 0pf+0w
    
    $ time env UPC_QUIET=1 GASNET_PSHM_NODES='64' ./barrierperf
    running barrier perf test with 10000 iterations, 64 threads
    Total time:    0.022 sec  Avg Named Barrier latency:    2.241 us
    Total time:    0.022 sec  Avg Anon. Barrier latency:    2.202 us
    done.
    202.708u 43.204s 0:04.40 5588.6%        0+0k 82+0io 267pf+0w
    
    $ time env UPC_QUIET=1 GASNET_PSHM_NODES='128' ./barrierperf  
    WARNING: Host nid00011 running more threads (128) than there are physical CPU's (64)
             enabling "polite", low-performance synchronization algorithms
    running barrier perf test with 10000 iterations, 128 threads
    Total time:    0.263 sec  Avg Named Barrier latency:   26.299 us
    Total time:    0.266 sec  Avg Anon. Barrier latency:   26.627 us
    done.
    90.308u 340.100s 0:07.18 5994.4%        0+0k 106+0io 97pf+0w
    
    $ time env UPC_QUIET=1 UPC_POLITE_SYNC=0 GASNET_PSHM_NODES='128' ./barrierperf
    WARNING: Host nid00011 running more threads (128) than there are physical CPU's (64)
             but setting UPC_POLITE_SYNC="0" in your environment has
             disabled "polite", low-performance synchronization algorithms
             Results of this run are not suitable for benchmarking
    running barrier perf test with 10000 iterations, 128 threads
    Total time:  350.608 sec  Avg Named Barrier latency: 35060.805 us
    Total time:  324.416 sec  Avg Anon. Barrier latency: 32441.605 us
    done.
    43342.080u 417.712s 11:24.46 6393.3%    0+0k 94+0io 122pf+0w
    

    Note the barrier performance when crossing from fully-committed (64) to 2x-polite-overcommit (128) dropped by quite a bit more than the expected linear factor of two, more like a factor of 10 - ie yield-based polite scheduling DOES has a cost. This is partially due to extra context switching overheads (significantly worse today thanks to Meltdown/Spectre), and probably also due to the fact that yielding does not achieve the ideal (for this test) FIFO schedule.

    However the big takeaway is that disabling the yielding and running 2x-"aggressive"-overcommit is worse by more than 1000x!!! This due to what amounts to priority inversion, where a process that has just signalled a barrier interferes with its own forward progress by continuing to hold the CPU resource waiting for a condition that cannot be signalled until the scheduler forcibly preempts the CPU resource and allows the descheduled peer process to run. That is exactly the pathological behavior we are looking to avoid, and BUPC's stock polite-yielding code still appears to achieve this design goal as intended. It's conceivable that replacing yield with usleep(some-small-value) might improve this particular example somewhat further (by enforcing a closer to FIFO schedule), but that probably doesn't generalize to more complicated scenarios - in any case, the use of sched_yield is not intended to provide optimal behavior, just to avoid pathological behavior.

    Obviously batch scheduled systems like cori are not really the most important use case for overcommit, since on dedicated iron you probably want to run perfect-fit and no yielding. However I believe the same applies to the laptop OSs and development systems we care about. Our runtime is not fundamentally designed to perform "well" for overcommit, and I'd never recommend it for "real" production/high-performance runs, but polite yielding at least makes overcommit "usable" (notably for development).

  7. Steven Hofmeyr

    ​​ Well, I experimented with this a while back. What I recall is that if you have two tasks running on one core, and one of them spins on yield, and the other does some work, then both get 50% of the core. And it makes no difference whether or not polite sync is used. With sched_compat_yield, the spinning task gets 0%. Consequently, if you have an oversubscribed app (even without other apps), you get much worse performance if you don't enable sched_compat_yield, and polite-sync has no benefits at all. My experiments were on a bunch of single server systems - you can find one of the papers discussing this here: http://upc.lbl.gov/publications/ipdps-iancu.pdf. In particular, take a look at figure 8: not using posix yield totally kills performance of some oversubscribed applications. Note that these effects were seen with only one app oversubscribed (not sharing). Also note that when oversubscribing in a load balanced way (eg 2 or 4 threads per core), there are sometimes big benefits to pinning threads.

    You wrote: "In particular, it should deschedule the running process and move it to or near the end of the run queue for its priority class." This is not how CFS works. The run queue is an rb tree ordered by runtime, so when sched_yield is called, during the context switch the task with the least runtime gets executed. So if a task calls sched_yield but it still has the lowest runtime, it will be the next task scheduled again. The only effect of the sched_yield call is to insert an additional context switch, nothing more. I'm not entirely sure how sched_compat_yield gets around this - I think it changes the runtime for the yielding task so that it's no longer the leftmost task.

    I think I can see why spinning on yield makes a difference in your barrier benchmark, since each yield forces a context-switch, and the threads are all at pretty much the same runtime, so we just get more frequent switching between threads and hence an earlier completion of the barrier. However, in practice, I don't think this will help at all - what applications spin on barriers? Certainly, I've never found polite sync to have any beneficial impact on performance for any actual applications. So what I'm saying is that if you want to do this properly, think of a mechanism other than using sched_yield, e.g try something like what openmp does, with a spin then sleep. What I'd like to see is that when you oversubscribe with upcxx, you get the best possible performance. We have an opportunity to do this if we use some openmp-like mechanisms; just inserting sched_yield will for all practical purposes give no benefits.

    To give some idea of the philosophy behind the CFS approach to sched_yield, take a look here: https://lkml.org/lkml/2007/9/19/328. Ingo Molnar (CFS author), has this to say about using sched_yield:

    The correct way to tell the kernel that the task is blocked is to use futexes for example, or any kernel-based locking or wait object - there are myriads of APIs for these. (The only well-defined behavior of yield is for SCHED_FIFO/RR tasks - and that is fully preserved in CFS.)

  8. Dan Bonachea

    The correct way to tell the kernel that the task is blocked is to use futexes for example, or any kernel-based locking or wait object - there are myriads of APIs for these.

    These typical counter-arguments to sched_yield on default-scheduled processes do NOT apply to our situation - because most/all(?) of the kernel-based wait objects they offer as "alternatives" involve either a timer (for an interval that in general is completely unpredictable) or a producer/consumer relationship where both sides can and do talk to the kernel. This doesn't work at all when the "producer" side of the relationship is an RDMA write to physical memory from a NIC on the I/O bus, which completely avoids kernel interaction by design. This is the most compelling situation where we really want a voluntary "one-sided" deschedule operation because there really is no other option except burning CPU to watch the memory hierarchy.

    The other POSIX scheduling policies and renice operations for dynamically adjusting priority (in BOTH directions) all appear to require root privilege to invoke (right?), so again unfortunately not applicable for use by general HPC users running on professionally-administered supercomputers.

    I've run a floating point benchmark on cori that appears to confirm Steve's assertion that overcommitted "idle" threads calling sched_yield() for polite mode sync in a UPC barrier still consume a fraction of the CPU roughly equal to the computing processes. The results imply that processes spamming sched_yield are being scheduled frequently even when compute-intensive processes are being preempted while they are trying to compute more - which is not the ideal behavior we'd want.

    Here is the 2011 Linux kernel commit that removed sched_yield_compat - corresponding to release 2.6.39. Cori compute nodes appear to be running a fork of 4.4.73, so the sched_compat_yield feature is probably gone, despite the fact /proc/sys/kernel/sched_compat_yield still exists for some reason.

    The same commit also introduced a yield_task_fair function in the scheduler that sounds like it might implement the behavior we want - unfortunately I don't see how a process can invoke this from user-mode. Perhaps I'm just misunderstanding what this is doing..

  9. Dan Bonachea

    Measurements with my floating point benchmark on macOS High Sierra (10.13) show that UPCR's sched_yield polite mode provides the expected behavior - naming adding many "idle" overcommited threads that just spin-yield in barrier calling sched_yield does not significantly impact the computational throughput of worker threads, whereas disabling the sched_yield()/polite_sync with no other changes has a large negative impact on worker performance.

    So while we may not yet have a good high-performance solution for overcommit on Linux, the UPCR approach seems to work quite well on macOS, which is probably the primary development platform of our target users and thus arguably the most important target for reasonable overcommit behavior. FWIW measurements show the same good/expected behavior of sched_yield is also true of both Cygwin 2.9 and Windows System for Linux (WSL b17134: which is user-space linux with kernel emulation, thus using the Microsoft process scheduler).

    IMHO these results alone are sufficient to motivate this work, even if we end up disabling sched_yield calls on Linux.

  10. Dan Bonachea

    This was resolved in pull request #116 at a6d23f1, and is effective at solving overcommit problems on macOS.

    Concerns regarding the Linux CFS scheduler ignoring sched_yield calls remain, but there appears to be nothing further we can do about that at user level.

    Any future performance defects along these lines should be opened as new issues.

  11. Log in to comment