intermittent lpc-stress/opt failures on ARM64

Issue #479 resolved
Dan Bonachea created an issue

GitLab testing has shown intermittent hangs of the lpc-stress test wombat, so far only for seq-opt-ibv. The timeout failures look like this: (complete output from the logs)

Tue May 25 23:58:38 PDT 2021
++ eval echo
+++ echo
+ /autofs/nccs-svm1_envoy_od/phargrov/.gitlab-runner/builds/Xnp8CS4f/0/anl/upcpp/bld/upcxx.assert1.optlev0.dbgsym1.gasnet_seq.ibv/bin/upcxx-run -np 4 -- timeout --foreground -k 420s 300s ./test-lpc-stress-seq-opt-ibv
Test: lpc-stress.cpp
Ranks: 4
iters = 10000 threads = 10 WORDS = 64
*** Caught a signal (proc 3): SIGTERM(15)
*** Caught a signal (proc 2): SIGTERM(15)
*** Caught a signal (proc 0): SIGTERM(15)
*** Caught a signal (proc 1): SIGTERM(15)
Wed May 26 00:03:40 PDT 2021

I've seen this intermittently in GitLab runs, including the following:

What we know:

  • These failures span compilers: clang/12.0.0-gcc1020, clang/4.0.0-gcc640, gcc6.4.0. These are all using compilers built by Paul, but (especially given the similar behavior across compilers) I have no reason to suspect the compiler.
  • Failures include both with and without cuda support built into UPC++
  • I think I've only seen it fail on seq-opt, despite the fact every GitLab job runs all four combinations of {seq,par}-{opt,debug}-ibv.
  • The failures above are on PR branches doing work on upcxx::copy() because that's what I've been running lately, but the test itself makes no copy calls so the PR branch changes are almost certainly irrelevant.

This test is almost entirely LPC-based, it spawns threads that talk to each other via LPC. The only inter-process communication is a barrier at startup and before exit, so the test should be almost entirely independent of conduit. The test is written to be independent of seq/par mode, and given the "meat" of the test doesn't touch the backend or GASNet, I don't have a theory regarding why a failure would only show up in seq mode.

The test is designed to identify memory races in the LPC queues, which have been speculated may be present for non-x86_64 architectures (including wombat's ARM processors). The hangs might indicate such failure has been found, but there's no way to tell for sure without alot more information, like reproducing the problem in a debugger.

So far I have NOT been able to reproduce this manually at all on wombat, even when using the same compiler and configure line used by GitLab CI and performing thousands of trials. These configs are notably disabling ODP and running dual-rail, I've also tried varying those dimensions, as well as conduit (smp, udp, ibv), to no avail.

Comments (8)

  1. Dan Bonachea reporter
    • edited description

    Description edited to add new failures seen on smp and udp conduits.

    Example output from https://gitlab-ci.alcf.anl.gov/anl/upcpp/-/jobs/46243 :

    Thu May 27 16:50:31 PDT 2021
    ++ eval echo
    +++ echo
    + /autofs/nccs-svm1_envoy_od/phargrov/.gitlab-runner/builds/Xnp8CS4f/0/anl/upcpp/bld/upcxx.assert1.optlev0.dbgsym1.gasnet_seq.udp/bin/upcxx-run -np 4 -- timeout --foreground -k 420s 300s ./test-lpc-stress-seq-opt-udp
    WARNING: Using GASNet's udp-conduit, which exists for portability convenience.
    WARNING: Support was detected for native GASNet conduits: ibv
    WARNING: You should *really* use the high-performance native GASNet conduit
    WARNING: if communication performance is at all important in this program run.
    Test: lpc-stress.cpp
    Ranks: 4
    iters = 10000 threads = 10 WORDS = 64
    *** Caught a signal (proc 0): SIGTERM(15)
    *** Caught a signal (proc 2): SIGTERM(15)
    *** Caught a signal (proc 1): SIGTERM(15)
    *** Caught a signal (proc 3): SIGTERM(15)
    Thu May 27 16:55:31 PDT 2021
    
    Thu May 27 16:45:29 PDT 2021
    ++ eval echo
    +++ echo
    + /autofs/nccs-svm1_envoy_od/phargrov/.gitlab-runner/builds/Xnp8CS4f/0/anl/upcpp/bld/upcxx.assert1.optlev0.dbgsym1.gasnet_seq.smp/bin/upcxx-run -np 4 -- timeout --foreground -k 420s 300s ./test-lpc-stress-seq-opt-smp
    WARNING: ignoring GASNET_SUPERNODE_MAXSIZE for smp-conduit with PSHM.
    Test: lpc-stress.cpp
    Ranks: 4
    iters = 10000 threads = 10 WORDS = 64
    *** Caught a signal (proc 0): SIGTERM(15)
    Thu May 27 16:50:31 PDT 2021
    
  2. Paul Hargrove

    I have direct evidence that this is NOT a matter of oversubscribing cores. Below is a snapshot of a test-lpc-stress-seq-opt-ibv running in a Wombat Gitlab CI tester.
    This shows 1100% CPU, consistent w/ 11 concurrent threads and TIME advancing much faster than wallclock, and hwloc shows the procs are NOT bound.

       PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
    161997 phargrov  20   0 3973952 164288 147072 R  1100   0.1  10:09.95 test-lpc-stress
    161998 phargrov  20   0 3973952 164224 147008 R  1100   0.1  10:09.95 test-lpc-stress
    {phargrov@wombat8 ~}$ hwloc-bind --get --pid 161998
    0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff
    {phargrov@wombat8 ~}$ hwloc-bind --get --pid 161997
    0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff
    
  3. Dan Bonachea reporter

    I've now run over 100 GitLab trials of this test using ibv-conduit and all four build configs across all four compilers, where the only difference from the GitLab runs demonstrating the problem is the addition of configure option --with-mpsc-queue=biglock:

    https://gitlab-ci.alcf.anl.gov/anl/upcpp/-/pipelines/5906/builds

    Ignoring one job that was inadvertently cancelled, ALL of the remaining jobs passed without hang or error, whereas the same setup with the default MPSC queues has been observed to hang a large fraction of the time.

    This appears to be strong evidence that the MPSC queues which are suspected to comprise the root cause are at least a contributing factor to the observed failures.

  4. Dan Bonachea reporter

    Fix issue 479: intermittent lpc-stress/opt failures on ARM64

    Turns out this was not actually a hang, but rather a pathological slow-down caused by thrashing the memory system.

    After reproducing the slow-down manually on a wombat login node (Cavium ThunderX2), I used gdb to confirm the master thread was spending most of its time trying to enqueue LPC acknowledgements to producer threads who are spinning on burst (dequeue).

    Inserting padding to separate the head and tail pointers in the atomic MPSC queues seems to reliably solve the problem.

    Even on architectures where this wasn't causing a critical slow-down, it's likely this defect was degrading the performance of our multi-threaded LPC queues any time they exceeded one element. The queue algorithm was designed to allow the head and tail to proceed independently, but with both pointers on the same cache line false sharing between the producers and consumer was likely leading to lots of unnecessary coherence cache misses.

    → <<cset e6dc7ff19986>>

  5. Log in to comment