intermittent lpc-stress/opt failures on ARM64

Issue #479 resolved

Dan Bonachea created an issue 2021-05-27

GitLab testing has shown intermittent hangs of the lpc-stress test wombat, so far only for seq-opt-ibv. The timeout failures look like this: (complete output from the logs)

Tue May 25 23:58:38 PDT 2021
++ eval echo
+++ echo
+ /autofs/nccs-svm1_envoy_od/phargrov/.gitlab-runner/builds/Xnp8CS4f/0/anl/upcpp/bld/upcxx.assert1.optlev0.dbgsym1.gasnet_seq.ibv/bin/upcxx-run -np 4 -- timeout --foreground -k 420s 300s ./test-lpc-stress-seq-opt-ibv
Test: lpc-stress.cpp
Ranks: 4
iters = 10000 threads = 10 WORDS = 64
*** Caught a signal (proc 3): SIGTERM(15)
*** Caught a signal (proc 2): SIGTERM(15)
*** Caught a signal (proc 0): SIGTERM(15)
*** Caught a signal (proc 1): SIGTERM(15)
Wed May 26 00:03:40 PDT 2021

I've seen this intermittently in GitLab runs, including the following:

https://gitlab-ci.alcf.anl.gov/anl/upcpp/-/jobs/45848
https://gitlab-ci.alcf.anl.gov/anl/upcpp/-/jobs/46081
https://gitlab-ci.alcf.anl.gov/anl/upcpp/-/jobs/46082
https://gitlab-ci.alcf.anl.gov/anl/upcpp/-/jobs/46087
https://gitlab-ci.alcf.anl.gov/anl/upcpp/-/jobs/46096
https://gitlab-ci.alcf.anl.gov/anl/upcpp/-/jobs/46185
- clang-12.0.0 w/ develop @ a6c99df5
https://gitlab-ci.alcf.anl.gov/anl/upcpp/-/jobs/46243
- clang/4.0.0-gcc640 w/ develop @ a6c99df5
- notably shows a failure of both lpc-stress-seq-opt-{smp,udp}
https://gitlab-ci.alcf.anl.gov/anl/upcpp/-/jobs/46244
- gcc-6.4.0 w/ develop @ a6c99df5

What we know:

These failures span compilers: clang/12.0.0-gcc1020, clang/4.0.0-gcc640, gcc6.4.0. These are all using compilers built by Paul, but (especially given the similar behavior across compilers) I have no reason to suspect the compiler.
Failures include both with and without cuda support built into UPC++
I think I've only seen it fail on seq-opt, despite the fact every GitLab job runs all four combinations of {seq,par}-{opt,debug}-ibv.
The failures above are on PR branches doing work on upcxx::copy() because that's what I've been running lately, but the test itself makes no copy calls so the PR branch changes are almost certainly irrelevant.

This test is almost entirely LPC-based, it spawns threads that talk to each other via LPC. The only inter-process communication is a barrier at startup and before exit, so the test should be almost entirely independent of conduit. The test is written to be independent of seq/par mode, and given the "meat" of the test doesn't touch the backend or GASNet, I don't have a theory regarding why a failure would only show up in seq mode.

The test is designed to identify memory races in the LPC queues, which have been speculated may be present for non-x86_64 architectures (including wombat's ARM processors). The hangs might indicate such failure has been found, but there's no way to tell for sure without alot more information, like reproducing the problem in a debugger.

So far I have NOT been able to reproduce this manually at all on wombat, even when using the same compiler and configure line used by GitLab CI and performing thousands of trials. These configs are notably disabling ODP and running dual-rail, I've also tried varying those dimensions, as well as conduit (smp, udp, ibv), to no avail.

Comments (8)

Dan Bonachea reporter

edited description

Description edited to add new failures seen on smp and udp conduits.

Example output from https://gitlab-ci.alcf.anl.gov/anl/upcpp/-/jobs/46243 :

Thu May 27 16:50:31 PDT 2021
++ eval echo
+++ echo
+ /autofs/nccs-svm1_envoy_od/phargrov/.gitlab-runner/builds/Xnp8CS4f/0/anl/upcpp/bld/upcxx.assert1.optlev0.dbgsym1.gasnet_seq.udp/bin/upcxx-run -np 4 -- timeout --foreground -k 420s 300s ./test-lpc-stress-seq-opt-udp
WARNING: Using GASNet's udp-conduit, which exists for portability convenience.
WARNING: Support was detected for native GASNet conduits: ibv
WARNING: You should *really* use the high-performance native GASNet conduit
WARNING: if communication performance is at all important in this program run.
Test: lpc-stress.cpp
Ranks: 4
iters = 10000 threads = 10 WORDS = 64
*** Caught a signal (proc 0): SIGTERM(15)
*** Caught a signal (proc 2): SIGTERM(15)
*** Caught a signal (proc 1): SIGTERM(15)
*** Caught a signal (proc 3): SIGTERM(15)
Thu May 27 16:55:31 PDT 2021

Thu May 27 16:45:29 PDT 2021
++ eval echo
+++ echo
+ /autofs/nccs-svm1_envoy_od/phargrov/.gitlab-runner/builds/Xnp8CS4f/0/anl/upcpp/bld/upcxx.assert1.optlev0.dbgsym1.gasnet_seq.smp/bin/upcxx-run -np 4 -- timeout --foreground -k 420s 300s ./test-lpc-stress-seq-opt-smp
WARNING: ignoring GASNET_SUPERNODE_MAXSIZE for smp-conduit with PSHM.
Test: lpc-stress.cpp
Ranks: 4
iters = 10000 threads = 10 WORDS = 64
*** Caught a signal (proc 0): SIGTERM(15)
Thu May 27 16:50:31 PDT 2021

2021-05-28T00:11:50+00:00

Dan Bonachea reporter
- edited description
- 2021-05-28T00:14:22+00:00

Paul Hargrove

I have direct evidence that this is NOT a matter of oversubscribing cores. Below is a snapshot of a test-lpc-stress-seq-opt-ibv running in a Wombat Gitlab CI tester.
This shows 1100% CPU, consistent w/ 11 concurrent threads and TIME advancing much faster than wallclock, and hwloc shows the procs are NOT bound.

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
161997 phargrov  20   0 3973952 164288 147072 R  1100   0.1  10:09.95 test-lpc-stress
161998 phargrov  20   0 3973952 164224 147008 R  1100   0.1  10:09.95 test-lpc-stress
{phargrov@wombat8 ~}$ hwloc-bind --get --pid 161998
0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff
{phargrov@wombat8 ~}$ hwloc-bind --get --pid 161997
0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff

2021-05-28T19:16:37+00:00

Dan Bonachea reporter
I've now run over 100 GitLab trials of this test using ibv-conduit and all four build configs across all four compilers, where the only difference from the GitLab runs demonstrating the problem is the addition of configure option --with-mpsc-queue=biglock:

https://gitlab-ci.alcf.anl.gov/anl/upcpp/-/pipelines/5906/builds

Ignoring one job that was inadvertently cancelled, ALL of the remaining jobs passed without hang or error, whereas the same setup with the default MPSC queues has been observed to hang a large fraction of the time.

This appears to be strong evidence that the MPSC queues which are suspected to comprise the root cause are at least a contributing factor to the observed failures.
- 2021-06-01T03:13:07+00:00
Dan Bonachea reporter
- changed title to intermittent lpc-stress/opt failures on ARM64
- assigned issue to
  
  Dan Bonachea
I now understand this problem, solution is on the way soon.
- 2021-06-15T17:53:18+00:00
Dan Bonachea reporter
Proposed solution in PR 363
- 2021-06-16T04:14:08+00:00
Dan Bonachea reporter
- changed status to resolved
Fix issue 479: intermittent lpc-stress/opt failures on ARM64

Turns out this was not actually a hang, but rather a pathological slow-down caused by thrashing the memory system.

After reproducing the slow-down manually on a wombat login node (Cavium ThunderX2), I used gdb to confirm the master thread was spending most of its time trying to enqueue LPC acknowledgements to producer threads who are spinning on burst (dequeue).

Inserting padding to separate the head and tail pointers in the atomic MPSC queues seems to reliably solve the problem.

Even on architectures where this wasn't causing a critical slow-down, it's likely this defect was degrading the performance of our multi-threaded LPC queues any time they exceeded one element. The queue algorithm was designed to allow the head and tail to proceed independently, but with both pointers on the same cache line false sharing between the producers and consumer was likely leading to lots of unnecessary coherence cache misses.

→ <<cset e6dc7ff19986>>
- 2021-06-18T02:14:31+00:00
Dan Bonachea reporter
Merge pull request #363 into develop
- lpc-queues: Update ChangeLog Fix issue 479: intermittent lpc-stress/opt failures on ARM64 intru_queue: Fix multiple relaxed memory defects in MPSC queues intru_queue: Cosmetic updates
→ <<cset d2eb43398c87>>
- 2021-06-18T02:14:31+00:00
Log in to comment

Assignee: Dan Bonachea

Type: bug

Priority: major

Status: resolved

Component: LPC

Milestone: 2021.9.0 release

Version: Development Branch

Votes: 0

Watchers: 1