Intermittent rpc-ctor-trace failure on copy-get-d2d(as_rpc)
We now have two instances of an intermittent rpc-ctor-trace failure on EX-dirac-ibv-nvidia-opt, using pgi 21.5:
- https://gasnet-bugs.lbl.gov/tests/detail.php?testname=external-upcxx/rpc-ctor-trace-par&date=2021-06-29&branch=ALL
- https://gasnet-bugs.lbl.gov/tests/detail.php?testname=external-upcxx/rpc-ctor-trace-par&date=2021-07-18&branch=ALL
After the first failure I could not reproduce it or find an explanation based on code inspection, so we chalked it up to a hardware fault and I added extra redundant checks in the test code to detect data corruption.
The harness config should not have changed significantly during this interval, and the only changes to UPC++ 8a2939d..ea21e24 that are potentially relevant are the changes I made to the test to detect corruption.
In both cases the failure mode output is identical, including the failing process rank, with the only difference being the line number difference due to the code changes in the interim. Sadly, the new checking code did NOT trigger to give us a clue what is happening:
copy-get-d2d: as_rpc(Fn&)&& -> (line 899)
T::ctors = 3
T::copies = 0
T::moves = 1
T::dtors = 4
copy-get-d2d: as_rpc(Fn&&)&& -> (line 904)
T::ctors = 3
T::copies = 0
T::moves = 3
T::dtors = 6
[5] ERROR: failed check: moves == expected_moves
copy-get-d2d: as_rpc(Fn&)&& ->: moves=2 expected=1 (line 899)
Test result: SUCCESS
Test result: ERROR
This error does not in itself necessarily represent a semantic defect, as this test is measuring a behavioral property that is not part of the specified library semantics. However the test does not perform data validation (aside from now asserting the RPC correctly runs exactly once per process), so it's conceivable (though seems unlikely) that the copy payload did not arrive correctly.
The unexpected move operation may in fact partially depend on underspecified behavior in the C++ compiler, although if that's the explanation then this would be the first behavioral divergence we've seen despite running this test thousands of times across multiple compiler families and versions. It also makes very little sense that the behavior would be both non-deterministic and (so far) only observed on a single process of an 8-process 4-ppn job.
The failure mode corresponds to an unexpected move operation on the lvalue function object in the completion object for a "get-like" device-to-device copy(remote_cx::as_rpc(fn)
using reference memory kinds. In both cases the failure occurred in opt/PAR mode, but this test notably does not create any personas or threads (all activity occurs on the master persona). One notable aspect of this copy case is the completion actually runs on the initiator, so the completion is never sent on the wire, it is synchronously serialized/deserialized into the heap during initiation, and later enlisted for execution after the transfer is complete.
Comments (7)
-
-
reporter - changed status to wontfix
This failure mode has not recurred in nightly testing since 2021-07-18, BUT the tester was retired after 2021-08-15 because newer NVHPC 21.7 has dropped codegen support for the dirac CPU.
AFAIK this defect has never been seen on any other config, so we are "optimistically" closing this issue under the assumption the misbehavior was somehow specific to NVHPC 21.5 and dirac.
-
reporter - changed status to open
I've now seen this intermittent failure more than once via GitLab on summit, most recently with GCC 11.1.0 and par/debug/smp:
copy-get-d2d: as_rpc(Fn&&)&& -> (line 893) T::ctors = 3 T::copies = 0 T::moves = 4 T::dtors = 7 [0] ERROR: failed check: moves == expected_moves copy-get-d2d: as_rpc(Fn&&)&& ->: moves=4 expected=3 (line 893) Test result: ERROR (rank 0/4: h33n08) Fri Feb 18 18:43:40 PST 2022
and GCC 11.1.0 and seq/debug/smp:
copy-get-d2d: as_rpc(Fn&)&& -> (line 888) T::ctors = 3 [1] ERROR: failed check: moves == expected_moves copy-get-d2d: as_rpc(Fn&)&& ->: moves=2 expected=1 (line 888) T::copies = 0 T::moves = 1 T::dtors = 4 copy-get-d2d: as_rpc(Fn&&)&& -> (line 893) [1] ERROR: failed check: moves == expected_moves copy-get-d2d: as_rpc(Fn&&)&& ->: moves=4 expected=3 (line 893) T::ctors = 3 T::copies = 0 T::moves = 3 T::dtors = 6 Test result: SUCCESS (rank 0/4: h33n12) Test result: ERROR (rank 1/4: h33n12)
In all cases so far the failure is again for a single rank out of several, and these being smp-conduit there is not even an asynchronous network in the picture. There is still some asynchrony arising from the device interaction, but I'm baffled as to where that might affect the loopback remote_cx move count for a copy-get-d2d in a single-threaded test.
-
reporter - changed milestone to 2022.3.0 release
- changed title to Intermittent rpc-ctor-trace failure on copy-get-d2d(as_rpc)
The compiler and network no longer seem relevant to this defect.
The problem seems more prevalent on summit, so at some point I should use it to track this down.
-
reporter With further manual poking on summit using seq/debug/smp I've also seen this failure mode in the copy-get-h2d subtest:
[0] ERROR: failed check: moves == expected_moves copy-get-h2d: as_rpc(Fn&)&& ->: moves=2 expected=1
-
reporter I've finally tracked this down.
The non-determinism arises from a combination of the asynchrony of the CUDA memcpy operations invoked by reference kinds, and the fact the runtime polls completion of those device operations (via
burst_device()
) from two distinct contexts:- The
burst_device()
call withinupcxx::progress(user)
executes in a "user burstable" state that allows synchronous execution ofas_rpc
callbacks (which are only permitted to run during user progress), whereas - the
burst_device()
call insidegasnet::after_gasnet()
(most likely reached viacopy() -> send_am_*() -> gasnet::after_gasnet() -> burst_device()
) executes in a context with internal-only progress whereas_rpc
callbacks are not permitted to run synchronously (regardless of what personas are present).
Commenting out the latter
burst_device()
invocation from internal progress reliably hides the problem.The failure mode is very unlikely because this particular sub-test spends very little time inside internal-only progress, so the inter-rank timing has to line up perfectly to show the problem.
The contextual difference affects the outcome of the
persona_tls::is_burstable()
branch insidepersona_tls::during(backend::master, progress_level::user, remote_cx)
which is invoked inside theburst_device()
callback for completion of the final CUDA memcpy operation. For the common case ofburst_device()
running inside user progress, the branch is true and the as_rpc completion callback executes synchronously. Otherwise forburst_device()
outside user progress the branch is false and the as_rpc completion callback is pushed onto the master persona's self LPC queue, incurring the "extra" move operation. Forcing this branch to false reliably shows an extra move (in many sub-tests).This failure mode does not arise for the
copy-loop-*
sub-tests only because those sub-tests do not invoke any calls that reachgasnet::after_gasnet()
internal progress within their barrier phase (nor doesbarrier()
). Inserting some gratuitousupcxx::progress(progress_level::internal)
calls betweencopy()
and the completion loop of acopy-loop
subtest reveals the same "extra move" failure mode.Proposed resolution appears in PR 406, which raises the expected move counts for affected
copy-get-d(as_rpc)
operations to allow the non-deterministic "extra move". - The
-
reporter - changed status to resolved
issue 494: Adjust expected move counts for copy-get-*2d(as_rpc)
Resolves issue
#494→ <<cset 10fa0c4b8654>>
- Log in to comment
Looking back at what changed in testing, I came to discover that my previous work to move this tester off of the questionable hardware was incomplete (but I've resolved that now). So the "hardware fault" explanation might still be valid IF we believe the same exact failure could occur twice completely at random.
In the future, the questionable hardware will no longer be used.