Intermittent rpc-ctor-trace failure on copy-get-d2d(as_rpc)

Issue #494 resolved
Dan Bonachea created an issue

We now have two instances of an intermittent rpc-ctor-trace failure on EX-dirac-ibv-nvidia-opt, using pgi 21.5:

After the first failure I could not reproduce it or find an explanation based on code inspection, so we chalked it up to a hardware fault and I added extra redundant checks in the test code to detect data corruption.

The harness config should not have changed significantly during this interval, and the only changes to UPC++ 8a2939d..ea21e24 that are potentially relevant are the changes I made to the test to detect corruption.

In both cases the failure mode output is identical, including the failing process rank, with the only difference being the line number difference due to the code changes in the interim. Sadly, the new checking code did NOT trigger to give us a clue what is happening:

copy-get-d2d: as_rpc(Fn&)&& ->                      (line 899)
  T::ctors = 3
  T::copies = 0
  T::moves = 1
  T::dtors = 4

copy-get-d2d: as_rpc(Fn&&)&& ->                     (line 904)
  T::ctors = 3
  T::copies = 0
  T::moves = 3
  T::dtors = 6

[5] ERROR: failed check: moves == expected_moves
copy-get-d2d: as_rpc(Fn&)&& ->: moves=2 expected=1  (line 899)
Test result: SUCCESS
Test result: ERROR

This error does not in itself necessarily represent a semantic defect, as this test is measuring a behavioral property that is not part of the specified library semantics. However the test does not perform data validation (aside from now asserting the RPC correctly runs exactly once per process), so it's conceivable (though seems unlikely) that the copy payload did not arrive correctly.

The unexpected move operation may in fact partially depend on underspecified behavior in the C++ compiler, although if that's the explanation then this would be the first behavioral divergence we've seen despite running this test thousands of times across multiple compiler families and versions. It also makes very little sense that the behavior would be both non-deterministic and (so far) only observed on a single process of an 8-process 4-ppn job.

The failure mode corresponds to an unexpected move operation on the lvalue function object in the completion object for a "get-like" device-to-device copy(remote_cx::as_rpc(fn) using reference memory kinds. In both cases the failure occurred in opt/PAR mode, but this test notably does not create any personas or threads (all activity occurs on the master persona). One notable aspect of this copy case is the completion actually runs on the initiator, so the completion is never sent on the wire, it is synchronously serialized/deserialized into the heap during initiation, and later enlisted for execution after the transfer is complete.

Comments (7)

  1. Paul Hargrove

    Looking back at what changed in testing, I came to discover that my previous work to move this tester off of the questionable hardware was incomplete (but I've resolved that now). So the "hardware fault" explanation might still be valid IF we believe the same exact failure could occur twice completely at random.

    In the future, the questionable hardware will no longer be used.

  2. Dan Bonachea reporter

    This failure mode has not recurred in nightly testing since 2021-07-18, BUT the tester was retired after 2021-08-15 because newer NVHPC 21.7 has dropped codegen support for the dirac CPU.

    AFAIK this defect has never been seen on any other config, so we are "optimistically" closing this issue under the assumption the misbehavior was somehow specific to NVHPC 21.5 and dirac.

  3. Dan Bonachea reporter
    • changed status to open

    I've now seen this intermittent failure more than once via GitLab on summit, most recently with GCC 11.1.0 and par/debug/smp:

    copy-get-d2d: as_rpc(Fn&&)&& ->                     (line 893)
      T::ctors = 3
      T::copies = 0
      T::moves = 4
      T::dtors = 7
    
    [0] ERROR: failed check: moves == expected_moves
    copy-get-d2d: as_rpc(Fn&&)&& ->: moves=4 expected=3     (line 893)
    Test result: ERROR (rank 0/4: h33n08)
    Fri Feb 18 18:43:40 PST 2022
    

    and GCC 11.1.0 and seq/debug/smp:

    copy-get-d2d: as_rpc(Fn&)&& ->                      (line 888)
      T::ctors = 3
    [1] ERROR: failed check: moves == expected_moves
    copy-get-d2d: as_rpc(Fn&)&& ->: moves=2 expected=1  (line 888)
      T::copies = 0
      T::moves = 1
      T::dtors = 4
    
    copy-get-d2d: as_rpc(Fn&&)&& ->                     (line 893)
    [1] ERROR: failed check: moves == expected_moves
    copy-get-d2d: as_rpc(Fn&&)&& ->: moves=4 expected=3     (line 893)
      T::ctors = 3
      T::copies = 0
      T::moves = 3
      T::dtors = 6
    
    Test result: SUCCESS (rank 0/4: h33n12)
    Test result: ERROR (rank 1/4: h33n12)
    

    In all cases so far the failure is again for a single rank out of several, and these being smp-conduit there is not even an asynchronous network in the picture. There is still some asynchrony arising from the device interaction, but I'm baffled as to where that might affect the loopback remote_cx move count for a copy-get-d2d in a single-threaded test.

  4. Dan Bonachea reporter

    With further manual poking on summit using seq/debug/smp I've also seen this failure mode in the copy-get-h2d subtest:

    [0] ERROR: failed check: moves == expected_moves
    copy-get-h2d: as_rpc(Fn&)&& ->: moves=2 expected=1      
    
  5. Dan Bonachea reporter

    I've finally tracked this down.

    The non-determinism arises from a combination of the asynchrony of the CUDA memcpy operations invoked by reference kinds, and the fact the runtime polls completion of those device operations (via burst_device()) from two distinct contexts:

    1. The burst_device() call within upcxx::progress(user) executes in a "user burstable" state that allows synchronous execution of as_rpc callbacks (which are only permitted to run during user progress), whereas
    2. the burst_device() call inside gasnet::after_gasnet() (most likely reached via copy() -> send_am_*() -> gasnet::after_gasnet() -> burst_device()) executes in a context with internal-only progress where as_rpc callbacks are not permitted to run synchronously (regardless of what personas are present).

    Commenting out the latter burst_device() invocation from internal progress reliably hides the problem.

    The failure mode is very unlikely because this particular sub-test spends very little time inside internal-only progress, so the inter-rank timing has to line up perfectly to show the problem.

    The contextual difference affects the outcome of the persona_tls::is_burstable() branch inside persona_tls::during(backend::master, progress_level::user, remote_cx) which is invoked inside the burst_device() callback for completion of the final CUDA memcpy operation. For the common case of burst_device() running inside user progress, the branch is true and the as_rpc completion callback executes synchronously. Otherwise for burst_device() outside user progress the branch is false and the as_rpc completion callback is pushed onto the master persona's self LPC queue, incurring the "extra" move operation. Forcing this branch to false reliably shows an extra move (in many sub-tests).

    This failure mode does not arise for the copy-loop-* sub-tests only because those sub-tests do not invoke any calls that reach gasnet::after_gasnet() internal progress within their barrier phase (nor does barrier()). Inserting some gratuitous upcxx::progress(progress_level::internal) calls between copy() and the completion loop of a copy-loop subtest reveals the same "extra move" failure mode.

    Proposed resolution appears in PR 406, which raises the expected move counts for affected copy-get-d(as_rpc) operations to allow the non-deterministic "extra move".

  6. Log in to comment