hipMemcpyPeerAsync is not consistent with ROCmRDMA

This issue serves as a brain-dump of my findings trying to get UPC++ local hipMemcpy*() to play nicely with ROCmRDMA offload as used by GASNet native conduits over InfiniBand.

Background

As of PR 397, the UPC++ runtime can effectively operate in one of two modes as relates to the AMD GPUs on a system:

configure --enable-hip --disable-kind-hip
- This mode globally activates UPC++ "reference HIP kind" support for ROCm HIP, where all upcxx::copy() operations involving HIP devices are explicitly staged through host memory by the appropriate process using hipMemcpy*(). The second option disables GASNet's memory kinds support, and the UPC++ runtime only ever passes host memory buffers to GASNet.
configure --enable-hip --enable-kind-hip
- This mode activates "native HIP kind" support for ROCm HIP, conditional on conduits that provide the required support (currently ibv,ucx). For those conduits we register device segments with GASNet and inter-process copy() operations favor passing device buffers directly to GASNet RMA calls (with appropriate endpoints), allowing ROCmRDMA offload. Conduits that lack GASNet memory kinds support get the same "reference HIP kind" treatment discussed in 1.

I say "favor" above because in "native HIP kind" mode there are some cases where UPC++ still ends up invoking hipMemcpy*(). This notably includes any copy() calls where the data motion takes place between segments with affinity to the same process; these cases do not involve interprocess data communication, so we invoke hipMemcpy*() in the hopes it will leverage the most efficient transfer path (which probably does not involve network card, and possibly not even the PCI bus). This is the policy recommended by GASNet documentation, and such "loopback" paths are not even supported by the current GASNet kinds implementation.

upcxx::copy() is a very general interface, and provides full freedom between the affinities of the caller and buffers. The previous paragraph applies not only to cases where copy() initiator == src.where() == dest.where() ("fully loopback"), but also to copy() initiator != src.where() == dest.where() ("3rd party loopback"). In the latter case, the process hosting the buffers is performing a loopback data transfer at the behest of some other process who invoked copy().

Problem

test/copy-cover is a very aggressive test that issues upcxx::copy() across a variety of buffers, attempting to exercise most of the cases in the underlying copy() implementation and validate that data is delivered correctly. It has detected data consistency violations involving "3rd party loopback" operations under certain conditions on an HPE Cray EX system with 4 AMD GPUs per node and UCX networking. The violation is detected in trials that perform the following operations:

Process I invokes copy(private mem -> hip buffer A at process P) "ROCmRDMA put"
Process I invokes copy(hip buffer A at process P -> hip buffer B at process P) "3rd party loopback"
- where A and B have affinity to the same process P but DIFFERENT physical AMD GPUs
Process I invokes copy(hip buffer B at process P -> private mem) "ROCmRDMA get"

Somewhere along that path we pick up incorrect values that are detected after they arrive back at the initiator in step 3.

Observations:

AFAIK none of the other varied scenarios exercised by test/copy-cover have ever been seen to deliver incorrect results on the affected system/configuration.
Failure mode ALWAYS includes a 3rd party loopback copy of hip-to-hip xfer between DIFFERENT physical devices on the same remote rank (HIP calls this a "peer copy").
The incorrect values are never "garbage" - they are stale values from a previous step or kill write. This highly suggests a memory consistency/completion race.
At UPC++ level this occurs with either operation_cx or remote_cx completions, with or without source buffer kill writes. This is strong evidence the problem exists in the low-level transfer operations, and not something higher-level in the UPC++ control flow.
The problem is not deterministic, but occurs with relatively high frequency at least once during each "round" of copy-cover. It seems most likely at 3-4 ranks (regardless of node boundaries), but has also been observed with only 2 ranks (-DHEAPS_PER_KIND=5 -DALLOCS_PER_HEAP=2). Node boundaries do not seem to matter, only process boundaries (when working around unrelated UCX's bug 4383).
Requires UCX ROCmRDMA: does not occur with reference HIP kind (udp or UCX), does not occur with CUDA GPUDirect RDMA
Failures seem to always affect transfers under 32kb in size, smaller seems more common
Failures are nearly always in the very first element (ie whole buffer is probably stale)
Adding ridiculous over-synchronization does not seem to affect the problem. Things I've tried include:
- Converting the "ROCmRMA put" step into a "ROCmRMA get" from the target
- Initiating the transfer with the "current device" set to match the source or destination buffer
- Performing the peer transfer using any of:
  - hipMemcpyPeerAsync(), hipMemcpyDtoDAsync(), hipMemcpyDtoD(), hipMemcpyPeer(), hipMemcpy(hipMemcpyDeviceToDevice), hipMemcpy(hipMemcpyDefault)
  - ... of which the last four are semantically blocking operations
- creating and syncing hipEventReleaseToSystem events on the streams of both devices
- Using hipEventSynchronize() (instead of Query) to synchronize the events
- Calling hipStreamSynchronize() on both devices before and after the transfer
- Calling hipDeviceSynchronize() on both devices before and after the transfer
- Adding sleep(1) calls around the transfer - this seemed to reduce the frequency of invalid outcomes but not eliminate the problem
- Toggling hipDevice*PeerAccess - seemed to be a no-op, based on Query
- envvar HSA_FORCE_FINE_GRAIN_PCIE=1 - no apparent effect

Theories:

The behavior is consistent with a read-after-write consistency violation somewhere along the problematic copy() path yielding stale values instead of updated ones. There are two places this could be happening:

HIP device-to-device Peer copies are not observing prior ROCmRDMA writes to the source memory
ROCmRDMA reading from memory on device B is not observing prior HIP peer copies from device A

Where "prior" is of course defined by the completion and consistency semantics of the operations involved (which are woefully under-documented in both cases). It's possible that further experiments with a more complicated test code could determine whether one or both hazards are implicated in the problem.

ROCm version dependency

This problem was initially seen using ROCm 4.2.0. Newer ROCm 4.5.0 on the same system seems to resolve the original problem, with no other changes to our original code (which uses hipMemcpyPeerAsync() with the destination device driving the peer transfer). For this reason we've decided to make ROCm 4.5.0 our officially supported version floor. However nothing in the 4.5.0 release notes suggests a bug fix along these lines, so the behavior may have disappeared as a side-effect of other changes (e.g. the "HIP Direct Dispatch" feature introduced in this version).

In the course of investigating this problem I read alot of HIP docs, wherein I discovered this recommendation regarding HIP peer copies:

"For multi-gpu or peer-to-peer configurations, it is recommended to set the current device to the device where the src data is physically located."

This recommendation advocates for instead using the source device to drive the peer transfer. However when I make the suggested change the same data validation failure symptoms return in ROCm/4.5.0 on the HPE Cray EX system. What's more with that change I can also duplicate the same symptoms on a different (non-HPE/Cray) system at JLSE using ROCmRDMA in both ucx-conduit and ibv-conduit. This last suggests nothing in UCX is a contributing factor (at least for the behavior under ROCm 4.5.0), and rather that the root cause of this misbehavior arises in the interaction between (PCI-driven) ROCmRDMA and (CPU driven) HIP peer-to-peer memcpy.

Background

Problem

Observations:

Theories:

ROCm version dependency

Comments (0)