erroneous RMA Puts with device source with OFI+{CXI,verbs}

Issue #557 resolved
Paul Hargrove created an issue

We have encountered more than one case in which RMA Puts with their source in device memory are incorrect in one way or another at the GASNet level or below. In the past we had the UPCXX_BUG4148_WORKAROUND environment variable to work around a bug of this nature, but it has been removed and the logic for upcxx::copy has changed significantly since then such that it cannot be restored simply by reverting git commits.

This is a request for an up-to-date implementation of a work-around for this class of bug. However, this time the work-around should use a UPC++ configure option to enable, rather than an environment variable (to avoid critical path branches in systems without the problem(s) to be worked-around).

Despite the "RFE:" in the title, this work is currently a blocker for deploying accelerated memory kinds on Perlmutter, and similar Nvidia+Slingshot-11 systems.

Comments (6)

  1. Dan Bonachea

    I am deploying a configure-time workaround to ensure we get correct native memory kind copy behavior on Nvidia+Slingshot-11 systems. However we differ a bit in the "philosophy" of this new option.

    I don't love the idea of a "generic knob" that has a guaranteed impact on protocol choice, because that constrains our implementation in ways that might not be necessary for correctness (or even provide the best performance). For example the old bug4148 workaround protocol switch mentioned above would be sufficient to avoid my understanding of the new defect, but would also unnecessarily penalize performance for some cases (because it applied to ALL puts that involved device memory on either side). There also might be a user perception that such a knob should continue to be supported beyond the point that we consider it relevant to correctness.

    The copy protocol space is highly multi-dimensional, and I'd prefer to deploy a surgical correctness workaround that is specific to this particular known misbehavior in the provider/conduit. This also allows us to retire the knob once our dependency floor includes the fix.

  2. Paul Hargrove reporter

    The identical misbehavior is observed with many (but not all) versions of the libfabric verbs provider as well as the cxi provider. I believe the proposed logic will address both without any additional efforts.

    While the verbs provider is far less important to us, I am adjusting the issue title to help keep track of the scope of the problem (and solution).

  3. Dan Bonachea

    issue 557: Deploy a workaround

    configure --enable-issue557-workaround now disables the use of GASNet native memory kind put for upcxx::copy() from local device memory on ofi-conduit, to avoid correctness problems caused by libfabric providers where this doesn't function as specified.

    The workaround affects the behavior for any native device memory kind and any libfabric provider, although it's currently believed to only be relevant / recommended for NVIDIA CUDA GPUs and certain libfabric providers.

    -DUPCXXI_USING_ISSUE557_WORKAROUND is an undocumented per-TU setting to force this behavior on any MK conduit.

    Resolves issue #557.

    → <<cset 67bd86e05fc0>>

  4. Paul Hargrove reporter

    It turns out that ofi-memory kinds as merged to GASNet-EX's develop branch did not include the defect which has been discovered earlier in development of that work. Therefore, the work around --enable-issue557-workaround should not be needed and I will be removing any mention of it from INSTALL.md

  5. Paul Hargrove reporter

    My previous comment was slightly flawed, because GASNet-EX bug 4485 is only one of two bugs which motivated the work around. The other is bug 4494. However, as of a few minutes ago, that one has an effective work around via an environment setting.

  6. Log in to comment