`copy(remote_cx::as_rpc)` with local destination may invoke callback in the wrong context

Issue #477 resolved
Dan Bonachea created an issue

The spec for copy() states:

remote-completion operations execute on the master persona of the host process associated with the destination (i.e. dest.where())

Thus, copy(remote_cx::as_rpc) with a local destination is required to execute the RC callback during user-level progress of the local master persona, regardless of which persona initiated the operation. Unfortunately the current implementation gets this wrong for many cases, as demonstrated by the following test:

#include <upcxx/upcxx.hpp>
#include <iostream>
#include <assert.h>
#include "../util.hpp"

using namespace upcxx;

static int done = 0;
static int errs = 0;

#define CHECK(condition, context) do { \
  if (!(condition)) { \
    say() << "ERROR: failed check " << #condition << " in copy " << context \
          << " at line " << __LINE__; \
    errs++; \
  } \
} while (0)

void do_rc(const char *ctx) {
  CHECK(&current_persona() == &master_persona(), ctx);
  #if UPCXX_SPEC_VERSION >= 20201000
   CHECK(upcxx::in_progress(), ctx);
  #endif
  done = 1;
}

int main() {
  upcxx::init();
  print_test_header();

  int peer = (rank_me()+1)%rank_n();
  global_ptr<int> gp = new_array<int>(2);
  dist_object<global_ptr<int>> dgp(gp);
  global_ptr<int> rgp = dgp.fetch(peer).wait();

  {
  #if UPCXX_THREADMODE
  persona p;
  persona_scope ps(p);
  #endif

  barrier();

  copy(gp, gp+1, 1, remote_cx::as_rpc([]() { do_rc("host-to-host loopback"); }));
  CHECK(!done, "host-to-host loopback"); // RC runs locally, but must be in user progress
  while (!done) progress();
  done = 0;

  barrier();

  copy(gp, rgp+1, 1, remote_cx::as_rpc([]() { do_rc("host-to-host put"); }));
  while (!done) progress();
  done = 0;

  barrier();

  copy(rgp, gp+1, 1, remote_cx::as_rpc([]() { do_rc("host-to-host get"); }));
  CHECK(!done, "host-to-host get"); // RC runs locally, but must be in user progress
  while (!done) progress();
  done = 0;

  barrier();
  }

  delete_array(gp);
  print_test_success(errs==0);
  upcxx::finalize();
  return 0;
}

This test fails at far back as 2019.3.0, where copy() was first introduced with prototype Memory Kinds. The exact failure mode varies a bit with UPC++ version and copy case, but most of the CHECK expressions above for "copy host-to-host loopback" and "copy host-to-host get" have been seen to fail for one or more versions/configs.

Here is the 2-rank output from the latest 2021.3.0 release with ibv/debug/par/cuda:

Test: copy_rc2.cpp
Ranks: 2
[0] ERROR: failed check &current_persona() == &master_persona() in copy host-to-host loopback at line 20
[0] ERROR: failed check upcxx::in_progress() in copy host-to-host loopback at line 22
[0] ERROR: failed check !done in copy host-to-host loopback at line 45
[0] ERROR: failed check &current_persona() == &master_persona() in copy host-to-host get at line 20
Test result: ERROR
[1] ERROR: failed check &current_persona() == &master_persona() in copy host-to-host loopback at line 20
[1] ERROR: failed check upcxx::in_progress() in copy host-to-host loopback at line 22
[1] ERROR: failed check !done in copy host-to-host loopback at line 45
[1] ERROR: failed check &current_persona() == &master_persona() in copy host-to-host get at line 20
Test result: ERROR

This same behavior persists in develop @ 5862f75.

It's worth noting that copy(remote_cx::as_rpc) with a local destination is a bit of a silly corner case, since in that case the operation_cx event is often a more efficient means of detecting completion. However the semantics allow for correct operation in cases where the destination is not known to be local.

Comments (3)

  1. Dan Bonachea reporter

    Merge pull request #354 into develop

    • issue477-copy-rc: SQUASH: copy: Add a function reference Update ChangeLog rpc-ctor-trace: update expected move counts for copy(remote_cx::as_rpc) copy: Add a function reference alias to improve readability copy: Remove some unnecessary remote completion moves completions: Rename an internal completion helper test/copy-cover: strengthen RC sanity checking Add issue 477 test Fix issue #477: Ensure copy(remote_cx::as_rpc) callback runs in the correct context Support detail::persona_tls::during() on functions that must be invoked as rvalue progress: repair burstable state tracking

    → <<cset 62f7c9c42650>>

  2. Log in to comment