Some upcxx::copy cases do not discharge() properly

Issue #432 resolved
Dan Bonachea created an issue

Example program:

#include <thread>

#include <upcxx/upcxx.hpp>
#include "util.hpp"

#if !UPCXX_CUDA_ENABLED 
#error This test requires CUDA support
#endif

#if !UPCXX_THREADMODE
  #error This test may only be compiled in PAR threadmode
#endif

using namespace upcxx;

int done = 0;

int main(int argc, char *argv[]) {
  upcxx::init();
  print_test_header();

  int me = upcxx::rank_me();
  int ranks = upcxx::rank_n();
  int peer = (me + 1)%ranks;

  #define SZ 1024
  cuda_device gpu(0);
  device_allocator<cuda_device> da(gpu,10*SZ);
  using gp_d = global_ptr<char,memory_kind::cuda_device>;
  gp_d gp = da.allocate<char>(SZ);
  char *lp = new char[SZ];

  dist_object<gp_d> all_gps(gp);
  gp_d peer_gp = all_gps.fetch(peer).wait();
  upcxx::say() << "starting..";

  std::thread th1{ [&]() {
    upcxx::say() << "launching get..";
    upcxx::copy<char>(peer_gp, lp, SZ, // get remote GPU -> local host
      remote_cx::as_rpc([]() { 
        upcxx::say() << "remote CX";
        done = 1;
      })
    );
    upcxx::discharge();
    upcxx::say() << "discharge complete";
  }};
  while (!done) upcxx::progress();
  upcxx::say() << "wait done";

  th1.join();
  upcxx::say() << "thread joined";

  upcxx::barrier();

  // cleanup
  delete [] lp;
  da.deallocate(gp);
  gpu.destroy();

  UPCXX_ASSERT_ALWAYS(&upcxx::current_persona() == &upcxx::master_persona());
  print_test_success();

  upcxx::finalize();
  return 0;
} 

This program should work by spec, but (as of 2020.10.0/2020.11.0) deadlocks awaiting the completion that never arrives.

This was discovered by source inspection, and is the symptom of a larger problem. The defect affects several (but not all) combinations of upcxx::copy protocol and completion variant, for both the reference memory kinds and the new GDR-enabled kinds. The defect dates back to at least 2020.3.2, although orthogonal bugs in memory_kinds obscure the problem for older versions.

I'll be pursuing a solution.

Comments (2)

  1. Dan Bonachea reporter

    Proposed fix in PR 302

    The problem is that some cases of copy() use/used the initiating persona to progress the operation (in cases where multiple steps are required), and if the initiating persona is destroyed before the operation has reached the remote completion stage, that completion could be lost or the runtime could crash. This does not arise for operations that request an operation completion event delivery to the initiating persona, because there the program is required to reap that event using the initiating persona before allowing that persona to be destroyed (otherwise it's undefined behavior). However for operations with only remote completion (or in some cases remote+source completion), there was nothing to prevent the initiating persona from discharging and exiting while the operation remains in-flight.

    The only effective workaround for end-users encountering this defect in versions up to and including 2020.11.0 is "don't do that". The only way to avoid these corner cases would be to request operation_cx (with any signalling variant) when invoking upcxx::copy(), and ensure the initiating persona reaps the corresponding completion before persona destruction. This workaround notably incurs additional overhead in some cases, and should not be used for newer versions where remote -only completion is a better match to the algorithm.

    The PR fixes the defect by ensuring that copy() operations that don't request operation completion either (1) don't rely on the continued existence of the initiating persona after injection returns, or (2) ensure that discharge() on the initiating persona stalls until the initiating persona is no longer required for the in-flight operation to reach completion.

  2. Dan Bonachea reporter

    issue #432: Some upcxx::copy cases do not discharge() properly

    Add missing discharge interlocks for upcxx::copy() cases with initiator-chained completion dependencies, and in a few places remove such dependencies in favor of restricted AMs.

    Fixes issue #432

    Notable cases:

    • 3rd party copy with remote_cx only - was sending a superfluous AM, that would also cause crashes in the case of a defunct initiator persona.

    • fully loopback copy with remote_cx only - if the initiating persona was defunct, the completion would never run.

    • GDR-enabled get with remote_cx only - if the initiating persona was defunct, the completion would never run.

    • GDR-enabled put from private memory without operation_cx, or from shared memory with remote_cx only - if the initiating persona was defunct, crashes or exit-time errors could ensue.

    • reference put from private memory without operation_cx, or from shared memory with remote_cx only - if the initiating persona was defunct, crashes or exit-time errors could ensue.

    • reference get with remote_cx only - if the initiating persona was defunct, the completion would never run.

    → <<cset be36fbdefb39>>

  3. Log in to comment