Some upcxx::copy cases do not discharge() properly

#include <thread> #include <upcxx/upcxx.hpp> #include "util.hpp" #if !UPCXX_CUDA_ENABLED #error This test requires CUDA support #endif #if !UPCXX_THREADMODE #error This test may only be compiled in PAR threadmode #endif using namespace upcxx; int done = 0; int main(int argc, char *argv[]) { upcxx::init(); print_test_header(); int me = upcxx::rank_me(); int ranks = upcxx::rank_n(); int peer = (me + 1)%ranks; #define SZ 1024 cuda_device gpu(0); device_allocator<cuda_device> da(gpu,10*SZ); using gp_d = global_ptr<char,memory_kind::cuda_device>; gp_d gp = da.allocate<char>(SZ); char *lp = new char[SZ]; dist_object<gp_d> all_gps(gp); gp_d peer_gp = all_gps.fetch(peer).wait(); upcxx::say() << "starting.."; std::thread th1{ [&]() { upcxx::say() << "launching get.."; upcxx::copy<char>(peer_gp, lp, SZ, // get remote GPU -> local host remote_cx::as_rpc([]() { upcxx::say() << "remote CX"; done = 1; }) ); upcxx::discharge(); upcxx::say() << "discharge complete"; }}; while (!done) upcxx::progress(); upcxx::say() << "wait done"; th1.join(); upcxx::say() << "thread joined"; upcxx::barrier(); // cleanup delete [] lp; da.deallocate(gp); gpu.destroy(); UPCXX_ASSERT_ALWAYS(&upcxx::current_persona() == &upcxx::master_persona()); print_test_success(); upcxx::finalize(); return 0; }

Comments (2)

Dan Bonachea reporter

Proposed fix in PR 302

The problem is that some cases of copy() use/used the initiating persona to progress the operation (in cases where multiple steps are required), and if the initiating persona is destroyed before the operation has reached the remote completion stage, that completion could be lost or the runtime could crash. This does not arise for operations that request an operation completion event delivery to the initiating persona, because there the program is required to reap that event using the initiating persona before allowing that persona to be destroyed (otherwise it's undefined behavior). However for operations with only remote completion (or in some cases remote+source completion), there was nothing to prevent the initiating persona from discharging and exiting while the operation remains in-flight.

The only effective workaround for end-users encountering this defect in versions up to and including 2020.11.0 is "don't do that". The only way to avoid these corner cases would be to request operation_cx (with any signalling variant) when invoking upcxx::copy(), and ensure the initiating persona reaps the corresponding completion before persona destruction. This workaround notably incurs additional overhead in some cases, and should not be used for newer versions where remote -only completion is a better match to the algorithm.

The PR fixes the defect by ensuring that copy() operations that don't request operation completion either (1) don't rely on the continued existence of the initiating persona after injection returns, or (2) ensure that discharge() on the initiating persona stalls until the initiating persona is no longer required for the in-flight operation to reach completion.

2020-12-11T17:43:04+00:00

Dan Bonachea reporter

changed status to resolved

issue ~~#432~~: Some upcxx::copy cases do not discharge() properly

Add missing discharge interlocks for upcxx::copy() cases with initiator-chained completion dependencies, and in a few places remove such dependencies in favor of restricted AMs.

Fixes issue ~~#432~~

Notable cases:

3rd party copy with remote_cx only - was sending a superfluous AM, that would also cause crashes in the case of a defunct initiator persona.
fully loopback copy with remote_cx only - if the initiating persona was defunct, the completion would never run.
GDR-enabled get with remote_cx only - if the initiating persona was defunct, the completion would never run.
GDR-enabled put from private memory without operation_cx, or from shared memory with remote_cx only - if the initiating persona was defunct, crashes or exit-time errors could ensue.
reference put from private memory without operation_cx, or from shared memory with remote_cx only - if the initiating persona was defunct, crashes or exit-time errors could ensue.
reference get with remote_cx only - if the initiating persona was defunct, the completion would never run.

→ <<cset be36fbdefb39>>

2020-12-13T04:23:29+00:00

Dan Bonachea reporter
Proposed fix in PR 302

The problem is that some cases of copy() use/used the initiating persona to progress the operation (in cases where multiple steps are required), and if the initiating persona is destroyed before the operation has reached the remote completion stage, that completion could be lost or the runtime could crash. This does not arise for operations that request an operation completion event delivery to the initiating persona, because there the program is required to reap that event using the initiating persona before allowing that persona to be destroyed (otherwise it's undefined behavior). However for operations with only remote completion (or in some cases remote+source completion), there was nothing to prevent the initiating persona from discharging and exiting while the operation remains in-flight.

The only effective workaround for end-users encountering this defect in versions up to and including 2020.11.0 is "don't do that". The only way to avoid these corner cases would be to request operation_cx (with any signalling variant) when invoking upcxx::copy(), and ensure the initiating persona reaps the corresponding completion before persona destruction. This workaround notably incurs additional overhead in some cases, and should not be used for newer versions where remote -only completion is a better match to the algorithm.

The PR fixes the defect by ensuring that copy() operations that don't request operation completion either (1) don't rely on the continued existence of the initiating persona after injection returns, or (2) ensure that discharge() on the initiating persona stalls until the initiating persona is no longer required for the in-flight operation to reach completion.
- 2020-12-11T17:43:04+00:00
Dan Bonachea reporter
- changed status to resolved
issue ~~#432~~: Some upcxx::copy cases do not discharge() properly

Add missing discharge interlocks for upcxx::copy() cases with initiator-chained completion dependencies, and in a few places remove such dependencies in favor of restricted AMs.

Fixes issue ~~#432~~

Notable cases:
- 3rd party copy with remote_cx only - was sending a superfluous AM, that would also cause crashes in the case of a defunct initiator persona.
- fully loopback copy with remote_cx only - if the initiating persona was defunct, the completion would never run.
- GDR-enabled get with remote_cx only - if the initiating persona was defunct, the completion would never run.
- GDR-enabled put from private memory without operation_cx, or from shared memory with remote_cx only - if the initiating persona was defunct, crashes or exit-time errors could ensue.
- reference put from private memory without operation_cx, or from shared memory with remote_cx only - if the initiating persona was defunct, crashes or exit-time errors could ensue.
- reference get with remote_cx only - if the initiating persona was defunct, the completion would never run.
→ <<cset be36fbdefb39>>
- 2020-12-13T04:23:29+00:00
Log in to comment