`copy(remote_cx::as_rpc)` with local destination may invoke callback in the wrong context
The spec for copy()
states:
remote-completion operations execute on the master persona of the host process associated with the destination (i.e. dest.where())
Thus, copy(remote_cx::as_rpc)
with a local destination is required to execute the RC callback during user-level progress of the local master persona, regardless of which persona initiated the operation. Unfortunately the current implementation gets this wrong for many cases, as demonstrated by the following test:
#include <upcxx/upcxx.hpp>
#include <iostream>
#include <assert.h>
#include "../util.hpp"
using namespace upcxx;
static int done = 0;
static int errs = 0;
#define CHECK(condition, context) do { \
if (!(condition)) { \
say() << "ERROR: failed check " << #condition << " in copy " << context \
<< " at line " << __LINE__; \
errs++; \
} \
} while (0)
void do_rc(const char *ctx) {
CHECK(¤t_persona() == &master_persona(), ctx);
#if UPCXX_SPEC_VERSION >= 20201000
CHECK(upcxx::in_progress(), ctx);
#endif
done = 1;
}
int main() {
upcxx::init();
print_test_header();
int peer = (rank_me()+1)%rank_n();
global_ptr<int> gp = new_array<int>(2);
dist_object<global_ptr<int>> dgp(gp);
global_ptr<int> rgp = dgp.fetch(peer).wait();
{
#if UPCXX_THREADMODE
persona p;
persona_scope ps(p);
#endif
barrier();
copy(gp, gp+1, 1, remote_cx::as_rpc([]() { do_rc("host-to-host loopback"); }));
CHECK(!done, "host-to-host loopback"); // RC runs locally, but must be in user progress
while (!done) progress();
done = 0;
barrier();
copy(gp, rgp+1, 1, remote_cx::as_rpc([]() { do_rc("host-to-host put"); }));
while (!done) progress();
done = 0;
barrier();
copy(rgp, gp+1, 1, remote_cx::as_rpc([]() { do_rc("host-to-host get"); }));
CHECK(!done, "host-to-host get"); // RC runs locally, but must be in user progress
while (!done) progress();
done = 0;
barrier();
}
delete_array(gp);
print_test_success(errs==0);
upcxx::finalize();
return 0;
}
This test fails at far back as 2019.3.0, where copy()
was first introduced with prototype Memory Kinds. The exact failure mode varies a bit with UPC++ version and copy case, but most of the CHECK
expressions above for "copy host-to-host loopback" and "copy host-to-host get" have been seen to fail for one or more versions/configs.
Here is the 2-rank output from the latest 2021.3.0 release with ibv/debug/par/cuda:
Test: copy_rc2.cpp
Ranks: 2
[0] ERROR: failed check ¤t_persona() == &master_persona() in copy host-to-host loopback at line 20
[0] ERROR: failed check upcxx::in_progress() in copy host-to-host loopback at line 22
[0] ERROR: failed check !done in copy host-to-host loopback at line 45
[0] ERROR: failed check ¤t_persona() == &master_persona() in copy host-to-host get at line 20
Test result: ERROR
[1] ERROR: failed check ¤t_persona() == &master_persona() in copy host-to-host loopback at line 20
[1] ERROR: failed check upcxx::in_progress() in copy host-to-host loopback at line 22
[1] ERROR: failed check !done in copy host-to-host loopback at line 45
[1] ERROR: failed check ¤t_persona() == &master_persona() in copy host-to-host get at line 20
Test result: ERROR
This same behavior persists in develop @ 5862f75.
It's worth noting that copy(remote_cx::as_rpc)
with a local destination is a bit of a silly corner case, since in that case the operation_cx
event is often a more efficient means of detecting completion. However the semantics allow for correct operation in cases where the destination is not known to be local.
Comments (3)
-
reporter -
reporter - changed status to resolved
Fix issue
#477: Ensurecopy(remote_cx::as_rpc)
callback runs in the correct context→ <<cset 42f8c68d4eab>>
-
reporter Merge pull request #354 into develop
- issue477-copy-rc:
SQUASH: copy: Add a function reference
Update ChangeLog
rpc-ctor-trace: update expected move counts for copy(remote_cx::as_rpc)
copy: Add a function reference alias to improve readability
copy: Remove some unnecessary remote completion moves
completions: Rename an internal completion helper
test/copy-cover: strengthen RC sanity checking
Add issue 477 test
Fix issue
#477: Ensurecopy(remote_cx::as_rpc)
callback runs in the correct context Support detail::persona_tls::during() on functions that must be invoked as rvalue progress: repair burstable state tracking
→ <<cset 62f7c9c42650>>
- issue477-copy-rc:
SQUASH: copy: Add a function reference
Update ChangeLog
rpc-ctor-trace: update expected move counts for copy(remote_cx::as_rpc)
copy: Add a function reference alias to improve readability
copy: Remove some unnecessary remote completion moves
completions: Rename an internal completion helper
test/copy-cover: strengthen RC sanity checking
Add issue 477 test
Fix issue
- Log in to comment
Proposed solution in pull request #354