- changed status to resolved
test/rput.cpp lacks sufficient synchronization
current version of test/rput.cpp includes this code:
int value = 100+me;
...
done_g = upcxx::make_future();
done_s = upcxx::rput(
&value, nebr_thing, 1,
source_cx::as_future() |
remote_cx::as_rpc([=]() { got_rpc++; })
);
...
done_g = done_g.then([&]() {
promise<int> *pro1 = new promise<int>;
upcxx::rget(nebr_thing, operation_cx::as_promise(*pro1));
there is a missing synchronization bug in this code, and this is believed to be responsible for intermittent failures observed on cori nightly testing.
Specifically: operation completion / remote completion of the rput
is not ensured before issuing the subsequent rget
which attempts to fetch the value just rput (meaning that network-level reordering can result in later validation failure).
This code needs some kind of a change to ensure correct synchronization (eg adding an operation_cx to the rput and stalling locally before issuing the subsequent rget, or adding a barrier or other acknowledgement to construct synchronization interlock via remote_cx). However since I'm unsure what @jdbachan 's original code was meant to test, I'm hesitant to apply the change (especially given the presence of commented-out code that apparently DOES handle this correctly, but also calls an undocumented operation_cx::as_blocking).
Comments (1)
-
- Log in to comment
"test/rput.cpp" was fixed to not ignore completion events. This was observed to cause a rare datarace. The call to the experimental "as_blocking" completion (which was commented out) has been removed.
Fixes issue 143.
→ <<cset d9ee769eb25e>>