test/rput.cpp lacks sufficient synchronization

Issue #143 resolved
Dan Bonachea created an issue

current version of test/rput.cpp includes this code:

    int value = 100+me;
...
    done_g = upcxx::make_future();
    done_s = upcxx::rput(
      &value, nebr_thing, 1,
      source_cx::as_future() |
      remote_cx::as_rpc([=]() { got_rpc++; })
    );
...
  done_g = done_g.then([&]() {
    promise<int> *pro1 = new promise<int>;
    upcxx::rget(nebr_thing, operation_cx::as_promise(*pro1));

there is a missing synchronization bug in this code, and this is believed to be responsible for intermittent failures observed on cori nightly testing.

Specifically: operation completion / remote completion of the rput is not ensured before issuing the subsequent rget which attempts to fetch the value just rput (meaning that network-level reordering can result in later validation failure).

This code needs some kind of a change to ensure correct synchronization (eg adding an operation_cx to the rput and stalling locally before issuing the subsequent rget, or adding a barrier or other acknowledgement to construct synchronization interlock via remote_cx). However since I'm unsure what @jdbachan 's original code was meant to test, I'm hesitant to apply the change (especially given the presence of commented-out code that apparently DOES handle this correctly, but also calls an undocumented operation_cx::as_blocking).

Comments (1)

  1. john bachan

    "test/rput.cpp" was fixed to not ignore completion events. This was observed to cause a rare datarace. The call to the experimental "as_blocking" completion (which was commented out) has been removed.

    Fixes issue 143.

    → <<cset d9ee769eb25e>>

  2. Log in to comment