Generate error for future::wait() on a non-ready future inside RPC

Dan Bonachea reporter

Here's an example of a program that a user might expect to work, but does not.

#include <upcxx/upcxx.hpp>
#include <iostream>
#include <assert.h>
#include <unistd.h>

using namespace std;

void status(const char *p) {
    cout << "Rank " << upcxx::rank_me() << ": " << p << endl;
}

int main(int argc, char *argv[]) {
    upcxx::init();

    assert(upcxx::rank_n() % 2 == 0);
    status("sending outermost RPC.");
    upcxx::rpc(upcxx::rank_me() ^ 1,[]() {
      status("in outermost RPC");
      sleep(1);
      status("sending inner RPC");
      auto f = upcxx::rpc(upcxx::rank_me() ^ 1,[]() {
        status("in inner RPC");
      });
      status("waiting for inner RPC");
      f.wait(); // deadlock here
      status("something else");
    }).wait();

    status("done");

    upcxx::finalize();
    return 0;
}

Here two ranks send rpcs to each other, and the body of each tries to send an rpc back and wait for it. This program silently deadlocks without ever running the "inner RPC".

This represents a fundamental semantic limitation we place on our RPC framework. This limitation is not immediately intuitive, and possibly needs to be explained more clearly. The error check suggested in this bug would at least help users understand this pattern won't work.

The recommended solution for this type of situation is to use future chaining to exit the restricted context, ie:

<       f.wait(); // deadlock here
<       status("something else");
---
>       return f.then([]() {
>         status("something else");
>       });

after this change the program will run as expected.

2017-11-22T05:55:28+00:00

Dan Bonachea reporter

changed milestone to 2018.03.31 release

In the 1/10/18 meeting, John said he'd have this for the March release

2018-01-12T05:18:39+00:00

john bachan

As a means to generalize beyond just wait(), How about in debug mode I maintain a counter of consecutive reentrant progress calls? When that counter hits like 10000 then I take abortive action.

2018-01-16T23:03:42+00:00

Dan Bonachea reporter

How about in debug mode I maintain a counter of consecutive reentrant progress calls? When that counter hits like 10000 then I take abortive action.

I'd say it depends on the details. Is this a thread-specific counter? Does the counter reset when the thread leaves outermost progress? (ie consecutive meaning within the context of a single outer progress call) Do we provide a mechanism (envvar or #define) to tweak the arbitrary value of 10000?

I'm certain that future::wait() on a non-ready future in restricted context is always an error, but I can't say the same with absolute certainty for any finite number of calls to progress.

2018-01-16T23:39:48+00:00

john bachan

changed status to resolved

Fixed issue 105. In debug-mode we assert that future::wait isn't called from within the user-progress context.

UPCXX_ASSERT output format has changed, and it has been more intelligently integrated with gasnet_tools debugger facilities.

→ <<cset 0a20cde62fdc>>

2018-01-18T21:03:21+00:00

Comments (5)