Generate error for future::wait() on a non-ready future inside RPC

Issue #105 resolved
Dan Bonachea created an issue

The spec effectively mandates that futures never become ready inside the dynamic context of an RPC. This means calling future::wait() on a non-ready future inside the restricted context is always a guaranteed hang. At least one external user has already run into this and been confused.

Rather than silently deadlocking, we should probably throw a helpful error message for this case, at least in debug mode.

Comments (5)

  1. Dan Bonachea reporter

    Here's an example of a program that a user might expect to work, but does not.

    #include <upcxx/upcxx.hpp>
    #include <iostream>
    #include <assert.h>
    #include <unistd.h>
    
    using namespace std;
    
    void status(const char *p) {
        cout << "Rank " << upcxx::rank_me() << ": " << p << endl;
    }
    
    int main(int argc, char *argv[]) {
        upcxx::init();
    
        assert(upcxx::rank_n() % 2 == 0);
        status("sending outermost RPC.");
        upcxx::rpc(upcxx::rank_me() ^ 1,[]() {
          status("in outermost RPC");
          sleep(1);
          status("sending inner RPC");
          auto f = upcxx::rpc(upcxx::rank_me() ^ 1,[]() {
            status("in inner RPC");
          });
          status("waiting for inner RPC");
          f.wait(); // deadlock here
          status("something else");
        }).wait();
    
        status("done");
    
        upcxx::finalize();
        return 0;
    }         
    

    Here two ranks send rpcs to each other, and the body of each tries to send an rpc back and wait for it. This program silently deadlocks without ever running the "inner RPC".

    This represents a fundamental semantic limitation we place on our RPC framework. This limitation is not immediately intuitive, and possibly needs to be explained more clearly. The error check suggested in this bug would at least help users understand this pattern won't work.

    The recommended solution for this type of situation is to use future chaining to exit the restricted context, ie:

    <       f.wait(); // deadlock here
    <       status("something else");
    ---
    >       return f.then([]() {
    >         status("something else");
    >       });
    

    after this change the program will run as expected.

  2. john bachan

    As a means to generalize beyond just wait(), How about in debug mode I maintain a counter of consecutive reentrant progress calls? When that counter hits like 10000 then I take abortive action.

  3. Dan Bonachea reporter

    How about in debug mode I maintain a counter of consecutive reentrant progress calls? When that counter hits like 10000 then I take abortive action.

    I'd say it depends on the details. Is this a thread-specific counter? Does the counter reset when the thread leaves outermost progress? (ie consecutive meaning within the context of a single outer progress call) Do we provide a mechanism (envvar or #define) to tweak the arbitrary value of 10000?

    I'm certain that future::wait() on a non-ready future in restricted context is always an error, but I can't say the same with absolute certainty for any finite number of calls to progress.

  4. john bachan

    Fixed issue 105. In debug-mode we assert that future::wait isn't called from within the user-progress context.

    UPCXX_ASSERT output format has changed, and it has been more intelligently integrated with gasnet_tools debugger facilities.

    → <<cset 0a20cde62fdc>>

  5. Log in to comment