entry barriers deadlock when invoked inside user-level progress callbacks

Issue #412 closed
Rob Egan created an issue

So in the process of writing this ticket I figured out my error, but I’m going to submit it because I think that the debug build could better support this user error by throwing an assertion instead of hanging indefinitely.

The following code constructs a shared_ptr of an atomic domain and destroys it in a future. I found this necessary in my dist_ofstream class in order to make the close() operation fully asynchronous but still tear down the atomic domain once all ranks in the team have called close too.

The user implementation error is lines 25 vs 26, but, if possible adding an assertion to the atomic_domain::destroy() method that either no barrier is requested or the code is not in user progress() would have helped me to diagnose the error faster.

#include <upcxx/upcxx.hpp>
#include <cstdio>
#include <memory>
#include <cassert>

int main(int argc, char **argv) {
    upcxx::init();
    upcxx::barrier();

    using AD = upcxx::atomic_domain<int>;
    //AD ad({upcxx::atomic_op::fetch_add, upcxx::atomic_op::load, upcxx::atomic_op::store}, upcxx::world());
    std::vector<upcxx::atomic_op> ops{upcxx::atomic_op::fetch_add, upcxx::atomic_op::load, upcxx::atomic_op::store};
    std::shared_ptr< AD > sh_ad = std::make_shared<AD>(ops, upcxx::world());

    upcxx::barrier();
    sh_ad->destroy();
    upcxx::barrier();
    sh_ad.reset();

    upcxx::barrier();
    if (!upcxx::rank_me()) std::cout << "Normal destory good" << std::endl;
    upcxx::barrier();

    sh_ad = std::make_shared<AD>(ops, upcxx::world());

    upcxx::future<> fut = upcxx::barrier_async().then([sh_ad](){
        assert(upcxx::current_persona().active_with_caller());
        assert(upcxx::master_persona().active_with_caller());
        std::cout << upcxx::rank_me() << " ready\n";
        //sh_ad->destroy(upcxx::entry_barrier::none); // okay
        sh_ad->destroy(); // hangs
        std::cout << upcxx::rank_me() << " destroyed\n";
        });

    fut.wait();
    upcxx::barrier();
    sh_ad.reset();


    upcxx::barrier();
    if (!upcxx::rank_me()) std::cout << "future destroy good" << std::endl;
    upcxx::barrier();

    upcxx::finalize();
    return 0;
}

Comments (6)

  1. Dan Bonachea

    I should note that invoking collectives (such as atomic_domain::destroy()) from callbacks in user-level progress is not a use case we've thought carefully about or really designed for. This particular trivial example does not violate any collective ordering properties, but for more general codes it quickly becomes very difficult to correctly maintain collective ordering preconditions within asynchronous progress callbacks (see issue #416).

    I think this program has raised some ambiguities in the specification with regards to entry barriers (a feature we provide on several collective calls). The entry barriers are intended to insert entries into progress of a given level during invocations from outside progress. However it's underspecified whether the entry barriers themselves (especially those with entry_barrier::internal) require further user-level progress to complete. As currently implemented they do, making entry_barrier::{internal,user} always deadlock when invoked within the restricted context (aside from trivial cases like singleton teams). If we specify/keep this semantic, then I agree with Rob that DEBUG mode should diagnose such calls as an error.

    However I also believe it's possible to change the implementation of entry barriers to avoid this requirement. I've done this in speculative PR #282.

    I'm on the fence as to which solution is best and think this merits discussion.

  2. Dan Bonachea

    issue412: Ensure progress of entry_barrier::internal within restricted context

    issue #412 demonstrates that invoking collective calls with entry barriers from within restricted context (from a callback inside user-level progress) leads to deadlock, because the entry barrier implementation was written under the assumption of the common-case invocation outside restricted context.

    Enhance our entry barriers to properly proceed internal progress sufficient for their own completion, even when invoked from the restricted context. Note this does NOT cause invocations using entry_barrier::user to force nested user-level progress, and invoking an entry_barrier::user inside progress has been made a fatal error.

    Resolves issue #412.

    → <<cset ebebe2297046>>

  3. Dan Bonachea

    Deploy calls to UPCXX_ASSERT_COLLECTIVE_SAFE

    This enforces prohibitions on collective calls initiated from callbacks running with progress.

    Resolves issue #412

    → <<cset 8186fc195593>>

  4. Dan Bonachea

    As of Impl PR 426 we now unconditionally prohibit all collective operation initiation inside the restricted context:

    Initiating collective operations with a progress level of internal or none from within the restricted context (within a callback running inside progress), an action deprecated with a runtime warning since 2020.10.0, is now prohibited with a fatal error. For details, see spec issue 169.

    The now-obsolete workaround for this issue has thus been reverted and regression tests adjusted accordingly.

  5. Log in to comment