Deadlocks arising from synchronous collective calls with internal progress

@Rob Egan has observed and reported a library design issue that I think merits further thought and discussion.

Example

The problem is essentially summarized by this code:

#include <upcxx/upcxx.hpp>
#include <iostream>
#include <assert.h>
#include <unistd.h>

using namespace upcxx;

struct myclass {
   atomic_domain<int> ad;
   myclass(const upcxx::team &myteam) : ad({atomic_op::fetch_add}, myteam) {
   }
   ~myclass() {
     ad.destroy(); // includes a barrier
   }
};

int main() {
  upcxx::init();

  if (!rank_me()) std::cout << "Starting..." << std::endl;

  global_ptr<int> gp = rank_me() ? nullptr : new_<int>(0);
  gp = broadcast(gp, 0).wait();

  if (!rank_me()) { sleep(1); rpc((rank_me()+1)%rank_n(),[](){}).wait(); }

  #if FORCE_BARRIER
    upcxx::barrier();
  #endif

  { myclass mc(world());

    mc.ad.fetch_add(gp, 1, std::memory_order_acq_rel);
  }

  if (!rank_me()) {
    assert(*gp.local() == rank_n());
    delete_(gp);
    std::cout << "SUCCESS" << std::endl;
  }

  upcxx::finalize();
}

Discussion

The example above deadlocks in the 2020.3.2 release when compiled in DEBUG mode without -DFORCE_BARRIER, because the collective atomic_domain constructor includes a broadcast at the GASNet level with internal-level progress (in this case, simply to enforce additional debug checking). However our definition of "collective" explicitly permits the implementation to perform this type of implicit barrier synchronization in any call labelled as collective: (1.7-3-4, emphasis added)

Collective. A constraint placed on some language operations which requires evaluation of such operations to be matched across all participating processes. The behavior of collective operations is undefined unless all processes execute the same sequence of collective operations.

A collective operation need not provide any actual synchronization between pro-cesses, unless otherwise noted. The collective requirement simply states a relative ordering property of calls to collective operations that must be maintained in the par-allel execution trace for all executions of any valid program. Some implementations may include unspecified synchronization between processes within collective operations, but programs must not rely upon the presence or absence of such unspecified synchronization for correctness.

The fundamental problem here is a bad interaction between what I'll call "synchronous" collective calls (those that are not split-phased with an asynchronous completion) and user-level progress. Most of our "synchronous" collective calls (upcxx::barrier() being the most prominent example) include user-level progress, which means they service incoming RPCs that might otherwise prevent other ranks from "catching up" to reach the collective call. Other examples include team::split and {team,atomic_domain<T>}::destroy() (with the default entry_barrier::user).

However we also have a few "synchronous" collective calls with internal-only progress - this currently includes the atomic_domain<T> and cuda_device regular constructors, and the various destroy calls with a non-default entry_barrier::internal. I believe these are all subject to the kind of deadlock demonstrated above, if the ranks are not already in "lock-step" upon reaching the collective call (specifically, without any outstanding point-to-point dependencies on remote user-level progress blocking arrival at the collective); assuming the implementation chooses to exercise its freedom to insert inter-process synchronization without user-level progress.

broadcast, reduce and barrier_async are "asynchronous" collectives with internal progress, which means that if the implementation chose to embed a barrier synchronization in initiation they could exhibit a similar deadlock. However these are explicitly split-phase, so that would be a poor implementation choice and one we're not likely to encounter in practice (and the wording in barrier_async explicitly prohibits such synchronization). So those are not really relevant to this discussion.

Slack discussion (copied with permission)

@Rob Egan: I have a question about the atomic_domain constructor. Is it correct that the construction does not call user-level progress? The specs state that this needs to be constructed and destructed collectively, but in our code it doesn't seem to be calling user level progress, so ranks that are waiting on a pending rpc from a rank that started creating the atomic_domain are causing a deadlock. Is this an oversight or feature? Practically it means for our code we need to call a barrier before construction of the atomic domain.

@Amir Kamil: That is the intended behavior. Collective does not imply synchronization or communication. It just means that the processes involved must invoke the collective in the same order with respect to other collective operations.

@Rob Egan: okay that makes sense, but considering the potential for a deadlock, it might be worthwhile to offer an api for the constructor of an atomic_domain that can call a barrier like there is already one for the destroy method. Not only for the convenience, but also for the warning to the user that this is something to consider. For my class that has an atomic_domain as a member variable, I had to initialize it with a team that ensures a barrier is called, because there is no method to construct an empty atomic_domain without a team and then assign it later after calling the barrier within the constructor body of my class.
i.e. I had to do this:

const upcxx::team &barrier_wrapper(const upcxx::team &team) { 
    barrier(team); 
    return team;
}
class myclass { 
    atomic_domain ad; 
    myclass(const upcxx::team &myteam) : ad({...}, barrier_wrapper(myteam)) {} 
};

because I could not get something like this to work:

class myclass { 
    atomic_domain ad; 
     myclass(const upcxx::team &myteam) { 
                   barrier(myteam); 
                   ad = atomic_domain({...},myteam);
     } 
};

Example

Discussion

Slack discussion (copied with permission)

Comments (3)