hang when creating dist_object within future chains

I’m having trouble creating a stable class that has member dist_objects and/or atomic_domains. (i.e. it creates & destroys them on construction and destruction). When multiple instances of the class may be pending the program hangs or throws assertions possibly indicating that the distributed objects are getting mixed between the instances on different ranks. This is preventing me from using these classes within a block called by progress(), because I need to call wait() in the destructor to avoid any hangs.

I’ve reduced this to a reproducer code, with effectively 2 versions (with and without DO_IN_FUTURE) which moves the construction of the distributed object into a future, instead of the main loop.

It works fine (on 1 machine) when dist_object construction is in the main loop and (just) FIX_NO_DO_IN_FUTURE is defined. The FIX_NO_DO_IN_FUTURE adds a wait() and barrier at the end of every iteration, effectively serializing the code across all ranks, which is something I’m trying to avoid. All the commented out “// futile “ barriers did not fix the hanging problem that occurs without FIX_NO_DO_IN_FUTURE. i.e. if you uncomment them all but do not include FIX_NO_DO_IN_FUTURE, the loop hangs. In my opinion, even a barrier in the loop should not be necessary, let alone a wait + barrier.

With DO_IN_FUTURE defined, I’ve had no success. I am trying to future-chain the loop but I haven’t been able to get any version to be stable at all (>90% of the time it hangs after a few iterations).

#define ITERATIONS 200
//#define DO_IN_FUTURE
#define FIX_NO_DO_IN_FUTURE

using DO = dist_object<int>;
using ShDO = shared_ptr<DO>;

void test_future_chain() {

  barrier();
  future<> fut_all = make_future();;
  for(int i = 0; i < ITERATIONS ; i++) {

      if (!rank_me()) std::cout << "." << std::flush;

      future<> fut = make_future();

      ShDO sh_do;
#ifdef DO_IN_FUTURE
      assert(!sh_do);
      barrier(); // futile
      fut = barrier_async(); // futile to protect future execution
      fut = when_all(fut, fut_all); // futile to chain here
      barrier(); // futile
#else
      //barrier(); // futile to protect dist_object construction
      sh_do = make_shared< DO >(i);
      assert(sh_do);
      assert(*(*sh_do) == i);
      //fut = barrier_async(); // also futile
      //barrier(); // futile to protect dist_object construction

      //fut = when_all(fut, fut_all); // futile to chain here
#endif

      fut = fut.then([i,sh_do]() {
          ShDO sh_do2 = sh_do; // copy since sh_do is const
          future<ShDO> fut_sh_do;

#ifdef DO_IN_FUTURE
          assert(!sh_do2);
#ifdef IDEALLY
          sh_do2 = make_shared< DO >(i)
          fut_sh_do = make_future(sh_do2);
#else  // but none of these following barriers do any good either
          fut_sh_do = barrier_async().then([i]() {
              auto sh_do2 = make_shared< DO >(i);
              DBG("do=", sh_do2->id(), " for i=", i, "\n");
              return sh_do2;
          });
          fut_sh_do = when_all(fut_sh_do, barrier_async()); // futile
#endif

#else
          assert(sh_do2);
          fut_sh_do = make_future(sh_do2);
          //fut_sh_do = when_all(fut_sh_do, barrier_async()); // futile
#endif

          auto fut_rpc = fut_sh_do.then([i](ShDO sh_do_copy) {
            return rpc((rank_me() + 1) % rank_n(), 
                  [](DO &_do, int other_i) {
                DBG("Got do=", *_do, " other_i=", other_i, "\n");
                assert(*_do == other_i);
                return *_do;
            }, *sh_do_copy, i);
          });
          return when_all(fut_rpc, fut_sh_do).then([i](int returned_i, ShDO sh_do_copy) {
             assert(i == *(*sh_do_copy)); 
             assert(i == returned_i);
          });

      });

#ifdef DO_IN_FUTURE
      fut.wait(); // futile
      barrier();  // futile
#else
#ifdef FIX_NO_DO_IN_FUTURE
      // must serialize with wait *and* barrier!!      
      fut.wait();
      barrier();
#endif

#endif

      fut_all = when_all(fut_all, fut);

  }
  if (!rank_me()) std::cout << "done\n" << std::flush;
  fut_all.wait();
  barrier();
}

‌

Comments (7)