Non-portable (spawner-specific) behavior of static constructors+destructors

Issue #419 new
Dan Bonachea created an issue

I've always known that our handling of C++ static initializers/destructors was not ideal, mostly because of interactions with multi-process job spawners and the process kills we sometimes use at exit time to ensure a lack of orphaned run-away compute processes. However I was surprised by how bad the situation actually can be, especially on ibv-conduit with ssh-spawner.

Demo program:

#include <upcxx/upcxx.hpp>
#include <iostream>
#include <iomanip>
#include <sstream>
#include <unistd.h>

struct say {
  std::stringstream ss;
  say() {
    *this << "pid:" << (int)getpid() << ": ";
  }
  template<typename T>
  say& operator<<(T const &that) {
    ss << that;
    return *this;
  }
  ~say() {
    *this << "\n";
    std::cout << ss.str() << std::flush;
  }
};

struct A {
  A() {
     say() << "constructor("<<std::setw(18)<<this<<"): init=" << upcxx::initialized();
  }
  ~A(){ 
     say() << "destructor ("<<std::setw(18)<<this<<"): init=" << upcxx::initialized();
     assert(!upcxx::initialized()); 
  }
};

A a1; // static data

int main() {
  say() << "main()";
  A a2; // stack

  upcxx::init();

  say() << "UPC++ process " << upcxx::rank_me() << "/" << upcxx::rank_n();

  upcxx::barrier();
  if (!upcxx::rank_me()) say() << "SUCCESS";
  upcxx::finalize();

  say() << "post-finalize";
  return 0;
}

Output on our linux cluster from a ONE rank run with ibv-conduit and ssh-spawner:

$ upcxx-run -np 1 a.out
pid:740: constructor(          0xb17d11): init=0
pid:740: main()
pid:740: constructor(    0x7ffe079502ef): init=0
pid:762: constructor(          0xb17d11): init=0
pid:762: main()
pid:762: constructor(    0x7ffd12c906ef): init=0
pid:763: UPC++ process 0/1
pid:763: SUCCESS
pid:763: post-finalize
pid:763: destructor (    0x7ffd12c906ef): init=0
pid:762: destructor (          0xb17d11): init=1
a.out: init-bug.cpp:29: A::~A(): Assertion `!upcxx::initialized()' failed.
pid:740: destructor (          0xb17d11): init=1
a.out: init-bug.cpp:29: A::~A(): Assertion `!upcxx::initialized()' failed.
Abort

This output reveals several behaviors that most non-expert users are likely to find surprising:

  1. A single-process job actually involves three separate processes which all run portions of the user program, sometimes concurrently.
  2. Static initializers run on two of the processes, both of which also enter main() and run the user code through upcxx::init(), including the stack object initializer (this is an oddball behavior explicitly permitted by GASNet to ensure portable distributed job spawning)
  3. Inside upcxx::init(), a third process is forked which acts as the actual UPC++ compute process, returning from upcxx::init() and finishing main(). Upon return from main, that process runs the destructor for the stack-allocated object (which was actually created before init() by a different process), but does NOT run the destructor for the static data object.
  4. The other two "hidden" processes (that did not run the UPC++ application) do NOT run the destructor for the stack-allocated object (they never returned from upcxx::init), but they both DO run the destructor for the static data object.
  5. Furthermore, whilst running the static destructors, the hidden processes report upcxx::initialized() == true, even though neither was ever a valid UPC++ compute process with an initialized library.

These behaviors are non-portable - other spawners result in different behaviors, although ssh-spawner is probably the "most surprising". These behaviors may cause problems for any code with observable external side-effects running in static initializers/destructors or in pre-init main() . Due to behavioral variation across spawners, this seems especially likely to impact codes developed using spawners that don't demonstrate these behaviors.

In most cases with normal (non-abortive) exits, setting documented envvar GASNET_CATCH_EXIT=0 can restore the execution of the static destructors on exit of the compute process, with a cost of sacrificing automated protection against orphaned processes that can occur in some systems after an incomplete job termination. However I don't think we offer any workarounds or mechanisms to deal with the other "weirdnesses" described above.

I'm planning to address surprising behavior (5), and want to discuss what we can do to help mitigate the effect of the others.

Comments (4)

  1. Dan Bonachea reporter

    Surprising behavior (4) resolved in GASNet commit 30b11ac (to appear in forthcoming 2020.10.0).

    The remaining (first three) behaviors are more deeply ingrained in the current design of ssh-spawner, and would require major design changes to address (which would likely involve installing permanent "helper executables" at known locations on the compute nodes). However these behaviors are explicitly permitted by the GASNet specification (and implicitly by UPC++).

    Users encountering problems related to these remaining issues are advised to use a different spawner - UPC++ executables spawned using mpi-spawner and pmi-spawner (eg SLURM srun) usually do not exhibit these behaviors.

  2. Log in to comment