Non-portable (spawner-specific) behavior of static constructors+destructors
I've always known that our handling of C++ static initializers/destructors was not ideal, mostly because of interactions with multi-process job spawners and the process kills we sometimes use at exit time to ensure a lack of orphaned run-away compute processes. However I was surprised by how bad the situation actually can be, especially on ibv-conduit with ssh-spawner.
Demo program:
#include <upcxx/upcxx.hpp>
#include <iostream>
#include <iomanip>
#include <sstream>
#include <unistd.h>
struct say {
std::stringstream ss;
say() {
*this << "pid:" << (int)getpid() << ": ";
}
template<typename T>
say& operator<<(T const &that) {
ss << that;
return *this;
}
~say() {
*this << "\n";
std::cout << ss.str() << std::flush;
}
};
struct A {
A() {
say() << "constructor("<<std::setw(18)<<this<<"): init=" << upcxx::initialized();
}
~A(){
say() << "destructor ("<<std::setw(18)<<this<<"): init=" << upcxx::initialized();
assert(!upcxx::initialized());
}
};
A a1; // static data
int main() {
say() << "main()";
A a2; // stack
upcxx::init();
say() << "UPC++ process " << upcxx::rank_me() << "/" << upcxx::rank_n();
upcxx::barrier();
if (!upcxx::rank_me()) say() << "SUCCESS";
upcxx::finalize();
say() << "post-finalize";
return 0;
}
Output on our linux cluster from a ONE rank run with ibv-conduit and ssh-spawner:
$ upcxx-run -np 1 a.out
pid:740: constructor( 0xb17d11): init=0
pid:740: main()
pid:740: constructor( 0x7ffe079502ef): init=0
pid:762: constructor( 0xb17d11): init=0
pid:762: main()
pid:762: constructor( 0x7ffd12c906ef): init=0
pid:763: UPC++ process 0/1
pid:763: SUCCESS
pid:763: post-finalize
pid:763: destructor ( 0x7ffd12c906ef): init=0
pid:762: destructor ( 0xb17d11): init=1
a.out: init-bug.cpp:29: A::~A(): Assertion `!upcxx::initialized()' failed.
pid:740: destructor ( 0xb17d11): init=1
a.out: init-bug.cpp:29: A::~A(): Assertion `!upcxx::initialized()' failed.
Abort
This output reveals several behaviors that most non-expert users are likely to find surprising:
- A single-process job actually involves three separate processes which all run portions of the user program, sometimes concurrently.
- Static initializers run on two of the processes, both of which also enter
main()
and run the user code throughupcxx::init()
, including the stack object initializer (this is an oddball behavior explicitly permitted by GASNet to ensure portable distributed job spawning) - Inside
upcxx::init()
, a third process is forked which acts as the actual UPC++ compute process, returning fromupcxx::init()
and finishingmain()
. Upon return from main, that process runs the destructor for the stack-allocated object (which was actually created beforeinit()
by a different process), but does NOT run the destructor for the static data object. - The other two "hidden" processes (that did not run the UPC++ application) do NOT run the destructor for the stack-allocated object (they never returned from
upcxx::init
), but they both DO run the destructor for the static data object. - Furthermore, whilst running the static destructors, the hidden processes report
upcxx::initialized() == true
, even though neither was ever a valid UPC++ compute process with an initialized library.
These behaviors are non-portable - other spawners result in different behaviors, although ssh-spawner is probably the "most surprising". These behaviors may cause problems for any code with observable external side-effects running in static initializers/destructors or in pre-init main()
. Due to behavioral variation across spawners, this seems especially likely to impact codes developed using spawners that don't demonstrate these behaviors.
In most cases with normal (non-abortive) exits, setting documented envvar GASNET_CATCH_EXIT=0
can restore the execution of the static destructors on exit of the compute process, with a cost of sacrificing automated protection against orphaned processes that can occur in some systems after an incomplete job termination. However I don't think we offer any workarounds or mechanisms to deal with the other "weirdnesses" described above.
I'm planning to address surprising behavior (5), and want to discuss what we can do to help mitigate the effect of the others.
Comments (4)
-
reporter -
reporter - changed title to Non-portable (spawner-specific) behavior of static constructors+destructors
- marked as minor
Surprising behavior (4) resolved in GASNet commit 30b11ac (to appear in forthcoming 2020.10.0).
The remaining (first three) behaviors are more deeply ingrained in the current design of ssh-spawner, and would require major design changes to address (which would likely involve installing permanent "helper executables" at known locations on the compute nodes). However these behaviors are explicitly permitted by the GASNet specification (and implicitly by UPC++).
Users encountering problems related to these remaining issues are advised to use a different spawner - UPC++ executables spawned using mpi-spawner and pmi-spawner (eg SLURM srun) usually do not exhibit these behaviors.
-
reporter - changed milestone to 2021.3.0 release
Mass roll-over of open issues to next release milestone
-
reporter - changed milestone to Deferred indefinitely
- removed responsible
Defer an issue we are unlikely to address without strong motivation
- Log in to comment
Surprising behavior (5) addressed in pull request #286, merged at 35ecaef