Difficulties with spawning new processes on the victim's node
Issue #33
resolved
As reported on the ULFM mailing-list the use of a machinefile to restrict or drive the allocation of new processes is difficult.
Comments (2)
-
reporter -
- changed status to resolved
open-mpi/ompi#5376 has been imported, as well as fixing the 'oversubscribe' non-propagation issue; this should resolve the problem.
- Log in to comment
This issue is rooted in OMPI and is due to the forwarding of job-level constraints from the original job to all spawnees. In this particular case adding "-npernode 1" restricts all future processes from sharing a node, across all jobid handled by the same HNP. In a normal MPI application such behavior might be desired, but in context of ULFM we need to be able to reuse nodes, which means to respawn processes on a node where older processes failed.
Multiple solution might be envisioned, but I think the cleanest solution is to provide an info key to prevent the original job parameters inheritance. I have create an OMPI issue related to this topic open-mpi/ompi#5376.