Difficulties with spawning new processes on the victim's node

Issue #33 resolved
George Bosilca created an issue

As reported on the ULFM mailing-list the use of a machinefile to restrict or drive the allocation of new processes is difficult.

Comments (2)

  1. George Bosilca reporter

    This issue is rooted in OMPI and is due to the forwarding of job-level constraints from the original job to all spawnees. In this particular case adding "-npernode 1" restricts all future processes from sharing a node, across all jobid handled by the same HNP. In a normal MPI application such behavior might be desired, but in context of ULFM we need to be able to reuse nodes, which means to respawn processes on a node where older processes failed.

    Multiple solution might be envisioned, but I think the cleanest solution is to provide an info key to prevent the original job parameters inheritance. I have create an OMPI issue related to this topic open-mpi/ompi#5376.

  2. Aurelien Bouteiller

    open-mpi/ompi#5376 has been imported, as well as fixing the 'oversubscribe' non-propagation issue; this should resolve the problem.

  3. Log in to comment