OpenMPI in tutorial containers binds multiple user's code to the same cores

Issue #2281 resolved
Roland Haas created an issue

OpenMPI by default binds MPI ranks to specific cores and on a shared system like the tutorial server it binds multiple users' MPI ranks to the same core. Eg

16625 ubuntu    20   0    7588   1012    928 R  50.0  0.0   0:38.28 awk
16670 ubuntu    20   0    7588    888    804 R  50.0  0.0   0:31.65 awk
16671 ubuntu    20   0    7588    968    884 R  50.0  0.0   0:31.66 awk
16624 ubuntu    20   0    7588   1064    980 R  43.8  0.0   0:38.32 awk

when running (in two logins in jupyter notebooks)

!mpirun -n 2 /usr/bin/awk 'BEGIN{while(1);}' /dev/null

A fix is to add the following environment variable to the tutorial notebook:

OMPI_MCA_hwloc_base_binding_policy=none

Comments (1)

  1. Log in to comment