Running with Simfactory’s default of 6 threads per rank and without SystemTopology gives bad core binding.
Ibrun (before SystemTopology) outputs
IBRUN: Even number of ranks per node (4)--hacking nodefile... IBRUN: Running on 3 unique nodes IBRUN: Will place 4 ranks on comet-19-36 with 6 threads each IBRUN: Will place 4 ranks on comet-19-40 with 6 threads each IBRUN: Will place 4 ranks on comet-25-64 with 6 threads each IBRUN: Nodefile is /tmp/lTL5bQXemw IBRUN: MPI binding policy: illogical/arbitrary for 6 threads per rank (12 cores per socket) IBRUN: Adding OMPI_MCA_btl_openib_use_rd_max=2048 to the environment IBRUN: Adding OMPI_MCA_btl_openib_use_srq=1 to the environment IBRUN: Adding OMPI_MCA_btl=self,vader,openib to the environment IBRUN: Adding OMPI_MCA_btl_openib_ib_timeout=23 to the environment IBRUN: Added 4 new environment variables to the execution environment IBRUN: Command string is [orterun --bind-to core --map-by socket -n 12 ...
and looking at the core binding it bids thread 0 of every rank to core 0 ie oversubscribing core 0 by the number of MPI ranks on the node.
ibruns docs (https://www.sdsc.edu/support/user_guides/comet.html) say
--npernode <n> only launch n MPI ranks per node (default: ppn from resource manager) --tpr|--tpp|--threads-per-rank|--threads-per-process <n> how many threads each MPI rank (often referred to as 'MPI process') will spawn. (default: $OMP_NUM_THREADS (if defined), <ppn>/<npernode> if ppn is divisible by npernode, or 1 otherwise) --switches '<implementation-specific>' Pass additional command-line switches to the underlying implementation's MPI launcher. These WILL be overridden by any switches ibrun subsequently enables (default: none) -bp|--binding-policy <scatter|compact|none> Define the CPU affinity's binding policy for each MPI rank. 'scatter' distributes ranks across each binding level, 'compact' fills up a binding level before allocating another, and 'none' disables all affinity settings (default: optimized for job geometry) -bl|--binding-level <core|socket|numanode|none> Define the level of granularity for CPU affinity for each MPI rank. 'core' binds each rank to a single core; 'socket' binds each rank to all cores on a single CPU socket (good for multithreaded ranks); 'numanode' binds each rank to the subset of cores belonging to a numanode; 'none' disables all affinity settings. (default: optimized for job geometry)
though whatever it picked is not very optimal given that it realized that there are 6 threads per rank.