simfactory's and ibruns default binds multiple threads to the same core on Comet

Running with Simfactory’s default of 6 threads per rank and without SystemTopology gives bad core binding.

Ibrun (before SystemTopology) outputs

IBRUN: Even number of ranks per node (4)--hacking nodefile...
IBRUN: Running on 3 unique nodes
IBRUN: Will place 4 ranks on comet-19-36 with 6 threads each
IBRUN: Will place 4 ranks on comet-19-40 with 6 threads each
IBRUN: Will place 4 ranks on comet-25-64 with 6 threads each
IBRUN: Nodefile is /tmp/lTL5bQXemw
IBRUN: MPI binding policy: illogical/arbitrary for 6 threads per rank (12 cores per socket)
IBRUN: Adding OMPI_MCA_btl_openib_use_rd_max=2048 to the environment
IBRUN: Adding OMPI_MCA_btl_openib_use_srq=1 to the environment
IBRUN: Adding OMPI_MCA_btl=self,vader,openib to the environment
IBRUN: Adding OMPI_MCA_btl_openib_ib_timeout=23 to the environment
IBRUN: Added 4 new environment variables to the execution environment
IBRUN: Command string is [orterun --bind-to core --map-by socket -n 12 ...

and looking at the core binding it bids thread 0 of every rank to core 0 ie oversubscribing core 0 by the number of MPI ranks on the node.

ibruns docs (https://www.sdsc.edu/support/user_guides/comet.html) say

        --npernode <n>
            only launch n MPI ranks per node (default: ppn from resource manager)

        --tpr|--tpp|--threads-per-rank|--threads-per-process <n>
            how many threads each MPI rank (often referred to as 'MPI process') 
            will spawn.  (default: $OMP_NUM_THREADS (if defined), <ppn>/<npernode>
            if ppn is divisible by npernode, or 1 otherwise)

        --switches '<implementation-specific>'
            Pass additional command-line switches to the underlying implementation's
            MPI launcher.  These WILL be overridden by any switches ibrun 
            subsequently enables (default: none)

        -bp|--binding-policy <scatter|compact|none>
            Define the CPU affinity's binding policy for each MPI rank.  'scatter' 
            distributes ranks across each binding level, 'compact' fills up a 
            binding level before allocating another, and 'none' disables all 
            affinity settings (default: optimized for job geometry)

        -bl|--binding-level <core|socket|numanode|none>
            Define the level of granularity for CPU affinity for each MPI rank.  
            'core' binds each rank to a single core; 'socket' binds each rank to 
            all cores on a single CPU socket (good for multithreaded ranks); 
            'numanode' binds each rank to the subset of cores belonging to a
            numanode; 'none' disables all affinity settings. (default: optimized 
            for job geometry)

though whatever it picked is not very optimal given that it realized that there are 6 threads per rank.

‌

Comments (4)