hack to unfix OpenMP threads from cores on Cray machines

Create issue
Issue #774 closed
Roland Haas created an issue

attached is a simple (and not terrible general) patch to have Carpet try and spread around the OpenMP threads if the parameter cray_thread_affinity_hack is set.

Right now something like this is required to avoid a 50% slowdown on Kraken and Hopper. Do we want something like this in the official sources (eventually made a bit more general I'd hope) or is there a way to achieve this via say numactl or do we want to keep Carpet clean of hacks for individual machines?

The patch was tried by me and Christian and does indeed unfix the threads. It is currently specific a particular arrangement of threads: for "n" threads and m MPI processes Kraken seems to affix all the threads from process 0 to core 0, all the threads from process 1 to core "n", those of process 2 to core "2n" etc. The patch just makes threads from process "j" affine to cores jn,...,(j-1)*n.

Keyword: hack

Comments (10)

  1. Erik Schnetter
    • changed status to resolved
    • removed comment

    I believe that the process affinity is inherited across exec. That means that one can write a wrapper that first sets the affinity, and then execs the Cactus executable.

    In fact, such a program already exists; it is called "numactl". I just tried on Hopper -- it does not run because it cannot find certain libraries. Maybe rebuilding numactl would already correct this problem?

    You would use it similar to "aprun -bla numactl -bla cactus_sim -bla my.par".

    I applied your patch with certain modifications. For example, instead of calling it "Cray hack" I view this as a generic feature that may also be useful elsewhere.

  2. Roland Haas reporter
    • removed comment

    The current implementation in Carpet will warn (I believe) on Kraken even when everything is ok since the run was started with aprun -cc numa_mode. In that case, 6 threads are bound to cores 0-5 and 6 threads are bound to cores 7-11 (ie half to one socket half to the other) presumably to avoid having to access memory belonging to the "other" socket.

    Since Carpet checks (on a single root thread only) that the number of cores on which this thread might run is at least as large as the number of threads requested, the test fails (since any given thread is only allowed to run on half the cores).

  3. Erik Schnetter
    • removed comment

    The current implementation compares the number of threads of a process (6) with the number of cores on which the process can run (6), and therefore should not warn.

    What warning do you see?

  4. Roland Haas reporter
    • removed comment

    The issue occurs (eg on Kraken where each node has two cpu sockets and each socket houses 6 cores) if one runs with 12 threads. Then I get

    INFO (Carpet): MPI is enabled INFO (Carpet): Carpet is running on 1 processes ESC[1mWARNING level 2 in thorn Carpet processor 0 host nid11406 (line 212 of /nics/c/home/rhaas/Zelmani/arrangements/Carpet/Carpet/src/SetupGH.cc): ->ESC[0m Although MPI is enabled, the environment variable CACTUS_NUM_PROCS is not set. INFO (Carpet): This is process 0 INFO (Carpet): OpenMP is enabled INFO (Carpet): This process contains 12 threads ESC[1mWARNING level 2 in thorn Carpet processor 0 host nid11406 (line 246 of /nics/c/home/rhaas/Zelmani/arrangements/Carpet/Carpet/src/SetupGH.cc): ->ESC[0m Although OpenMP is enabled, the environment variable CACTUS_NUM_THREADS is not set. INFO (Carpet): There are 12 threads in total INFO (Carpet): There are 12 threads per process INFO (Carpet): Host listing: host 0: "nid11406" INFO (Carpet): Host/process mapping: process 0: host 0 "nid11406" INFO (Carpet): Host mapping: This is process 0, host 0 "nid11406" INFO (Carpet): This process runs on host nid11406, pid=8985 INFO (Carpet): This process runs on 6 cores: 0-5 ESC[1mWARNING level 1 in thorn Carpet processor 0 host nid11406 (line 383 of /nics/c/home/rhaas/Zelmani/arrangements/Carpet/Carpet/src/SetupGH.cc): ->ESC[0m The number of threads for this process is larger its number of cores. This may indicate a performance problem.

    This was run with:

    aprun -cc numa_node -n @NUM_PROCS@ -d @NUM_THREADS@ ${NODE_PROCS} ${SOCKET_PROCS} @EXECUTABLE@ -L 3 @PARFILE@ and

    create-submit affinitytest --procs 12 --num-threads 12 --walltime 0:05:00

    For this situation the Cray affinity display utility outputs:

    Hello from rank 0, thread 1, on nid01593. (core affinity = 0-5) Hello from rank 0, thread 2, on nid01593. (core affinity = 0-5) Hello from rank 0, thread 3, on nid01593. (core affinity = 0-5) Hello from rank 0, thread 4, on nid01593. (core affinity = 0-5) Hello from rank 0, thread 5, on nid01593. (core affinity = 6-11) Hello from rank 0, thread 6, on nid01593. (core affinity = 6-11) Hello from rank 0, thread 7, on nid01593. (core affinity = 6-11) Hello from rank 0, thread 8, on nid01593. (core affinity = 6-11) Hello from rank 0, thread 11, on nid01593. (core affinity = 0-5) Hello from rank 0, thread 0, on nid01593. (core affinity = 0-5) Hello from rank 0, thread 9, on nid01593. (core affinity = 6-11) Hello from rank 0, thread 10, on nid01593. (core affinity = 6-11)

    The reason for the warning is that each individual thread is given 6 cores to run on, not 12. This is actually a sensible setting since it prevents threads migrating from one socket to the other which can impact memory bandwidth. It would even be perceivable to bind each thread to exactly one core, which is fine as long as no two threads '''have''' to run on the same core.

    I think a sharp version of the criterion would be: Given all the allowed affinity sets of all threads in all the processes on this cluster node, is there a way of spreading the threads over the cores so that no core is occupied by more than one thread? If not, warn. This seems like a hard thing to test for (though sounds suspiciously like one of these travelling salesman like problems [it's not the travelling salesman problem certainly]) :-( .

  5. Erik Schnetter
    • removed comment

    It seems that CPU affinity is a thread property, not a process property.

    Given this, Carpet's checking needs to be updated. The easiest solution is to take the union of all allowed CPUs over all threads of a process, and check this. This is not as strict as possible, but good enough -- simple mismatches will be detected.

  6. Roland Haas reporter
    • removed comment

    This patch detects faulty assignments when not using -cc numa_node for 2,3 and 12 threads on kraken and does not misdetect 12,6,3,2 threads when using -c numa_node (in the later case the affinity is '''always''' to either cores 0-5 or 6-11).

    It way to make it fail would be:

    thread 0: affine to core 0,1,2,3,4,5 thread 1: affine to core 0 thread 2: affine to core 0

    ie. only thread 0 has its affinity properly set. Kraken does not seem to produce this assignment though (its assignments without -cc numa_node are unusual other, it does spread threads around a bit but seems to always avoid core 1 unless one uses numa_node).

    Please apply.

  7. Log in to comment