- removed comment
running loopcontrol on strange number of threads fails
my machine has 8 cores (according to /proc/cpuinfo). Running eg the trigger test with 3 threads fails inside of loopcontrol.
To reproduce:
export OMP_NUM_THREADS=3
mpirun -n 2 exe/cactus_bns_all arrangements/AEIThorns/Trigger/test/trigger.par
Keyword: LoopControl
Comments (5)
-
reporter -
I recently attempted to run the gallery example on Deep Bayou. I set OMP_NUM_THREADS=4, and asked for 12 procs to run on the node (which has 48 cores according to lscpu). Cactus failed with this message cactus_sim: /nvme/sbrandt/Cactus/arrangements/Carpet/LoopControl/src/loopcontrol.cc:264: T <unnamed>::divexact(T, T) [with T = int]: Assertion
i % j == 0' faile
It also said
INFO (Carpet): This process runs on 24 cores
. At Roland’s suggestion, I tried addingLoopControl::use_smt_threads = "no"
to the par file and all was well. Maybe “no” is a more sensible default?Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 48 On-line CPU(s) list: 0-47 Thread(s) per core: 1 Core(s) per socket: 24 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel
-
Steve
You seem to expect to use 4 cores per process, and Carpet says there are 24 cores per process. It might be that your job startup isn’t working right. I would debug this first, before looking at LoopControl.
-
I set and exported OMP_NUM_THREADS=4 and manually asked MPI to run with 12 procs. Is that considered wrong/buggy? I suppose I could also have set CACTUS_NUM_THREADS and CACTUS_NUM_PROCS. Regardless, I thought using smt threads was generally expected to be not helpful and to be avoided by default?
-
Setting these variables is fine. Setting the other variables (
CACTUS_...
) only allows Cactus to check whether things actually worked out as intended; they are only used for checking.If you set OMP_NUM_THREADS=4 and expect there to be one thread per core, and Cactus later thinks it’s running on 12 cores, then something went wrong. Did you look at the output of
omp_max_threads()
? Did you environment variable actually make it to Cactus? Did you actually start one job with 12 processes, or accidentally 12 individual processes that know nothing about each other? Did the queuing system get confused and set up a cgroup with a different number of cores? Lots of things can go wrong.The error message might come from LoopControl trying to determine how to split 4 threads of 12 cores. The resulting 1/3 threads per core might have caused the problem (although it shouldn’t).
- Log in to comment
This still happens even with current (Sun Mar 22 18:46:21 CET 2015) trunk, though failure looks a bit different now: