- changed milestone to ET_2010_11
CT_MultiLevel tests abort when run using more than one thread via simfactory
In git hash f813545 "CT_MultiLevel: request single-thread execution of all tests." of ctthorns the CT_MultiLevel
tests were changed to force a single OpenMP thread. This is a good thing as there is a (known, harmless) race condition in the Gauss-Seidel sweep that the code uses when more than a single thread is used which renders results non-deterministic.
Unfortunately the change interferes badly with the sanity checks in Carpet using CACTUS_NUM_THREADS
to check that the number of threads requested in the RunScript agrees with the number of threads that Carpet sees in use during its SetupGH routine:
if (cactus_num_threads != mynthreads) {
CCTK_VWarn(CCTK_WARN_ABORT, __LINE__, __FILE__, CCTK_THORNSTRING,
"The environment variable CACTUS_NUM_THREADS is set to %d, "
"but there are %d threads on this process. This may "
"indicate a severe problem with the OpenMP startup "
"mechanism.",
cactus_num_threads, mynthreads);
}
which uses the env variable CACTUS_NUM_THREADS
.
This currently causes the tests to fail on any cluster when using simfactory’s --testsuite
option.
keyword: CT_MultiLevel
Comments (10)
-
reporter -
reporter - changed milestone to ET_2019_10
-
This seems to me like a Carpet issue rather than a CT_MultiLevel issue, as any other thorn using the parameter Carpet::num_threads would incur in the same problem. There are basically two conflicting ways to specify the number of threads, which need to be reconciled in a way that retains the ability to specify the number of threads at the parfile level (so it can be used in the tests). I am not sure what to suggest in this direction.
On the same note, CT_MultiLevel does not use raw OpenMP directives, but adopts the LoopControl headers (except for two minor utility functions), so the solution you suggest can’t be applied.
-
I think the best way to fix this would be to add the option to specify that the test requires a specific number of threads in test.ccl, as is done for the number of processes. So we would have nthreads as well as nprocs. The test system would then set CACTUS_NUM_THREADS and OMP_NUM_THREADS, and Carpet would find a consistent setup.
This solution is probably also the most work to implement.
-
reporter @Eloisa Bentivegna LoopControl’s
LC_LOOP
macros themselves do not enable any parallel threading. They need to be surround by a#pragma omp parallel
section (see eg their use in Llama). Without they are just single threaded. Having had a look at CT_MultiLevel’s code it seems that there is no multi-threading going on. And indeed if I run the poisson test with 4 threads I only see 1 core (per MPI rank) in use even when removing thenum_threads
setting.That is to say: CT_MultiLevel always (with the exception of a reduction and the loops in CT_Analytic and CT_Dust), even before introducing the
num_threads
setting, used only a single thread.There should have never been a chance for a race condition (since there was only one runner) and any claim that I have made about there being one in the Gauss-Seidel iteration was false, sorry.
I certainly leaves me worried why your data produced with gcc (4.8.2 admittedly) and -O2 would differ what is produced on the tutorial server (also gcc) or my workstation. The affected test seems to be the boostedpuncture test (which now of course produces bit-identical results for me whether run with
OMP_NUM_THREADS=1
orOMP_NUM_THREADS=64
on my workstation (gcc9, -O2, 24 cores so I am oversubscribing it).My understanding of what
Carpet::num_threads
can be used for would be if one somehow cannot pass theOMP_NUM_THREADS
variable to the executable (though I do not know of a MPI implementation that would not give me some way of passing ENV variables) in which case one would setCACTUS_NUM_THREADS
and thenum_threads
parameter to the same value, which will not cause Carpet to abort the run.For the upcoming release I would try reverting f813545 "CT_MultiLevel: request single-thread execution of all tests." of ctthorns, regenerating the data with a “recent” gcc compiler (7 or better I’d say) using
OMP_NUM_THREADS=1
(just to be sure).I am wondering if the issue is just the old gcc compiler used (and everything being roundoff) but will have to see if I can somehow get a gcc-4.8 compiler to run on my workstation.
-
Thanks @Roland Haas , interesting I didn’t realize this earlier… I agree to revert the single-thread commit, and try getting some agreement with a more recent gcc. Also, the test obviously involves data which is very affected by roundoff, and this simply shouldn’t be the case. We have had recurring problems with this as any perturbation of any origin ends up showing up there. Perhaps it’s a matter of turning this into a more sensible test in the first place.
-
reporter True. I reverted the commit and re-generated all data on my workstation using gcc9, -O2 and using 2 MPI ranks and 1 thread per rank.
This dataset then fails to pass on some of the clusters (eg stampede2, comet) but passes on others (and obviously on my workstation and presumably on the Jenkins system since it is a Ubuntu VM).
Running with very high thread count I do see differences (on my workstation) for the reduced quantities, which is can kind of understand since an
omp reduce
will produce different answers depending on the number of threads used (and there is a such a one in CT_MultiLevel).I have checked and on Stampede2 the number of iterations output to stdout is the same as on my workstation so it is not “simply” an issue of roundoff level differences in the error norms triggering one more iteration.
One will have to see if the failing clusters are all using aggressive Intel compiler optimizations or if it is eg something that has to do with vectorization.
Unfortunately even with the recent changes the
CT_MultiLevel
tests still have a tendency to hang (just never finish) on the OSX Macports testing system, which has me very confused.Have you run the testsuites on a Mac (using MacPorts ideally) recently?
-
reporter - changed status to resolved
-
Since the main code is not OpenMP parallelised, could we remove the OpenMP reduction? Would that help? Would it be very bad for performance?
Roland: is it just the computed constraints that fail?
-
reporter Removing the OMP reduction likely does not impact speed a lot since it is only used to compute an error norm (the relevant lines are : https://bitbucket.org/eloisa/ctthorns/src/master/CT_MultiLevel/src/CT_Utils.cc#lines-255 and https://bitbucket.org/eloisa/ctthorns/src/master/CT_MultiLevel/src/CT_Utils.cc#lines-305). However the errors do not only show up in the error norm output. For example on Comet the log file for 2 ranks 12 threads reads (only showing significant differences):
admbase-curv.d.asc: substantial differences significant differences on 6 (out of 109) lines (insignificant differences on 88 lines) admbase-curv.x.asc: substantial differences significant differences on 6 (out of 182) lines (insignificant differences on 168 lines) admbase-curv.y.asc: substantial differences significant differences on 6 (out of 182) lines (insignificant differences on 164 lines) admbase-curv.z.asc: substantial differences significant differences on 6 (out of 109) lines (insignificant differences on 97 lines) ct_multilevel-auxiliaries.d.asc: substantial differences significant differences on 2 (out of 109) lines (insignificant differences on 16 lines) ct_multilevel-auxiliaries.x.asc: substantial differences significant differences on 2 (out of 182) lines (insignificant differences on 12 lines) ct_multilevel-auxiliaries.y.asc: substantial differences significant differences on 2 (out of 182) lines (insignificant differences on 19 lines) ct_multilevel-auxiliaries.z.asc: substantial differences significant differences on 2 (out of 109) lines (insignificant differences on 19 lines) ct_multilevel-coeffs.d.asc: substantial differences significant differences on 6 (out of 109) lines (insignificant differences on 25 lines) ct_multilevel-coeffs.x.asc: substantial differences significant differences on 6 (out of 182) lines (insignificant differences on 30 lines) ct_multilevel-coeffs.y.asc: substantial differences significant differences on 6 (out of 182) lines (insignificant differences on 26 lines) ct_multilevel-coeffs.z.asc: substantial differences significant differences on 6 (out of 109) lines (insignificant differences on 18 lines)
meaning there are differences on the grid. In this case there were no significant differences in the
_norm
quantities.
- Log in to comment
@Eloisa Bentivegna do you a preferred way of how to try and fix this? One would be to add a parameter
single_threaded
toCT_MultiLevel
and have all#pragma omp parallel
contain anif
clause#pragma omp parallalel if(!single_threaded)
.