CT_MultiLevel tests abort when run using more than one thread via simfactory

Issue #2289 resolved
Roland Haas created an issue

In git hash f813545 "CT_MultiLevel: request single-thread execution of all tests." of ctthorns the CT_MultiLevel tests were changed to force a single OpenMP thread. This is a good thing as there is a (known, harmless) race condition in the Gauss-Seidel sweep that the code uses when more than a single thread is used which renders results non-deterministic.

Unfortunately the change interferes badly with the sanity checks in Carpet using CACTUS_NUM_THREADS to check that the number of threads requested in the RunScript agrees with the number of threads that Carpet sees in use during its SetupGH routine:

if (cactus_num_threads != mynthreads) {
  CCTK_VWarn(CCTK_WARN_ABORT, __LINE__, __FILE__, CCTK_THORNSTRING,
             "The environment variable CACTUS_NUM_THREADS is set to %d, "
             "but there are %d threads on this process. This may "
             "indicate a severe problem with the OpenMP startup "
             "mechanism.",
             cactus_num_threads, mynthreads);
}

which uses the env variable CACTUS_NUM_THREADS.

This currently causes the tests to fail on any cluster when using simfactory’s --testsuite option.

keyword: CT_MultiLevel

Comments (10)

  1. Roland Haas reporter

    @Eloisa Bentivegna do you a preferred way of how to try and fix this? One would be to add a parameter single_threaded to CT_MultiLevel and have all #pragma omp parallel contain an if clause #pragma omp parallalel if(!single_threaded).

  2. Eloisa Bentivegna

    This seems to me like a Carpet issue rather than a CT_MultiLevel issue, as any other thorn using the parameter Carpet::num_threads would incur in the same problem. There are basically two conflicting ways to specify the number of threads, which need to be reconciled in a way that retains the ability to specify the number of threads at the parfile level (so it can be used in the tests). I am not sure what to suggest in this direction.

    On the same note, CT_MultiLevel does not use raw OpenMP directives, but adopts the LoopControl headers (except for two minor utility functions), so the solution you suggest can’t be applied.

  3. Ian Hinder

    I think the best way to fix this would be to add the option to specify that the test requires a specific number of threads in test.ccl, as is done for the number of processes. So we would have nthreads as well as nprocs. The test system would then set CACTUS_NUM_THREADS and OMP_NUM_THREADS, and Carpet would find a consistent setup.

    This solution is probably also the most work to implement.

  4. Roland Haas reporter

    @Eloisa Bentivegna LoopControl’s LC_LOOP macros themselves do not enable any parallel threading. They need to be surround by a #pragma omp parallel section (see eg their use in Llama). Without they are just single threaded. Having had a look at CT_MultiLevel’s code it seems that there is no multi-threading going on. And indeed if I run the poisson test with 4 threads I only see 1 core (per MPI rank) in use even when removing the num_threads setting.

    That is to say: CT_MultiLevel always (with the exception of a reduction and the loops in CT_Analytic and CT_Dust), even before introducing the num_threadssetting, used only a single thread.

    There should have never been a chance for a race condition (since there was only one runner) and any claim that I have made about there being one in the Gauss-Seidel iteration was false, sorry.

    I certainly leaves me worried why your data produced with gcc (4.8.2 admittedly) and -O2 would differ what is produced on the tutorial server (also gcc) or my workstation. The affected test seems to be the boostedpuncture test (which now of course produces bit-identical results for me whether run with OMP_NUM_THREADS=1 or OMP_NUM_THREADS=64 on my workstation (gcc9, -O2, 24 cores so I am oversubscribing it).

    My understanding of what Carpet::num_threads can be used for would be if one somehow cannot pass the OMP_NUM_THREADS variable to the executable (though I do not know of a MPI implementation that would not give me some way of passing ENV variables) in which case one would set CACTUS_NUM_THREADS and the num_threads parameter to the same value, which will not cause Carpet to abort the run.

    For the upcoming release I would try reverting f813545 "CT_MultiLevel: request single-thread execution of all tests." of ctthorns, regenerating the data with a “recent” gcc compiler (7 or better I’d say) using OMP_NUM_THREADS=1 (just to be sure).

    I am wondering if the issue is just the old gcc compiler used (and everything being roundoff) but will have to see if I can somehow get a gcc-4.8 compiler to run on my workstation.

  5. Eloisa Bentivegna

    Thanks @Roland Haas , interesting I didn’t realize this earlier… I agree to revert the single-thread commit, and try getting some agreement with a more recent gcc. Also, the test obviously involves data which is very affected by roundoff, and this simply shouldn’t be the case. We have had recurring problems with this as any perturbation of any origin ends up showing up there. Perhaps it’s a matter of turning this into a more sensible test in the first place.

  6. Roland Haas reporter

    True. I reverted the commit and re-generated all data on my workstation using gcc9, -O2 and using 2 MPI ranks and 1 thread per rank.

    This dataset then fails to pass on some of the clusters (eg stampede2, comet) but passes on others (and obviously on my workstation and presumably on the Jenkins system since it is a Ubuntu VM).

    Running with very high thread count I do see differences (on my workstation) for the reduced quantities, which is can kind of understand since an omp reduce will produce different answers depending on the number of threads used (and there is a such a one in CT_MultiLevel).

    I have checked and on Stampede2 the number of iterations output to stdout is the same as on my workstation so it is not “simply” an issue of roundoff level differences in the error norms triggering one more iteration.

    One will have to see if the failing clusters are all using aggressive Intel compiler optimizations or if it is eg something that has to do with vectorization.

    Unfortunately even with the recent changes the CT_MultiLevel tests still have a tendency to hang (just never finish) on the OSX Macports testing system, which has me very confused.

    Have you run the testsuites on a Mac (using MacPorts ideally) recently?

  7. Roland Haas reporter

    Commit reverted in git hash 82971d4 "Revert "CT_MultiLevel: request single-thread execution of all tests."" of ctthorns

    Still failures on some clusters even after regenerating data.

  8. Ian Hinder

    Since the main code is not OpenMP parallelised, could we remove the OpenMP reduction? Would that help? Would it be very bad for performance?

    Roland: is it just the computed constraints that fail?

  9. Roland Haas reporter

    Removing the OMP reduction likely does not impact speed a lot since it is only used to compute an error norm (the relevant lines are : https://bitbucket.org/eloisa/ctthorns/src/master/CT_MultiLevel/src/CT_Utils.cc#lines-255 and https://bitbucket.org/eloisa/ctthorns/src/master/CT_MultiLevel/src/CT_Utils.cc#lines-305). However the errors do not only show up in the error norm output. For example on Comet the log file for 2 ranks 12 threads reads (only showing significant differences):

       admbase-curv.d.asc: substantial differences
          significant differences on 6 (out of 109) lines
          (insignificant differences on 88 lines)
       admbase-curv.x.asc: substantial differences
          significant differences on 6 (out of 182) lines
          (insignificant differences on 168 lines)
       admbase-curv.y.asc: substantial differences
          significant differences on 6 (out of 182) lines
          (insignificant differences on 164 lines)
       admbase-curv.z.asc: substantial differences
          significant differences on 6 (out of 109) lines
          (insignificant differences on 97 lines)
       ct_multilevel-auxiliaries.d.asc: substantial differences
          significant differences on 2 (out of 109) lines
          (insignificant differences on 16 lines)
       ct_multilevel-auxiliaries.x.asc: substantial differences
          significant differences on 2 (out of 182) lines
          (insignificant differences on 12 lines)
       ct_multilevel-auxiliaries.y.asc: substantial differences
          significant differences on 2 (out of 182) lines
          (insignificant differences on 19 lines)
       ct_multilevel-auxiliaries.z.asc: substantial differences
          significant differences on 2 (out of 109) lines
          (insignificant differences on 19 lines)
       ct_multilevel-coeffs.d.asc: substantial differences
          significant differences on 6 (out of 109) lines
          (insignificant differences on 25 lines)
       ct_multilevel-coeffs.x.asc: substantial differences
          significant differences on 6 (out of 182) lines
          (insignificant differences on 30 lines)
       ct_multilevel-coeffs.y.asc: substantial differences
          significant differences on 6 (out of 182) lines
          (insignificant differences on 26 lines)
       ct_multilevel-coeffs.z.asc: substantial differences
          significant differences on 6 (out of 109) lines
          (insignificant differences on 18 lines)
    

    meaning there are differences on the grid. In this case there were no significant differences in the _norm quantities.

  10. Log in to comment