McLachlan constraint tests fail

Issue #1995 closed
Ian Hinder created an issue

Several tests fail on several machines, and the cause seems to be the constraints calculated by McLachlan. Peter is looking into this. We see test failures in the thorns Dissipation and RotatingSymmetry90/180. Indications are that most failures happen with Intel 15, but some are apparently also seen with Intel 16, while others with Intel 16 seem to work fine. Optimization -O1 instead of -O2 seems to prevent the problem, but is not a viable workaround. A simple 'print' statement in the affected (auto-generated) code also makes the problem disappear. Running using one MPI process and one openMP thread reproduces the problem. valgrind does not find anything obvious pointing to memory mess-up.

Keyword: McLachlan
Keyword: constraints
Keyword: tests
Keyword: compiler
Keyword: optimization

Comments (49)

  1. Frank Löffler
    • removed comment

    Redefining 'restrict' to be empty also do not solve the problem, so this appears to be different from the older restrict-issue in earlier versions of the Intel compiler.

  2. Wolfgang Kastaun
    • removed comment

    Maybe related: I just found huge constraint violations in a BNS simulation. This already affects the initial data, and the run behaves pretty normal otherwise, so I assume it is only in the calculation of the constraints. The momentum constraints differ, but only by 10% or so. The maximum Hamiltonian constraint violation corresponds to around 16 pi times the maximum mass density. This could mean that either the matter terms are not added or added twice or that the spacetime part is added twice. The latter would not be that obvious in a vacuum simulation, only a factor 2. How big was the difference for the failed tests?

    I am using the Payne release compiled with 17.0.1, and use ML_CCZ4 for spacetime evolution.

  3. Wolfgang Kastaun
    • removed comment

    For the case of my run, the 2-norm also differed a lot, so probably it was not just the boundary.

  4. Wolfgang Kastaun
    • removed comment

    I did not use Lama (Unless it is active by default now. This is the first run I did using the Payne release).

  5. Wolfgang Kastaun
    • removed comment

    I just looked at runs of TOV stars compiled with gcc 4.9.2 versus intel 17.0.1, both using the Payne release. The constraints for gcc looks normal, while for intel the Hamiltonian constraint is around 3 orders of magnitude larger.

    Also, disregard my comment on the error for the BNS being of the order of the matter terms. I made an error, the Hamiltonian was actually around 100 times larger than that.

  6. Ian Hinder reporter
    • removed comment

    Do the Intel constraints look OK if you disable as much optimisation as possible? e.g. -O0, possibly -fp-model source. Intel 17 seems to be a bit dodgy still. Do you get the same problems with earlier versions of Intel?

  7. Frank Löffler
    • removed comment

    How does the constraints when computed using ADMConstraints compare to those calculated by McLachlan directly?

  8. anonymous
    • removed comment

    Replying to [comment:9 knarf]:

    How does the constraints when computed using ADMConstraints compare to those calculated by McLachlan directly? I did not check. The cases I provided where using ML_ADMConstraints::ML_Ham / ML_Mom. Also, can someone remind me what are the differences between the two?

  9. anonymous
    • removed comment

    Replying to [comment:8 hinder]:

    Do the Intel constraints look OK if you disable as much optimisation as possible? e.g. -O0, possibly -fp-model source. Intel 17 seems to be a bit dodgy still. Do you get the same problems with earlier versions of Intel?

    With intel 17.0.2 and -O0, the problem is gone (using ML_ADMConstraints::ML_Ham / ML_Mom). There is however still a small difference to the gcc results, the relative difference in H is around 5e-5. Also, with intel 17.0.2 and -O3, there is significant difference when going from 80 to 40 MPI procs (4 threads each), H differs by around 6%.

  10. anonymous
    • removed comment

    I just tried with Intel 15.0.4 and -O3 on a different cluster (hydra), and the bug is still the same.

    Maybe it is not a compiler bug but a code bug of the undefined behavior variety that only manifests with certain compiler/optimization settings..

  11. Ian Hinder reporter
    • removed comment

    Summary of problem from the above reports: * Test failures in thorns Dissipation and RotatingSymmetry90/180 (which tests?) * Constraints computed using ML_ADMConstraints, not ML_BSSN (is this true for all the failing tests?) * Failures occur on several machines (which machines?) * Tests do not fail in Jenkins (gcc) * See failures with Intel 15 and 16, but not all Intel 16 machines fail (do all Intel 15 machine fail?) * Failures with -O2 but not with -O1. * Adding a print statement to the code makes the problem go away * Problem is reproducible using 1 process and 1 thread * valgrind does not find any problem * defining "restrict" to empty does not solve the problem * NaNs in initial constraints in unigrid multi-block Minkowski * May be related to boundary initialisation, but maybe not * Large constraints in BNS simulation * Intel 17.0.1 * ML_ADMConstraints::ML_Ham / ML_Mom * Large constraints in TOV star * Constraints 3 orders of magnitude larger with Intel 17.0.1 than with GCC * ML_ADMConstraints::ML_Ham / ML_Mom * Problem seen with Intel 17.0.2 and -O3 * Problem goes away with Intel 17.0.2 and -O0 * With Intel 17.0.2 and -O3, the values of the constraints depend on the number of MPI processes * Problem seen with Intel 15.0.4 and -O3

  12. Wolfgang Kastaun
    • removed comment

    I am re-running the TOV test again with intel17+O3, but now with the Brahe release. So far the constraints look normal.

  13. Frank Löffler
    • removed comment

    Update from Peter Diener:

    I have made a little progress, but am still confused.

    First I made sure that I could reproduce the problem on my laptop with the intel 17 compilers. Then I made a configuration with no optimisation (-O0) and ensured that the testsuite passes in this case.

    Then, as the default McLachlan produces an explicitly vectorized code, I also made sure that the I could reproduce the same behaviour with the non-vectorized version (i.e. ML_BSSN_NV). This turned out to be the case. I then inserted a printf statement at the end of the loop in and printed out the calculated data for H at the coordinates that is output to ml_admconstraints-ml_ham.x.asc. I found that the same numbers are calculated in both the -O2 and -O0 versions of the executables. However, and this is very interesting, those numbers do not match either of the output files produced by the -O2 and -O0 executables as you can see from the attached plot, where I plot the data from the Cactus output files for -O0 (purple plus) and -O2 (green cross), the printf output from -O0 (blue asterisk )and -O2 (orange empty square) as well as from the actual test suite data (yellow filled square). I made sure that I printed from exactly the same coordinates (x, y and z) that are available in the Cactus output files. So it looks to me like the data gets modified after being calculated and before getting output. And somehow the modification is different at different optimization level.

    I'm not sure where this happens.

  14. Ian Hinder reporter
    • removed comment

    Are you looking at the right thorn? There are constraints calculated in both ML_BSSN and ML_ADMConstraints. This problem affects ML_ADMConstraints, which is the thorn that outputs the data for the tests. ML_BSSN outputs the constraints calculated from the BSSN variables, vs ML_ADMConstraints which outputs them calculated from the ADM variables.

  15. Peter Diener
    • removed comment

    Earlier today I realized that I had indeed added the printf statements to the wrong constraint function. Doing the same with the correct one (i.e. ML_ADMConstraints) the behavior of the -O2 executable changed and agreed with the -O0 executable. Next, I'll be trying to run both versions in a debugger (without the printf statement) and see if I can pinpoint where the -O2 executable goes wrong.

  16. Peter Diener
    • removed comment

    Also compiling this ML_ADMConstraints with -O1 leads to passing the otherwise failing testsuites on my laptop with the intel 17 compilers. I attempted to run both the -O0 and -O2 executables in gdb in order to try to track down where the differences occured. Unfortunately the -O2 version is too optimized to get anything sensible out of the debugger. Many variables (including the loop indices and local temporary variables) are being reported as "optimized away" so it is impossible to even know if I'm looking at the same loop iteration as in the -O0 debugging session. At this point I have no further ideas about how to proceed in tracking down the cause of the problem. I recommend (if possible) to limit the optimization level to -O1 for the file for the affected compilers, though it will have to be tested if this works in all cases.

  17. Peter Diener
    • removed comment

    I should probably also mention that valgrind didn't report any significant memory issues.

  18. Ian Hinder reporter
    • removed comment

    Do any of the machines supported in simfactory show this problem? Can you point to (or attach) the optionlist that you used to reproduce this problem?

  19. Frank Löffler
    • removed comment

    Since using tp model precise seems to also solve the issue, I suggest testing this:

    #if __INTEL_COMPILER >= 1500
    #pragma float_control (precise, on)

    (untested so far, taken from documentation)

  20. Peter Diener
    • removed comment

    I have added the optionlist I use on my laptop.

    After todays call I experimented with adding -fp-model precise and found that the problem disappeared. In order to investigate the sensitivity of this test to roundoff errors, I then used the Noise thorn to add noise to the initial data with an amplitude of 1e-6. The differences between running with -O2 with and without noise was of order a few times 1e-5. Similarly the differences bwetween running with -O2 -fp-model precise with and without noise was of the same order of magnitude. On the other hand the difference between running with -O2 with and without -fp-precise is of order unity. This points to (but probably doesn't prove it) that the test is not overly sensitive to roundoff error. Rather I think the -fp-model precise turns off the buggy optimization.

    I will now test Frank's suggestion.

  21. Roland Haas
    • removed comment

    On the technical side, since this is Kranc generated code: how would we (reliable, in a reproducible manner and understandably) add the pragma to the C++ code?

  22. Ian Hinder reporter
    • removed comment

    I don't think this is possible at the moment. Can it be achieved by appending something to the make.code.defn file? If so, then the Kranc MergeFiles thorn option could be used. That allows you to append any text to any file in the thorn. Maybe we can add something to this makefile which modifies the optimisation flags. That would then apply just to this one thorn, rather than the source file, but that's probably fine in this case.

  23. Ian Hinder reporter
    • removed comment

    I have added a proof of concept example to a branch in McLachlan. See [ admcons_intelbug]. This uses the Kranc "MergeFiles" option to provide a make.code.deps file which tests for the Intel compiler version and adds fp-model source to CXXFLAGS. We would need to determine which compiler versions are affected and adjust the logic accordingly. Other thorns may also need to know about this bug, in which case it would be better to determine its existence elsewhere in Cactus, but we don't have any such examples just yet.

  24. Roland Haas
    • removed comment

    (the same comment is also in the diff for Ian's branch, I just don't know how permanent those comments are)

    This must read: ...

    in both line 7 and 8, ie give the name of the make target.

  25. Frank Löffler
    • removed comment

    The patch tests the C-Compiler, not the C++ compiler (they should agree in their version, but we should still use the C++ compiler).

    The versions that it tests at the moment are likely not sufficient. We are unlikely to have the time at the moment to test all the different versions around, and we know that version 15 and some 16 fail as well. Thus, I suggest to set the 'broken' flag for >=15, at least for now.

  26. Ian Hinder reporter
    • removed comment

    I would like to see some solid data for which versions of the compiler have this problem. At the moment, the reports are too anecdotal.

    I am having trouble finding information about this problem, having never seen it myself. From this ticket, I deduce that some of the tests in Dissipation and RotatingSymmetry90/180 fail, and the variables which are wrong are computed by ML_ADMConstraints.

    The tests in those thorns are


    Every one of these tests uses the NoExcision thorn. Listing all the files which do NOT contain "NoExcision" gives: {{ Ian-Hinders-MacBook-Pro:CactusNumerical ian ((c3219faa...))$ grep -L NoExcision Dissipation/test/.par RotatingSymmetry180/test/.par RotatingSymmetry90/test/*.par Ian-Hinders-MacBook-Pro:CactusNumerical ian ((c3219faa...))$ }}

    I have compiled Cactus with Intel 17.0.4 on Minerva, but since NoExcision is disabled there, none of these tests run. I am not sure it is a coincidence that all these failing tests use NoExcision. Could it be NoExcision writing to memory it shouldn't be?

    Peter, if you disable NoExcision, do you still see differences between -O1 and -O2?

  27. Frank Löffler
    • removed comment

    I agree that we don't have a whole lot of information about specific compiler versions.

    On the other hand, I don't see a reasonable way how changing the fp model in ADMConstraints would hide/show a memory problem in another thorn. If overwriting happens, it would happen regardless of the compiler settings in ADMConstraints. Then a different fp model in ADMConstraints would have to mask (or not) that overwriting. While I suppose it is possible (we do use compilers mainly as black box...), that seems unlikely.

  28. Peter Diener
    • removed comment

    In order to turn off NoExcision, I had to change the spin of the black hole from a=0.8 to a=0.0 in order to not get NaN's in the constraint near the horizon. However, doing that still showed a significant difference in the constraints (both Hamiltonian and momentum) between -O1 and -O2 optimization executables. So, I don't think the issue can be blamed on NoExcision.

  29. Ian Hinder reporter
    • removed comment

    OK, good! Which test was this? Are the ADMBase variables also still the same between -O1 and -O2? It might be that the default CarpetIOASCII out_precision (15) is not high enough to see the differences, which might be amplified up by the derivatives taken when computing the constraints. The tolerance for this test has been adjusted:

     # incr. abstol since we expect some values of order 100 in the metric
    ABSTOL 1.e-10

    in Dissipation/test/test.ccl.

    I was able to include NoExcision in the thornlist and compile on Minerva with 17.0.4 with no problems (16.0.1 gives an internal compiler error when compiling NoExcision, which is why it was disabled in the first place). The Dissipation tests fails due to large constraints, as reported for the other versions. So the problem is still present in the latest released version of the compiler. I'm compiling with 16.0.3 now.

  30. Peter Diener
    • removed comment

    This was KerrSchild-rotating-180-EE from RotatingSymmetry180. Yes, the ADMBase variables agree to the precision in the output (15 digits). As shown by my earler Noise test, even differences in the ADMBase variables of 1e-6 does not produce differences of order unity in the constraints, so amplification of roundoff error is unlikely in my opinion.

  31. Ian Hinder reporter
    • removed comment

    The Dissipation tests fail in the same way for Intel 16.0.3. Do we actually have a version of the Intel compiler which does //not// show this problem?

  32. Frank Löffler
    • removed comment

    Replying to [comment:41 hinder]:

    The Dissipation tests fail in the same way for Intel 16.0.3. Do we actually have a version of the Intel compiler which does //not// show this problem?

    Version 14 (and probably earlier) don't show it.

  33. Ian Hinder reporter
    • removed comment

    I have written a new test in ML_BSSN_Test which uses the shifted gauge wave from EinsteinExact and outputs the constraints as calculated by ML_ADMConstraints. This test suffers from the same problem. The test fails when run with Intel 17.0.4, while the test data was generated using gcc 4.9.2. I have pushed the test to master, since it's always good to have more tests. This eliminates the possibility that the problem is caused by "irregular" or strange initial data.

    I have also tried adding

    printf("Hello World!\n");"

    as the last line of the loop in ML_ADMConstraints_evaluate_Body. This causes the test to pass, whereas previously it failed. If the problem is so sensitive to changes in the code, I'm worried that it will be difficult to produce a cut-down example to send to Intel. On the other hand, if doing something as drastic as disabling vectorisation doesn't cause the bug to go away, then maybe there is hope.

  34. Ian Hinder reporter
    • removed comment

    Did someone say they had also seen these failures on Intel 18 beta? I don't have access to a machine with it installed. Shall we disable optimisation for all Intel compiler versions >= 15?

  35. Peter Diener
    • removed comment

    I have pushed a version of the patch to the main branch, that contains a regex that should only match Intel version 15 to 18. Please go ahead and retest.

  36. Log in to comment