Intel 2013.1.117 mis-compiles NewRad

Issue #1276 open
Roland Haas created an issue

Intels compiler fails (with -O2) to push values for bmin onto the stack in lines 316 of newrad.cc and line 126 of extrap.cc. Adding printf's for bmin perturbs the bug out of existence, but adding a printf of the address of bmax and reveals that at the time extrap_kernel is call the integer just before this address is still the initialization value of bmin[2] and not the correct value.

The attached patch disables optimization for the two driver functions affected (but not the actual kernel).

The patch is specific (via an #if) for this particular compiler and version. What is the best way of handling this? Target any intel version starting from the known failing one until we know of known good one? Or starting from an older known good one (that would be intel 11 in my case).

Hopefully no similar bug is triggered by Carpet's use of the same idiom.

Keyword: NewRad

Comments (55)

  1. Erik Schnetter
    • removed comment

    Traditionally, we have not included statements like these in source code. Apart from the difficulty of removing these again (when is it safe to do so?), the question is whether the compiler bug would also miscompile other parts of the code.

    The preferred approach is to not use this compiler version, or to determine a lower optimisation setting that makes this compiler still usable. This information would be encoded in Simfactory. Obviously this bug would also be reported to Intel, so that they can correct it.

  2. Wolfgang Kastaun
    • removed comment

    We encountered the very same issue recently with intel 13.0.1 on the marenostrum 3 cluster, manifested as assertion failures in Newrad that went away when inserting print statements. What exactly is the compiler bug (how is it triggered) and how did you diagnose this ? In any case the code seems not special, so if it fails probably other parts of ET will fail as well.

  3. Erik Schnetter
    • removed comment

    Intel 13.1.0.146 (build 20130121) is available. I suggest upgrading to this release, or requesting from the system administrators that they do so.

  4. Roland Haas reporter
    • removed comment

    I do not fully understand what triggers the compiler bug. Unfortunately I do not have a short test case that reproduces it. Only NewRad. The error manifested as a segfault when the code was run, it was easy to detect. To diagnose I added printf statements at various places and ran the code in gdb. Eventually I found that the error is triggered in the very first iteration of the loop in extrap_kernel at which point i,j,k had the values that bmin was initialized with :-INT_MAX,-INT_MAX/2,-INT_MAX/3 in my case. Adding printf statements at the wrong spot removes the error eg adding a printf(%d %d %d\n", bmin[0],...) removes it, indep. of whether it is added in the driver or in the extrap routine, which in itself is odd since it would hint at some kind of cross-procedure optimization.

    I cannot really lower the optimization settings below -O2 (for the whole code) and thus will use a different compiler version. As far are removing the statements goes: we'd never be able to remove them since this compiler version will always be buggy. Only when no one uses the compiler anymore could we do so.

    I am unfamiliar with the procedure of submitting a bug report to intel. Can one submit all of newrad as the failing routine or is a stand-alone code that demonstrates the issue required?

  5. Erik Schnetter
    • removed comment

    First, you would reproduce the bug with the currently released version. Then you would need to find a way for the Intel engineers to reproduce the problem, which is either a self-contained code that can be run (may be large), or a short piece of code where you can point at the assembler output indicating the error there.

  6. Frank Löffler
    • removed comment

    Independent on how to workaround this - it would be good to have one testsuite using Newrad which would make this kind of problem more visible. Currently the thorn doesn't have a testsuite, but I am not sure if another thorn might include it (although I would doubt it).

  7. Roland Haas reporter
    • removed comment

    There are quite a number of tests that include it:

    rhaas@horizon:.../gr/Zelmani$ grep -l NewRad arrangements/*/*/test/*.par
    arrangements/AEIThorns/Trigger/test/trigger.par
    arrangements/CIGR/Hydro_Analysis/test/Hydro_Analysis_TOV.par
    arrangements/CactusNumerical/RotatingSymmetry180/test/Kerr-EE.par
    arrangements/CactusNumerical/RotatingSymmetry180/test/Kerr-rotating-180-EE.par
    arrangements/CactusNumerical/RotatingSymmetry180/test/Kerr-rotating-180-staggered-EE.par
    arrangements/CactusNumerical/RotatingSymmetry180/test/Kerr-rotating-180-staggered.par
    arrangements/CactusNumerical/RotatingSymmetry180/test/Kerr-rotating-180.par
    arrangements/CactusNumerical/RotatingSymmetry180/test/Kerr-staggered-EE.par
    arrangements/CactusNumerical/RotatingSymmetry180/test/Kerr-staggered.par
    arrangements/CactusNumerical/RotatingSymmetry180/test/Kerr.par
    arrangements/CactusNumerical/RotatingSymmetry180/test/KerrSchild-rotating-180.par
    arrangements/CactusNumerical/RotatingSymmetry90/test/Kerr-rotating-90-EE.par
    arrangements/CactusNumerical/RotatingSymmetry90/test/Kerr-rotating-90-staggered-EE.par
    arrangements/CactusNumerical/RotatingSymmetry90/test/Kerr-rotating-90-staggered.par
    arrangements/CactusNumerical/RotatingSymmetry90/test/Kerr-rotating-90.par
    arrangements/CactusNumerical/RotatingSymmetry90/test/KerrSchild-rotating-90.par
    arrangements/Carpet/CarpetEvolutionMask/test/CarpetEvolutionMask_test.par
    arrangements/Carpet/CarpetEvolutionMask/test/CarpetEvolutionMask_test_off.par
    arrangements/EinsteinAnalysis/AHFinderDirect/test/checkpointML-EE.par
    arrangements/EinsteinAnalysis/AHFinderDirect/test/checkpointML.par
    arrangements/EinsteinAnalysis/AHFinderDirect/test/recoverML-EE.par
    arrangements/EinsteinAnalysis/AHFinderDirect/test/recoverML.par
    arrangements/EinsteinAnalysis/WeylScal4/test/teukolsky.par
    arrangements/EinsteinEvolve/GRHydro/test/A3_oct.par
    arrangements/EinsteinEvolve/GRHydro/test/GRHydro_test_tov_ppm_ML.par
    arrangements/EinsteinEvolve/GRHydro/test/GRHydro_test_tov_ppm_ML_disable_internal_excision.par
    arrangements/EinsteinEvolve/GRHydro/test/GRHydro_test_tov_ppm_no_trp_ML.par
    arrangements/EinsteinInitialData/Exact/test/KS-tilted.par
    arrangements/LSUThorns/QuasiLocalMeasures/test/qlm-bl.par
    arrangements/McLachlan/ML_BSSN_Test/test/ML_BSSN_MP_O8_bh.par
    arrangements/McLachlan/ML_BSSN_Test/test/ML_BSSN_NewRad.par
    arrangements/Zelmani/GRHydro/test/A3_oct.par
    arrangements/Zelmani/GRHydro/test/GRHydro_test_tov_ppm_ML.par
    arrangements/Zelmani/GRHydro/test/GRHydro_test_tov_ppm_ML_disable_internal_excision.par
    arrangements/Zelmani/GRHydro/test/GRHydro_test_tov_ppm_no_trp_ML.par
    

    essentially all tests that use McLachlan.

  8. Frank Löffler
    • removed comment

    That is very good. That also means that this problem must be pretty "new", or appears only on otherwise not tested machines, right? We should otherwise have seen testsuite failures because of that in the past.

    I otherwise agree with Erik: disabling optimization here is 'dirty' and very likely that compiler has problems compiling other code as well. It would be much better to avoid the compiler instead, if possible.

  9. Roland Haas reporter
    • removed comment

    This still happens (for me, on bethe, in the NewRad thorn [only]) with 2013.2.146 . I have so far not been able to reproduce this with a small test case. Maybe I will just have to make up a small Cactus thornlist with NewRad to demonstrate.

  10. Frank Löffler
    • removed comment

    Is this still an issue? Should we, within NewRad and for this release, detect that very compiler version and abort compilation with an error to at least avoid the segfaults and point users to the root of the problem?

  11. Roland Haas reporter
    • removed comment

    This is still an issue. Since the last update of the ticket I found that adding

    #define restrict
    

    ie. removing "restrict" at the beginning of the files also seems to work around the issue.

  12. Erik Schnetter
    • removed comment

    Since we are approaching the release date, I suggest a correction that is completely contained within and #ifdef for the particular compiler version. Either switching off optimization or undefining restrict would both be fine.

    Switching off optimization in the *_loop routines should be fine, since on the kernels are performance relevant.

  13. Frank Löffler
    • removed comment

    Just testing this with Intel 13 2013.3.163: still segfaults. So far I think we don't have any version '13' that works, or do we?

  14. Erik Schnetter
    • removed comment

    Unfortunately, Intel uses different version numbers on the outside of their packages and on the inside. Easiest to check is the output of "icc -V"; what does this show for you?

    I'm using Version 13.1.1.163 Build 20130313.

  15. Frank Löffler
    • removed comment

    Removing the 'restrict' keywords from bmin and bmax also seems to be a workaround. Are we sure we are not doing something wrong here, declaring bmin and bmax 'restrict'? I don't see how, but I am not sure.

  16. Frank Löffler

    I attach a short version of the file, reduced as much as I could so far. It is a single C-file, so be compiled using 'icpc -o ./extrap ./extrap.cc'. If working, you should see "2 2" as output (three lines). If it's not, you see random (large) numbers, and it usually hangs. The difference can be seen by commenting/uncommenting the printf in line 59.

    It would be nice if someone else could confirm that this short example can be used to shop the bug, wherever it might be.

    Also raising severity, since the Intel compiler is crucial and version 13 is the newest, and only available on some systems.

  17. Roland Haas reporter
    • removed comment

    Without restrict I would expect the code to run much slower in some places.

  18. Frank Löffler
    • removed comment

    I agree. But the alternative is to have it not run at all, or possibly to miss places which the compiler also happens to miscompile, if one would redefine 'restrict' only in files we find to be miscompiled.

  19. Erik Schnetter
    • removed comment

    This patch doesn't disable restrict. The default case, where "restrict" remains undefined, leaves restrict enabled (since it remains "restrict"). I recommend changing only HAVE_CCTK_C_RESTRICT and CCTK_C_RESTRICT, and keeping the logic that redefines "restrict". Alternatively, add the Intel #ifdef afterwards, and just overwrite HAVE_CCTK_C_RESTRICT, CCTK_C_RESTRICT, and restrict.

    if BORKEN_INTEL

    undef HAVE_CCTK_C_RESTRICT

    undef CCTK_C_RESTRICT

    undef restrict

    define CCTK_C_RESTRICT

    define restrict

    endif

  20. Frank Löffler
    • removed comment

    The first part of the patch is incorrect, but the second worked. This left 'restrict' for C intact, while it does remove it for C++. The file in question here was C++, so I didn't catch this, and the (most/all?) tests passed for me. For C++, and with this version of the compiler, autoconf sets HAVE_CCTK_CXX_RESTRICT to 1 and CCTK_CXX_RESTRICT to restrict. With the old patch, 'restrict' is then defined empty for problematic versions of the intel compiler (instead of restrict).

    However, you are right: this could be made even better. Attached is a new version of the patch. This one first contains a conditional on the version of the compiler for all CCODE. Then, for C and CXX separately, it skips the autoconf-provided values if the compiler was found to be bad, and instead HAVE_CCTK_CXX_RESTRICT remains undefined and CCTK_CXX_RESTRICT is set empty, which later defines 'restrict' to empty too (similarly for C).

    Because someone might actually want to overwrite this, CCTK_INTEL_COMPILER_DONT_DISABLE_RESTRICT is checked, and if set, doesn't disable restrict even for bad compilers.

    Also, now all Intel compilers with build dates between 20121010 and 20130313 are flagged 'bad'.

  21. Frank Löffler
    • removed comment

    It was reported quite a while back (through TACC). I didn't hear anything back yet though.

  22. Frank Löffler
    • removed comment

    The last version we've tested it with is 20130313, which isn't the most recent version - but the most recent I have currently access to. Also interestingly, this version is marked 'buggy' by intel itself (see http://software.intel.com/en-us/articles/intel-compiler-and-composer-update-version-numbers-to-compiler-version-number-mapping). Newer versions are 20130514, 20130607, 20130728, and the currently most recent 20131008. To the best of my knowledge none of these has been tested by us.

  23. Frank Löffler
    • marked as
    • changed milestone to ET_2014_05
    • removed comment

    The workaround is now enabled for all versions past 20121010, in commit 5049 of the flesh. Marking as relevant for the next release because we need to revisit after some time to see if Intel fixed this, and in which version.

  24. Roland Haas reporter
    • removed comment

    It seems as if Intel14 (Version 14.0.1.106 Build 20131008), which is available on stampede as intel/14.0.1.106 is no longer affected. At least the standalone test produces "2 2" without options, with -O3, -O2, -O1, -O0, -Ofast.

  25. Ian Hinder
    • removed comment

    The version of Intel 14 on Datura is slightly older: Version 14.0.0.080 Build 20130728. The standalone test passes with this version. I am attaching a patch to cctk_Config.h.in which enables restrict for this version and later. We are assuming here that Intel do not release new builds of old versions which might not have the bug fixed.

    OK to apply?

  26. Frank Löffler
    • removed comment

    Yes, please apply. However, please also extend the comment above this section accordingly.

  27. Bruno Mundim
    • changed status to open
    • removed comment

    This bug seems to be partially back on intel 14. I am updating the loewe.cfg file after a machine upgrade and noticed that all tests using the parameter

    ML_BSSN::my_initial_boundary_condition = "extrapolate-gammas"

    failed. After I realized all failed at NewRad.c I recalled this ticket and then I ran Frank's standalone code with -O0, -O1, -O2, -O3, -Ofast flags. As you can see below only the -O1 and -O2 flags corrupted the output. Since the -O2 is usually the default flag, I thought of reporting this issue here again.

    icpc -o ./extrap -O0 ./extrap.cc ./extrap 2 2 2 2 2 2 icpc -o ./extrap -O1 ./extrap.cc /extrap -10184 -10184 -10184 -10184 -10184 -10184 icpc -o ./extrap -O2 ./extrap.cc ./extrap 40896 40896 40896 40896 40896 40896 icpc -o ./extrap -O3 ./extrap.cc ./extrap 2 2 2 2 2 2 icpc -o ./extrap -Ofast ./extrap.cc ./extrap 2 2 2 2 2 2

    The full version of the compiler I used is:

    icpc -V Intel(R) C++ Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 14.0.3.174 Build 20140422 Copyright (C) 1985-2014 Intel Corporation. All rights reserved.

  28. Bruno Mundim
    • removed comment

    No. I was looking at their website and they have already an update for this version:

    https://software.intel.com/en-us/articles/intel-compiler-and-composer-update-version-numbers-to-compiler-version-number-mapping

    I don't know if they would just tell me to try a more recent update. Do you have experience filing bugs there? In any case, I will ask the sysadmin here first to update the compiler version, before filing a bug at intel.

    @Frank: did you file this bug at intel with this extrap.c code snipet?

  29. Frank Löffler
    • removed comment

    I didn't report directly, but through our contact at TACC (which now does not work there anymore).

  30. Bruno Mundim
    • removed comment

    Ok. Do you mind reporting to them? I could also report to them and use your code snippet, if you don't mind. But first I will wait for the sysadmin here update the compiler build to see if this bug goes away.

    What about making your code snippet part of the configure stage where we decide to use restrict or not based on tests with your code?

  31. Ian Hinder
    • removed comment

    The blacklisting of restrict in Cactus should be updated to match this compiler version. And unless and until someone tests this with a newer version, any newer versions should also be blacklisted just in case. The latest version I have tested with is Intel(R) C Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 14.0.0.080 Build 20130728, which is the latest version installed on Datura.

  32. Bruno Mundim
    • removed comment

    I tested the extrap code with other intel compiler versions. The most recent intel 14 still has the bug. The first build for intel 15 works fine. So, we should blacklist the following:

    Intel(R) 64, Version 14.0.3.174 Build 20140422 Intel(R) 64, Version 14.0.4.211 Build 20140805

    and add to a white list the intel 15 one:

    Intel(R) 64, Version 15.0.0.090 Build 20140723

    I haven't tested a full build of ET yet. The sysadmin here will work on compiling the other modules with the newest intel 15. I will have to report back here later.

  33. Frank Löffler
    • removed comment

    I only looked through the patch now without trying - but doesn't this test only run the standalone code, not test if it actually did the correct thing?

  34. Steven R. Brandt
    • removed comment

    It checks to see if it got the correct answer (i.e. 2). If it doesn't, it exits with return code 1 which causes the test to fail. You can artificially cause a failure if you manually change the "return 0" to "return 1" inside the test to see what happens.

  35. Roland Haas reporter
    • removed comment

    This ticket is still open after 2 years. Frank, given that Wolfgang reports that the test is effective, do you want to push this ticket forward so that the patch can be included?

  36. Frank Löffler
    • removed comment

    I would expect that a few tests with different versions of the compiler would be necessary to include the patch. In general it is a good idea to have a test for it. One immediate problem is that the patch adds

    undef CCTK_DISABLE_RESTRICT

    It shouldn't, as by default this isn't defined, and if it is, it was done so presumably by the user and should not be ignored.

  37. Roland Haas reporter

    Unless objected I will close this ticket as “resolved” after 2019-7-14.

    The test remains in cctk_Config.h which is maybe a bit unfortunate since that is not a file one would normally have checked. A test in configure.in (ideally not requiring actually running code, so that it also works with cross compiling) would be nicer. Something like:

    #if (defined __INTEL_COMPILER                &&                 \
         __INTEL_COMPILER_BUILD_DATE >= 20120731 &&                 \
         __INTEL_COMPILER_BUILD_DATE <  20130728 &&                 \
         !defined CCTK_INTEL_COMPILER_DONT_DISABLE_RESTRICT)
    #error "icpc with failing __restrict__ handling detected"
    #endif
    

    would do the trick.

    There are no longer any production machines with intel compilers which fall into this range. The oldest compiler available is on shelob with /usr/local/compilers/Intel/composer_xe_2013.5.192/bin/intel64/icpc which has a build date of 20130607 and thus would fall into the range of “buggy” compilers (>= 20120731 and <20130728) but passes the standalone test for me (no matter what optimization level I choose).

  38. Log in to comment