- changed status to open
- removed comment
Intel 2013.1.117 mis-compiles NewRad
Intels compiler fails (with -O2) to push values for bmin onto the stack in lines 316 of newrad.cc and line 126 of extrap.cc. Adding printf's for bmin perturbs the bug out of existence, but adding a printf of the address of bmax and reveals that at the time extrap_kernel is call the integer just before this address is still the initialization value of bmin[2] and not the correct value.
The attached patch disables optimization for the two driver functions affected (but not the actual kernel).
The patch is specific (via an #if) for this particular compiler and version. What is the best way of handling this? Target any intel version starting from the known failing one until we know of known good one? Or starting from an older known good one (that would be intel 11 in my case).
Hopefully no similar bug is triggered by Carpet's use of the same idiom.
Keyword: NewRad
Comments (54)
-
reporter -
- removed comment
Traditionally, we have not included statements like these in source code. Apart from the difficulty of removing these again (when is it safe to do so?), the question is whether the compiler bug would also miscompile other parts of the code.
The preferred approach is to not use this compiler version, or to determine a lower optimisation setting that makes this compiler still usable. This information would be encoded in Simfactory. Obviously this bug would also be reported to Intel, so that they can correct it.
-
- removed comment
We encountered the very same issue recently with intel 13.0.1 on the marenostrum 3 cluster, manifested as assertion failures in Newrad that went away when inserting print statements. What exactly is the compiler bug (how is it triggered) and how did you diagnose this ? In any case the code seems not special, so if it fails probably other parts of ET will fail as well.
-
- removed comment
Intel 13.1.0.146 (build 20130121) is available. I suggest upgrading to this release, or requesting from the system administrators that they do so.
-
reporter - removed comment
I do not fully understand what triggers the compiler bug. Unfortunately I do not have a short test case that reproduces it. Only NewRad. The error manifested as a segfault when the code was run, it was easy to detect. To diagnose I added printf statements at various places and ran the code in gdb. Eventually I found that the error is triggered in the very first iteration of the loop in extrap_kernel at which point i,j,k had the values that bmin was initialized with :-INT_MAX,-INT_MAX/2,-INT_MAX/3 in my case. Adding printf statements at the wrong spot removes the error eg adding a printf(%d %d %d\n", bmin[0],...) removes it, indep. of whether it is added in the driver or in the extrap routine, which in itself is odd since it would hint at some kind of cross-procedure optimization.
I cannot really lower the optimization settings below -O2 (for the whole code) and thus will use a different compiler version. As far are removing the statements goes: we'd never be able to remove them since this compiler version will always be buggy. Only when no one uses the compiler anymore could we do so.
I am unfamiliar with the procedure of submitting a bug report to intel. Can one submit all of newrad as the failing routine or is a stand-alone code that demonstrates the issue required?
-
- removed comment
First, you would reproduce the bug with the currently released version. Then you would need to find a way for the Intel engineers to reproduce the problem, which is either a self-contained code that can be run (may be large), or a short piece of code where you can point at the assembler output indicating the error there.
-
- removed comment
Independent on how to workaround this - it would be good to have one testsuite using Newrad which would make this kind of problem more visible. Currently the thorn doesn't have a testsuite, but I am not sure if another thorn might include it (although I would doubt it).
-
reporter - removed comment
There are quite a number of tests that include it:
rhaas@horizon:.../gr/Zelmani$ grep -l NewRad arrangements/*/*/test/*.par arrangements/AEIThorns/Trigger/test/trigger.par arrangements/CIGR/Hydro_Analysis/test/Hydro_Analysis_TOV.par arrangements/CactusNumerical/RotatingSymmetry180/test/Kerr-EE.par arrangements/CactusNumerical/RotatingSymmetry180/test/Kerr-rotating-180-EE.par arrangements/CactusNumerical/RotatingSymmetry180/test/Kerr-rotating-180-staggered-EE.par arrangements/CactusNumerical/RotatingSymmetry180/test/Kerr-rotating-180-staggered.par arrangements/CactusNumerical/RotatingSymmetry180/test/Kerr-rotating-180.par arrangements/CactusNumerical/RotatingSymmetry180/test/Kerr-staggered-EE.par arrangements/CactusNumerical/RotatingSymmetry180/test/Kerr-staggered.par arrangements/CactusNumerical/RotatingSymmetry180/test/Kerr.par arrangements/CactusNumerical/RotatingSymmetry180/test/KerrSchild-rotating-180.par arrangements/CactusNumerical/RotatingSymmetry90/test/Kerr-rotating-90-EE.par arrangements/CactusNumerical/RotatingSymmetry90/test/Kerr-rotating-90-staggered-EE.par arrangements/CactusNumerical/RotatingSymmetry90/test/Kerr-rotating-90-staggered.par arrangements/CactusNumerical/RotatingSymmetry90/test/Kerr-rotating-90.par arrangements/CactusNumerical/RotatingSymmetry90/test/KerrSchild-rotating-90.par arrangements/Carpet/CarpetEvolutionMask/test/CarpetEvolutionMask_test.par arrangements/Carpet/CarpetEvolutionMask/test/CarpetEvolutionMask_test_off.par arrangements/EinsteinAnalysis/AHFinderDirect/test/checkpointML-EE.par arrangements/EinsteinAnalysis/AHFinderDirect/test/checkpointML.par arrangements/EinsteinAnalysis/AHFinderDirect/test/recoverML-EE.par arrangements/EinsteinAnalysis/AHFinderDirect/test/recoverML.par arrangements/EinsteinAnalysis/WeylScal4/test/teukolsky.par arrangements/EinsteinEvolve/GRHydro/test/A3_oct.par arrangements/EinsteinEvolve/GRHydro/test/GRHydro_test_tov_ppm_ML.par arrangements/EinsteinEvolve/GRHydro/test/GRHydro_test_tov_ppm_ML_disable_internal_excision.par arrangements/EinsteinEvolve/GRHydro/test/GRHydro_test_tov_ppm_no_trp_ML.par arrangements/EinsteinInitialData/Exact/test/KS-tilted.par arrangements/LSUThorns/QuasiLocalMeasures/test/qlm-bl.par arrangements/McLachlan/ML_BSSN_Test/test/ML_BSSN_MP_O8_bh.par arrangements/McLachlan/ML_BSSN_Test/test/ML_BSSN_NewRad.par arrangements/Zelmani/GRHydro/test/A3_oct.par arrangements/Zelmani/GRHydro/test/GRHydro_test_tov_ppm_ML.par arrangements/Zelmani/GRHydro/test/GRHydro_test_tov_ppm_ML_disable_internal_excision.par arrangements/Zelmani/GRHydro/test/GRHydro_test_tov_ppm_no_trp_ML.par
essentially all tests that use McLachlan.
-
- removed comment
That is very good. That also means that this problem must be pretty "new", or appears only on otherwise not tested machines, right? We should otherwise have seen testsuite failures because of that in the past.
I otherwise agree with Erik: disabling optimization here is 'dirty' and very likely that compiler has problems compiling other code as well. It would be much better to avoid the compiler instead, if possible.
-
reporter - removed comment
This still happens (for me, on bethe, in the NewRad thorn [only]) with 2013.2.146 . I have so far not been able to reproduce this with a small test case. Maybe I will just have to make up a small Cactus thornlist with NewRad to demonstrate.
-
- changed milestone to ET_2013_05
- removed comment
-
- removed comment
Is this still an issue? Should we, within NewRad and for this release, detect that very compiler version and abort compilation with an error to at least avoid the segfaults and point users to the root of the problem?
-
reporter - removed comment
This is still an issue. Since the last update of the ticket I found that adding
#define restrict
ie. removing "restrict" at the beginning of the files also seems to work around the issue.
-
- removed comment
Since we are approaching the release date, I suggest a correction that is completely contained within and #ifdef for the particular compiler version. Either switching off optimization or undefining restrict would both be fine.
Switching off optimization in the *_loop routines should be fine, since on the kernels are performance relevant.
-
- removed comment
Just testing this with Intel 13 2013.3.163: still segfaults. So far I think we don't have any version '13' that works, or do we?
-
- removed comment
Unfortunately, Intel uses different version numbers on the outside of their packages and on the inside. Easiest to check is the output of "icc -V"; what does this show for you?
I'm using Version 13.1.1.163 Build 20130313.
-
- removed comment
Same: Version 13.1.1.163 Build 20130313
-
- removed comment
Removing the 'restrict' keywords from bmin and bmax also seems to be a workaround. Are we sure we are not doing something wrong here, declaring bmin and bmax 'restrict'? I don't see how, but I am not sure.
-
- changed version to development version
- marked as critical
- removed comment
I attach a short version of the file, reduced as much as I could so far. It is a single C-file, so be compiled using 'icpc -o ./extrap ./extrap.cc'. If working, you should see "2 2" as output (three lines). If it's not, you see random (large) numbers, and it usually hangs. The difference can be seen by commenting/uncommenting the printf in line 59.
It would be nice if someone else could confirm that this short example can be used to shop the bug, wherever it might be.
Also raising severity, since the Intel compiler is crucial and version 13 is the newest, and only available on some systems.
-
- removed comment
The attached patch https://trac.einsteintoolkit.org/attachment/ticket/1276/Cactus_intel_no_restrict disables restrict for the version of Intel compiler I can test. Other versions might need to be added, but someone else (who has them) would need to test that. With this patch, testsuites that segfaulted before now pass.
-
reporter - removed comment
-
reporter - removed comment
Without restrict I would expect the code to run much slower in some places.
-
- removed comment
I agree. But the alternative is to have it not run at all, or possibly to miss places which the compiler also happens to miscompile, if one would redefine 'restrict' only in files we find to be miscompiled.
-
- removed comment
This patch doesn't disable restrict. The default case, where "restrict" remains undefined, leaves restrict enabled (since it remains "restrict"). I recommend changing only HAVE_CCTK_C_RESTRICT and CCTK_C_RESTRICT, and keeping the logic that redefines "restrict". Alternatively, add the Intel #ifdef afterwards, and just overwrite HAVE_CCTK_C_RESTRICT, CCTK_C_RESTRICT, and restrict.
#if BORKEN_INTEL # undef HAVE_CCTK_C_RESTRICT # undef CCTK_C_RESTRICT # undef restrict # define CCTK_C_RESTRICT # define restrict #endif
-
- removed comment
The first part of the patch is incorrect, but the second worked. This left 'restrict' for C intact, while it does remove it for C++. The file in question here was C++, so I didn't catch this, and the (most/all?) tests passed for me. For C++, and with this version of the compiler, autoconf sets HAVE_CCTK_CXX_RESTRICT to 1 and CCTK_CXX_RESTRICT to restrict. With the old patch, 'restrict' is then defined empty for problematic versions of the intel compiler (instead of restrict).
However, you are right: this could be made even better. Attached is a new version of the patch. This one first contains a conditional on the version of the compiler for all CCODE. Then, for C and CXX separately, it skips the autoconf-provided values if the compiler was found to be bad, and instead HAVE_CCTK_CXX_RESTRICT remains undefined and CCTK_CXX_RESTRICT is set empty, which later defines 'restrict' to empty too (similarly for C).
Because someone might actually want to overwrite this, CCTK_INTEL_COMPILER_DONT_DISABLE_RESTRICT is checked, and if set, doesn't disable restrict even for bad compilers.
Also, now all Intel compilers with build dates between 20121010 and 20130313 are flagged 'bad'.
-
- changed status to open
- removed comment
Tested and approved.
-
- changed status to resolved
- removed comment
Committed as rev 5001
-
reporter - changed status to open
- removed comment
Did anyone actually report this to intel?
-
- removed comment
It was reported quite a while back (through TACC). I didn't hear anything back yet though.
-
- changed milestone to ET_2014_05
- removed comment
Nothing we can really do about this right now.
-
- changed milestone to ET_2013_11
- removed comment
We should apply the workaround also the newer versions.
-
- removed comment
The last version we've tested it with is 20130313, which isn't the most recent version - but the most recent I have currently access to. Also interestingly, this version is marked 'buggy' by intel itself (see http://software.intel.com/en-us/articles/intel-compiler-and-composer-update-version-numbers-to-compiler-version-number-mapping). Newer versions are 20130514, 20130607, 20130728, and the currently most recent 20131008. To the best of my knowledge none of these has been tested by us.
-
- marked as
- changed milestone to ET_2014_05
- removed comment
The workaround is now enabled for all versions past 20121010, in commit 5049 of the flesh. Marking as relevant for the next release because we need to revisit after some time to see if Intel fixed this, and in which version.
-
reporter - removed comment
It seems as if Intel14 (Version 14.0.1.106 Build 20131008), which is available on stampede as intel/14.0.1.106 is no longer affected. At least the standalone test produces "2 2" without options, with -O3, -O2, -O1, -O0, -Ofast.
-
- removed comment
The problem is present as far back as 20120731; we have changed the blacklist to reflect this in r5092 of the flesh.
-
- removed comment
The version of Intel 14 on Datura is slightly older: Version 14.0.0.080 Build 20130728. The standalone test passes with this version. I am attaching a patch to cctk_Config.h.in which enables restrict for this version and later. We are assuming here that Intel do not release new builds of old versions which might not have the bug fixed.
OK to apply?
-
- removed comment
Yes, please apply. However, please also extend the comment above this section accordingly.
-
reporter - removed comment
Intel only fixes the latest version [https://software.intel.com/en-us/articles/intel-fortran-compiler-supported-compiler-versions]
Typically, bug fixes are provided for the latest version only, unless an explicit support contract for older versions exists.
-
- removed comment
Ian: can you commit the patch - or did you see problems when testing it?
-
- changed status to resolved
- removed comment
Committed as [[changeset:5110/Cactus flesh]].
-
- changed status to open
- removed comment
This bug seems to be partially back on intel 14. I am updating the loewe.cfg file after a machine upgrade and noticed that all tests using the parameter
ML_BSSN::my_initial_boundary_condition = "extrapolate-gammas"
failed. After I realized all failed at NewRad.c I recalled this ticket and then I ran Frank's standalone code with -O0, -O1, -O2, -O3, -Ofast flags. As you can see below only the -O1 and -O2 flags corrupted the output. Since the -O2 is usually the default flag, I thought of reporting this issue here again.
icpc -o ./extrap -O0 ./extrap.cc ./extrap 2 2 2 2 2 2 icpc -o ./extrap -O1 ./extrap.cc /extrap -10184 -10184 -10184 -10184 -10184 -10184 icpc -o ./extrap -O2 ./extrap.cc ./extrap 40896 40896 40896 40896 40896 40896 icpc -o ./extrap -O3 ./extrap.cc ./extrap 2 2 2 2 2 2 icpc -o ./extrap -Ofast ./extrap.cc ./extrap 2 2 2 2 2 2
The full version of the compiler I used is:
icpc -V Intel(R) C++ Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 14.0.3.174 Build 20140422 Copyright (C) 1985-2014 Intel Corporation. All rights reserved.
-
- removed comment
Did you also report the issue to Intel?
-
- removed comment
No. I was looking at their website and they have already an update for this version:
I don't know if they would just tell me to try a more recent update. Do you have experience filing bugs there? In any case, I will ask the sysadmin here first to update the compiler version, before filing a bug at intel.
@Frank: did you file this bug at intel with this extrap.c code snipet?
-
- removed comment
I didn't report directly, but through our contact at TACC (which now does not work there anymore).
-
- removed comment
Ok. Do you mind reporting to them? I could also report to them and use your code snippet, if you don't mind. But first I will wait for the sysadmin here update the compiler build to see if this bug goes away.
What about making your code snippet part of the configure stage where we decide to use restrict or not based on tests with your code?
-
- removed comment
The blacklisting of restrict in Cactus should be updated to match this compiler version. And unless and until someone tests this with a newer version, any newer versions should also be blacklisted just in case. The latest version I have tested with is
Intel(R) C Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 14.0.0.080 Build 20130728
, which is the latest version installed on Datura. -
- removed comment
I tested the extrap code with other intel compiler versions. The most recent intel 14 still has the bug. The first build for intel 15 works fine. So, we should blacklist the following:
Intel(R) 64, Version 14.0.3.174 Build 20140422 Intel(R) 64, Version 14.0.4.211 Build 20140805
and add to a white list the intel 15 one:
Intel(R) 64, Version 15.0.0.090 Build 20140723
I haven't tested a full build of ET yet. The sysadmin here will work on compiling the other modules with the newest intel 15. I will have to report back here later.
-
- removed comment
Can please someone with a failing compiler test it before the release?
-
- removed comment
I only looked through the patch now without trying - but doesn't this test only run the standalone code, not test if it actually did the correct thing?
-
- removed comment
It checks to see if it got the correct answer (i.e. 2). If it doesn't, it exits with return code 1 which causes the test to fail. You can artificially cause a failure if you manually change the "return 0" to "return 1" inside the test to see what happens.
-
- removed milestone
- removed comment
-
- removed comment
I tested with intel 15.0.1 20141023 and the standalone test works with -O0, -O1, -O2, -O3.
-
reporter - removed comment
This ticket is still open after 2 years. Frank, given that Wolfgang reports that the test is effective, do you want to push this ticket forward so that the patch can be included?
-
- removed comment
I would expect that a few tests with different versions of the compiler would be necessary to include the patch. In general it is a good idea to have a test for it. One immediate problem is that the patch adds
#undef CCTK_DISABLE_RESTRICT
It shouldn't, as by default this isn't defined, and if it is, it was done so presumably by the user and should not be ignored.
- Log in to comment