- changed status to open
- removed comment
fix sqrt() for AVX512, updates to Vectors thorn
This pull request mostly add better support for the AVX512 instructions found in KNL and modern Intel Xeon / Core iN CPUs.
There is one bugfix commit “Correct error in AVX512 sqrt implemention” which is important for those architectures (eg to compute sqrt(detg)).
I am marking this as critical until I can verify that without this fix the result of a BBH simulation on eg Stampede2 is not incorrect because of using an incorrect sqrt() function.
Pull request is here: https://bitbucket.org/cactuscode/cactusutils/pull-requests/17/rhaas-avx512/diff
Keyword: Vectors
Comments (9)
-
reporter -
reporter Downgrading this since I remember Erik's comment only sqrt(0) was incorrect.
-
reporter - attached sqrt.cc
Version from before 5dea378
-
reporter - attached sqrt-fixed.cc
Version after 5dea378, modified to remove the #error and to compile the code when used with -xMIC-AVX512 on stampede2.
-
reporter It tested this on stampede2 and the modified code produces the correct values. Literally only sqrt(0.) was incorrect and produced a NaN. The fix is to check if the input was 0. and if so then return 0. This happens in the return statement of the function. Other changes are just moving the intrinsic functions into inline functions. Here's a diff between the codes (with the inline functions not shown):
--- sqrt.cc 2019-02-21 17:39:49.371725396 -0600 +++ sqrt-fixed.cc 2019-02-21 17:39:49.303724044 -0600 @@ -14,6 +14,10 @@ typedef __m512i CCTK_INTEGER8_VEC; typedef __mmask8 CCTK_BOOLEAN8_VEC; +#ifdef __AVX512ER__ +#define __knl__ +#endif + // Number of vector elements in a CCTK_REAL_VEC #define CCTK_REAL8_VEC_SIZE 8 @@ -34,40 +38,137 @@ // We start with an approximate result, then perform Goldschmidt iterations // <https://en.wikipedia.org/wiki/Methods_of_computing_square_roots>. // Theoretically the result should be exact. // Initialisation CCTK_REAL8_VEC const y0 = _mm512_rsqrt28_pd(x); - CCTK_REAL8_VEC const x0 = _mm512_mul_pd(x, y0); - CCTK_REAL8_VEC const h0 = _mm512_mul_pd(vec8_set1(0.5), y0); + CCTK_REAL8_VEC const x0 = k8mul(x, y0); + CCTK_REAL8_VEC const h0 = k8mul(vec8_set1(0.5), y0); // Step - CCTK_REAL8_VEC const r0 = _mm512_fnmadd_pd(x0, h0, vec8_set1(0.5)); - CCTK_REAL8_VEC const x1 = _mm512_fmadd_pd(x0, r0, x0); - return x1; + CCTK_REAL8_VEC const r0 = k8nmsub(x0, h0, vec8_set1(0.5)); + CCTK_REAL8_VEC const x1 = k8madd(x0, r0, x0); + return k8ifthen(k8cmpeq(x, vec8_set1(0.0)), x, x1); #else // We start with an approximate result, then perform Goldschmidt iterations // <https://en.wikipedia.org/wiki/Methods_of_computing_square_roots>. // Theoretically the result should be exact. // Initialisation CCTK_REAL8_VEC const y0 = _mm512_rsqrt14_pd(x); - CCTK_REAL8_VEC const x0 = _mm512_mul_pd(x, y0); - CCTK_REAL8_VEC const h0 = _mm512_mul_pd(vec8_set1(0.5), y0); + CCTK_REAL8_VEC const x0 = k8mul(x, y0); + CCTK_REAL8_VEC const h0 = k8mul(vec8_set1(0.5), y0); // Step - CCTK_REAL8_VEC const r0 = _mm512_fnmadd_pd(x0, h0, vec8_set1(0.5)); - CCTK_REAL8_VEC const x1 = _mm512_fmadd_pd(x0, r0, x0); - CCTK_REAL8_VEC const h1 = _mm512_fmadd_pd(h0, r0, h0); + CCTK_REAL8_VEC const r0 = k8nmsub(x0, h0, vec8_set1(0.5)); + CCTK_REAL8_VEC const x1 = k8madd(x0, r0, x0); + CCTK_REAL8_VEC const h1 = k8madd(h0, r0, h0); // Step - CCTK_REAL8_VEC const r1 = _mm512_fnmadd_pd(x1, h1, vec8_set1(0.5)); - CCTK_REAL8_VEC const x2 = _mm512_fmadd_pd(x1, r1, x1); - return x2; + CCTK_REAL8_VEC const r1 = k8nmsub(x1, h1, vec8_set1(0.5)); + CCTK_REAL8_VEC const x2 = k8madd(x1, r1, x1); + return k8ifthen(k8cmpeq(x, vec8_set1(0.0)), x, x2); #endif }
The codes can be compiled like such
icpc -Ofast -g -debug all -xMIC-AVX512 -axMIC-AVX512 ./sqrt-fixed.cc icpc -Ofast -g -debug all -xMIC-AVX512 -axMIC-AVX512 ./sqrt.cc
for KNL and like so:
icpc -Ofast -g -debug all -xCORE-AVX2 -axCORE-AVX512,MIC-AVX512 ./sqrt-fixed.cc icpc -Ofast -g -debug all -xCORE-AVX2 -axCORE-AVX512,MIC-AVX512 ./sqrt.cc
to make them run on the login nodes.
Note that the simfactory
stampede2-knl.cfg
option list does not cause the KNL code branch to be used b/c-xCORE-AVX2
requires that code runs on older CPUs. This is needed to be able to compile Cactus on the login nodes, otherwise configure fails to execute its tests (the only one that executes code is the one to detect byte sizes of CCTK_REAL etc and that one could be made to work without in the way that newer autoconf version do) and one would have to cross compile.We could add
-D__knl__
to stampede2 which will compile the "best" AVX512 code on stampede2.I have modified the pull request to:
- remove the
#error
from the KNL code since it works fine - made
__AVX512ER__
an alternative for__knl__
inside ofvectors-8-AVX512.h
since that is the macro that the compilers set (see stackoverflow)
- remove the
-
reporter - attached sqrt-fixed.cc
sqrt in git hash 5dea378 of Vectors
-
reporter Applied as git hash aa3ad4462f1d of cactusutils.
-
reporter - edited description
- changed status to closed
-
reporter To clarify why only sqrt(0) failed. The algorithm used (and cited) is computing both sqrt(x) and 1/sqrt(x) and the hardware instruction used for the initial guess in the iteration computes (an approximation to) 1/sqrt(x) hence the iteration incorrectly fails for only x==0.
- Log in to comment