 changed status to open
 removed comment
This pull request mostly add better support for the AVX512 instructions found in KNL and modern Intel Xeon / Core iN CPUs.
There is one bugfix commit “Correct error in AVX512 sqrt implemention” which is important for those architectures (eg to compute sqrt(detg)).
I am marking this as critical until I can verify that without this fix the result of a BBH simulation on eg Stampede2 is not incorrect because of using an incorrect sqrt() function.
Pull request is here: https://bitbucket.org/cactuscode/cactusutils/pullrequests/17/rhaasavx512/diff
Keyword: Vectors
Comments (9)

reporter 
reporter Downgrading this since I remember Erik's comment only sqrt(0) was incorrect.

reporter  attached sqrt.cc
Version from before 5dea378

reporter  attached sqrtfixed.cc
Version after 5dea378, modified to remove the #error and to compile the code when used with xMICAVX512 on stampede2.

reporter It tested this on stampede2 and the modified code produces the correct values. Literally only sqrt(0.) was incorrect and produced a NaN. The fix is to check if the input was 0. and if so then return 0. This happens in the return statement of the function. Other changes are just moving the intrinsic functions into inline functions. Here's a diff between the codes (with the inline functions not shown):
 sqrt.cc 20190221 17:39:49.371725396 0600 +++ sqrtfixed.cc 20190221 17:39:49.303724044 0600 @@ 14,6 +14,10 @@ typedef __m512i CCTK_INTEGER8_VEC; typedef __mmask8 CCTK_BOOLEAN8_VEC; +#ifdef __AVX512ER__ +#define __knl__ +#endif + // Number of vector elements in a CCTK_REAL_VEC #define CCTK_REAL8_VEC_SIZE 8 @@ 34,40 +38,137 @@ // We start with an approximate result, then perform Goldschmidt iterations // <https://en.wikipedia.org/wiki/Methods_of_computing_square_roots>. // Theoretically the result should be exact. // Initialisation CCTK_REAL8_VEC const y0 = _mm512_rsqrt28_pd(x);  CCTK_REAL8_VEC const x0 = _mm512_mul_pd(x, y0);  CCTK_REAL8_VEC const h0 = _mm512_mul_pd(vec8_set1(0.5), y0); + CCTK_REAL8_VEC const x0 = k8mul(x, y0); + CCTK_REAL8_VEC const h0 = k8mul(vec8_set1(0.5), y0); // Step  CCTK_REAL8_VEC const r0 = _mm512_fnmadd_pd(x0, h0, vec8_set1(0.5));  CCTK_REAL8_VEC const x1 = _mm512_fmadd_pd(x0, r0, x0);  return x1; + CCTK_REAL8_VEC const r0 = k8nmsub(x0, h0, vec8_set1(0.5)); + CCTK_REAL8_VEC const x1 = k8madd(x0, r0, x0); + return k8ifthen(k8cmpeq(x, vec8_set1(0.0)), x, x1); #else // We start with an approximate result, then perform Goldschmidt iterations // <https://en.wikipedia.org/wiki/Methods_of_computing_square_roots>. // Theoretically the result should be exact. // Initialisation CCTK_REAL8_VEC const y0 = _mm512_rsqrt14_pd(x);  CCTK_REAL8_VEC const x0 = _mm512_mul_pd(x, y0);  CCTK_REAL8_VEC const h0 = _mm512_mul_pd(vec8_set1(0.5), y0); + CCTK_REAL8_VEC const x0 = k8mul(x, y0); + CCTK_REAL8_VEC const h0 = k8mul(vec8_set1(0.5), y0); // Step  CCTK_REAL8_VEC const r0 = _mm512_fnmadd_pd(x0, h0, vec8_set1(0.5));  CCTK_REAL8_VEC const x1 = _mm512_fmadd_pd(x0, r0, x0);  CCTK_REAL8_VEC const h1 = _mm512_fmadd_pd(h0, r0, h0); + CCTK_REAL8_VEC const r0 = k8nmsub(x0, h0, vec8_set1(0.5)); + CCTK_REAL8_VEC const x1 = k8madd(x0, r0, x0); + CCTK_REAL8_VEC const h1 = k8madd(h0, r0, h0); // Step  CCTK_REAL8_VEC const r1 = _mm512_fnmadd_pd(x1, h1, vec8_set1(0.5));  CCTK_REAL8_VEC const x2 = _mm512_fmadd_pd(x1, r1, x1);  return x2; + CCTK_REAL8_VEC const r1 = k8nmsub(x1, h1, vec8_set1(0.5)); + CCTK_REAL8_VEC const x2 = k8madd(x1, r1, x1); + return k8ifthen(k8cmpeq(x, vec8_set1(0.0)), x, x2); #endif }
The codes can be compiled like such
icpc Ofast g debug all xMICAVX512 axMICAVX512 ./sqrtfixed.cc icpc Ofast g debug all xMICAVX512 axMICAVX512 ./sqrt.cc
for KNL and like so:
icpc Ofast g debug all xCOREAVX2 axCOREAVX512,MICAVX512 ./sqrtfixed.cc icpc Ofast g debug all xCOREAVX2 axCOREAVX512,MICAVX512 ./sqrt.cc
to make them run on the login nodes.
Note that the simfactory
stampede2knl.cfg
option list does not cause the KNL code branch to be used b/cxCOREAVX2
requires that code runs on older CPUs. This is needed to be able to compile Cactus on the login nodes, otherwise configure fails to execute its tests (the only one that executes code is the one to detect byte sizes of CCTK_REAL etc and that one could be made to work without in the way that newer autoconf version do) and one would have to cross compile.We could add
D__knl__
to stampede2 which will compile the "best" AVX512 code on stampede2.I have modified the pull request to:
 remove the
#error
from the KNL code since it works fine  made
__AVX512ER__
an alternative for__knl__
inside ofvectors8AVX512.h
since that is the macro that the compilers set (see stackoverflow)
 remove the

reporter  attached sqrtfixed.cc
sqrt in git hash 5dea378 of Vectors

reporter Applied as git hash aa3ad4462f1d of cactusutils.

reporter  edited description
 changed status to closed

reporter To clarify why only sqrt(0) failed. The algorithm used (and cited) is computing both sqrt(x) and 1/sqrt(x) and the hardware instruction used for the initial guess in the iteration computes (an approximation to) 1/sqrt(x) hence the iteration incorrectly fails for only x==0.
 Log in to comment