fix sqrt() for AVX512, updates to Vectors thorn

Issue #2219 closed
Roland Haas created an issue

This pull request mostly add better support for the AVX512 instructions found in KNL and modern Intel Xeon / Core iN CPUs.

There is one bugfix commit “Correct error in AVX512 sqrt implemention” which is important for those architectures (eg to compute sqrt(detg)).

I am marking this as critical until I can verify that without this fix the result of a BBH simulation on eg Stampede2 is not incorrect because of using an incorrect sqrt() function.

Pull request is here: https://bitbucket.org/cactuscode/cactusutils/pull-requests/17/rhaas-avx512/diff

Keyword: Vectors

Comments (9)

  1. Roland Haas reporter
    • marked as
    • removed comment

    Downgrading this since I remember Erik's comment only sqrt(0) was incorrect.

  2. Roland Haas reporter

    It tested this on stampede2 and the modified code produces the correct values. Literally only sqrt(0.) was incorrect and produced a NaN. The fix is to check if the input was 0. and if so then return 0. This happens in the return statement of the function. Other changes are just moving the intrinsic functions into inline functions. Here's a diff between the codes (with the inline functions not shown):

    --- sqrt.cc 2019-02-21 17:39:49.371725396 -0600
    +++ sqrt-fixed.cc   2019-02-21 17:39:49.303724044 -0600
    @@ -14,6 +14,10 @@
     typedef __m512i CCTK_INTEGER8_VEC;
     typedef __mmask8 CCTK_BOOLEAN8_VEC;
    
    +#ifdef __AVX512ER__
    +#define __knl__
    +#endif
    +
     // Number of vector elements in a CCTK_REAL_VEC
     #define CCTK_REAL8_VEC_SIZE 8
    @@ -34,40 +38,137 @@
       // We start with an approximate result, then perform Goldschmidt iterations
       // <https://en.wikipedia.org/wiki/Methods_of_computing_square_roots>.
       // Theoretically the result should be exact.
       // Initialisation
       CCTK_REAL8_VEC const y0 = _mm512_rsqrt28_pd(x);
    -  CCTK_REAL8_VEC const x0 = _mm512_mul_pd(x, y0);
    -  CCTK_REAL8_VEC const h0 = _mm512_mul_pd(vec8_set1(0.5), y0);
    +  CCTK_REAL8_VEC const x0 = k8mul(x, y0);
    +  CCTK_REAL8_VEC const h0 = k8mul(vec8_set1(0.5), y0);
       // Step
    -  CCTK_REAL8_VEC const r0 = _mm512_fnmadd_pd(x0, h0, vec8_set1(0.5));
    -  CCTK_REAL8_VEC const x1 = _mm512_fmadd_pd(x0, r0, x0);
    -  return x1;
    +  CCTK_REAL8_VEC const r0 = k8nmsub(x0, h0, vec8_set1(0.5));
    +  CCTK_REAL8_VEC const x1 = k8madd(x0, r0, x0);
    +  return k8ifthen(k8cmpeq(x, vec8_set1(0.0)), x, x1);
     #else
       // We start with an approximate result, then perform Goldschmidt iterations
       // <https://en.wikipedia.org/wiki/Methods_of_computing_square_roots>.
       // Theoretically the result should be exact.
       // Initialisation
       CCTK_REAL8_VEC const y0 = _mm512_rsqrt14_pd(x);
    -  CCTK_REAL8_VEC const x0 = _mm512_mul_pd(x, y0);
    -  CCTK_REAL8_VEC const h0 = _mm512_mul_pd(vec8_set1(0.5), y0);
    +  CCTK_REAL8_VEC const x0 = k8mul(x, y0);
    +  CCTK_REAL8_VEC const h0 = k8mul(vec8_set1(0.5), y0);
       // Step
    -  CCTK_REAL8_VEC const r0 = _mm512_fnmadd_pd(x0, h0, vec8_set1(0.5));
    -  CCTK_REAL8_VEC const x1 = _mm512_fmadd_pd(x0, r0, x0);
    -  CCTK_REAL8_VEC const h1 = _mm512_fmadd_pd(h0, r0, h0);
    +  CCTK_REAL8_VEC const r0 = k8nmsub(x0, h0, vec8_set1(0.5));
    +  CCTK_REAL8_VEC const x1 = k8madd(x0, r0, x0);
    +  CCTK_REAL8_VEC const h1 = k8madd(h0, r0, h0);
       // Step
    -  CCTK_REAL8_VEC const r1 = _mm512_fnmadd_pd(x1, h1, vec8_set1(0.5));
    -  CCTK_REAL8_VEC const x2 = _mm512_fmadd_pd(x1, r1, x1);
    -  return x2;
    +  CCTK_REAL8_VEC const r1 = k8nmsub(x1, h1, vec8_set1(0.5));
    +  CCTK_REAL8_VEC const x2 = k8madd(x1, r1, x1);
    +  return k8ifthen(k8cmpeq(x, vec8_set1(0.0)), x, x2);
     #endif
     }
    

    The codes can be compiled like such

    icpc -Ofast -g -debug all -xMIC-AVX512 -axMIC-AVX512 ./sqrt-fixed.cc
    icpc -Ofast -g -debug all -xMIC-AVX512 -axMIC-AVX512 ./sqrt.cc
    

    for KNL and like so:

    icpc -Ofast -g -debug all -xCORE-AVX2 -axCORE-AVX512,MIC-AVX512 ./sqrt-fixed.cc
    icpc -Ofast -g -debug all -xCORE-AVX2 -axCORE-AVX512,MIC-AVX512 ./sqrt.cc
    

    to make them run on the login nodes.

    Note that the simfactory stampede2-knl.cfg option list does not cause the KNL code branch to be used b/c -xCORE-AVX2 requires that code runs on older CPUs. This is needed to be able to compile Cactus on the login nodes, otherwise configure fails to execute its tests (the only one that executes code is the one to detect byte sizes of CCTK_REAL etc and that one could be made to work without in the way that newer autoconf version do) and one would have to cross compile.

    We could add -D__knl__ to stampede2 which will compile the "best" AVX512 code on stampede2.

    I have modified the pull request to:

    1. remove the #error from the KNL code since it works fine
    2. made __AVX512ER__an alternative for __knl__ inside of vectors-8-AVX512.h since that is the macro that the compilers set (see stackoverflow)
  3. Roland Haas reporter

    To clarify why only sqrt(0) failed. The algorithm used (and cited) is computing both sqrt(x) and 1/sqrt(x) and the hardware instruction used for the initial guess in the iteration computes (an approximation to) 1/sqrt(x) hence the iteration incorrectly fails for only x==0.

  4. Log in to comment