"Accuracy test failed" on some systems with avx-512

Issue #37 resolved
Shen Yu created an issue

I met the “Accuracy test failed” problem when I try to test the sample “precession” provided in magic-sph(https://github.com/magic-sph/magic).

The testing is done on two systems: one is a workstation whose CPU is dual Intel(R) Xeon(R) Gold 5218 CPU, and the OS is Ubuntu 18.04.4 LTS (GNU/Linux 4.15.0-111-generic x86_64). The other one is a cluster, whose CPU is dual Intel(R) Xeon(R) Gold 6248 CPU and the OS is CentOS Linux release 7.7.1908 (Core). Both the compiler and math library are intel parallel studio 2020 cluster edition.

What confused me is that the problem is only happened on the first system.

I do a little research and try to compile SHTns without openmp support and set verbose value to 3, like this: “export CC=icc; export FC=ifort; ./configure --enable-mkl --enable-ishioka --enable-magic-layout --prefix=/home/shen/local --enable-verbose=3 --disable-openmp”. I compiled magic-sph also without openmp, like this: “cmake .. -DUSE_SHTNS=yes -DUSE_OMP=no“.

I test the “precession” sample three times with only 1 mpi processes. each time it gave me different SPH value. here are the parts of the log:

  • the first try
[SHTns 3.4.2] built for MagIC Jul 21 2020, 22:58:48, id: avx512,ishioka
        Lmax=42, Mmax*Mres=42, Mres=1, Nlm=946  [1 threads, no Condon-Shortley phase, Robert form, orthonormalized]
        > Condon-Shortley phase = 0, normalization = 0
        => using FFTW : Mmax=42, Nphi=128, Nlat=64  (data layout : phi_inc=72, theta_inc=1, phi_embed=128)
        => using Gauss nodes
          Gauss quadrature for 3/2.x^2 = 1 (should be 1.0) error = -1.55431e-15
        + polar optimization threshold = 1.0e-10
        finding optimal algorithm*****************
        + SHT accuracy = 0.241
ESC[93m Accuracy test failed. Please file a bug report at https://bitbucket.org/nschaeff/shtns/issues ESC[0m
        => SHTns is ready.
 ! SHTns uses theta padding with nlat_padded=          72

**************forrtl: severe (154): array index out of bounds
  • the second try
[SHTns 3.4.2] built for MagIC Jul 21 2020, 22:58:48, id: avx512,ishioka
        Lmax=42, Mmax*Mres=42, Mres=1, Nlm=946  [1 threads, no Condon-Shortley phase, Robert form, orthonormalized]
        > Condon-Shortley phase = 0, normalization = 0
        => using FFTW : Mmax=42, Nphi=128, Nlat=64  (data layout : phi_inc=72, theta_inc=1, phi_embed=128)
        => using Gauss nodes
          Gauss quadrature for 3/2.x^2 = 1 (should be 1.0) error = -1.55431e-15
        + polar optimization threshold = 1.0e-10
        finding optimal algorithm*****************
        + SHT accuracy = 0.683
ESC[93m Accuracy test failed. Please file a bug report at https://bitbucket.org/nschaeff/shtns/issues ESC[0m
        => SHTns is ready.
 ! SHTns uses theta padding with nlat_padded=          72

**************forrtl: severe (154): array index out of bounds
  • the third try
[SHTns 3.4.2] built for MagIC Jul 21 2020, 22:58:48, id: avx512,ishioka
        Lmax=42, Mmax*Mres=42, Mres=1, Nlm=946  [1 threads, no Condon-Shortley phase, Robert form, orthonormalized]
        > Condon-Shortley phase = 0, normalization = 0
        => using FFTW : Mmax=42, Nphi=128, Nlat=64  (data layout : phi_inc=72, theta_inc=1, phi_embed=128)
        => using Gauss nodes
          Gauss quadrature for 3/2.x^2 = 1 (should be 1.0) error = -1.55431e-15
        + polar optimization threshold = 1.0e-10
        finding optimal algorithm*****************
        + SHT accuracy = 0.467
ESC[93m Accuracy test failed. Please file a bug report at https://bitbucket.org/nschaeff/shtns/issues ESC[0m
        => SHTns is ready.
 ! SHTns uses theta padding with nlat_padded=          72

**************ESC[93m Accuracy test failed. Please file a bug report at https://bitbucket.org/nschaeff/shtns/issues ESC[0m
forrtl: severe (154): array index out of bounds

Please Notice that the SHT accuracy value are different, which are 0.241, 0.683 and 0.467 respectively.

I also tried to run the test with 4 MPI processes, the parts of logs are:

  • the first try
[SHTns 3.4.2] built for MagIC Jul 21 2020, 22:58:48, id: avx512,ishioka
        Lmax=42, Mmax*Mres=42, Mres=1, Nlm=946  [1 threads, no Condon-Shortley phase, Robert form, orthonormalized]
        > Condon-Shortley phase = 0, normalization = 0
        => using FFTW : Mmax=42, Nphi=128, Nlat=64  (data layout : phi_inc=72, theta_inc=1, phi_embed=128)
        => using Gauss nodes
          Gauss quadrature for 3/2.x^2 = 1 (should be 1.0) error = -1.55431e-15
        + polar optimization threshold = 1.0e-10
        finding optimal algorithm************************************************************ESC[93m Accuracy test failed. Please file a bug report at https://bitbucket.org/nschaeff/shtns/issues ESC[0m
*
        + SHT accuracy = 8.18e-14
        => SHTns is ready.
 ! SHTns uses theta padding with nlat_padded=          72

***********************************************************ESC[93m Accuracy test failed. Please file a bug report at https://bitbucket.org/nschaeff/shtns/issues ESC[0m
**
 ! MPI transpose strategy for 5 fields
 ! isend/irecv/waitall communicator= 1.675E-03 s
 ! alltoallv communicator          = 2.730E-03 s
 ! alltoallw communicator          = 1.491E-03 s
 ...

 ! STARTING TIME INTEGRATION AT:
   start_time =  0.0000000000E+00
   step no    =         0
   start dt   =      1.0000E-05

 ! Starting time integration!
 ! Building matrices at time step:       1    1.000000E-05
  ! COURANT: dt= 1.0000E-05 > dt_r=  0.0000E+00 and dt_h=  0.0000E+00

 ! Time step too small, dt=    0.0000E+00
 ! I thus stop the run !



 ! Something went wrong, MagIC will stop now
 ! See below the error message:

 Stop run in steptime!
  • the second try
[SHTns 3.4.2] built for MagIC Jul 21 2020, 22:58:48, id: avx512,ishioka
        Lmax=42, Mmax*Mres=42, Mres=1, Nlm=946  [1 threads, no Condon-Shortley phase, Robert form, orthonormalized]
        > Condon-Shortley phase = 0, normalization = 0
        => using FFTW : Mmax=42, Nphi=128, Nlat=64  (data layout : phi_inc=72, theta_inc=1, phi_embed=128)
        => using Gauss nodes
          Gauss quadrature for 3/2.x^2 = 1 (should be 1.0) error = -1.55431e-15
        + polar optimization threshold = 1.0e-10
        finding optimal algorithm***************************************************************
        + SHT accuracy = 9.03e-14
        => SHTns is ready.
 ! SHTns uses theta padding with nlat_padded=          72

***********************************************************
 ! MPI transpose strategy for 5 fields
 ! isend/irecv/waitall communicator= 8.008E-04 s
 ! alltoallv communicator          = 1.152E-03 s
 ! alltoallw communicator          = 7.503E-04 s
 ! -> I pack some fields for the MPI transposes
 ! -> I choose alltoallw
 ...
  ! STOPPING TIME INTEGRATION AT:
   stop time =  2.0000000000E-03
   stop step =       201
   steps gone=       200

 !!! regular end of program MagIC !!!



  !***********************************!
  !---- THANK YOU FOR USING MAGIC ----!
  !---- ALWAYS HAPPY TO PLEASE YOU ---!
  !--------  call BACK AGAIN ---------!
  !- GET YOUR NEXT DYNAMO WITH MAGIC -!
  !***********************************!
                                   JW
  • the third try
[SHTns 3.4.2] built for MagIC Jul 21 2020, 22:58:48, id: avx512,ishioka
        Lmax=42, Mmax*Mres=42, Mres=1, Nlm=946  [1 threads, no Condon-Shortley phase, Robert form, orthonormalized]
        > Condon-Shortley phase = 0, normalization = 0
        => using FFTW : Mmax=42, Nphi=128, Nlat=64  (data layout : phi_inc=72, theta_inc=1, phi_embed=128)
        => using Gauss nodes
          Gauss quadrature for 3/2.x^2 = 1 (should be 1.0) error = -1.55431e-15
        + polar optimization threshold = 1.0e-10
        finding optimal algorithm*************************************************************ESC[93m Accuracy test failed. Please file a bug report at https://bitbucket.org/nschaeff/shtns/issues ESC[0m

        + SHT accuracy = 5.51e-14
        => SHTns is ready.
 ! SHTns uses theta padding with nlat_padded=          72

ESC[93m Accuracy test failed. Please file a bug report at https://bitbucket.org/nschaeff/shtns/issues ESC[0m
**********************************************************ESC[93m Accuracy test failed. Please file a bug report at https://bitbucket.org/nschaeff/shtns/issues ESC[0m
ESC[93m Accuracy test failed. Please file a bug report at https://bitbucket.org/nschaeff/shtns/issues ESC[0m
ESC[93m Accuracy test failed. Please file a bug report at https://bitbucket.org/nschaeff/shtns/issues ESC[0m

 ! MPI transpose strategy for 5 fields
 ! isend/irecv/waitall communicator= 8.255E-04 s
 ! alltoallv communicator          = 1.142E-03 s
 ! alltoallw communicator          = 7.339E-04 s
 ! -> I pack some fields for the MPI transposes
 ! -> I choose alltoallw

 ...


 ! Something went wrong, MagIC will stop now
 ! See below the error message:

 Stop run in steptime!

  ! COURANT: dt= 1.0000E-05 > dt_r=  0.0000E+00 and dt_h=  0.0000E+00

 ! Time step too small, dt=    0.0000E+00
 ! I thus stop the run !



 ! Something went wrong, MagIC will stop now
 ! See below the error message:

 Stop run in steptime!

Abort(32) on node 1 (rank 1 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 32) - process 1
Abort(32) on node 0 (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 32) - process 0

These results are more confusing: the second try ended normally, and both the first and third tries failed. It is more weird that the SHT accuracy` value in all three tests are very small: 8.18e-14, 9.03e-14, and 5.51e-14. I checked the source file sht_init.c, the critical value of the “accuracy” is 1e-6 at line 1439. I don’t understand why the program still said “Accuracy test failed”.

In summary, I met the SHT accuracy problem when I try to do the test in a magic’s sample. Because this problem only happens on the ubuntu workstation. I guess maybe there are some hardware or OS related reasons.

Please give me some suggestion. Thank you and best wish.

PS: please see the sample and log in the attachment file.

Official response

  • Nathanaël Schaeffer repo owner

    The issue of systematic accuracy error (reported initially) has been tracked down to the gnu assembler “as” that is invoked by gcc to compile, and that produced wrong avx512 code in some cases.
    See https://sourceware.org/bugzilla/show_bug.cgi?id=23465 and https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96512

    To fix it on your system, you have to upgrade the “binutils” package to version 2.32 or more, and then recompile SHTns.
    If you can’t upgrade the “binutils” package, you may try one of the following workarounds:

    1. [almost no performance impact] edit the Makefile and replace the first occurence of $(shtcc) with $(shtcc) -fno-tree-pre
    2. [small performance impact] if you have icc (intel compiler), reconfigure SHTns with the option --enable-kernel-compiler=icc
    3. [significant performance impact] edit the Makefile and replace all occurences of native with skylake effectively using avx2 instead of avx512.

Comments (18)

  1. Nathanaël Schaeffer repo owner

    Thanks for reporting this.
    A few other users of SHTns are experiencing a similar issue, but unfortunately I can’t reproduce it on the machines I have access.

    As possible workarounds I suggest:

    1. edit the Makefile of SHTns, and replace native with `skylake`. This will disable avx512 and reduce performance

    OR (maybe better):

    2. reconfigure SHTns with the following option `--enable-kernel-compiler=icc`. This usually also reduce performance, but maybe less than the first solution.

    I would appreciate feedback about any of these workarounds.

  2. Shen Yu reporter

    Thanks for your reply

    I tried both suggestions, and they work.

    However, when I set the number of MPI processes to 1, eg: mpirun -n 1 ../magic.exe input.nml. Another error happens:

    ...
    [SHTns 3.4.2] built for MagIC Jul 22 2020, 10:45:35, id: avx2,ishioka
            Lmax=42, Mmax*Mres=42, Mres=1, Nlm=946  [1 threads, no Condon-Shortley phase, Robert form, orthonormalized]
            > Condon-Shortley phase = 0, normalization = 0
            => using FFTW : Mmax=42, Nphi=128, Nlat=64  (data layout : phi_inc=72, theta_inc=1, phi_embed=128)
            => using Gauss nodes
              Gauss quadrature for 3/2.x^2 = 1 (should be 1.0) error = -1.55431e-15
            + polar optimization threshold = 1.0e-10
            finding optimal algorithm*******************
            + SHT accuracy = 1.01e-13
            => SHTns is ready.
     ! SHTns uses theta padding with nlat_padded=          72
    
    ********************forrtl: severe (154): array index out of bounds
    Image              PC                Routine            Line        Source
    magic.exe          0000000000913D1B  Unknown               Unknown  Unknown
    ...
    

    This error won’t happen when the number MPI processes is set to 2, 3, and 4, and the good news is that in actually calculations the MPI processes number will never use 1.

    Thanks again.

  3. Nathanaël Schaeffer repo owner

    After some more investigations, it seems that the problem comes from the compilation of the sht_kernels_s.c file. In the Makefile, Replacing the first occurence of $(shtcc) with $(shtcc) -fno-tree-pre solves the issue on an affected system (gcc 8.3.0 on Intel(R) Xeon(R) Gold 5218 CPU).

    Could you please confirm ?

  4. Nathanaël Schaeffer repo owner

    Also, I would like to know what both your systems (the one that works without workaround and the other) tell you when you type gcc --versionand uname -a
    Thanks.

  5. Shen Yu reporter

    The new compile option is under testing, and the compiler and system information are:

    The first system:

    The second system:

    $ gcc --version
    gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39)
    Copyright (C) 2015 Free Software Foundation, Inc.
    This is free software; see the source for copying conditions.  There is NO
    warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
    
    $ uname -a
    Linux login01 3.10.0-1062.9.1.el7.x86_64 #1 SMP Fri Dec 6 15:49:49 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
    

    Thank you.

  6. Shen Yu reporter

    The student who use the first system told me that the problem was solved after make the change to the Makefile.

    Thank you again.

  7. peter schuck

    I am also having a very similar issue on openSUSE Leap 15.1. Oddly, it’s showing up with the time_SHT and it’s intermittent – it doesn’t happen every time. I first noticed it while using the “--enable-cuda” switch. However this doesn’t seem to be the problem as I’ve now disabled it and I’m continuing to see the error. I will see if the makefile modification solves this for me as well and report back.

    Linux wittenator 4.12.14-lp151.28.59-default #1 SMP Wed Aug 5 10:58:34 UTC 2020 (337e42e) x86_64 x86_64 x86_64 GNU/Linux

    Thanks

  8. peter schuck

    Okay I modified

    sht_kernels_s.o : sht_kernels_s.c Makefile $(hfiles) SHT/SH_to_spat_kernel.c
    \$(shtcc) -fno-tree-pre -c $< -o $@

    in the Makefile.

    sphwvlts/shtns> gcc --version
    gcc (SUSE Linux) 9.3.1 20200406 [revision 6db837a5288ee3ca5ec504fbd5a765817e556ac2]
    Copyright (C) 2019 Free Software Foundation, Inc.
    This is free software; see the source for copying conditions. There is NO
    warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

    I ran “./time_SHT 8192 -mmax=4096 -mres=2” 10x’s it failed once:


    finding optimal algorithm t() = 5.91e+08 => nloop=1 (takes 20.7543 s)
    finding best syn ... t(omp2a) = 4.46e+08* t(omp3a) = 4.45e+08* t(omp4a) = 4.29e+08* t(omp6a) = 4.87e+08 t(omp8a) = 8.18e+08 => omp4a
    finding best ana ... t(omp2a) = 5.64e+08* t(omp3a) = 5.73e+08 t(omp4a) = 6.09e+08 t(omp6a) = 5.38e+08* t(omp8a) = 6.77e+08 => omp6a

    scalar SH - poloidal rms error = 0.00465 max error = 0.291 for l=7670,lm=7682597

    SHT accuracy = 0.291

    Accuracy test failed. Please file a bug report at https://bitbucket.org/nschaeff/shtns/issues


    I have the full log of all 10 runs if that helps.

    Thanks

  9. Nathanaël Schaeffer repo owner

    Hello,

    Yes, I would like the logs of the 10 runs, please (maybe more convenient to send on my email) ?
    Also, could you try if it helps to run time_SHT with the -polaropt=0 option ?

    Best,

  10. peter schuck

    I also recompiled with --enable-cuda. Similar result. 10 runs of “./time_SHT 8192 -mmax=4096 -mres=2” and 1 failure:

    GPU #0 successfully initialized.
    finding optimal algorithm t() = 6.08e+08 => nloop=1 (takes 21.1976 s)
    finding best syn ... t(gpu1) = 7.65e+08* t(gpu3) = 7.42e+08* t(omp2a) = 5.81e+08 t(omp3a) = 6.65e+08 t(omp4a) = 6.78e+08 t(omp6a) = 6.34e+08 t(omp8a) = 7.46e+08 => gpu3

    finding best ana ... t(gpu1) = 1.3e+09* t(gpu3) = 1.45e+09 t(omp2a) = 1.38e+09 t(omp3a) = 1.52e+09 t(omp4a) = 1.54e+09 t(omp6a) = 1.6e+09 t(omp8a) = 1.62e+09 => gpu1

    scalar SH - poloidal rms error = 1.02 max error = 5.37 for l=1434,lm=5361009

    Accuracy test failed. Please file a bug report at https://bitbucket.org/nschaeff/shtns/issues

    nthreads = 96

    => SHTns is ready.
    Lmax=8192, Mmax*Mres=8192, Mres=2, Nlm=16785409 [96 threads, orthonormalized]
    Gauss grid : Nlat=8256, Nphi=8232

    syn ana vsy van gsp gto v3s v3a
    std: gpu3 gpu1 omp2a omp2a omp2a omp2a omp2a omp2a
    m: fly2 fly2 fly2 fly2 fly2 fly2 fly2 fly2

    generating random test case...
    ** performing 50 scalar SHT
    :: STD

    SHT time (lmax=8192): synthesis = 282.25118 ms [cpu 12483.651] [1981.267 Gflops] analysis = 373.49948 ms [cpu 533.675] [1497.231 Gflops]

    => max error = 24667.3 (l=4137,lm=4137) rms error = 348.063 **** ERROR ****

    :: LTR

    SHT time truncated at l=4096 : synthesis = 179.025900 ms, analysis = 151.707000 ms

    => max error = 4.7504 (l=666,lm=2602657) rms error = 0.469953 **** ERROR ****

    Thanks

  11. Nathanaël Schaeffer repo owner

    Thank you. I looked at the logs. Could you please do another test for me? Without --enable-cuda, do the errors still show up when running ./time_SHT with the -quickinit option ?

  12. peter schuck

    added -fno-tree-pre to Makefile (as described above)

    No –enable_cuda

    ./time_SHT 8192 -mmax=4096 -mres=2  -polaropt=0

    No errors in 10 tests!

  13. Nathanaël Schaeffer repo owner

    The issue of systematic accuracy error (reported initially) has been tracked down to the gnu assembler “as” that is invoked by gcc to compile, and that produced wrong avx512 code in some cases.
    See https://sourceware.org/bugzilla/show_bug.cgi?id=23465 and https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96512

    To fix it on your system, you have to upgrade the “binutils” package to version 2.32 or more, and then recompile SHTns.
    If you can’t upgrade the “binutils” package, you may try one of the following workarounds:

    1. [almost no performance impact] edit the Makefile and replace the first occurence of $(shtcc) with $(shtcc) -fno-tree-pre
    2. [small performance impact] if you have icc (intel compiler), reconfigure SHTns with the option --enable-kernel-compiler=icc
    3. [significant performance impact] edit the Makefile and replace all occurences of native with skylake effectively using avx2 instead of avx512.

  14. Nathanaël Schaeffer repo owner

    in v3.4.3, detection of this issue is performed by ./configure, user is informed how to solve it, and a possible workaround is applied.

  15. Log in to comment