FP32 accuracy issues near the equator

Issue #66 open
Zekun Ni created an issue

Recently I’ve spotted a small accuracy problem in v3.6.x versions, and it seems to exist both in spat_to_SH and cu_spat_to_SH routines as long as the SHT config is set to FP32 mode. Here I provide an example to illustrate the problem.

1. Download a z500 analysis from the “reanalysis-era5-complete” archive, so that the variable is in its original spherical harmonics form with lmax=639.

2. Use SHTns to transform it from spherical harmonics to the regular gaussian grid F320 (320 lines between pole and equator). The SHT config is initialized with norm=sht_fourpi|SHT_NO_CS_PHASE and the grid is set with flags=SHT_THETA_CONTIGUOUS|SHT_SCALAR_ONLY|SHT_FP32. The attached file z500_1.png is the contour plot of the resulting variable, zoomed into 120E-120W, 30N-30S.

3. Use SHTns to transform it back to spherical harmonics and then transform again to the same regular gaussian grid. The attached file z500_2.png is the contour plot of the resulting variable, zoomed into the same area.

As you can see some artifacts appear near the equator in z500_2.png but not z500_1.png, which hints at the possibility that the accuracy problem exists only in the transform from spatial domain to spherical harmonics. This problem doesn’t exist if the SHT is switched to double precision (by turning off SHT_FP32), or if I fall back to my own adapted version of single precision SHTns provided in #44.

Comments (20)

  1. Nathanaël Schaeffer repo owner

    Could you please try the gpu-devel branch ? There is the possibility to perform the recurrence in double precision while keeping the transform single precision. This is controlled using an environment variable SHTNS_GPU_REC_PREC (only for single-precision transforms).

    Run your program like:

    • `SHTNS_GPU_REC_PREC=1 ./my_program` for single precision legendre polynomials computation
    • `SHTNS_GPU_REC_PREC=2 ./my_program` for double precision legendre polynomials computation

    You should see a performance change and accuracy change. Let me please know how the two compare, please. I hope the problem disapears with SHTNS_GPU_REC_PREC=2, but I wonder about the performance hit.

  2. Zekun Ni reporter

    Your guess is correct, the imperfections disappear after setting SHTNS_GPU_REC_PREC=2. However, there is a precipitous slowdown associated with this flag. On my laptop 4060 GPU, under SHTNS_GPU_REC_PREC=1, 10 times of 512 back-and-forth SHT transforms on F320 gaussian grid cost 2.2s without batching and 1.8s with batch size=4; but under SHTNS_GPU_REC_PREC=2, same number of operations cost 13.8s without batching and 4.1s with batch size=4.

  3. Zekun Ni reporter

    Despite the precipitous degradation, compared to end-to-end FP64 there is still substantial improvement. Under end-to-end FP64 arithmetic, the same operations cost 20.3s without batching and 15.3s with batch size=4. However, if compared with the old single precision version before v3.6, the unbatched version becomes much slower under FP64 legendre polynomials computation. The batched version still maintains an advantage but one should argue the comparison is not very fair.

  4. Nathanaël Schaeffer repo owner

    Ok, can you try:

    • for performance: use a batch size of 8 ? I expect further improvements up to a batch size of 8.

    • for the initial accuracy issue: around line 400 in sht_gpu.cu, there is a string containing “SHT_ACCURACY 1.0e-15f”, can you change it to 1.0e-18f for instance?

    • for performance again, can you try to appy this commit and look if performance improves with SHTNS_GPU_REC_PREC=2 ?

    https://gricad-gitlab.univ-grenoble-alpes.fr/schaeffn/shtns/-/commit/2a84251424ee933d5ea0d0174dfcb7acaa786844

    After you tell me about these 3 tests, I need to think about a proper fix for this issue, that keeps performance acceptable.

  5. Zekun Ni reporter

    Thanks for your timely reply during holiday season!

    1. The batch size of 4 is the optimal configuration on my laptop 4060. Larger batch size takes slightly more time.
    2. Unfortunately this seems ineffective. No difference can be seen in the contour plot. I also tried 1.0e-13f, no difference either.
    3. Indeed, this is faster. It now takes 10.2s without batching and 3.3s with batch size=4.

  6. Nathanaël Schaeffer repo owner

    Regarding 3., I was not expecting such a big difference! Especially since it applies only to synthesis. In addition, can you also add this commit on top of the previous one ? It will use the trick for analysis too (no need to recompile):

    https://gricad-gitlab.univ-grenoble-alpes.fr/schaeffn/shtns/-/commit/dc63884f174d884d728bc7133d24b3b6f824832e

    Regarding 1., there are group sizes, and other kernel launch parameters that I tuned on the GPUs I have access to. But for your device, it may not be optimal. If you want, you can play around without recompiling, by putting the a file named sht_gpu.conf in the directory where you run from (undocumented feature), with the following content:

    4 # nwarp_s : maximum number of warps per block for synthesis
    4 # nf_s : number of fields handled together in the synthesis kernel (sharing a recursion)
    1 # nw_s : maximum number of latitudes handled together in the synthesis kernel (IPC)
    1 # nwarp_a : maximum number of warps per block for analysis
    4 # nf_a : number of fields handled together in the analysis kernel (sharing a recursion)
    4 # lspan_a : number of degrees handled together within a loop of the analysis kernel
    

    I think the above values are the default. There is no failsafe, and some values may not work. For instance nf_* must be a divisor of the batch size. It is normally safe to play with nwarp_s, nw_s and nwarp_a.

    For sht_gpu.conf to be taken into account, you must also define the environment variable SHTNS_GPU_CONF (to anything).

  7. Zekun Ni reporter

    Your new patch further speeds up the test. It now finishes in 7.1s without batching and in 2.6s with batch size 4. It now looks like there is only 40~50% slowdown with batch size 4, a big improvement! Without batching performance still suffers a lot.

  8. Zekun Ni reporter

    I tried to understand the difference between 3.6.x and 3.5.x. It looks like you changed the meaning of y0 and y1 in order to reduce the number of arithmetic operations. Probably that is the culprit that the error goes up by a lot?

    The biggest differences are actually on m=0 and l even. On higher ls the input is only on the order of 1e-3~1e-2 while your transform gives something between 0.5 and 1 with alternating signs (positive when mod4=0, negative when mod4=2). After replacing the m=0 part with the correct input data and transform back to gaussian grid, the imperfections are gone. Average error and maximum error also way down (avg error=2.25 and max error=50.94 before replacement, reduced to 0.01 and 0.23 after replacement).

    I hope this issue can be resolved without falling back to FP64 in some arithmetic operations. If that is not possible (or only with significant slowdown), probably as a hack we can only do FP64 arithmetic on m=0 cases.

  9. Nathanaël Schaeffer repo owner

    The new recurrence that computes y0 and y1 (with different meaning, yes) involves x*x instead of x, where x is cos(theta) and theta the colatitude. I suspect this is an issue when x is close to zero (= near the equator). For (l-m) odd, the values tend to zero close to the equator, so there seems to be no issue. But for (l-m) even, the values are finite close to the equator.

    I however don’t understand why m=0 is special.

    In addition, I cannot reproduce the issue when starting from a random set of coefficients. I also could not find the data you use. Could you please send me the list of coefficients needed to produce the pictures above ?

  10. Zekun Ni reporter

    I actually found out that this issue only becomes prominent when the values in the input are much higher than their standard deviation, for example, geopotential at a certain level high above surface. After I subtract the values by its mean, the error goes down a lot, though there is still obvious degradation on precision. This may also explain why this issue is worst at m=0.

  11. Nathanaël Schaeffer repo owner

    With a large mean, I can reproduce the behaviour. Of course, in fp32 the dynamic range is reduced, so having a large mean implies larger errors. Ideally, the caller, which has knowledge about this large mean, should take measures like removing the mean before calling shtns and adding it back in afterwards. A larger mean is something that comes up more frequently than larger other modes, but it could happen too. Treating several modes separately to avoid errors piling up would be too costly. However for the mean I think it is acceptable to take measures within shtns to mitigate this. What do you think ?

    I have pushed a new branch gpu-fp32-wip that implements a quick and dirty hack to do so. It basically computes the mean separately, and takes it out of the recurrence where it pollutes other modes due to finite precision. It is fully fp32. Could you please test it ? It needs some polish and maybe a little more thought about ther errors in general before I include this in the main branch.

  12. Zekun Ni reporter

    The hack of computing the mean and removing from calculation should largely reduce the error, but in my opinion it’s probably not the job the SHT transform library should do. You have also mentioned there are other modes, such as a large spectral component at some l>0, which cannot be addressed by this hack, though they should occur much less frequently in real-world use cases.

    If there is a way to reduce numerical error with a more delicate sort of computation instead of falling back to FP64 or resorting to dirty hacks, it would be the best way to do. Otherwise, it’s probably the best to keep the current SHTNS_GPU_REC_PREC switch as the fallback option provided to users.

    Since FP32 precision is much better in previous versions with the original recurrence formula, there is probably also an alternative solution to reimplement that original recurrence, but keeping other new features intact (such as batching), and give users the option to fallback to this one. I notice that there is already a switch SHTNS_ISHIOKA which appears to do the exactly same thing, but it’s not supported in GPU mode right now. Falling back to FP64 is not a desired solution since modern GPUs are heavily optimized for lower precision computations such as FP32 or even FP16. Even a small portion of FP64 may drag down the performance a lot. I suspect the fully FP32 version with the original recurrence would outperform the partially FP64 version with the new recurrence and SHTNS_GPU_REC_PREC=2, though it might be slightly slower than the current fully FP32 version.

  13. Nathanaël Schaeffer repo owner

    Yes, the previous recurrence relation will be made available fore FP32 soon (I already did it for synthesis, analysis coming soon). The correction of the mean will also be included. It is only my quick implementation that is a bit hacky (but it is enough for you to test already), but I’ll make it clean later.
    I think correction of the mean is a significative improvement also for the previous recurrence, because it improves accuracy for it too (although less spectacularly).

  14. Zekun Ni reporter

    I’ve tested your gpu-fp32-wip branch. It does the job and there is improvement on precision just the same as if one manually subtracts the mean from the input. I didn’t observe any visible difference on transform speed.

    Thank you for following up on this issue and I’ll wait for your release of the previous recurrence!

  15. Nathanaël Schaeffer repo owner

    Branch gpu-fp32-wip (commit d00b3ed) now uses the new recurrence (ishioka’s) with fp64, and for pure fp32 falls back to a recurrence that is very close to the previous standard one, but uses a slightly different normalization internally, that allows to reduce the arithmetic operation count. Regarding accuracy, in my tests, it is never worse than the standard recurrence, and often slightly better (probably due to less roundings due to less operations). I also kept the mean correction (only for fp32 for now), which is always a good thing for fp32, at negligible cost.

    For now, only scalar transforms work, as some fixes are still needed for vector transform, but I think you don’t use vector transforms anyway.

    Could you please test ?

  16. Zekun Ni reporter

    I’ve tested your latest branch. It appears the (modified) old recurrence runs slightly slower than the new one, but the difference is very small (~2% at batch size=4). However, contrary to what you described in your comment, in most cases I find the error is larger than using the new recurrence, though they are mostly on the same order of magnitude.

    I tested on six settings. On settings #1~#3 I randomly generate the spherical harmonics where each real and imaginary part follow normal Gaussian distribution N(0,1); then for setting #2 I multiply the value at l=1, m=0 by 1000 and for setting #3 I multiply the value at l=1, m=1 by 1000. After that they are converted to grid point values. Settings #4~#6 are similar with #1~#3 respectively but the Gaussian distribution are modified to N(0,1/(l+1)^1.5) for each l, which mimics the pattern found in meteorological data. The mean squared error between the grid point values after two back-and-forth SHT transforms and the original ones are reported below:

    Setting Old Recurrence New Recurrence
    #1 1.55e-3 4.79e-4
    #2 6.00e-3 2.59e-3
    #3 1.51e-3 6.57e-4
    #4 1.61e-9 1.07e-9
    #5 6.03e-4 2.20e-4
    #6 1.90e-6 1.80e-5

    As you can see in settings #1~#5 the old recurrence actually produces 20~80% larger error, but in contrast, in setting #6 the new one produces 3x error compared to the old one. It appears there is no straightforward answer which one is better. Perhaps in most cases either one is enough and the new recurrence is slightly better, but in some cases the new recurrence may fail spectacularly.

  17. Nathanaël Schaeffer repo owner

    What Lmax did you use for your tests ?
    I think there is a misunderstanding: In the statements I made about the errors, I was comparing the old unmodified recurrence (the one used before batching was introduced) with the old modified one (re-introduced to fix the accuracy issues). They have very similar errors.

    What you say is the new recurrence (Ishioka’s recurrence, which works with cos(theta)^2 instead of cos(theta) – explaining the accuracy issues at the equator where cos(theta)^2 becomes too small for a pure fp32 recurrence at large Lmax) is actually better in most cases you tested. Indeed, I have also observed in my tests that it is better close to the poles, but worse at the equator (which is what you discovered, as the title of the issue says). It is also more sensitive to magnitude contrasts between the modes (your case 6 has the largest contrast), because it somehow mixes the modes before separating them, which works well with modes of comparable magnitude, but not so well with large contrasts (significant bits of smaller modes will be lost).

    The fact that the new one is better near the poles is also likely due to the use of cos(theta)^2 instead of cos(theta): near the poles the values get very close to one. The error close to the poles is due to m=0. Indeed, m>0 are all zero at the poles. I’m considering using full double precision for m=0 to overcome this issue.

  18. Zekun Ni reporter

    Thanks for your clarification! In the tests I used Lmax=639.

    Using full double precision for m=0 could be a compromising solution while I’m curious about the speed difference.

  19. Log in to comment