Float (single precision) support

Issue #44 new
Zekun Ni created an issue

Hi, I came across your library and found it very helpful (and I’m the reporter of #42). An important reason why it’s helpful for me is its GPU support. However, I also notice that the whole library uses double precision, and single precision can be many times faster than double precision on GPUs (and perhaps also on CPUs, but to a lesser extent). ECMWF has also started to use single-precision arithmetic in its IFS ensemble forecasts.

Fortunately, CUDA programs are build upon C++, so I can make modifications relatively easily. Most of the coding work is to add a type template parameter to a lot of functions and replace “double” with that parameter. Of course, that’s not enough. I also need to make float copies of CUDA arrays stored in struct shtns_info, and adjust several constants controlling the rescaling for large lmax. After repeated tuning of the constants I can make SH_to_spat and spat_to_SH functions achieve a satisfactory accuracy (~1e-5 avg relative error) at lmax=767, which is enough for my usage. 180 rounds of back and forth single-precision SHT transforms at lmax=767 take ~0.7s on a NVIDIA T4 GPU, while the same double-precision transforms take ~3s, corresponding to a 4x speedup.

Although I have satisfied my needs by doing it myself, I’m looking forward to an official release with single precision support. If you are interested, I could also provide my modifications.

Comments (6)

  1. Zekun Ni reporter

    One thing I shall mention: the two _any calls in leg_m_highllim_kernel function, located at lines 973 and 980 in sht_gpu_kernels.cu at version 3.4.5, are also not correct and should be removed, like those in ileg_m_highllim_kernel function. This is actually a bug and should be fixed.

  2. Nathanaël Schaeffer repo owner

    Hello,

    Single precision support is one of the feature that will be implemented soon. A user already contributed float support for CUDA, and you can find his branch here: https://gricad-gitlab.univ-grenoble-alpes.fr/schaeffn/shtns/-/tree/cuda-float

    You contribution would also be appreciated, and I can merge everything into the next release.
    Full support for float (ie also on CPU) will maybe have to wait a little more: the cpu code is in C so the template are not supported.

    PS: the _any calls you are referring to are not neeed, but they improve performance on older (Kepler, maybe Pascal) hardware. Why are you saying it is a bug? In double precision it is correct. I don’t expect this _any condition to work in single precision though.

  3. Zekun Ni reporter

    I’ve attached my modified version. There are some dirty hacks, though.

    The problem of the _any calls is they produce wrong results when I try single precision. I didn’t try a sufficiently large lmax in double precision and I apologize if it’s intended in the case of double precision.

  4. Nathanaël Schaeffer repo owner

    Thank you! Could you just tell me on which version or commit did you base your modified version ?

  5. Log in to comment