Float (single precision) support
Hi, I came across your library and found it very helpful (and I’m the reporter of #42). An important reason why it’s helpful for me is its GPU support. However, I also notice that the whole library uses double precision, and single precision can be many times faster than double precision on GPUs (and perhaps also on CPUs, but to a lesser extent). ECMWF has also started to use single-precision arithmetic in its IFS ensemble forecasts.
Fortunately, CUDA programs are build upon C++, so I can make modifications relatively easily. Most of the coding work is to add a type template parameter to a lot of functions and replace “double” with that parameter. Of course, that’s not enough. I also need to make float copies of CUDA arrays stored in struct shtns_info, and adjust several constants controlling the rescaling for large lmax. After repeated tuning of the constants I can make SH_to_spat and spat_to_SH functions achieve a satisfactory accuracy (~1e-5 avg relative error) at lmax=767, which is enough for my usage. 180 rounds of back and forth single-precision SHT transforms at lmax=767 take ~0.7s on a NVIDIA T4 GPU, while the same double-precision transforms take ~3s, corresponding to a 4x speedup.
Although I have satisfied my needs by doing it myself, I’m looking forward to an official release with single precision support. If you are interested, I could also provide my modifications.
Comments (6)
-
reporter -
repo owner Hello,
Single precision support is one of the feature that will be implemented soon. A user already contributed float support for CUDA, and you can find his branch here: https://gricad-gitlab.univ-grenoble-alpes.fr/schaeffn/shtns/-/tree/cuda-float
You contribution would also be appreciated, and I can merge everything into the next release.
Full support for float (ie also on CPU) will maybe have to wait a little more: the cpu code is in C so the template are not supported.PS: the
_any
calls you are referring to are not neeed, but they improve performance on older (Kepler, maybe Pascal) hardware. Why are you saying it is a bug? In double precision it is correct. I don’t expect this_any
condition to work in single precision though. -
reporter - attached shtns_gpu_float.zip
<div class="preview-container wiki-content"><!-- loaded via ajax --></div> <div class="mask"></div> </div>
</div> </form>
-
reporter I’ve attached my modified version. There are some dirty hacks, though.
The problem of the _any calls is they produce wrong results when I try single precision. I didn’t try a sufficiently large lmax in double precision and I apologize if it’s intended in the case of double precision.
-
repo owner Thank you! Could you just tell me on which version or commit did you base your modified version ?
-
reporter v3.4.6
- Log in to comment
One thing I shall mention: the two _any calls in leg_m_highllim_kernel function, located at lines 973 and 980 in sht_gpu_kernels.cu at version 3.4.5, are also not correct and should be removed, like those in ileg_m_highllim_kernel function. This is actually a bug and should be fixed.