Memory issues on cu_spat_to_SH/cu_SH_to_spat functions

Issue #60 resolved
Zekun Ni created an issue

I’m heavily using this library now after having single precision support on GPU. However, sometimes the library function cu_spat_to_SH fails mysteriously with cuda error “invalid argument”. Google search results suggested it might be a memory issue, and I took a look into the code again. What I found is that unlike their CPU equivalent spat_to_SH/SH_to_spat, the two cu_* functions actually require slightly larger memories for the input and output arrays, with spatial and spectral representations required to have cfg->spat_stride and cfg->nlm_stride elements respectively, as evidenced here. Since nlat must be divisible by 64, spat_stride is usually equal to nlat*nphi, but the same thing usually doesn’t hold between nlm_stride and nlm. Unfortunately, nlm_stride is a private member, so this goes beyond a documentation issue since function callers actually aren’t able to figure out how much they should allocate for the arrays, and some code changes are surely required.

Comments (9)

  1. Nathanaël Schaeffer repo owner

    Hello,
    A first quick answer, with a possible workaround (but it should be confirmed by a careful analysis later).
    If the issue is the one I think based on your description, you can assume that if you allocate (nlm+mmax+16) complex numbers, you should be safe (see here). But I’m not sure this is the issue, as this nlm_stride is used for internal buffers only (as far as I remember, I will check later).

    Let me know if this tentative workaround solves the issue.
    By the way, are the failures systematic for a given size or more or less random ?

  2. Zekun Ni reporter

    Unfortunately according to the link in my original post nlm_stride is also explicitly used for the output of cu_spat_to_SH as this line sets all nlm_stride elements to zero. I’m not sure if indices after nlm are actually visited in the succeeding CUDA kernel ilegendre, though. I can hard code (nlm+mmax+1+31)/32 in my code as a workaround but a less ugly solution is definitely needed.

    The failure is kind of random but is much more frequent if the array itself is small, so that such overflow is more likely to affect an adjacent memory chunk.

  3. Nathanaël Schaeffer repo owner

    Yes, if the workaround solves the issue, I will definitely fix the code in a proper and clean way.
    But use this formula please, the one you wrote is not correct (real vs complex numbers, bytes vs sizeof(real)):

    nlm + mmax + 16
    

    Thank you very much for reporting.
    FYI, I also use extensively the double-precision version of these functions and have never stumbled on this kind of issue, so it might also be specific to single-precision, which is not really what you suggest.

  4. Zekun Ni reporter

    I’ll test the workaround after next time my program fails and see if “invalid argument” error still appears. Anyway this cudaMemsetAsync call does look likely to trigger memory errors if the output array is not large enough.

    I also haven’t run into this issue during the last two years, but recently the program grows more sophisticated and involves more frequent calls on more arrays, and then it starts to fail miserably, especially when there are heavier memory accesses on the GPUs.

  5. Zekun Ni reporter

    Hi Nathanaël, thanks for the fix, and after applying this fix the “invalid argument” error no longer pops up so far. There is a variant program which kept failing shortly after being started before this fix. Now it has been running without errors for an hour. It does look like this error has been fixed after eliminating the memory overflow at sht_gpu_kernels.cu:1576.

  6. Nathanaël Schaeffer repo owner

    Thank you for spotting it! For what it’s worth, I have made a new release, v3.5.4.

  7. Log in to comment