Another memory access issue on cu_SH_to_spat

Issue #61 resolved
Zekun Ni created an issue

My programs ran smoothly for a couple of weeks after the issue #60 was fixed, but started to fail again recently after I made some changes. This time it was a different error “an illegal memory access was encountered” and the point of failure was cu_SH_to_spat instead of cu_spat_to_SH.

Since it was definitely a memory access issue I scanned through the code again. With lmax=255 under single precision I inspected the kernel function leg_m_highllim_kernel. It turned out I did find one line which could cause out-of-bound reads, which was sht_gpu_kernels.cu:1025. I changed it to

qk[j] = l+((j>>1)&15) <= llim ? ql[2*l+(j&31)] : 0.0; 

and ran the program again. It has been up for two hours without an error and I’ll see if this really solves this issue.

Comments (7)

  1. Nathanaël Schaeffer repo owner

    Yes, your fix is needed to avoid array overflow. Thanks.

    Actually, both issue #60 and this one are not present in the cuda-rtc branch that I’m about to make the main branch. The kernels have been almost completely rewritten, but they do not support single precision yet, because of lack of time on my side. I plan to do it, possibly before the summer, but there should also be some testing, both in terms of accuracy (hopefully stays the same) and performance (that should be significantly better). Let me know if you are interested in any of these tasks.

  2. Zekun Ni reporter

    It looks like this change does fix this issue as my program has been up for more than one day.

    The news that there is a major upgrade ongoing is very thrilling. The potential improvement on performance could also give a big boost to my project. I’d definitely love to port the new kernels to single precision, but that depends on the timeline of my own project. Right now I need to spend quite some time monitoring and debugging every day. If later it requires less work I’d take a look on your new upgrade. Have this upgrade already been tested under double precision, so that it’s ready for being ported to single precision?

    By the way, issue #60 does seem to exist in your new branch, namely here.

  3. Nathanaël Schaeffer repo owner

    I use the cuda-rtc branch in production on GPU in double precision quite often.
    Even if it does seem that issue #60 also exists in cuda-rtc, in fact ql there is never a buffer provided by the user: another kernel invocation follows which reads ql and writes into the user provided buffer.
    I’ve actually started to write the single precision port, maybe I can finish in the next two weeks or so. But I’m considering to change a bit the logic, and make an shtns_cfg work either for ‘double' or ‘float' but not both. It would allow memory savings and shorter initialization times. Would that fit your application?

    By the way, the new branch also provides so-called “batched” transforms, where several identical transforms are performed together. This increases performance quite a bit at low to moderate sizes (about a factor 3 for lmax=255). I don’t know if this would fit your problem or not.

  4. Zekun Ni reporter

    Great to know that line doesn’t actually cause a memory breach! I didn’t read your code in a very detailed manner.

    The proposal that shtns_cfg works only on one scalar type doesn’t affect my project since I only use single precision. The “batched” transforms will also make huge performance boosts in my use case. Looking forward to your release soon! If you’d like me to test your code I’d love to do so.

  5. Nathanaël Schaeffer repo owner

    You can try the cuda-rtc branch in single precision. It works here, but I would be interested in the accuracy change (if there is) and performance change (there will be). Note that batched transforms may be needed to get the performance boost. Also, you now have to pass SHT_FP32 to shtns_set_grid*() in order to have single-precision instead of double precision.

  6. Log in to comment