Another memory access issue on cu_SH_to_spat

Nathanaël Schaeffer repo owner

Yes, your fix is needed to avoid array overflow. Thanks.

Actually, both issue ~~#60~~ and this one are not present in the cuda-rtc branch that I’m about to make the main branch. The kernels have been almost completely rewritten, but they do not support single precision yet, because of lack of time on my side. I plan to do it, possibly before the summer, but there should also be some testing, both in terms of accuracy (hopefully stays the same) and performance (that should be significantly better). Let me know if you are interested in any of these tasks.

2023-05-11T08:30:34+00:00

Nathanaël Schaeffer repo owner

changed status to open

2023-05-11T08:31:28+00:00

Zekun Ni reporter

It looks like this change does fix this issue as my program has been up for more than one day.

The news that there is a major upgrade ongoing is very thrilling. The potential improvement on performance could also give a big boost to my project. I’d definitely love to port the new kernels to single precision, but that depends on the timeline of my own project. Right now I need to spend quite some time monitoring and debugging every day. If later it requires less work I’d take a look on your new upgrade. Have this upgrade already been tested under double precision, so that it’s ready for being ported to single precision?

By the way, issue ~~#60~~ does seem to exist in your new branch, namely here.

2023-05-12T05:52:13+00:00

Nathanaël Schaeffer repo owner

I use the cuda-rtc branch in production on GPU in double precision quite often.
Even if it does seem that issue ~~#60~~ also exists in cuda-rtc, in fact ql there is never a buffer provided by the user: another kernel invocation follows which reads ql and writes into the user provided buffer.
I’ve actually started to write the single precision port, maybe I can finish in the next two weeks or so. But I’m considering to change a bit the logic, and make an shtns_cfg work either for ‘double' or ‘float' but not both. It would allow memory savings and shorter initialization times. Would that fit your application?

By the way, the new branch also provides so-called “batched” transforms, where several identical transforms are performed together. This increases performance quite a bit at low to moderate sizes (about a factor 3 for lmax=255). I don’t know if this would fit your problem or not.

‌

2023-05-12T13:01:33+00:00

Zekun Ni reporter

Great to know that line doesn’t actually cause a memory breach! I didn’t read your code in a very detailed manner.

The proposal that shtns_cfg works only on one scalar type doesn’t affect my project since I only use single precision. The “batched” transforms will also make huge performance boosts in my use case. Looking forward to your release soon! If you’d like me to test your code I’d love to do so.

2023-05-12T20:25:41+00:00

Nathanaël Schaeffer repo owner

You can try the cuda-rtc branch in single precision. It works here, but I would be interested in the accuracy change (if there is) and performance change (there will be). Note that batched transforms may be needed to get the performance boost. Also, you now have to pass SHT_FP32 to shtns_set_grid*() in order to have single-precision instead of double precision.

2023-05-16T07:01:05+00:00

Nathanaël Schaeffer repo owner

changed status to resolved

fixed in v3.5.5

2023-05-16T07:01:55+00:00

Comments (7)