Support for simultaneous transforms

Nathanaël Schaeffer repo owner

Multiple simultaneous transforms is currently not available in the stable or devel branch, but it exists in an experimental branch.
However, for performance reasons, the fastest memory axis should always be the latitude. As you mention, rearranging the data can take longer than the transform, but this would be roughly the same if SHTns needed to rearrange internally. So if you care about performance and SHTs is a performance bottleneck , you should definitely consider changing your data layout. Note that, although FFTW offers immense flexibility in the layout, not all layouts perform the same.

Currently, the experimental simultaneous transforms have the following features:

they are implemented only for full transforms (not the ones restricted to a single order m).
they have latitude (theta) as the fastest memory axis AND longitude (phi) as the slowest memory axis; the “other” axis is in the middle. This constraint is a hard constraint for GPU for performance reasons. I won’t implement other layouts, so you must rearrange the data if this does not suit you.
they work for both CPU and GPU, but they have been developed with GPU in mind.
the performance on CPU is no better than doing the transforms individually (where cache reuse is best on CPU).
performance on GPU is significantly improved (more exposed parallelism).

If you are interested, I can give you access to the experimental branch. If you need it with legendre-only transforms (single m), it is possible to implement, but I cannot do it before fall this year.

2022-06-30T07:29:37+00:00

Keaton Burns reporter

Thanks for the information. We would need the individual Legendre transforms, but there’s no rush. For FFTs though, it seems like transforms along slow memory axes, while slower than along the fastest axis, are still much faster than rearranging the data. Is it clear if/why that wouldn’t be the case with the Legendre transforms? It seems like having multiple datasets interleaved might still be quite efficient if SIMD was used at the lowest level. But you mentioned there might be cache reuse issues? I’m far from understanding the real implementation details that make SHTns so fast, but I’m just curious!

‌

2022-06-30T12:20:06+00:00

Nathanaël Schaeffer repo owner

Basically, for a spectral to spatial transform, SHTns will read many times (proportional to nlat) the array of spectral values. If this stays in cache, it will be very efficient. If it goes out of cache (because now you have several fields together and it can’t stay in cache), performance will drop significantly.
For spatial to spectral it is similar: SHTns will read many times (proportional to lmax) the array of spatial values.
If the fastest direction was another one, SHTns would need to gather each field separately (or maybe in small groups of 2 to 4). This would indeed be a little bit more efficient than doing it on the caller side.
It would probably be possible to do it reasonably well (like what you say for FFT), but it would require a lot of work I think.

All that said, If many users ask for such a feature (having the fast memory axis being something else that latitude), I can consider implementing it.

2022-06-30T15:02:07+00:00

Comments (3)