Hierarchical Reduce functions needed for CPU
Right now the implementation of the Reduce functions on the CPU is just a sequential implementation. A hierarchical version similar to the GPU implementation should replace this, as it is much more accurate (less rounding error for large summations)
Comments (6)
-
-
For reference, in case we wanna nerd out about it, here is a similar discussion from the Julia folks: implement a better summation algorithm
-
From the Wikipedia page it looks like the heirarchical sum has only slightly worse error and takes the same amount of flops as a sequential sum. The sums should probably be a full binary tree on the GPU as well as CPU if we go the heirarchical route. Either way gives very close to constant error, so they're both enormous improvements over sequential.
-
reporter Yeah, kahan sums don't really work on the GPU (or parallel CPU), and from an implementation perspective it's much easier to have a limited depth to the tree on the GPU. My preference would just be to closely match whatever GPU implementation we use. Currently it's our own (based closely on the old cudpp reduce functions), but we could definitely replace that with Thrust's implementation at some point if we feel that it is better.
-
reporter Actually a hybrid hierarchical / kahan version might be the way to go -- each thread does a sequential kahan sum and then the thread results are combined hierarchically
-
Yeah heirarchical kahan will get us lowest error and can still match exactly cpu-gpu
- Log in to comment
Could also do a Kahan sum on the CPU. This is the default sum in Julia for instance.