Hierarchical Reduce functions needed for CPU

Issue #23 new
Sam Preston created an issue

Right now the implementation of the Reduce functions on the CPU is just a sequential implementation. A hierarchical version similar to the GPU implementation should replace this, as it is much more accurate (less rounding error for large summations)

Comments (6)

  1. Jacob Hinkle

    From the Wikipedia page it looks like the heirarchical sum has only slightly worse error and takes the same amount of flops as a sequential sum. The sums should probably be a full binary tree on the GPU as well as CPU if we go the heirarchical route. Either way gives very close to constant error, so they're both enormous improvements over sequential.

  2. Sam Preston reporter

    Yeah, kahan sums don't really work on the GPU (or parallel CPU), and from an implementation perspective it's much easier to have a limited depth to the tree on the GPU. My preference would just be to closely match whatever GPU implementation we use. Currently it's our own (based closely on the old cudpp reduce functions), but we could definitely replace that with Thrust's implementation at some point if we feel that it is better.

  3. Sam Preston reporter

    Actually a hybrid hierarchical / kahan version might be the way to go -- each thread does a sequential kahan sum and then the thread results are combined hierarchically

  4. Log in to comment