team::allreduce BinaryOp semantics and h/w acceleration

Issue #70 resolved
Dan Bonachea created an issue

The BinaryOp op argument to team::allreduce specification:

BinaryOp must be a function object represent-ing an associative and commutative mathematical operation taking two values of type T and returning a value implicitly convertible to T. BinaryOp must be referentially transparent and concurrently invocable. BinaryOp may not invoke any UPC++ routine with a progress level other than none.

Nothing in here requires all ranks to pass a BinaryOp corresponding to the same mathematical operation. For example:

int result = wait(
  world().allreduce<int>((int)rank_me(), 
                        [](int a, int b) { return a + b + (int)rank_me(); }));

I believe this lambda op follows all the rules (modulo possible syntax errors). Specifically, the BinaryOp provided by each rank is associative and commutative over the arguments, concurrently invokable, and does not invoke any UPC++ routine with a progress level other than none. Referentially transparent is not formally defined, but I believe the BinaryOp provided by each rank satisfies the usual definition of referentially transparent within that rank.

However this code will definitely generate an indeterminate result, because the output depends on the exact rank where each BinaryOp is invoked by the parallel reduction algorithm. We probably need to make this clarification somehow.

On a related note, there is potentially significant performance benefit to exposing some constants of type BinaryOp to represent standard arithmetic operations that are often implemented in network hardware. Otherwise if the implementation is forced to always treat op as opaque, it may be impossible to take advantage of such acceleration.

Comments (12)

  1. Amir Kamil

    We can specialize for the function objects defined in the STL <functional> header (e.g. std::plus<T>, std::multiplies<T>) for whatever types that GASNet supports. We would need to define our own function objects for operations that are not included in <functional> (whatever GASNet supports out of max, min, logical xor, maxloc, minloc). And we should document the cases we specialize so that users know to use them.

  2. BrianS

    We can template specialize of the binary ops that are defined in std that we have hardware acceleration for and document which ones will typically be supported. i'll need to try coding this up before I'm comfortable updating the spec.

  3. Dan Bonachea reporter

    This issue was triaged at the 2018-06-13 Pagoda meeting and assigned a new milestone/priority.

    We agreed all the details about BinaryOp mentioned in this issue need to be clarified for the upcoming release that delivers this feature.

  4. Dan Bonachea reporter

    Notes copied from @jdbachan in PR #36:

    Sadly std::min<T> cannot be passed since there exists a few overloads, the user would need to add a cast like (T(*)(T const&, T const&))std::min<T>. As if that weren't bad enough, passing fun ptrs as function objects erases the static knowledge of which function is being called, the type is just "function pointer" instead of a class name having overloaded operator(). So if we were to accelerate std::min, we would need to compare the fun ptr given at runtime to determine that it translates to the min hardware operation.

    I think its clear we need to have upcxx::add, upcxx::min, upcxx::max, etc. as global function objects of distinct types, so that we can dispatch to the correct gasnet enum at compile time.

  5. Dan Bonachea reporter

    @PHHargrove and I recently discussed this issue in depth and want to propose a better solution for accommodating offloadable reduction operators. The basic idea is to specify two separate overloads of upcxx::allreduce:

    1. A new overload which where the operator argument is specified by an enum value from a fixed list of builtins in the upcxx:: namespace (eg ), and IS eligible for offload (based on the limitations of hardware capabilities, of course). Example: upcxx::allreduce(myvalue, MIN)

    2. The current overload which takes the operator argument as an arbitrary lambda or function pointer as the reduction operator. This variant is NEVER offloadable. Example: upcxx::allreduce(myvalue, [](int64_t a, int64_t b) { return std::min(a, b); })

    This approach has the advantage of providing a clear distinction that is easy to explain and understand. There a potentially fundamental difference in behavior (and performance) of the implementation here that should be emphasized, and this proposal makes that difference explicit in the type system. Users should be strongly encouraged to use the first variant (ie one of the provided built-in operators), and only to roll their own reduction operator when none of the built-in operators is appropriate.

    The implementation of this approach should also be straightforward and efficient. Thanks to gasnet_fwd.h, the built-in enum values exposed by upcxx.hpp can be the same enum values used by GEX, so that absolutely no translation or magical function detection is required for the GEX-supported data types - the call just passes the enum value straight through to GASNet (which then activates the offload hardware where available).

    As a side note, we also discussed the set of built-in operators that should be exposed, and observed that MINLOC and MAXLOC should eventually be exposed by both GEX and UPC++ (for the same reasons that MPI reductions provide them). At the UPC++ level this should probably use data type T=std::pair<int, U>, which probably justifies a template specialization of upcxx::reduce that would be required to use that particular operator.

  6. john bachan

    I am not in favor of the the above proposal for split enum/lambda reductions. The main disadvantage of the approach is that it does not work well with generic code. If I want to write a function that initiates reductions but is parametric wrt the reduction operator, I would want to write that function to accept the operator as a lambda. If the user passes upcxx::add, then the collective in my algorithm will just go fast. While the split enum calls won't actually break this capability (since the lambda type would get deduced to the enum integral type), it would mean that what I think is a lambda isn't actually callable (cuz its a int). Example: if upcxx were to expose an exclusive prefix scan reduction to our users, and then a user wants to build a generic inclusive prefix scan parametric on lambda, they have no nice way of calling the lambda outside the collective without us adding more machinery to allow users to invoke things which might be reduction enums or proper callables.

    I also don't see the advantage to making the difference in potential performance "explicit in the type system". How is that easier to explain than "use the upcxx provided constants to ensure highest probability of offload" where constants are callable objects like upcxx::add etc?

    The proposal does not have any advantage over the constant callable approach in terms of runtime efficiency.

  7. Log in to comment