Vector-based user-defined reduction operator

Issue #130 new

Paul Hargrove created an issue 2018-08-27

Currently the reduce API is based on an operator taking two instances of a type and returning the same type. There is nothing fundamentally wrong with that interface, which is going to be very natural for any user to implement. However, the purpose of this issue is to float the idea of supporting a second signature for a reduction function.

At this point in time, I am not advocating that we implement this second signature (which I will describe and motivate below). The main concern at this moment is to get the C++ experts to examine the current planned specification for reduce and determine if we can introduce a second signature in the future without breaking backwards compatibility.

This should be decided prior to the Sep 2018 release, in case we decide some changes to the current interfaces might be required to make this enhancement in the future.

The proposed signature is void BinOp(T *in1, T *in2_out, size_t count)

The semantic being: for (size_t i = 0; i < count; ++i) in2_out[i] = OP(in1[i], in2_out[i]);

One motivation is that this operator signature is closer to that of GASNet-EX. That alone can result in simpler (faster) code in the UPC++ runtime when implementing the C callback that GASNet-EX will invoke to perform the reduction. However, the design choice in GASNet-EX was made with some performance motivations, which should also be applicable to UPC++.

There are multiple ways in which this "vector" signature reduces the number of times a function call overhead is incurred. For the proposed multi-field reduction (issue ~~#95~~), the use should be obvious.

What may not be obvious is that GASNet-EX can (though admittedly it does not yet) take advantage of this signature to reduce the number of function calls for a single-field reduction such as exists in the UPC++ specification today. For an implementation using a reduction tree with degree N, one currently requires N functions calls to reduce the local contribution plus that received from N children. However, with the vector signature this can be reduced to ceil(log_2(N)) function calls by performing the equivalent of Pairwise Summation (though not motivated by round-off reduction as in that page).

In addition to reducing the number of times the reduction operator must be called, the proposed form allows the compiler to vectorize the loop, which may have a significant performance impact for the case of even moderately long vectors.

Comments (16)

Paul Hargrove reporter
- edited description
- 2018-08-27T21:32:59+00:00

john bachan

This second signature is probably not necessary as I can generate a conforming function as a loop over the first signature while permitting the compiler to inline the inner per-element function call. I say "probably" because I would be generating the obvious loop, so we would miss the opportunity of calling out to BLAS etc.

template<typename T, typename ScalarBinOp>
void vector_bin_op_for_gex(void *in1, void *in2_out, size_t n, void *arg) {
  #pragma simd // or whatever
  for(size_t i=0; i != n; i++)
    // inlineable call to scalar op
    ((T*)in2_out)[i] = ((ScalarBinOp*)arg)->operator()(((T*)in1)[i], ((T*)in2_out)[i]);
}

// John calls "gex" with &vector_bin_op_for_gex<T,ScalarBinOp>

2018-08-27T22:06:42+00:00

john bachan
And to answer Paul's very direct question: I think it will be easy to add more signatures without breaking backward compatibility. There are template tricks (sfinae) which allow us to ask the compiler if a given expression is "typeable" or not and select code blocks accordingly. So I can take the lambda the user gives me, plug it into a number of different candidate call sites to find the one(s) that typecheck, and then decide what to do.
- 2018-08-28T18:16:11+00:00
Dan Bonachea
- changed milestone to 2018.09.30 release
- 2018-08-28T18:24:39+00:00
Paul Hargrove reporter
@jdbachan thanks for the direct answer to the direct question.
Based on that, I think we can potentially remove the corresponding item from Wednesday's agenda.

At some point in the future we can debate the value (if any) of supporting more than one signature for the reduction operator. Your initial response suggests that at least some of the benefit I was expecting may not really be a difference.
- 2018-08-28T18:27:31+00:00
BrianS
A clear benefit for this kind of operation would be communication-avoiding s-step Krylov methods like Erin worked on with Jim

https://epubs.siam.org/doi/abs/10.1137/120893057

The other question would be can reproducible summation be rolled into the UPC++ API. As Jim and Hong Diep Nguyen developed

https://www.computer.org/csdl/trans/tc/2015/07/06875899.html
- 2018-08-28T18:39:54+00:00
john bachan
Along those lines, we have yet to consider how we are going to advertise the math properties of the reduction operator to the runtime: associativity, idempodence, and maybe commutativity?

Floating point summation assuming associativity (inaccurately) is the common "fast" implementation, floating point with associativity disabled makes for the determinism in Jim's work.

Idempodence (min, max, bit and, bit or) allows for faster of communication graphs, like dissemination.

The obvious thing to do (and what C would do) is to add a flags argument to allreduce. But since this is a property of the reduction operator, it seems natural to make it a derived quantity from the function object. Importantly, this would allow us to declare the math properties of builtin operators (upcxx::plus on float vs int, upcxx::bit_or, etc) to ensure the syntactly concise case gets the "best" implementation (have to decide if performance or determinism constitutes "better"). There are A LOT of ways to accomplish this in C++, ideally the one we choose is syntacticly nice for lambdas. Some that come to mind:
- Type traits (upcxx::is_associative<Foo>::value). Bad for dynamically known math properties, for instance, we can't ask a function pointer about its math model. Bad for lambdas since its "tricky" to to get the type name of a lambda and plug it in as a specialization of the trait.
- bool getter (foo.is_assosiative()). Allows dynamism, but the user can't add members to lambdas, function pointers, or any type out of their control. We could provide wrapper types though:
```
// Half-baked proposal
template<typename Foo>
struct associative_wrapper {
  Foo foo;
  static constexpr bool is_associative() const { return true; }
  template<typename T>
  T operator()(T a, T b) { return foo(a,b); }
};

associative_wrapper<Foo> make_associative(Foo foo) {
 return {foo};
}

upcxx::allreduce(..., make_associative([=](int a, int b) { return a + b + 1; });
```
- 2018-08-28T19:09:27+00:00
john bachan
The proposal above was indeed half-baked. The math query on operations needs to include the type over which its being applied (+ has different associativity for int vs float).
```
struct upcxx::plus {
  template<typename T>
  bool is_associative() const {
    return !std::is_same<T,float>::value; // need double etc.
  }
  template<typename T>
  T operator()(T a, T b) const { return a + b; }
};
```
The wrapper would probably remain unchanged, which would take a callable and make it report as "some property" for all types. If the user wanted the math to be type dependent, they can write their own callable class just like our upcxx::plus.
- 2018-08-28T21:28:02+00:00
john bachan
And another thing... reductions of non-trivial types (like intersection of std::set<T>) can benefit from multiple lambda signatures. The signature we advertised of T operator()(T a, T b) is really bad for container types (lots of copying). Better would be void operator()(T &accum,T const &b). And even better is void operator()(T &accum, T &&b) where b's resources can be "stolen" into the container (like splicing in linked list nodes). I have a working implementation which can detect all three and dispatch to the best available one. We should spec it.
- 2018-08-28T21:51:49+00:00
BrianS
reduce of std::set would make AMR mesh generation way easier. MPI has thwarted me for years by requiring all reduced types have the same fixed size.
- 2018-08-29T00:20:41+00:00
Dan Bonachea
- changed milestone to Deferred indefinitely
- marked as minor
This was discussed at the 8.29.2018 Pagoda meeting, where we resolved this could be added at anytime in the future (ideally based on some user-driven motivating example) without breaking backwards compatibility. Until then, the idea is deferred indefinitely.
- 2018-09-19T23:43:30+00:00
Dan Bonachea
- changed milestone to FY21
May be a candidate for the FY21 collectives milestone
- 2019-05-10T17:18:05+00:00
Paul Hargrove reporter
- removed responsible
- 2021-04-07T22:21:03+00:00
Amir Kamil
Here is a concrete proposal for allowing vector operations.
- Replace BinaryOp in the signature of the reduction operations with BinaryOrVectorOp.
- Modify the preconditions to state: “BinaryOrVectorOp must be a function-object type representing an associative and commutative mathematical operation. It must either be a BinaryOp that is invokable on two arguments of type T and returns a value implicitly convertible to T, or a VectorOp that is invokable on three arguments of type T*, const T*, and std::size_t, respectively, or both a BinaryOp and VectorOp.”
- State that if the BinaryOrVectorOp is both a BinaryOp and a VectorOp, then it will be invoked on T*, const T*, std::size_t arguments.
Until we decide to officially support nontrivial reductions, I don’t see a need for allowing void operator()(T &accum,T const &b) or void operator()(T &accum, T &&b).

We already internally convert a BinaryOp into a VectorOp. Allowing the user to provide a VectorOp would make it more efficient when the compiler does not inline the BinaryOp into the VectorOp we construct internally (e.g. if the BinaryOp is a function pointer). However, I’m not sure we have a compelling need for this at the moment.

Overall, I’m more or less ambivalent about actually implementing this right now. I’d be fine with deferring this indefinitely. I’m also fine with spec’ing and implementing it if others think we should do so.
- 2021-04-09T18:24:53+00:00
Paul Hargrove reporter
Thanks, @Amir Kamil , for looking at this.

I think the implementation feature you mention about internally using a VectorOp already means the implementation work is both tractable and has relatively low-risk of hurting performance or stability.

However, lacking a compelling use case, I think we will need to discuss relative priority of this work before you or another team member invests any significant effort into specification and/or implementation.
- 2021-04-09T19:56:16+00:00
Dan Bonachea
- changed milestone to Deferred indefinitely
- changed title to Vector-based user-defined reduction operator
Amir said:

Allowing the user to provide a VectorOp would make it more efficient when the compiler does not inline the BinaryOp into the VectorOp we construct internally (e.g. if the BinaryOp is a function pointer). However, I’m not sure we have a compelling need for this at the moment.

Re-stating the points I made in our recent meeting to put this into context:
1. This proposed extension is entirely motivated by performance of reductions using user-defined reduction operators
2. We currently only specify reductions on trivial types, which tends to make user-defined operators far less "interesting".
3. User-provided operators (Binary or Vector) are NEVER offloadable to network hardware. This means that by selecting a custom reduction operator outside our built-in op_fast_* set, the programmer has potentially already surrendered significant performance, even before we talk about the overheads of dispatching to their operator.
4. We have yet to encounter a stakeholder request or compelling use case for "application specific" user-defined operators on trivial types.
So while we agree we could deploy this "optimization for the slow case", I agree with Paul and Amir that it doesn't seem like a priority. Apps doing performance-sensitive reductions should be strongly encouraged to use built-in operators, making this all irrelevant. We seem to (again) have consensus not to do this right now, so I'm (again) returning this to "Deferred indefinitely".

In the interests of full disclosure, I'll mention in passing that MPI provides a couple of additional built-in reduction operators we don't currently support (MINLOC and MAXLOC). I believe these DO represent a real use case (for some applications), but IF we decided these were important for UPC++ app performance then I think the right approach would be to extend our set of built-in operators to include these (rather than encouraging user-defined operators for this purpose). The only reason we have not done this already is because these operators introduce additional typing issues that slightly increase specification complexity.
- 2021-04-09T20:27:12+00:00
Log in to comment

Assignee: –

Type: proposal

Priority: minor

Status: new

Component: General

Milestone: Deferred indefinitely

Version: Development branch (master)

Votes: 0

Watchers: 2