optimize empty futures in `make_future()` and `when_all(future<>, future<>...)`

Issue #478 resolved
Dan Bonachea created an issue

The current implementation of when_all(future<>, future<>...) always ends up allocating a new heap object, even when it's semantically unnecessary.

Example program: (WARNING: this test uses unspecified internals which are NOT supported)

#include <upcxx/upcxx.hpp>
#include <iostream>

using namespace upcxx;

int main() {
  upcxx::init();

  future<> f1 = upcxx::make_future();
  future<> f1a = f1;
  future<> f2 = upcxx::make_future();

  future<> w1 = when_all(f1);
  future<> w2 = when_all(f1, f2);

  // the following uses unsupported internals and is UNSAFE UPC++:
  std::cout<< "f1=  " << std::move(f1.impl_).steal_header() << std::endl;
  std::cout<< "f1a= " << std::move(f1a.impl_).steal_header() << std::endl;
  std::cout<< "f2=  " << std::move(f2.impl_).steal_header() << std::endl;

  std::cout<< "w1=  " << std::move(w1.impl_).steal_header() << std::endl;
  std::cout<< "w2=  " << std::move(w1.impl_).steal_header() << std::endl;

  upcxx::finalize();
  return 0;
}

Example output:

$ upcxx -g when-all.cpp && ./a.out
f1=  0x1bcef40
f1a= 0x1bcef40
f2=  0x1bcf7c0
w1=  0x1bcef40
w2=  0x82d860

Note how future f1 and copied future f1a report the same value, indicating they are aliased to the same underlying implicit promise cell which was allocated on the heap (as expected by the semantics of future copy). However f2 came from a different call to make_future, and has its own heap location.

As a more interesting example, a call like: when_all(f) can always trivially return a copy of f. This particular optimization appears above, where w1 returns a copy of f1 (aliasing the same underlying promise). However analogous optimizations are NOT currently attempted with more than one argument to when_all.

Note that in the example above, there's no fundamental reason that all the futures created here cannot all alias the same underlying ready empty implicit promise. Once pull request #345 is merged, we'll even have a ready empty future available in the runtime so that none of the future constructions above need to dynamically allocate anything.

Proposed optimizations:

  1. A specialization make_future<>(void) should always return the canonical ready empty future, avoiding dynamic allocation.
  2. For the (statically detectable) case of when_all(future<>, future<>…) (all the arguments are empty futures), we should dynamically check readiness of the arguments and skip the allocation when possible. Whenever at most 1 of the arguments is non-ready, we should return an alias of the non-ready future (or the canonical ready empty future when they are all ready).
  3. Extra credit: For when_all with exactly one non-empty future and one or more ready empty futures, we can always return an alias of the non-empty future (regardless of readiness).

I believe these optimizations could help with our recommended future chaining idiom (excerpt from the Programming Guide):

  upcxx::future<> fut_all = upcxx::make_future();
  for (long i = 0; i < N; i++) {
    ...
    upcxx::future<> fut = ...
    // conjoin the futures
    fut_all = upcxx::when_all(fut_all, fut);
    ...
  }

Here we currently always perform dynamic allocation on the make_future and when_all calls, but the first allocation is always unnecessary, and the second is unnecessary any time fut is synchronously ready. The latter condition is far more likely to be true once we deploy as_eager_future as the default completion (WIP in PR 345), and it will be synchronously complete in exactly those cases involving no network traffic, which is exactly where the relative overheads of dynamic allocation for future management have the largest potential impact on performance.

Comments (9)

  1. Amir Kamil

    My preliminary assessment of your proposals:

    1. Should be easily doable. It does introduce an additional dependence on the backend, which would require moving the ready_empty_future logic from backend.hpp to backend_fwd.hpp, plus an alternate implementation when UPCXX_BACKEND is undefined.
    2. This is a lot hairier than it seems. The when_all() implementation encodes the underlying futures in the spooky return type. We can’t change this type dynamically, so the logic for optimizing this cannot go in the existing when_all[_fast]() itself. The other place the optimization could conceivably go is when converting the spooky type back to a normal reference-counted future. But the problem there is that the latter’s constructor invokes steal_header() on the r-value spooky future, so that in turn needs to create a header. We can’t steal it from one of the underlying futures, as that effectively invalidates the underlying future. I don’t see a good solution to these issues.
    3. At least as hard as 2).

    Even if we were to find a path forward, my guess is that future conjoining will still perform significantly worse than promises in the case of synchronous completion. I think we should reconsider whether future conjoining is an idiom we should be recommending for cases that are more naturally expressed with promises.

  2. Dan Bonachea reporter

    @Amir Kamil said:

    The when_all() implementation encodes the underlying futures in the spooky return type. We can’t change this type dynamically, so the logic for optimizing this cannot go in the existing when_all_fast itself.

    This seems to me like a good reason to finish the fix to issue #288 and ditch spooky future returns from when_all entirely; i.e. this is already a defect we should fix, because when_all currently breaks the specified type semantics in a user-visible way. If that makes it possible/easier to deploy the runtime optimizations described in this issue, all the better.

    My basic view here is the static optimizations attempted by spooky futures are a failed optimization attempt; they only ever provided a potential benefit for futures with inferred types, and yet they violated the specified types in ways that broke the use of type interference in applications. Also they were only ever capable of optimizing statically ready futures, which I'd argue are far less important in practice than futures that are synchronously readied based on dynamic information (locality of access). I believe the optimizations described in this issue can fully subsume the potential benefits offered by the known-to-be-broken spooky future goop (for free in make_future<> and at the cost of at most one branch per argument in when_all), with the substantial correctness upside of actually adhering to our specified types.

  3. Amir Kamil

    My revised assessment after looking into this is mostly the opposite of my preliminary assessment.

    1. This is not as easy as I thought. There is a chain of dependencies that I have not been able to resolve. get_ready_empty_future cannot move into backend_fwd.hpp because it uses personas (in par mode), and I get errors about persona being incomplete in backend_fwd.hpp. I tried forward declarations of get_ready_empty_future in future/core.hpp. But depending on how I write the declarations, I get either an error that a specialization is defined after the template has been used or a linker error that the specialization is missing from the symbol table. I tried a non-template wrapper around get_ready_empty_future but ran into runtime errors due to interactions with 2) below. Furthermore, get_ready_empty_future calls make_future, so we need to use something like make_fast_future to avoid that circular dependency, and I’m not sure what all the implications of that are.
      There might be a way around this, but I’m not inclined to pursue it further. I don’t think make_future() is critical to performance of the future-conjoining idiom.
    2. My comment about not being able to steal the header from an underlying future is incorrect – the underlying futures are stored by value, so their headers can indeed be stolen. I was able to implement this optimization in two different ways, one that returns the canonical ready empty future and another that steals an underlying header. The latter is slightly more efficient on both GCC and Clang on my machine, so that’s what I’m going with. It also allows the optimization to be applied to 3).
      For a when_all() over solely empty futures, my current implementation only considers stealing the header from the index-0 underlying future. It’ll take more work to get it to consider stealing from a different future, but I think it’s doable with an appropriate hand-rolled fold algorithm (no C++17 at our disposal, unfortunately).
    3. I have an implementation that does this.

    To my surprise, with my current implementation of 2) and 3), performance is not half bad. It’s not as good as promises, but future-conjoining gups with smp/seq/eager gets ~46% of the performance of promises on GCC and 65% on Clang. So I withdraw my comment about removing the conjoining idiom from our materials. (The gap between future- and promise-based implementations may be significantly larger when the eager optimization doesn’t apply, but I don’t have numbers to support that.)

  4. Amir Kamil

    This seems to me like a good reason to finish the fix to issue #288 and ditch spooky future returns from when_all entirely; i.e. this is already a defect we should fix, because when_all currently breaks the specified type semantics in a user-visible way. If that makes it possible/easier to deploy the runtime optimizations described in this issue, all the better.

    I don’t have an opinion one way or another on ditching spooky-future return from when_all(). It turns out to be orthogonal to this optimization, and most of the when_all() implementation details probably are necessary to deal with the type system. The main question is when the spooky future turns into a normal one, and I’m ambivalent about whether that should in the return from when_all() or when the user converts that to a normal future.

  5. Dan Bonachea reporter

    @Amir Kamil wrote:

    This is not as easy as I thought. There is a chain of dependencies that I have not been able to resolve. [...]. Furthermore, get_ready_empty_future calls make_future, so we need to use something like make_fast_future to avoid that circular dependency, and I’m not sure what all the implications of that are.

    I don't have a specific suggestion without looking deeper, but can we perhaps leverage the fact that when_all<>(void) is equivalent to make_future<>(void) in order to help break the cycle?

    There might be a way around this, but I’m not inclined to pursue it further. I don’t think make_future() is critical to performance of the future-conjoining idiom.

    Not sure I agree with this assessment. The base-case make_future will admittedly be lost in the noise for a significantly long conjoining chain. However once these in-flight optimizations are in place and assuming all the conjoins use synchronous/eager empty futures (eg in a single-node run), I could easily see a 4/6-halo exchange being too short to amortize the overhead of that first allocator call. For example, here is abbreviated code from the 3-d halo exchange in our jac3d example in upcxx-extras:

            upcxx::future<> fut = upcxx::make_future();
            if(r[X] != 0){
                fut = upcxx::when_all(fut, upcxx::copy(...));
            }
    
            if(r[X] != p[X]-1){
                fut = upcxx::when_all(fut, upcxx::copy(...));
            }
    
            if(r[Y]!=0){
                fut = upcxx::when_all(fut, upcxx::copy(...));
            }
    
            if(r[Y]!=p[Y]-1){
                fut = upcxx::when_all(fut, upcxx::copy(...));
            }
    
            if(r[Z]!=0){
                fut = upcxx::when_all(fut, upcxx::copy(...));
            }
    
            if(r[Z]!=p[Z]-1){
                fut = upcxx::when_all(fut, upcxx::copy(...));
            }
    
            // Wait for all my upcxx::copy to complete
            fut.wait();
    

    This particular example is not directly relevant because the copy operations actually involve the GPU which will not complete synchronously, but if they were all host memory these copies could all return ready empty futures (after PR 345).

    Setting aside the conjoining case, there are numerous instances in our examples of using make_future<>() to initialize a future to a "safe" value, to workaround the fact that our default future constructor constructs a useless/dangerous object (spec issue 104).

    I'd really like to find a way to implement proposed optimization 1.

  6. Amir Kamil

    I tried a non-template wrapper around get_ready_empty_future but ran into runtime errors due to interactions with 2) below. Furthermore, get_ready_empty_future calls make_future, so we need to use something like make_fast_future to avoid that circular dependency…

    Turns out I had a brain fart and forgot to change the seq-mode implementation of get_ready_empty_future to use make_fast_future. I did fix it for par mode but was testing in seq mode.

    The optimization is now implemented in PR 345. Note that the implementation of make_future<>() now requires UPC++ to be initialized. This wasn’t required before by the implementation, though it was by the spec. One non-conforming test needed to be updated.

  7. Amir Kamil

    So optimization 1) already partially exists prior to PR 345. Here’s the relevant code in impl_result.hpp:

        template<>
        struct future_impl_result<> {
          bool ready() const { return true; }
    
          std::tuple<> result_refs_or_vals() const { return std::tuple<>(); }
    
          typedef future_header_ops_result_ready header_ops;
    
          future_header* steal_header() && {
            return &future_header_result<>::the_always;
          }
        };
    

    There already is a canonical ready empty header for empty futures. make_fast_future<>() returns a spooky future that contains a future_impl_result<>, and when that spooky future is converted to a normal future, future_impl_result<>::steal_header() avoids an allocation.

    However, prior to PR 345, make_future<>() explicitly allocates a future_header_result<> rather than using future_header_result<>::the_always. So it does not do the optimization.

    Thread-safety does not appear to be an issue with future_header_result<>::the_always – it has a ref count of -1, and operations on ref counts do nothing when the ref count is negative. Since a future that uses this header is ready, doing a then() on it synchronously executes the callback on the calling persona.

    Given the above, we no longer need persona-specific ready empty futures in SEQ mode. We can specialize make_future<>() as follows:

      template<>
      future<> make_future() {
        return future<>(
          detail::future_impl_shref<detail::future_header_ops_general, /*unique=*/false>(
            &detail::future_header_result<>::the_always
          ),
          detail::internal_only{}
        );
      }
    

  8. Amir Kamil

    Given the above, we no longer need persona-specific ready empty futures in SEQ mode.

    For that matter, we likely can avoid keeping around a ready empty future at all in either SEQ or PAR mode. We should be able to just call make_future<>() once it is specialized as above.

  9. Log in to comment