optimize empty futures in `make_future()` and `when_all(future<>, future<>...)`
The current implementation of when_all(future<>, future<>...)
always ends up allocating a new heap object, even when it's semantically unnecessary.
Example program: (WARNING: this test uses unspecified internals which are NOT supported)
#include <upcxx/upcxx.hpp>
#include <iostream>
using namespace upcxx;
int main() {
upcxx::init();
future<> f1 = upcxx::make_future();
future<> f1a = f1;
future<> f2 = upcxx::make_future();
future<> w1 = when_all(f1);
future<> w2 = when_all(f1, f2);
// the following uses unsupported internals and is UNSAFE UPC++:
std::cout<< "f1= " << std::move(f1.impl_).steal_header() << std::endl;
std::cout<< "f1a= " << std::move(f1a.impl_).steal_header() << std::endl;
std::cout<< "f2= " << std::move(f2.impl_).steal_header() << std::endl;
std::cout<< "w1= " << std::move(w1.impl_).steal_header() << std::endl;
std::cout<< "w2= " << std::move(w1.impl_).steal_header() << std::endl;
upcxx::finalize();
return 0;
}
Example output:
$ upcxx -g when-all.cpp && ./a.out
f1= 0x1bcef40
f1a= 0x1bcef40
f2= 0x1bcf7c0
w1= 0x1bcef40
w2= 0x82d860
Note how future f1
and copied future f1a
report the same value, indicating they are aliased to the same underlying implicit promise cell which was allocated on the heap (as expected by the semantics of future copy). However f2
came from a different call to make_future
, and has its own heap location.
As a more interesting example, a call like: when_all(f)
can always trivially return a copy of f
. This particular optimization appears above, where w1
returns a copy of f1
(aliasing the same underlying promise). However analogous optimizations are NOT currently attempted with more than one argument to when_all.
Note that in the example above, there's no fundamental reason that all the futures created here cannot all alias the same underlying ready empty implicit promise. Once pull request #345 is merged, we'll even have a ready empty future available in the runtime so that none of the future constructions above need to dynamically allocate anything.
Proposed optimizations:
- A specialization
make_future<>(void)
should always return the canonical ready empty future, avoiding dynamic allocation. - For the (statically detectable) case of
when_all(future<>, future<>…)
(all the arguments are empty futures), we should dynamically check readiness of the arguments and skip the allocation when possible. Whenever at most 1 of the arguments is non-ready, we should return an alias of the non-ready future (or the canonical ready empty future when they are all ready). - Extra credit: For
when_all
with exactly one non-empty future and one or more ready empty futures, we can always return an alias of the non-empty future (regardless of readiness).
I believe these optimizations could help with our recommended future chaining idiom (excerpt from the Programming Guide):
upcxx::future<> fut_all = upcxx::make_future();
for (long i = 0; i < N; i++) {
...
upcxx::future<> fut = ...
// conjoin the futures
fut_all = upcxx::when_all(fut_all, fut);
...
}
Here we currently always perform dynamic allocation on the make_future
and when_all
calls, but the first allocation is always unnecessary, and the second is unnecessary any time fut
is synchronously ready. The latter condition is far more likely to be true once we deploy as_eager_future as the default completion (WIP in PR 345), and it will be synchronously complete in exactly those cases involving no network traffic, which is exactly where the relative overheads of dynamic allocation for future management have the largest potential impact on performance.
Comments (9)
-
-
reporter @Amir Kamil said:
The when_all() implementation encodes the underlying futures in the spooky return type. We can’t change this type dynamically, so the logic for optimizing this cannot go in the existing when_all_fast itself.
This seems to me like a good reason to finish the fix to issue
#288and ditch spooky future returns fromwhen_all
entirely; i.e. this is already a defect we should fix, becausewhen_all
currently breaks the specified type semantics in a user-visible way. If that makes it possible/easier to deploy the runtime optimizations described in this issue, all the better.My basic view here is the static optimizations attempted by spooky futures are a failed optimization attempt; they only ever provided a potential benefit for futures with inferred types, and yet they violated the specified types in ways that broke the use of type interference in applications. Also they were only ever capable of optimizing statically ready futures, which I'd argue are far less important in practice than futures that are synchronously readied based on dynamic information (locality of access). I believe the optimizations described in this issue can fully subsume the potential benefits offered by the known-to-be-broken spooky future goop (for free in
make_future<>
and at the cost of at most one branch per argument inwhen_all
), with the substantial correctness upside of actually adhering to our specified types. -
My revised assessment after looking into this is mostly the opposite of my preliminary assessment.
- This is not as easy as I thought. There is a chain of dependencies that I have not been able to resolve.
get_ready_empty_future
cannot move intobackend_fwd.hpp
because it uses personas (in par mode), and I get errors aboutpersona
being incomplete inbackend_fwd.hpp
. I tried forward declarations ofget_ready_empty_future
infuture/core.hpp
. But depending on how I write the declarations, I get either an error that a specialization is defined after the template has been used or a linker error that the specialization is missing from the symbol table. I tried a non-template wrapper aroundget_ready_empty_future
but ran into runtime errors due to interactions with 2) below. Furthermore,get_ready_empty_future
callsmake_future
, so we need to use something likemake_fast_future
to avoid that circular dependency, and I’m not sure what all the implications of that are.
There might be a way around this, but I’m not inclined to pursue it further. I don’t thinkmake_future()
is critical to performance of the future-conjoining idiom. - My comment about not being able to steal the header from an underlying future is incorrect – the underlying futures are stored by value, so their headers can indeed be stolen. I was able to implement this optimization in two different ways, one that returns the canonical ready empty future and another that steals an underlying header. The latter is slightly more efficient on both GCC and Clang on my machine, so that’s what I’m going with. It also allows the optimization to be applied to 3).
For awhen_all()
over solely empty futures, my current implementation only considers stealing the header from the index-0 underlying future. It’ll take more work to get it to consider stealing from a different future, but I think it’s doable with an appropriate hand-rolled fold algorithm (no C++17 at our disposal, unfortunately). - I have an implementation that does this.
To my surprise, with my current implementation of 2) and 3), performance is not half bad. It’s not as good as promises, but future-conjoining gups with smp/seq/eager gets ~46% of the performance of promises on GCC and 65% on Clang. So I withdraw my comment about removing the conjoining idiom from our materials. (The gap between future- and promise-based implementations may be significantly larger when the eager optimization doesn’t apply, but I don’t have numbers to support that.)
- This is not as easy as I thought. There is a chain of dependencies that I have not been able to resolve.
-
This seems to me like a good reason to finish the fix to issue #288 and ditch spooky future returns from
when_all
entirely; i.e. this is already a defect we should fix, becausewhen_all
currently breaks the specified type semantics in a user-visible way. If that makes it possible/easier to deploy the runtime optimizations described in this issue, all the better.I don’t have an opinion one way or another on ditching spooky-future return from
when_all()
. It turns out to be orthogonal to this optimization, and most of thewhen_all()
implementation details probably are necessary to deal with the type system. The main question is when the spooky future turns into a normal one, and I’m ambivalent about whether that should in the return fromwhen_all()
or when the user converts that to a normal future. -
reporter @Amir Kamil wrote:
This is not as easy as I thought. There is a chain of dependencies that I have not been able to resolve. [...]. Furthermore, get_ready_empty_future calls make_future, so we need to use something like make_fast_future to avoid that circular dependency, and I’m not sure what all the implications of that are.
I don't have a specific suggestion without looking deeper, but can we perhaps leverage the fact that
when_all<>(void)
is equivalent tomake_future<>(void)
in order to help break the cycle?There might be a way around this, but I’m not inclined to pursue it further. I don’t think make_future() is critical to performance of the future-conjoining idiom.
Not sure I agree with this assessment. The base-case
make_future
will admittedly be lost in the noise for a significantly long conjoining chain. However once these in-flight optimizations are in place and assuming all the conjoins use synchronous/eager empty futures (eg in a single-node run), I could easily see a 4/6-halo exchange being too short to amortize the overhead of that first allocator call. For example, here is abbreviated code from the 3-d halo exchange in our jac3d example in upcxx-extras:upcxx::future<> fut = upcxx::make_future(); if(r[X] != 0){ fut = upcxx::when_all(fut, upcxx::copy(...)); } if(r[X] != p[X]-1){ fut = upcxx::when_all(fut, upcxx::copy(...)); } if(r[Y]!=0){ fut = upcxx::when_all(fut, upcxx::copy(...)); } if(r[Y]!=p[Y]-1){ fut = upcxx::when_all(fut, upcxx::copy(...)); } if(r[Z]!=0){ fut = upcxx::when_all(fut, upcxx::copy(...)); } if(r[Z]!=p[Z]-1){ fut = upcxx::when_all(fut, upcxx::copy(...)); } // Wait for all my upcxx::copy to complete fut.wait();
This particular example is not directly relevant because the copy operations actually involve the GPU which will not complete synchronously, but if they were all host memory these copies could all return ready empty futures (after PR 345).
Setting aside the conjoining case, there are numerous instances in our examples of using
make_future<>()
to initialize a future to a "safe" value, to workaround the fact that our default future constructor constructs a useless/dangerous object (spec issue 104).I'd really like to find a way to implement proposed optimization 1.
-
I tried a non-template wrapper around
get_ready_empty_future
but ran into runtime errors due to interactions with 2) below. Furthermore,get_ready_empty_future
callsmake_future
, so we need to use something likemake_fast_future
to avoid that circular dependency…Turns out I had a brain fart and forgot to change the seq-mode implementation of
get_ready_empty_future
to usemake_fast_future
. I did fix it for par mode but was testing in seq mode.The optimization is now implemented in PR 345. Note that the implementation of
make_future<>()
now requires UPC++ to be initialized. This wasn’t required before by the implementation, though it was by the spec. One non-conforming test needed to be updated. -
So optimization 1) already partially exists prior to PR 345. Here’s the relevant code in
impl_result.hpp
:template<> struct future_impl_result<> { bool ready() const { return true; } std::tuple<> result_refs_or_vals() const { return std::tuple<>(); } typedef future_header_ops_result_ready header_ops; future_header* steal_header() && { return &future_header_result<>::the_always; } };
There already is a canonical ready empty header for empty futures.
make_fast_future<>()
returns a spooky future that contains afuture_impl_result<>
, and when that spooky future is converted to a normal future,future_impl_result<>::steal_header()
avoids an allocation.However, prior to PR 345,
make_future<>()
explicitly allocates afuture_header_result<>
rather than usingfuture_header_result<>::the_always
. So it does not do the optimization.Thread-safety does not appear to be an issue with
future_header_result<>::the_always
– it has a ref count of -1, and operations on ref counts do nothing when the ref count is negative. Since a future that uses this header is ready, doing athen()
on it synchronously executes the callback on the calling persona.Given the above, we no longer need persona-specific ready empty futures in SEQ mode. We can specialize
make_future<>()
as follows:template<> future<> make_future() { return future<>( detail::future_impl_shref<detail::future_header_ops_general, /*unique=*/false>( &detail::future_header_result<>::the_always ), detail::internal_only{} ); }
-
Given the above, we no longer need persona-specific ready empty futures in SEQ mode.
For that matter, we likely can avoid keeping around a ready empty future at all in either SEQ or PAR mode. We should be able to just call
make_future<>()
once it is specialized as above. -
- changed status to resolved
Resolved by PR 345.
- Log in to comment
My preliminary assessment of your proposals:
backend.hpp
tobackend_fwd.hpp
, plus an alternate implementation whenUPCXX_BACKEND
is undefined.when_all()
implementation encodes the underlying futures in the spooky return type. We can’t change this type dynamically, so the logic for optimizing this cannot go in the existingwhen_all[_fast]()
itself. The other place the optimization could conceivably go is when converting the spooky type back to a normal reference-counted future. But the problem there is that the latter’s constructor invokessteal_header()
on the r-value spooky future, so that in turn needs to create a header. We can’t steal it from one of the underlying futures, as that effectively invalidates the underlying future. I don’t see a good solution to these issues.Even if we were to find a path forward, my guess is that future conjoining will still perform significantly worse than promises in the case of synchronous completion. I think we should reconsider whether future conjoining is an idiom we should be recommending for cases that are more naturally expressed with promises.