Implementing AMReX fillBoundary based on futures?

Issue #14 resolved
BrianS created an issue

So we have eyed the idea that upcxx::future can capture the behavior we want everywhere, so we can put that to the test early on.

BoxLib::fillBoundary is a pretty typical operation, currently done with events.

there is a Every rank computes for itself using BoxLib meta data what messages they are expecting to send every other rank and creates a 1D buffer in upcxx global space. It sets a counter for how many messages it expect to send. "send_info.send_count" lets call it.

then a upcxx::shared_array<global_ptr<void*> > sendBuffs is created. So now every rank can see every other ranks sendBuffer

a send buffer is created in upcxx::allocate space as a 1D array.

These objects persist in BoxLib for the life of this particular fillBoundary "MotionPlan" in the old Kelp terminology.

When fillBoundary is then called:

1.each rank packs the parts of the FArrayBox destined for each rank, contiguously into the send buffer. and sets it's currentSendCount to 0

each rank then loops over the receives it wants and executes a remote function call

upcxx::async(src_rank, NULL)(BLPgas::Sendrecv, tag, upcxx::global_ptr<void>(NULL, src_rank), dst, nbytes, SeqNum, signal_event, (upcxx::event)NULL, (int )NULL);

the BL_PGAS::Sendrecv function then issues one of these for each request made

 async_copy_and_signal(send_info.src_ptr,
                        send_info.dst_ptr,
                        send_info.nbytes,
                        send_info.signal_event,
                        send_info.done_event,
                        NULL);

and adds one to the counter

the fillBoundary function then waits for all it's receives to be satisfied (they are all linked to one event) then unpacks it's receive buffer into FArrayBox objects.

then polls on sendCount to match the expected sendCount, and calls the progress function in a while loop. Then the send buffer is ok to take the next send round.

It is a bit of a hard lift to code this right. Partly this is meant to re-use a lot of existing BoxLib code that already had an MPI_ISend/MPI_IRecv implementation. Setting that aside. how would we like to write code like this?

Comments (5)

  1. Former user Account Deleted

    I don't think this example requires futures (and therefor events), or at least I wouldn't do it that way. To me, futures, events, and continuation passing are all about managing the same thing: callback driven control flow. In this formulation of fillboundary we have no need for that. In fact, if you were to replace every instance of upcxx::event* with just a regular int* you could achieve the same algorithm. Events are only being used as counters that receivers explicitly poll.

    If upcxx::async_copy_and_signal accepted a lambda to ship to the receiver as completion action instead of a remote event to satisfy, the user could still decrement a counter as is the case now as well as do other more informative things like marking a bitmap or satisfying a node in a tasking runtime.

    And hopefully with the addition of VIS like primitives, the shared flat buffer with explicit packing/unpacking could disappear as well. This is the API for an async VIS copy that I would like to see:

    // Dim: dimensionality of array
    // return: future indicating local completion (memory is ours again, not remote received)
    // lens: dimension sizes
    // dst_addr, src_addr: address of first element and Dim stride values for both source (local) and dest (remote) arrays (strided_ptr is my own concoction)
    // remote_done: lambda to run remotely when data has been deliverd
    template<int Dim>
    future<> upcxx::put_strided(
      std::array<std::size_t,Dim> lens,
      upcxx::strided_ptr<T,Dim> dst_addr,
      upcxx::strided_ptr<T,Dim> src_addr,
      upcxx::function<void()> remote_done
    );
    

    With this routine the structure of the algorithm would remain minus the buffer packing, and we would still use futures (events) to wait for all outgoing transmissions to complete. Futures do work well for locally initiated work even if the callbacks aren't utilized. But the remotely initiated stuff is best dealt with by user-declared state and lambdas nudging that state closer to its "completed" value. Again, so if the user wants to use something more interesting than a countdown, they can have it.

  2. Dan Bonachea

    John said; This is the API for an async VIS copy that I would like to see:

    We should definitely discuss this further. I'm not sure what all is wrapped up in your "strided_ptr" data structure, and whether it captures all the generality that users will need for strided copies - specifically, the source and destination dimensions might not necessarily match in extent (although they probably do in dimensional cardinality), and there needs to be a way to express transpose. Regardless, the details of how we express the metadata is easily remedied by massaging the API and/or adding arguments.

    More concerning (and relevant to this issue) is that GASNet-level VIS does not currently include a provision for remote completion notification. You can of course achieve this semantically by sending an AM from the initiator to the target after putv completion is reported to the initiator, but that potentially adds at least one network round-trip to the critical path latency at the target. This is a feature we talked about adding to VIS for EX, but it's currently slated as a "maybe". If we believe this will be important for UPC++ then we may need to increase the priority of that feature.

    Part of what's difficult is defining the right interface for gasnet-client-independent remote notification. The most natural way to do this in GASNet is with something that looks like an AM handler. However, if the remote_done lambda in your proposed API implicitly includes a variable-length closure that could consume more than a small fixed number of bytes, things might get ugly.

  3. BrianS reporter

    I'm actually thinking that the VIS API would be expressed as get and put with array_ref arguments.

  4. Dan Bonachea

    I believe this is resolved by the combination of the VIS operations and the remote completion feature that will be introduced with issue #76

  5. Log in to comment