atomic_fetchadd example

Issue #31 resolved
BrianS created an issue

so I had this form in my talk

void insertBuffer(const Bob& b, 
   const vector<global_ptr<char> >& buffers,
   const vector<global_ptr<uint64_t> >& offsets)
{
  ostringstream stream;
  stream<<b; // serialization
  string* bytes = new string(stream.str());
  uint64_t s = bytes.size();
  intrank_t key=hash<Bob>(b)%upcxx::rank_n();
  upcxx::future<uint64_t> end = 
      upcxx::atomic_fetchadd(offsets[key], s);
  upcxx::wait(end); 
  auto f =upcxx::rput(t_bytes.c_str(), 
                  buffers[key]+end.result()-s, s);
  f.then([bytes](){ delete bytes;});
}

The idea being Bob has variable size. I also let the function end and the future go out of scope but rely on the continuation to be run when f is ready and hence the data in bytes is safe to discard. The user's code can go onto the next operations it needs to perform to construct the next Bob, then call into insertBuffer with the next Bob. The better designs would require checking you are not overflowing the destination buffer.

This could be combined with a signaling function that updates a remote variable to indicate b has been completed and can be harvested for the hash table.

Or, we can go all non-blocking

void insertBuffer2(const Bob& b, 
   const vector<global_ptr<char> >& buffers,
   const vector<global_ptr<uint64_t> >& offsets)
{
  ostringstream stream;
  stream<<b; // serialization
  string* bytes = new string(stream.str());
  uint64_t s = bytes.size();
  intrank_t key=hash<Bob>(b)%upcxx::rank_n();
  const global_ptr<char>& buff=buffers[key];
  upcxx::future<uint64_t> end = 
      upcxx::atomic_fetchadd(offsets[key], s);
  end.then([=](uint64_t endVal){ 
      auto f =upcxx::rput(bytes->c_str(), 
                     buff+end.result()-s, s);
      f.then([bytes](){ delete bytes;});});
}

when the fetchadd completes, it triggers the rput, when the rput completes it triggers the delete on the byte string.

Comments (7)

  1. Dan Bonachea

    This might work as a nice tutorial example where you start with just a simple rput of a buffer, then add serialization of an object w/ cleanup of the serialization buffer, target notification, fetch-add acquire of a landing zone, and full asynchrony as successive versions to highlight each feature.

    Regarding serialization, this is not really our feature but it's probably worth including to demonstrate how we recommend non-POD objects be communicated. That being said, how many local data copies are present in the version you gave? Assuming Bob is very large, how would one best do serialization that performs exactly one local data copy?

  2. BrianS reporter

    oh the copies could be a huge cost for large Bob objects. stringstream doesn't give us access to the raw pointer. The material I've gone through on Boost serialization is not much clearer. you have an Archiver but you still push data into it with operator<<. Unless I can't see it in the documentation.

    If Bobs' size can be bounded within some reasonable size, then we can have a pool of buffers and on function entry you grab the next free slot and set stream.pubsetbuf(currentBuffer, PoolElementSize) then call rput with currentBuffer, then the last continuation return currentBuffer to the pool. really, we would like to ostream into memory allocated in the shared segment if that eliminates a copy for the transfer.

  3. Dan Bonachea

    The reason I ask is because if I understand correctly, this is always the wrong way to move small data. In particular, the whole reason for an extra round-trip to arrange a landing zone at the target is so you can use rput which becomes RDMA on the NIC, so the setup costs (at least those that don't scale with payload size) are amortized and offset by the zero-copy RDMA transfer performance of the payload. However if the setup cost includes multiple memcopies that scale with payload size, that quickly erodes the benefit of using RDMA.

    On the other hand, if the cost of local memcopying the payload is trivial, then the most efficient means to perform this communication is likely to serialize Bob and send the buffer by value as part of an RPC closure, where the lambda at the remote side discovers the final location for the payload and copies and/or deserializes it to that place and performs target notification. With that approach this could all compile down to a single AMMedium on the network - which should be much lower overhead and latency for a small payload.

  4. BrianS reporter

    I understand the reason, but I'm unclear what the remedy should be. Bob should be serialized once, and not copied again. Do we need to have the send buffer in registered memory? If part of Bob's serialization is to give an upper bound on the linear size, then we can allocate the buffer space in registered memory and perform the rput. The destination can deserialize out of registered memory.

  5. Dan Bonachea

    Assuming Bob is large enough that RDMA is win (despite the initial round-trip to get a landing zone), then ideally Bob is serialized into a registered local memory buffer and rput from there with no additional copies. Registration of the source buffer is not a hard requirement, but it could eliminate overheads on some networks.

  6. Log in to comment