upcxx::copy deadlocks in rpc depending on memory_kind

Issue #607 invalid
whorne created an issue

Hi,

I am still relatively new to upc++ so apologies in advance if this is way out of left field. I am attempting to perform a upcxx::copy within a RPC and found some odd differences between host memory and cuda device memory. The following code will always deadlock on my setup, a typical nvidia workstation. If I use host memory instead it runs without issue. Is there something about this that is not allowed by the spec or is the issue due to something else?

TEST_F(BasicCuda, MinimalBreak) {

  using device_pointer = upcxx::global_ptr<double,upcxx::memory_kind::cuda_device>;

  auto device_allocator = upcxx::make_gpu_allocator<upcxx::cuda_device>(1000*8);
  auto data_to_pass = 
    device_pointer(device_allocator.allocate<double>(100));
  
  if (upcxx::rank_me() == 0) {
    upcxx::rpc_ff(1,[](device_pointer head_rank_data){

      double test_copy_data[100];

      bool done = false;
      upcxx::copy(head_rank_data, test_copy_data, 100, upcxx::remote_cx::as_rpc([&done](){
        done = true;
      }));

      upcxx::discharge();
      while(!done)upcxx::progress();
      std::cout << "This will never be printed \n";


    }, data_to_pass);
  }

  upcxx::barrier();

  device_allocator.destroy();

}

Comments (5)

  1. Dan Bonachea

    Hello @whorne - Thanks for reaching out, this is a great question!

    There are several problems with the sample code fragment you've provided.

    Most importantly, calling upcxx::progress() within an RPC callback (which itself runs inside UPC++ user-level progress) has no effect. See UPC++ Specification section 10.2 "Restricted Context":

    a thread which is already in the restricted context should assume no-op behavior from further attempts at making progress. This makes it pointless to try and wait for UPC++ notifications from within restricted context since there is no viable mechanism to make the notifications visible to the user. Thus, calling any routine which spins on user-level progress until some notification occurs will likely hang the thread.

    This means you should never write a spin-loop around upcxx::progress() within an RPC callback, because with those calls doing nothing such a loop will likely deadlock, as you observed. Similarly, upcxx::discharge() is defined as a spin-loop around upcxx::progress, so invoking it within an RPC callback (or elsewhere inside a restricted context) is also ill-formed. Incidentally, invoking future::wait() from the restricted context on a non-ready future is also prohibited, for the same reason.

    You don't really explain the actual problem you're trying to solve, so I'm not sure what your goal is here. However the intuition you should keep in mind is that code running in the restricted context (e.g. inside an RPC callback) should never "stall" to wait for completion of an asynchronous operation. Instead of stalling, such code should schedule a continuation to execute upon completion of the asynchronous operation. So instead of writing code like this:

    // VERSION 0
        upcxx::rpc_ff(1,[](global_ptr<double> head_rank_data){
          double test_copy_data[100];
          bool done = false;
          // launch asynchronous operation:
          upcxx::copy(head_rank_data, test_copy_data, 100, upcxx::remote_cx::as_rpc([&done](){
            done = true;
          }));
          upcxx::discharge();             // ERRONEOUS - may deadlock
          while(!done)upcxx::progress();  // ERRONEOUS - may deadlock
          // dependent work:
          std::cout << "Do whatever comes next\n";
          // process test_copy_data ...      
        }, data_to_pass);
    

    You should instead write something like this:

    // VERSION 1
        upcxx::rpc_ff(1,[](global_ptr<double> head_rank_data){
          static double test_copy_data[100];
          // launch asynchronous operation:
          upcxx::copy(head_rank_data, test_copy_data, 100).then([&test_copy_data](){
            // dependent work inside continuation executes after operation completion
            std::cout << "Do whatever comes next\n";
            // process test_copy_data ...
          }));
        }, data_to_pass);
    

    or equivalently (for this case):

    // VERSION 2
        upcxx::rpc_ff(1,[](global_ptr<double> head_rank_data){
          static double test_copy_data[100];
          // launch asynchronous operation:
          upcxx::copy(head_rank_data, test_copy_data, 100, upcxx::remote_cx::as_rpc([&test_copy_data](){
            // dependent work inside continuation executes after operation completion
            std::cout << "Do whatever comes next\n";
            // process test_copy_data ...        
          }));
        }, data_to_pass);
    

    Note that above I've also converted test_copy_data from stack memory to static memory. This is important to ensure correctness, because the continuation generally executes AFTER the stack frame of the RPC callback has returned and any stack variables destroyed (where "generally" is "usually" in version 1 above, and "always" in version 2). Depending on your actual use case (e.g. if several of these RPCs might be outstanding concurrently) this deferment might necessitate relocating the destination array to heap memory.

    Regarding your finding that the code appears to work on host memory, I suspect the actual relevant difference here is not the type of memory involved but whether the upcxx::copy() operation happens to complete synchronously. Specifically if you issue a upcxx::copy() using host memory on both sides between processes sharing the same compute node, the data copy is equivalent to a std::memcpy() and will complete synchronously inside the copy injection call. For such a call using the default operation_cx::as_future() completion variant (as in my version 1 above), the call defaults to returning a ready future, perhaps this might have misled you in your experiments? So for example in my version 1 above that chains a continuation on a future returned by upcxx::copy(), that continuation might also execute synchronously. However even when a upcxx::copy() completes the data motion synchronously, a remote_cx::as_rpc() callback destined for the initiating persona should NOT execute synchronously within the call: the callback is deferred until the next entry to user-level progress from outside the restricted context (meaning your original code fragment should still be expected to deadlock as currently written).

    I'll finally note that if you happened to test using a very old version of UPC++ (prior to 2021.9.0 where issue #477 was fixed) you might have seen slightly different behavior for programs similar to yours, but this was a defect we repaired.

    Does this help clarify the difficulties you encountered?

  2. whorne reporter

    Thank you for the explanation! What I originally was attempting to do was offloading some heavily unbalanced work via RPC that involves device data. I narrowed down a deadlock to a upcxx::copy.wait() and attempted to minimally recreate it above. I had the thought that using a upcxx::copy of the data might be better than going through a serialization process by hand (copy from GPU to CPU → serialize CPU data across nodes, CPU → GPU on other side). It looks like following your continuation examples I should be able to still do what I need.

  3. Dan Bonachea

    I narrowed down a deadlock to a upcxx::copy.wait() and attempted to minimally recreate it above.

    Yes - upcxx::copy.wait() from inside an RPC handler will not work, except for the trivial case where the copy happens to complete synchronously. In fact, if you are building in codemode debug (upcxx -g) that attempt should generate a fatal error at runtime.

    I cannot overstate the importance/utility of using codemode debug for development, if you're not already using that you can find more info here

    I had the thought that using a upcxx::copy of the data might be better than going through a serialization process by hand (copy from GPU to CPU → serialize CPU data across nodes, CPU → GPU on other side).

    Yes! This is absolutely the design goal of upcxx::copy() and the primary selling point of the UPC++ memory kinds feature. We've seen enormous speedups from the transformation you describe on platforms with native memory kinds support.

    It looks like following your continuation examples I should be able to still do what I need.

    Great! I'll go ahead and close this issue since it sounds resolved, but feel free to create a new issue or reach out over email if you encounter other troubles. We'd also be happy to chat more about your application over Zoom if that would be useful.

  4. Log in to comment