Clarify behavior of discharge() in restricted context

While poking around with Impl issue 607, I encountered an ambiguity regarding upcxx::discharge() from the restricted context.

Consider this program:

#include <upcxx/upcxx.hpp>
#include <iostream>
#include "util.hpp"


using namespace std;
using namespace upcxx;

int done = 0;

int main() {
  upcxx::init();
  print_test_header();

  if (local_team().rank_n() == rank_n()) {
    print_test_skipped("This test requires multiple nodes");
    upcxx::finalize();
    return 0;
  }

  auto gptr = new_<int>();
  rpc((rank_me()+1)%rank_n(),[=](){
      say() << "outer RPC starting";

      static int buffer;
      // activates internal function detail::copy_as_rget:
      copy(gptr, &buffer, 1, remote_cx::as_rpc([](){ done = 1; }));

      // the following property is not guaranteed by spec, 
      // but we are leveraging internal knowledge for this demonstration:
      UPCXX_ASSERT_ALWAYS(progress_required());

      say() << "starting discharge";
      upcxx::discharge(); // <-- DEADLOCK HERE for multi-node
      say() << "discharge complete";

  }).wait();
  say() << "outer RPC complete";

  while(!done)upcxx::progress();
  say() << "transaction complete";
  barrier();

  print_test_success();
  upcxx::finalize();
}

Run with multiple nodes, this test launches a upcxx::copy() from inside the restricted context that is carefully arranged to ensure it will result in progress_required()==true for the initiating persona. It then issues a upcxx::discharge() from inside the restricted context, and this currently deadlocks: because the internal progress call issued by discharge is treated as a no-op inside the restricted context by the implementation, so the discharge never returns. However I don't think this program is doing anything that we explicitly discourage or prohibit in current documentation.

Now the question:

Should it be meaningful to upcxx::discharge() inside the restricted context?

I think this is currently ambiguous in the spec, because section 10.2 Retricted Context is currently focused specifically on user-level progress (emphasis added):

User code running in the restricted context must assume that for the duration of the context all other attempts at making user-level progress, from any thread on any process, may result in a no-op every time.

The immediate implication is that a thread which is already in the restricted context should assume no-op behavior from further attempts at making progress. This makes it pointless to try and wait for UPC++ notifications from within restricted context since there is no viable mechanism to make the notifications visible to the user. Thus, calling any routine which spins on user-level progress until some notification occurs will likely hang the thread.

However discharge specifically spins on internal-level progress (not user-level progress) awaiting a state change specific to internal progress, so it's unclear how any of the current text above applies. I believe we have either a spec defect or an implementation defect, depending on the resolution to the above question. I think there are some subtle questions here about progress and our implementation that determine whether discharge inside the restricted context should be considered meaningful. However based on the outcome of that discussion, I see two possible resolutions:

Decide that attempts to make internal-level progress within the restricted context may also be treated as no-ops.
- Fix the spec section above to clarify the caveats also apply to internal progress
- If we go this route, we should further decide whether discharge should behave differently inside the restricted context:
  - IF we keep the current behavior, we should clarify that discharge in restricted context can lead to deadlock, and the implementation should be adjusted to issue warnings about this behavior, or even fatal errors when we can detect an impending deadlock.
  - We could alternatively specify a change to the behavior of discharge, allowing it to return while progress_required() == true when called inside restricted context, thus avoiding potential deadlock by silently stubbing-away the call.
Decide that attempts to make internal-level progress within the restricted context are meaningful and should not be ignored.
- Fix the implementation to actually do this (instead of ignoring ALL calls to progress inside the restricted context), and any problems that might arise as a result.
  - This would require some non-trivial changes to our progress machinery, and I'm currently unclear on what internal problems might arise.

I'm currently leaning towards option 1, mostly because I'm worried that even if we solve any challenges inside the runtime, allowing a discharge to truly stall inside the restricted context introduces a hidden dependency edge on remote internal progress. I don't currently see a way this could lead to deadlock for an otherwise well-formed program. However allowing the user to deliberately block user-level progress of the current persona dependent on remote activity feels like a poor design and not one that we should encourage, especially in the guise of discharge which is intended as a performance optimization for a process that is about to enter a period of local inattentiveness to progress (which by definition is never the situation inside the restricted context).

Comments (7)