Clarify behavior of discharge() in restricted context

Issue #204 resolved
Dan Bonachea created an issue

While poking around with Impl issue 607, I encountered an ambiguity regarding upcxx::discharge() from the restricted context.

Consider this program:

#include <upcxx/upcxx.hpp>
#include <iostream>
#include "util.hpp"


using namespace std;
using namespace upcxx;

int done = 0;

int main() {
  upcxx::init();
  print_test_header();

  if (local_team().rank_n() == rank_n()) {
    print_test_skipped("This test requires multiple nodes");
    upcxx::finalize();
    return 0;
  }

  auto gptr = new_<int>();
  rpc((rank_me()+1)%rank_n(),[=](){
      say() << "outer RPC starting";

      static int buffer;
      // activates internal function detail::copy_as_rget:
      copy(gptr, &buffer, 1, remote_cx::as_rpc([](){ done = 1; }));

      // the following property is not guaranteed by spec, 
      // but we are leveraging internal knowledge for this demonstration:
      UPCXX_ASSERT_ALWAYS(progress_required());

      say() << "starting discharge";
      upcxx::discharge(); // <-- DEADLOCK HERE for multi-node
      say() << "discharge complete";

  }).wait();
  say() << "outer RPC complete";

  while(!done)upcxx::progress();
  say() << "transaction complete";
  barrier();

  print_test_success();
  upcxx::finalize();
}

Run with multiple nodes, this test launches a upcxx::copy() from inside the restricted context that is carefully arranged to ensure it will result in progress_required()==true for the initiating persona. It then issues a upcxx::discharge() from inside the restricted context, and this currently deadlocks: because the internal progress call issued by discharge is treated as a no-op inside the restricted context by the implementation, so the discharge never returns. However I don't think this program is doing anything that we explicitly discourage or prohibit in current documentation.

Now the question:

Should it be meaningful to upcxx::discharge() inside the restricted context?

I think this is currently ambiguous in the spec, because section 10.2 Retricted Context is currently focused specifically on user-level progress (emphasis added):

User code running in the restricted context must assume that for the duration of the context all other attempts at making user-level progress, from any thread on any process, may result in a no-op every time.

The immediate implication is that a thread which is already in the restricted context should assume no-op behavior from further attempts at making progress. This makes it pointless to try and wait for UPC++ notifications from within restricted context since there is no viable mechanism to make the notifications visible to the user. Thus, calling any routine which spins on user-level progress until some notification occurs will likely hang the thread.

However discharge specifically spins on internal-level progress (not user-level progress) awaiting a state change specific to internal progress, so it's unclear how any of the current text above applies. I believe we have either a spec defect or an implementation defect, depending on the resolution to the above question. I think there are some subtle questions here about progress and our implementation that determine whether discharge inside the restricted context should be considered meaningful. However based on the outcome of that discussion, I see two possible resolutions:

  1. Decide that attempts to make internal-level progress within the restricted context may also be treated as no-ops.

    • Fix the spec section above to clarify the caveats also apply to internal progress
    • If we go this route, we should further decide whether discharge should behave differently inside the restricted context:
      • IF we keep the current behavior, we should clarify that discharge in restricted context can lead to deadlock, and the implementation should be adjusted to issue warnings about this behavior, or even fatal errors when we can detect an impending deadlock.
      • We could alternatively specify a change to the behavior of discharge, allowing it to return while progress_required() == true when called inside restricted context, thus avoiding potential deadlock by silently stubbing-away the call.
  2. Decide that attempts to make internal-level progress within the restricted context are meaningful and should not be ignored.

    • Fix the implementation to actually do this (instead of ignoring ALL calls to progress inside the restricted context), and any problems that might arise as a result.
      • This would require some non-trivial changes to our progress machinery, and I'm currently unclear on what internal problems might arise.

I'm currently leaning towards option 1, mostly because I'm worried that even if we solve any challenges inside the runtime, allowing a discharge to truly stall inside the restricted context introduces a hidden dependency edge on remote internal progress. I don't currently see a way this could lead to deadlock for an otherwise well-formed program. However allowing the user to deliberately block user-level progress of the current persona dependent on remote activity feels like a poor design and not one that we should encourage, especially in the guise of discharge which is intended as a performance optimization for a process that is about to enter a period of local inattentiveness to progress (which by definition is never the situation inside the restricted context).

Comments (7)

  1. Rob Egan

    I do not believe MHM2 call discharge from within the restricted context. We use discharge rarely and only where we know the attentiveness will be lost for a bit. With that rule-of-thumb, then it should never be placed in the restricted context.

    BUT it would be beneficial IMO if it was guaranteed to also be a noop like progress.  I’ve always expected that discharge would be a noop in the restricted context.

    Motivation is two fold:

    1. if there is a function that can be called within or without of the restricted context, say an opaque class member function or deep in some library, you cannot control what it does. If it doesn’t rely external events but calls progress or discharge to be nice to other asynchronous activity the code should not deadlock
    2. when one is refactoring top level code to be asynchronous you can safely throw any function that calls progress or discharge in to a lamba and know it won’t misbehave (unless an external event is required).

    Alternatively, you could just prohibit discharge from being called from the restricted context (like barrier, dist_object constructors, etc). I cannot think of a good example (in well formed code) where discharge is necessary and a call to progress() is not sufficient.

  2. Amir Kamil

    I think I’d prefer discharge() to be (unconditionally) prohibited in the restricted context. In addition to it having blocking semantics as currently defined, I don’t think turning it into a no-op is in line with the intended use case. IMO, the intent is to provide the user with some assurance that they can enter a period of inattentiveness, and I think that assurance falls apart if it becomes a no-op. If the user has a callback that involves a lengthy period of inattentiveness, they should discharge() before entering the progress call that runs that callback. (It would be nice if we had a “yield-for-progress” mechanism, but that may require C++20 coroutines to implement.)

    For users who want the semantics of “discharge if I’m not in progress”, that can easily be accomplished with

    if (!in_progress()) discharge();
    

    But I think this should be explicit so that the behavior matches what the user writes, rather than us doing this implicitly depending on where discharge() is called.

  3. Dan Bonachea reporter

    After our discussion today, we seem to have consensus on Amir's proposal, namely a blanket prohibition to calling discharge() in the restricted context.

    I've updated spec PR 100 and will have an implementation PR soon.

    On a related topic, we also agreed to change the default argument value for progress_required(ps=...) and discharge(ps=...) from top_persona_scope() to default_persona_scope(), as that seems like a better match to user expectations and common use cases. So this work will also include that change.

  4. Log in to comment