Clarify behavior of discharge() in restricted context
While poking around with Impl issue 607, I encountered an ambiguity regarding upcxx::discharge()
from the restricted context.
Consider this program:
#include <upcxx/upcxx.hpp>
#include <iostream>
#include "util.hpp"
using namespace std;
using namespace upcxx;
int done = 0;
int main() {
upcxx::init();
print_test_header();
if (local_team().rank_n() == rank_n()) {
print_test_skipped("This test requires multiple nodes");
upcxx::finalize();
return 0;
}
auto gptr = new_<int>();
rpc((rank_me()+1)%rank_n(),[=](){
say() << "outer RPC starting";
static int buffer;
// activates internal function detail::copy_as_rget:
copy(gptr, &buffer, 1, remote_cx::as_rpc([](){ done = 1; }));
// the following property is not guaranteed by spec,
// but we are leveraging internal knowledge for this demonstration:
UPCXX_ASSERT_ALWAYS(progress_required());
say() << "starting discharge";
upcxx::discharge(); // <-- DEADLOCK HERE for multi-node
say() << "discharge complete";
}).wait();
say() << "outer RPC complete";
while(!done)upcxx::progress();
say() << "transaction complete";
barrier();
print_test_success();
upcxx::finalize();
}
Run with multiple nodes, this test launches a upcxx::copy()
from inside the restricted context that is carefully arranged to ensure it will result in progress_required()==true
for the initiating persona. It then issues a upcxx::discharge()
from inside the restricted context, and this currently deadlocks: because the internal progress call issued by discharge is treated as a no-op inside the restricted context by the implementation, so the discharge never returns. However I don't think this program is doing anything that we explicitly discourage or prohibit in current documentation.
Now the question:
Should it be meaningful to upcxx::discharge()
inside the restricted context?
I think this is currently ambiguous in the spec, because section 10.2 Retricted Context is currently focused specifically on user-level progress (emphasis added):
User code running in the restricted context must assume that for the duration of the context all other attempts at making user-level progress, from any thread on any process, may result in a no-op every time.
The immediate implication is that a thread which is already in the restricted context should assume no-op behavior from further attempts at making progress. This makes it pointless to try and wait for UPC++ notifications from within restricted context since there is no viable mechanism to make the notifications visible to the user. Thus, calling any routine which spins on user-level progress until some notification occurs will likely hang the thread.
However discharge specifically spins on internal-level progress (not user-level progress) awaiting a state change specific to internal progress, so it's unclear how any of the current text above applies. I believe we have either a spec defect or an implementation defect, depending on the resolution to the above question. I think there are some subtle questions here about progress and our implementation that determine whether discharge inside the restricted context should be considered meaningful. However based on the outcome of that discussion, I see two possible resolutions:
-
Decide that attempts to make internal-level progress within the restricted context may also be treated as no-ops.
- Fix the spec section above to clarify the caveats also apply to internal progress
- If we go this route, we should further decide whether discharge should behave differently inside the restricted context:
- IF we keep the current behavior, we should clarify that discharge in restricted context can lead to deadlock, and the implementation should be adjusted to issue warnings about this behavior, or even fatal errors when we can detect an impending deadlock.
- We could alternatively specify a change to the behavior of discharge, allowing it to return while
progress_required() == true
when called inside restricted context, thus avoiding potential deadlock by silently stubbing-away the call.
-
Decide that attempts to make internal-level progress within the restricted context are meaningful and should not be ignored.
- Fix the implementation to actually do this (instead of ignoring ALL calls to progress inside the restricted context), and any problems that might arise as a result.
- This would require some non-trivial changes to our progress machinery, and I'm currently unclear on what internal problems might arise.
- Fix the implementation to actually do this (instead of ignoring ALL calls to progress inside the restricted context), and any problems that might arise as a result.
I'm currently leaning towards option 1, mostly because I'm worried that even if we solve any challenges inside the runtime, allowing a discharge to truly stall inside the restricted context introduces a hidden dependency edge on remote internal progress. I don't currently see a way this could lead to deadlock for an otherwise well-formed program. However allowing the user to deliberately block user-level progress of the current persona dependent on remote activity feels like a poor design and not one that we should encourage, especially in the guise of discharge
which is intended as a performance optimization for a process that is about to enter a period of local inattentiveness to progress (which by definition is never the situation inside the restricted context).
Comments (7)
-
-
reporter -
assigned issue to
-
assigned issue to
-
reporter Proposed spec resolution in spec PR 100
-
I think I’d prefer
discharge()
to be (unconditionally) prohibited in the restricted context. In addition to it having blocking semantics as currently defined, I don’t think turning it into a no-op is in line with the intended use case. IMO, the intent is to provide the user with some assurance that they can enter a period of inattentiveness, and I think that assurance falls apart if it becomes a no-op. If the user has a callback that involves a lengthy period of inattentiveness, they shoulddischarge()
before entering the progress call that runs that callback. (It would be nice if we had a “yield-for-progress” mechanism, but that may require C++20 coroutines to implement.)For users who want the semantics of “discharge if I’m not in progress”, that can easily be accomplished with
if (!in_progress()) discharge();
But I think this should be explicit so that the behavior matches what the user writes, rather than us doing this implicitly depending on where
discharge()
is called. -
reporter After our discussion today, we seem to have consensus on Amir's proposal, namely a blanket prohibition to calling
discharge()
in the restricted context.I've updated spec PR 100 and will have an implementation PR soon.
On a related topic, we also agreed to change the default argument value for
progress_required(ps=...)
anddischarge(ps=...)
fromtop_persona_scope()
todefault_persona_scope()
, as that seems like a better match to user expectations and common use cases. So this work will also include that change. -
reporter Proposed implementation resolution now in impl PR 488
-
reporter - changed status to resolved
Resolved in spec PR 100 and impl PR 488
- Log in to comment
I do not believe MHM2 call discharge from within the restricted context. We use discharge rarely and only where we know the attentiveness will be lost for a bit. With that rule-of-thumb, then it should never be placed in the restricted context.
BUT it would be beneficial IMO if it was guaranteed to also be a noop like progress. I’ve always expected that discharge would be a noop in the restricted context.
Motivation is two fold:
Alternatively, you could just prohibit discharge from being called from the restricted context (like barrier, dist_object constructors, etc). I cannot think of a good example (in well formed code) where discharge is necessary and a call to progress() is not sufficient.