Quiescence issues with parallel backend
Hi,
(sorry if that's the wrong place to post this)
There seems to be an issue with quiescence when it comes to using the parallel backend of my application. I use (atomically updated) counters to keep track of RPCs and LPCs in flight in my application. In some cases, there seem to be multiple LPCs in flight, and no matter how many calls to progress I perform, the don’t seem to decrease. In other cases it terminates. After the main code of my application finishes, I tried to call progress until all the counters were zero, however that would never happen. My first case would be a race condition with the counter variables, but I make sure that they are updated atomically, which should prevent these kind of issues.
As an alternative, I also tried the following:
if (config::isGasnetSequentialBackend) {
// […SNIP…]
} else {
auto start = std::chrono::steady_clock::now();
while (activeActors.load() > 0) {
upcxx::progress();
}
auto end = std::chrono::steady_clock::now();
runTime = std::chrono::duration<double, std::ratio<1>>(end - start).count();
for (auto &actorPairs : actors) {
if (actorPairs.second.where() == upcxx::rank_me()) {
auto aRef = *(actorPairs.second.local());
aRef->actorThread.join();
aRef->actorThread = std::thread();
std::cout << aRef->name << " thread terminated." << std::endl;
}
}
}
std::cout << "messages in flight: RPCs: " << rpcsInFlight << " LPCs: " << lpcsInFlight << std::endl;
// Drain the queues, we want no more messages in flight.
while (rpcsInFlight.load() > 0 || lpcsInFlight.load() > 0) {
upcxx::progress();
for (auto &actorPairs : actors) {
if (actorPairs.second.where() == upcxx::rank_me()) {
Actor *a = *(actorPairs.second.local());
// Always evaluates to true. Why? there are no more other threads active, they are all joined above.
if (a->actorPersona->active()) {
std::cout << "persona of " << a->name << " still active!" << std::endl;
continue;
}
upcxx::persona_scope ps(*a->actorPersona);
upcxx::discharge(ps);
}
}
}
However, the if statement (yellow) always evaluates to true. If I omit it, there is an assertion failure. Apparently that persona is still active somewhere. I don’t know where, however. I checked, and there are no other threads active at this point, and this thread did not assume that persona prior to that if statement. I didn't see anything in the spec that would help me with this issue.
Best, Alex
Comments (3)
-
-
reporter - changed status to resolved
I think this was an issue with my code. I talked to John about it, and my actual problem was solved using rank-internal barriers. The issue in the listing here is due to me using the initial personas on threads that finished executing. Apparently those enter an undefined state after thread execution finishes, and therefore performing actions on them will fail (@jdbachan did I get this right?).
-
Yes, default personas are undefined when their principal thread terminates. Glad you were able to solve this!
- Log in to comment
Hi Alex -
I suspect we'll need to either see more of your code or a small complete example to figure this out. Specific questions I have already:
For what it's worth, you should also insert a call to
upcxx::progress()
inside the scope ofps
beside the discharge -- because discharge only makes internal-level progress and will not run any of your callbacks. Note the call toprogress
in the outer loop won't progress personas that are not currently active on the stack of the calling thread.