- changed version to Development Branch
- marked as critical
- edited description
- changed milestone to 2020.9.0 release
-
assigned issue to
Point of clarification: the code Rob mentions only produces that output starting in the (forthcoming) 2020.3.2 release (or develop).
It's probably too late to design and inject this into our forthcoming 2020.3.2 release, but this is a good idea and I think we can definitely provide something by the Sept release.
I think the only tricky bit is we probably wouldn't want to promise anything about the details of what's tracked for the shared heap in the specification, especially since those details might change in future releases. So this could be an "implementation-defined" extension that returns information in a format like Rob suggests, whose type is subject to change/breakage without notice in subsequent releases.
However I'd rather we design a more general key/value-like query interface to fetch self-describing information about whatever statistics we have.
Example:
std::vector<std::pair<std::string, size_t>> upcxx::query_my_sheap_status();
where the return might look something like this:
{
{ "Shared heap size", 134217728 }, // this rank's total shared heap size in bytes
{ "Live user object count", 14 }, // shared objects currently allocated by client on this rank
{ "Live user object size", 4096 }, // their total size, in bytes, including allocator padding
{ "Live rdzv buffer count", 2 }, // same for rendezvous buffers
{ "Live rdzv buffer size", 128 },
{ "Live misc buffer count", 1 }, // same for misc buffers
{ "Live misc buffer size", 1024 },
}
Thoughts?
CC: @Amir Kamil
This was discussed in today's Pagoda meeting and a joint session with the HipMer team.
Until now this issue has been "stalled" on reaching consensus for the semantics of providing a rich (probably dictionary-like) interface for insight into the shared heap state, which would provide programmatic access to the same "deep" insights currently only available via the
upcxx::bad_shared_alloc::what()
exception message (usually post-mortem). The crux of the semantic difficulty is providing enough information to meet client needs, without exposing too many details of the internal implementation that may be subject to change.In discussion today the HipMer team indicated that for their purposes a simple query that allowed them to compute the total available shared heap memory would be sufficient to address one of their most important current problems. Deploying this ASAP should help address their immediate concerns with the lack of backpressure in the RPC rendezvous algorithm (issue
#242), while we work on longer-term, more general solutions to that problem.So here is the sketch of the proposed API I'm pursuing in the near-term:
Caveats (to be fleshed out later):
We'll probably eventually add similar queries to
device_allocator
, but that's not a high priority so may happen later.CC: @Steven Hofmeyr