- edited description
local_team requires consecutive ranks
It is currently the case that our implementation of upcxx::local_team()
only groups together processes with consecutively numbered ranks.
In the extreme case of cyclic (aka round-robin) assignment of ranks to hosts, one may have dozens or hundreds of processes on the same host with each a member of a distinct "singleton" local_team. While GASNet-EX will still use shared-memory paths for communication among processes on the same host, there are optimizations within the UPC++ runtime which will not be applied, and the application is denied use of global_ptr::local()
where it would otherwise be available.
The GASNet-level mechanisms for job spawning which live below upcxx-run
are designed to request consecutive assignment of ranks on each host by default, thus ensuring one local_team spanning all processes on the same host is the default. So, this implementation property is not a high priority to fix.
We have briefly discussed the addition of a warning at job start if rank assignment prevents creation of a single local_team per host. The outcome was a recognition that doing so "well" will probably require adding (at least) a reduction collective in the startup code (and doing it poorly would involve a non-scalable scan of data linear in the job size).
Comments (4)
-
reporter -
The original description is not quite right.
First, UPC++ only considers the GASNet neighborhood (the domain for the shared-memory-bypass transport) when deciding on boundaries for
local_team
, which will never cross a GASNet neighborhood boundary. The GASNet neighborhood defaults to including all processes co-located on a given host (making these two boundaries identical). However GASNet provides non-default configure and envvar knobs that can result in the neighborhood being a subset of host. Such cases always result in at least one local_team per neighborhood on such a host, and there is never shared-memory bypass (at UPC++ or GASNet level) between ranks sharing a host but appearing in different GASNet neighborhoods.The other wrinkle involves the details of UPC++'s fallback behavior in the presence of discontiguous rank assignment across nodes. The actual behavior (up to and including version 2020.11.0) is that any GASNet neighborhood containing a discontiguous "run" of GASNet jobranks results in all members of that neighborhood reverting to a degenerate singleton
local_team()
. This means that even block-cyclic process layouts across nodes can result in this (correct but degenerate) singletonlocal_team()
behavior for all processes landing in such neighborhoods. -
Users who are trying to debug a process placement or
local_team
layout issue are highly recommended to spawn usingupcxx-run -vv
(orUPCXX_VERBOSE=1
, which activates a relevant subset of this output) to get console output reporting the process layout andlocal_team
boundaries.As of de53f0b,
upcxx-run -vv
will now report when degenerate singletonlocal_team()
's have been activated due to discontiguous rank ids in the process neighborhood. Sampleupcxx-run -vv
output for a discontiguous spawn:$ upcxx-run -n 8 -vv ./a.out ... ////////////////////////////////////////////////// upcxx::init(): > CPUs Oversubscribed: no "upcxx::progress() never yields to OS" > Shared heap statistics: max size: 0x8000000 (128 MB) min size: 0x8000000 (128 MB) P0 base: 0x7ff317b79000 > Local team statistics: local teams = 5 min rank_n = 1 max rank_n = 2 min discontig_rank = 2 > WARNING: One or more processes (including rank 2) are co-located in a GASNet neighborhood with discontiguous rank IDs. As a result, these ranks will use a singleton local_team(). This generally arises when the job spawner is directed to assign processes to nodes in a manner other than pure-blocked layout. For details, see issue #438 ////////////////////////////////////////////////// UPCXX: Process 0/8 (local_team: 0/2) on pcp-d-6 (16 processors) UPCXX: Process 1/8 (local_team: 1/2) on pcp-d-6 (16 processors) UPCXX: Process 4/8 (local_team: 1/2) on pcp-d-5 (16 processors) UPCXX: Process 3/8 (local_team: 0/2) on pcp-d-5 (16 processors) UPCXX: Process 2/8 (local_team: 0/1) on pcp-d-16 (16 processors) UPCXX: Process 5/8 (local_team: 0/1) on pcp-d-16 (16 processors) UPCXX: Process 6/8 (local_team: 0/2) on pcp-d-15 (16 processors) UPCXX: Process 7/8 (local_team: 1/2) on pcp-d-15 (16 processors) ...
This warning will ONLY print when using
upcxx-run -vv
, which explicitly requests job spawn information to the console. -
- changed status to wontfix
As of GASNet 90817e7 (currently in the stable branch, to appear in the spring 2021 release) udp-conduit defaults to process rank assignment which is sensitive to host, which should remove the last source of "random" rank assignments that could generate discontiguous rank assignments and degenerate singleton
local_team()
in UPC++.System spawners such as SLURM srun, Cray aprun and jsrun can still be used to force discontiguous rank assignments and generate this behavior (eg assigning ranks cyclically across compute nodes), but this should not happen by default (rank assignments usually default to pure-blocked by compute node). IMO a user who explicitly requests such a layout essentially gets what they deserve wrt
local_team
membership, and we should not add overhead to operation under normal/expected layout in order to diagnose or slightly improve behavior for such corner-case layouts. - Log in to comment