`upcxx::local_team_position()` returns incorrect results for discontiguous layouts

Issue #600 resolved
Dan Bonachea created an issue

The upcxx::local_team_position() call added in 2022.3.0 returns incorrect results for discontiguous process layouts (which are only permitted when configured with non-default option --enable-discontig-ranks).

Simple demonstration on perlmutter with test/local_team and ofi/debug: (interleaved output sorted for clarity)

{pm[2]} env UPCXX_VERBOSE=1 srun -N 2 -n 3 -m cyclic test-local_team-ofi-orig
//////////////////////////////////////////////////
upcxx::init():
> 128 CPUs ARE NOT Oversubscribed: upcxx::progress() never yields to OS
> Shared heap statistics:
  max size: 0x8000000 (128 MB)
  min size: 0x8000000 (128 MB)
  P0 base:  0x7f2d067df000
> Local team statistics:
  local teams = 3
  min rank_n = 1
  max rank_n = 1
  min discontig_rank = 0
> WARNING: All local team's are singletons. Memory sharing between ranks will never succeed.
> WARNING: One or more processes (including rank 0) are co-located in a GASNet neighborhood with discontiguous rank IDs. As a result, these ranks will use a singleton local_team().
This generally arises when the job spawner is directed to assign processes to nodes in a manner other than pure-blocked layout.
For details, see issue #438
//////////////////////////////////////////////////
UPCXX: Process   0/3 (local_team: 1 rank)  on nid001320 (128 processors)
UPCXX: Process   1/3 (local_team: 1 rank)  on nid001432 (128 processors)
UPCXX: Process   2/3 (local_team: 1 rank)  on nid001320 (128 processors)
Test: local_team.cpp
Ranks: 3
[0] local_team: 0/1 (position: 0/2): nid001320
[1] local_team: 0/1 (position: 1/2): nid001432
[2] local_team: 0/1 (position: 0/2): nid001320
*** FATAL ERROR (proc 0): 
//////////////////////////////////////////////////////////////////////
UPC++ assertion failure:
 on process 0 (nid001320)
 at <redacted>/upcxx/test/local_team.cpp:60
 in function: int main()

result=4 expect=3

To have UPC++ freeze during these errors so you can attach a debugger,
rerun the program with GASNET_FREEZE_ON_ERROR=1 in the environment.
//////////////////////////////////////////////////////////////////////

This run has three UPC++ local_teams. local_team_position().second currently reports the number of GASNet neighborhoods (two in this example), but this approach is wrong for any run with discontiguous process layouts, where these values will differ (should be three in this example).

Here's another more complicated example:

{pm[2]} env UPCXX_VERBOSE=1 srun -N 2 -n 5 -m plane=2 test-local_team-ofi
//////////////////////////////////////////////////
upcxx::init():
> 128 CPUs ARE NOT Oversubscribed: upcxx::progress() never yields to OS
> Shared heap statistics:
  max size: 0x8000000 (128 MB)
  min size: 0x8000000 (128 MB)
  P0 base:  0x7f3a49b9d000
> Local team statistics:
  local teams = 4
  min rank_n = 1
  max rank_n = 2
  min discontig_rank = 0
> WARNING: One or more processes (including rank 0) are co-located in a GASNet neighborhood with discontiguous rank IDs. As a result, these ranks will use a singleton local_team().
This generally arises when the job spawner is directed to assign processes to nodes in a manner other than pure-blocked layout.
For details, see issue #438
//////////////////////////////////////////////////
UPCXX: Process   0/5 (local_team: 1 rank)  on nid001320 (128 processors)
UPCXX: Process   1/5 (local_team: 1 rank)  on nid001320 (128 processors)
UPCXX: Process 2-3/5 (local_team: 2 ranks) on nid001432 (128 processors)
UPCXX: Process   4/5 (local_team: 1 rank)  on nid001320 (128 processors)
Test: local_team.cpp
Ranks: 5
[0] local_team: 0/1 (position: 0/2): nid001320
[1] local_team: 0/1 (position: 0/2): nid001320
[2] local_team: 0/2 (position: 1/2): nid001432
[3] local_team: 1/2 (position: 1/2): nid001432
[4] local_team: 0/1 (position: 0/2): nid001320
*** FATAL ERROR (proc 0): 
//////////////////////////////////////////////////////////////////////
UPC++ assertion failure:
 on process 0 (nid001320)
 at <redacted>/upcxx/test/local_team.cpp:60
 in function: int main()

result=5 expect=3

To have UPC++ freeze during these errors so you can attach a debugger,
rerun the program with GASNET_FREEZE_ON_ERROR=1 in the environment.
//////////////////////////////////////////////////////////////////////

This defect should only affect the return value of the local_team_position() function, which is not currently "consumed" elsewhere in the runtime.

Discontiguous layouts are currently strongly discouraged, so this bug is considered minor.

Comments (1)

  1. Log in to comment