mmap failure on UDP testing environment with 3 nodes

Issue #337 resolved
Matthew created an issue

Hello,

I am compiling the hello-world example with the following compile line:

upcxx -g -network=udp hello-world.cpp 

and run with the following:

upcxx-run -backtrace -n 3 -N 3 ./a.out

Whether I copy a.out to each node or compile on each node separately, the result is:

upcxx-run -backtrace -n 3 -N 3 ./a.out
*** FATAL ERROR (proc 1): Failed to mmap 12 MB for intra-node shared memory communication, errno=No such file or directory(2)
[1] Invoking EXECINFO for backtrace...
*** FATAL ERROR (proc 2): Failed to mmap 12 MB for intra-node shared memory communication, errno=No such file or directory(2)
[2] Invoking EXECINFO for backtrace...
[1] 0: ./a.out(+0xb7af5) [0x55a315182af5] ?? ??:0
[1] 1: ./a.out(+0xb8437) [0x55a315183437] ?? ??:0
[1] 2: ./a.out(+0xb8ad5) [0x55a315183ad5] ?? ??:0
[1] 3: ./a.out(+0xb63e5) [0x55a3151813e5] ?? ??:0
[1] 4: ./a.out(+0xb658f) [0x55a31518158f] ?? ??:0
[1] 5: ./a.out(+0xd328a) [0x55a31519e28a] ?? ??:0
[1] 6: ./a.out(+0x812be) [0x55a31514c2be] ?? ??:0
[1] 7: ./a.out(+0x82735) [0x55a31514d735] ?? ??:0
[1] 8: ./a.out(+0x16cb8) [0x55a3150e1cb8] ?? ??:0
[1] 9: ./a.out(+0xb19a) [0x55a3150d619a] ?? ??:0
[1] 10: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7) [0x7f060e793b97] ?? ??:0
[1] 11: ./a.out(+0xb0ca) [0x55a3150d60ca] ?? ??:0
[2] 0: ./a.out(+0xb7af5) [0x56018a9ebaf5] ?? ??:0
[2] 1: ./a.out(+0xb8437) [0x56018a9ec437] ?? ??:0
[2] 2: ./a.out(+0xb8ad5) [0x56018a9ecad5] ?? ??:0
[2] 3: ./a.out(+0xb63e5) [0x56018a9ea3e5] ?? ??:0
[2] 4: ./a.out(+0xb658f) [0x56018a9ea58f] ?? ??:0
[2] 5: ./a.out(+0xd328a) [0x56018aa0728a] ?? ??:0
[2] 6: ./a.out(+0x812be) [0x56018a9b52be] ?? ??:0
[2] 7: ./a.out(+0x82735) [0x56018a9b6735] ?? ??:0
[2] 8: ./a.out(+0x16cb8) [0x56018a94acb8] ?? ??:0
[2] 9: ./a.out(+0xb19a) [0x56018a93f19a] ?? ??:0
[2] 10: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7) [0x7f6bd2fbbb97] ?? ??:0
[2] 11: ./a.out(+0xb0ca) [0x56018a93f0ca] ?? ??:0
bash: line 1:  3707 Aborted                 (core dumped) env 'AMUDP_SLAVE_ARGS=1,ubuntu-server-1:41859,' './a.out'
bash: line 1:  6063 Aborted                 (core dumped) env 'AMUDP_SLAVE_ARGS=1,ubuntu-server-1:41859,' './a.out'

Note that ubuntu-server-1 is the machine that upcxx-run is called on. Each node runs the serial version fine alone. The results are the same when any node is the upcxx-run caller.

Version:

UPC++ version 20190900L  / gex-2019.9.0
Copyright (c) 2019, The Regents of the University of California,
through Lawrence Berkeley National Laboratory.
https://upcxx.lbl.gov

g++ (Ubuntu 8.3.0-6ubuntu1~18.04.1) 8.3.0
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Running serial with SMP:

upcxx-run -backtrace -n 4 -N 1 ./a.out
Hello world from process 0 out of 4 processes
Hello world from process 1 out of 4 processes
Hello world from process 3 out of 4 processes
Hello world from process 2 out of 4 processes

Each testing node (total 3) is ubuntu-server-18.0.3 LTS with two cores and 8GB ram

EDIT: Wording and typos

Comments (8)

  1. Dan Bonachea

    Hi Matthew -

    We suspect this may be an issue we've already fixed in the current development version (to be officially released in 2 weeks). Do all your compute nodes have the same hostname? If so, giving them distinct hostnames may be the simplest workaround (and a good general practice).

    If that's not possible or ineffective, can you please try running after setting the following temporary workaround:
    export GASNET_SUPERNODE_MAXSIZE=1
    this will disable all shared-memory bypass communication (which substantially hurts performance when any processes are co-located, but it sounds like you are trying to run one process per node at the moment). If that's ineffective the same experiment can instead be done at UPC++ install time with GASNET_CONFIGURE_ARGS=--disable-pshm

  2. Matthew reporter

    AFTER-NOTE: The bottom of the comment has a working solution partly due to your suggestion. The middle parts of the comment are the process it took to come to the workaround and are included for any readers benefit (if there is any benefit to be had).

    Each node has a different hostname (yes, good practice) and static IP.

    when using:

    export GASNET_SUPERNODE_MAXSIZE=1
    

    before building and running, the following is the output (same version as mentioned before).

    upcxx-run -backtrace -n 3 -N 3 ./a.out
    *** GASNET WARNING(Node 0): int sendPacket(ep_t, amudp_msg_t*, size_t, en_t, packet_type) returning an error code: AM_ERR_RESOURCE (Problem with requested resource)
      from function sendPacket
      at /home/developer/upcxx_build/.nobs/art/9c603a00da2b66ea76d2e16681946b807808ad41/GASNet-2019.9.0/other/amudp/amudp_reqrep.cpp:112
      reason: Invalid argument
    
    *** GASNET WARNING(Node 0): int AMUDP_RequestGeneric(amudp_category_t, ep_t, amudp_node_t, handler_t, void*, size_t, uintptr_t, int, __va_list_tag*, uint8_t, uint8_t) returning an error code: AM_ERR_RESOURCE (Problem with requested resource)
      at /home/developer/upcxx_build/.nobs/art/9c603a00da2b66ea76d2e16681946b807808ad41/GASNet-2019.9.0/other/amudp/amudp_reqrep.cpp:1045
    
    GASNet gasnetc_AMRequestShort encountered an AM Error: AM_ERR_RESOURCE(3)
      at /home/developer/upcxx_build/.nobs/art/9c603a00da2b66ea76d2e16681946b807808ad41/GASNet-2019.9.0/udp-conduit/gasnet_core.c:827
    *** WARNING (proc 0): GASNet gasnetc_AMRequestShort returning an error code: GASNET_ERR_RESOURCE (Problem with requested resource)
      at /home/developer/upcxx_build/.nobs/art/9c603a00da2b66ea76d2e16681946b807808ad41/GASNet-2019.9.0/udp-conduit/gasnet_core.c:829
    

    After doing the following, and the running:

    export GASNET_SUPERNODE_MAXSIZE=1
    export GASNET_VERBOSEENV=1
    

    The output is:

    upcxx-run -backtrace -n 3 -N 3 ./a.out
    ENV parameter: GASNET_NETWORKDEPTH = 128                        (default)
    ENV parameter: GASNET_SPAWNFN = S                               (default)
    GASNET: master host name: ubuntu-server-1
    ENV parameter: GASNET_MASTERIP = *empty*                        (default)
    ENV parameter: GASNET_WORKERIP = *empty*                        (default)
    ENV parameter: GASNET_ROUTE_OUTPUT = 1                          (default)
    ENV parameter: GASNET_SSH_SERVERS = developer@ubuntu-server-1,developer@ubuntu-server-2,developer@ubuntu-server-3
    ENV parameter: GASNET_SSH_REMOTE_PATH = /home/developer         (default)
    ENV parameter: GASNET_SSH_CMD = ssh                             (default)
    ENV parameter: GASNET_SSH_OPTIONS = *empty*                     (default)
    ENV parameter: GASNET_ENV_CMD = env                             (default)
    GASNET: system(ssh -f -o 'StrictHostKeyChecking no' -o 'FallBackToRsh no'  developer@ubuntu-server-1 "echo connected to \$HOST... ; cd '/home/developer' ; env 'AMUDP_SLAVE_ARGS=2,ubuntu-server-1:41793,' './a.out' "  || ( echo "connection to developer@ubuntu-server-1 failed." ; kill 2093 ) &)
    GASNET: system(ssh -f -o 'StrictHostKeyChecking no' -o 'FallBackToRsh no'  developer@ubuntu-server-2 "echo connected to \$HOST... ; cd '/home/developer' ; env 'AMUDP_SLAVE_ARGS=2,ubuntu-server-1:41793,' './a.out' "  || ( echo "connection to developer@ubuntu-server-2 failed." ; kill 2093 ) &)
    GASNET: system(ssh -f -o 'StrictHostKeyChecking no' -o 'FallBackToRsh no'  developer@ubuntu-server-3 "echo connected to \$HOST... ; cd '/home/developer' ; env 'AMUDP_SLAVE_ARGS=2,ubuntu-server-1:41793,' './a.out' "  || ( echo "connection to developer@ubuntu-server-3 failed." ; kill 2093 ) &)
    connected to ...
    connected to ...
    connected to ...
    ENV parameter: GASNET_LINEBUFFERSZ = 1024                       (default)
    GASNET: slave connecting to 127.0.1.1:41793
    GASNET: slave using IP 127.0.0.1
    GASNET: slave connecting to 192.168.2.201:41793
    GASNET: slave using IP 192.168.2.203
    GASNET: slave connecting to 192.168.2.201:41793
    GASNET: slave using IP 192.168.2.202
    GASNET: Endpoint table (nproc=3):
    GASNET:  P#0:   (127.0.0.1:45782)       tag: 0x7f0001010000082d
    GASNET:  P#1:   (192.168.2.203:39829)   tag: 0x7f0001010001082d
    GASNET:  P#2:   (192.168.2.202:37189)   tag: 0x7f0001010002082d
    ENV parameter: GASNET_FAULT_RATE = 0.0                          (default)
    ENV parameter: GASNET_RECVDEPTH = 512                           (default)
    ENV parameter: GASNET_SENDDEPTH = 256                           (default)
    ENV parameter: GASNET_REQUESTTIMEOUT_MAX = 30000000             (default)
    ENV parameter: GASNET_REQUESTTIMEOUT_INITIAL = 100000           (default)
    ENV parameter: GASNET_REQUESTTIMEOUT_BACKOFF = 2                (default)
    GASNET(Node 0): UDP SO_RCVBUF buffer successfully set to 413039 bytes
    GASNET(Node 0): UDP SO_SNDBUF buffer successfully set to 413039 bytes
    ENV parameter: GASNET_FS_SYNC = NO                              (default)
    GASNET(Node 0): Slave 0/3 starting (tag=0x7f0001010000082d)...
    ENV parameter: GASNET_MALLOC_INIT = NO                          (default)
    ENV parameter: GASNET_MALLOC_INIT = NO                          (default)
    ENV parameter: GASNET_MALLOC_INIT = NO                          (default)
    ENV parameter: GASNET_MALLOC_INIT = NO                          (default)
    ENV parameter: GASNET_MALLOC_INIT = NO                          (default)
    ENV parameter: GASNET_MALLOC_INITVAL = NAN                      (default)
    ENV parameter: GASNET_MALLOC_CLOBBER = NO                       (default)
    ENV parameter: GASNET_MALLOC_CLOBBERVAL = NAN                   (default)
    ENV parameter: GASNET_MALLOC_LEAKALL = NO                       (default)
    ENV parameter: GASNET_MALLOC_SCANFREED = NO                     (default)
    ENV parameter: GASNET_MALLOC_EXTRACHECK = NO                    (default)
    ENV parameter: GASNET_DISABLE_ARGDECODE = NO                    (default)
    ENV parameter: GASNET_DISABLE_ENVDECODE = NO                    (default)
    ENV parameter: GASNET_BACKTRACE = YES
    ENV parameter: GASNET_BACKTRACE_NODES = *not set*               (default)
    ENV parameter: GASNET_TMPDIR = *not set*                        (default)
    ENV parameter: TMPDIR = *not set*                               (default)
    GASNET(Node 2): UDP SO_RCVBUF buffer successfully set to 413039 bytes
    GASNET(Node 2): UDP SO_SNDBUF buffer successfully set to 413039 bytes
    GASNET(Node 2): Slave 2/3 starting (tag=0x7f0001010002082d)...
    GASNET(Node 1): UDP SO_RCVBUF buffer successfully set to 413039 bytes
    GASNET(Node 1): UDP SO_SNDBUF buffer successfully set to 413039 bytes
    GASNET(Node 1): Slave 1/3 starting (tag=0x7f0001010001082d)...
    ENV parameter: GASNET_BACKTRACE_TYPE = EXECINFO                 (default)
    ENV parameter: GASNET_FREEZE_ON_ERROR = NO                      (default)
    ENV parameter: GASNET_FREEZE_SIGNAL = *not set*                 (default)
    ENV parameter: GASNET_BACKTRACE_SIGNAL = *not set*              (default)
    ENV parameter: GASNET_TRACEFILE = *empty*                       (default)
    ENV parameter: GASNET_STATSFILE = *empty*                       (default)
    ENV parameter: GASNET_TRACEFLUSH = NO                           (default)
    ENV parameter: GASNET_TRACELOCAL = YES                          (default)
    ENV parameter: GASNET_MALLOCFILE = *empty*                      (default)
    ENV parameter: GASNET_NODEMAP_EXACT = YES                       (default)
    ENV parameter: GASNET_SUPERNODE_MAXSIZE = 1
    ENV parameter: GASNET_PSHM_NETWORK_DEPTH = 32                   (default)
    ENV parameter: GASNET_BARRIER = DISSEM                          (default)
    ENV parameter: GASNET_COLL_MIN_SCRATCH_SIZE = 1 KB              (default)
    ENV parameter: GASNET_COLL_SCRATCH_SIZE = 2 MB                  (default)
    ENV parameter: GASNET_MAX_SEGSIZE = 128MB/P (128 MB)
    ENV parameter: GASNET_CATCH_EXIT = YES                          (default)
    ENV parameter: GASNET_DISABLE_MUNMAP = NO                       (default)
    ENV parameter: GASNET_FS_SYNC = NO                              (default)
    ENV parameter: GASNET_MAX_THREADS = 1                           (default)
    ENV parameter: GASNET_PSHM_BARRIER_HIER = YES                   (default)
    ENV parameter: GASNET_PSHM_BARRIER_RADIX = 0                    (default)
    ENV parameter: GASNET_COLL_P2P_EAGER_MIN = 16                   (default)
    ENV parameter: GASNET_COLL_P2P_EAGER_SCALE = 16                 (default)
    ENV parameter: GASNET_COLL_ROOTED_GEOM = KNOMIAL_TREE,2         (default)
    ENV parameter: GASNET_COLL_BROADCAST_GEOM = KNOMIAL_TREE,2      (default)
    ENV parameter: GASNET_COLL_SCATTER_GEOM = KNOMIAL_TREE,2        (default)
    ENV parameter: GASNET_COLL_GATHER_GEOM = KNOMIAL_TREE,2         (default)
    ENV parameter: GASNET_COLL_GATHER_ALL_DISSEM_LIMIT_PER_THREAD = 1 KB   (default)
    ENV parameter: GASNET_COLL_GATHER_ALL_DISSEM_LIMIT = 1 KB       (default)
    ENV parameter: GASNET_COLL_EXCHANGE_DISSEM_LIMIT_PER_THREAD = 1 KB   (default)
    ENV parameter: GASNET_COLL_EXCHANGE_DISSEM_LIMIT = 1 KB         (default)
    ENV parameter: GASNET_COLL_EXCHANGE_DISSEM_RADIX = 2            (default)
    ENV parameter: GASNET_COLL_PIPE_SEG_SIZE = 21 KB                (default)
    ENV parameter: GASNET_COLL_AUTOTUNE_WARM_ITERS = 5              (default)
    ENV parameter: GASNET_COLL_AUTOTUNE_PERF_ITERS = 10             (default)
    ENV parameter: GASNET_COLL_AUTOTUNE_ALLOW_FLAT_TREE = 1         (default)
    ENV parameter: GASNET_COLL_TUNING_FILE = *not set*              (default)
    ENV parameter: GASNET_COLL_PRINT_AUTOTUNE_TIMER = NO            (default)
    ENV parameter: GASNET_COLL_PRINT_COLL_ALG = NO                  (default)
    ENV parameter: GASNET_COLL_ENABLE_SEARCH = NO                   (default)
    ENV parameter: GASNET_COLL_ENABLE_PROFILE = NO                  (default)
    ENV parameter: GASNET_COLL_REDUCE_GEOM = KNOMIAL_TREE,2         (default)
    ENV parameter: GASNET_VIS_AMPIPE = YES                          (default)
    ENV parameter: GASNET_VIS_MAXCHUNK = 63 KB                      (default)
    ENV parameter: GASNET_VIS_PUT_MAXCHUNK = 63 KB                  (default)
    ENV parameter: GASNET_VIS_GET_MAXCHUNK = 63 KB                  (default)
    ENV parameter: GASNET_VIS_REMOTECONTIG = YES                    (default)
    *** GASNET WARNING(Node 0): int sendPacket(ep_t, amudp_msg_t*, size_t, en_t, packet_type) returning an error code: AM_ERR_RESOURCE (Problem with requested resource)
      from function sendPacket
      at /home/developer/upcxx_build/.nobs/art/9c603a00da2b66ea76d2e16681946b807808ad41/GASNet-2019.9.0/other/amudp/amudp_reqrep.cpp:112
      reason: Invalid argument
    
    *** GASNET WARNING(Node 0): int AMUDP_RequestGeneric(amudp_category_t, ep_t, amudp_node_t, handler_t, void*, size_t, uintptr_t, int, __va_list_tag*, uint8_t, uint8_t) returning an error code: AM_ERR_RESOURCE (Problem with requested resource)
      at /home/developer/upcxx_build/.nobs/art/9c603a00da2b66ea76d2e16681946b807808ad41/GASNet-2019.9.0/other/amudp/amudp_reqrep.cpp:1045
    
    GASNet gasnetc_AMRequestShort encountered an AM Error: AM_ERR_RESOURCE(3)
      at /home/developer/upcxx_build/.nobs/art/9c603a00da2b66ea76d2e16681946b807808ad41/GASNet-2019.9.0/udp-conduit/gasnet_core.c:827
    *** WARNING (proc 0): GASNet gasnetc_AMRequestShort returning an error code: GASNET_ERR_RESOURCE (Problem with requested resource)
      at /home/developer/upcxx_build/.nobs/art/9c603a00da2b66ea76d2e16681946b807808ad41/GASNet-2019.9.0/udp-conduit/gasnet_core.c:829
    

    When I run:

    export GASNET_MASTERIP=192.168.2.201
    export GASNET_VERBOSEENV=1
    

    and not

    export GASNET_SUPERNODE_MAXSIZE=1
    

    I get:

    upcxx-run -backtrace -n 3 -N 3 ./a.out
    ENV parameter: GASNET_NETWORKDEPTH = 128                        (default)
    ENV parameter: GASNET_SPAWNFN = S                               (default)
    GASNET: master host name: ubuntu-server-1
    ENV parameter: GASNET_MASTERIP = 192.168.2.201
    ENV parameter: GASNET_WORKERIP = *empty*                        (default)
    ENV parameter: GASNET_ROUTE_OUTPUT = 1                          (default)
    ENV parameter: GASNET_SSH_SERVERS = developer@ubuntu-server-1,developer@ubuntu-server-2,developer@ubuntu-server-3
    ENV parameter: GASNET_SSH_REMOTE_PATH = /home/developer         (default)
    ENV parameter: GASNET_SSH_CMD = ssh                             (default)
    ENV parameter: GASNET_SSH_OPTIONS = *empty*                     (default)
    ENV parameter: GASNET_ENV_CMD = env                             (default)
    GASNET: system(ssh -f -o 'StrictHostKeyChecking no' -o 'FallBackToRsh no'  developer@ubuntu-server-1 "echo connected to \$HOST... ; cd '/home/developer' ; env 'AMUDP_SLAVE_ARGS=2,192,168,2,201,142,221,' './a.out' "  || ( echo "connection to developer@ubuntu-server-1 failed." ; kill 2926 ) &)
    GASNET: system(ssh -f -o 'StrictHostKeyChecking no' -o 'FallBackToRsh no'  developer@ubuntu-server-2 "echo connected to \$HOST... ; cd '/home/developer' ; env 'AMUDP_SLAVE_ARGS=2,192,168,2,201,142,221,' './a.out' "  || ( echo "connection to developer@ubuntu-server-2 failed." ; kill 2926 ) &)
    GASNET: system(ssh -f -o 'StrictHostKeyChecking no' -o 'FallBackToRsh no'  developer@ubuntu-server-3 "echo connected to \$HOST... ; cd '/home/developer' ; env 'AMUDP_SLAVE_ARGS=2,192,168,2,201,142,221,' './a.out' "  || ( echo "connection to developer@ubuntu-server-3 failed." ; kill 2926 ) &)
    connected to ...
    connected to ...
    GASNET: slave connecting to 192.168.2.201:36573
    ENV parameter: GASNET_LINEBUFFERSZ = 1024                       (default)
    GASNET: slave using IP 192.168.2.203
    GASNET: slave connecting to 192.168.2.201:36573
    GASNET: slave using IP 192.168.2.201
    connected to ...
    GASNET: slave connecting to 192.168.2.201:36573
    GASNET: slave using IP 192.168.2.202
    GASNET: Endpoint table (nproc=3):
    GASNET:  P#0:   (192.168.2.203:45500)   tag: 0xc0a802c900000b6e
    GASNET:  P#1:   (192.168.2.201:38710)   tag: 0xc0a802c900010b6e
    GASNET:  P#2:   (192.168.2.202:52320)   tag: 0xc0a802c900020b6e
    GASNET(Node 1): UDP SO_RCVBUF buffer successfully set to 413039 bytes
    GASNET(Node 1): UDP SO_SNDBUF buffer successfully set to 413039 bytes
    GASNET(Node 1): Slave 1/3 starting (tag=0xc0a802c900010b6e)...
    GASNET(Node 2): UDP SO_RCVBUF buffer successfully set to 413039 bytes
    GASNET(Node 2): UDP SO_SNDBUF buffer successfully set to 413039 bytes
    ENV parameter: GASNET_FAULT_RATE = 0.0                          (default)
    ENV parameter: GASNET_RECVDEPTH = 512                           (default)
    ENV parameter: GASNET_SENDDEPTH = 256                           (default)
    ENV parameter: GASNET_REQUESTTIMEOUT_MAX = 30000000             (default)
    ENV parameter: GASNET_REQUESTTIMEOUT_INITIAL = 100000           (default)
    ENV parameter: GASNET_REQUESTTIMEOUT_BACKOFF = 2                (default)
    GASNET(Node 0): UDP SO_RCVBUF buffer successfully set to 413039 bytes
    GASNET(Node 0): UDP SO_SNDBUF buffer successfully set to 413039 bytes
    ENV parameter: GASNET_FS_SYNC = NO                              (default)
    GASNET(Node 0): Slave 0/3 starting (tag=0xc0a802c900000b6e)...
    ENV parameter: GASNET_MALLOC_INIT = NO                          (default)
    ENV parameter: GASNET_MALLOC_INIT = NO                          (default)
    ENV parameter: GASNET_MALLOC_INIT = NO                          (default)
    ENV parameter: GASNET_MALLOC_INIT = NO                          (default)
    ENV parameter: GASNET_MALLOC_INIT = NO                          (default)
    GASNET(Node 2): Slave 2/3 starting (tag=0xc0a802c900020b6e)...
    ENV parameter: GASNET_MALLOC_INITVAL = NAN                      (default)
    ENV parameter: GASNET_MALLOC_CLOBBER = NO                       (default)
    ENV parameter: GASNET_MALLOC_CLOBBERVAL = NAN                   (default)
    ENV parameter: GASNET_MALLOC_LEAKALL = NO                       (default)
    ENV parameter: GASNET_MALLOC_SCANFREED = NO                     (default)
    ENV parameter: GASNET_MALLOC_EXTRACHECK = NO                    (default)
    ENV parameter: GASNET_DISABLE_ARGDECODE = NO                    (default)
    ENV parameter: GASNET_DISABLE_ENVDECODE = NO                    (default)
    ENV parameter: GASNET_BACKTRACE = YES
    ENV parameter: GASNET_BACKTRACE_NODES = *not set*               (default)
    ENV parameter: GASNET_TMPDIR = *not set*                        (default)
    ENV parameter: TMPDIR = *not set*                               (default)
    ENV parameter: GASNET_BACKTRACE_TYPE = EXECINFO                 (default)
    ENV parameter: GASNET_FREEZE_ON_ERROR = NO                      (default)
    ENV parameter: GASNET_FREEZE_SIGNAL = *not set*                 (default)
    ENV parameter: GASNET_BACKTRACE_SIGNAL = *not set*              (default)
    ENV parameter: GASNET_TRACEFILE = *empty*                       (default)
    ENV parameter: GASNET_STATSFILE = *empty*                       (default)
    ENV parameter: GASNET_TRACEFLUSH = NO                           (default)
    ENV parameter: GASNET_TRACELOCAL = YES                          (default)
    ENV parameter: GASNET_MALLOCFILE = *empty*                      (default)
    ENV parameter: GASNET_NODEMAP_EXACT = YES                       (default)
    ENV parameter: GASNET_SUPERNODE_MAXSIZE = 0                     (default)
    ENV parameter: GASNET_PSHM_NETWORK_DEPTH = 32                   (default)
    *** FATAL ERROR (proc 1): Failed to mmap 12 MB for intra-node shared memory communication, errno=No such file or directory(2)
    [1] Invoking EXECINFO for backtrace...
    [1] 0: ./a.out(+0xb8950) [0x562c66238950] ?? ??:0
    [1] 1: ./a.out(+0xb9292) [0x562c66239292] ?? ??:0
    [1] 2: ./a.out(+0xb9930) [0x562c66239930] ?? ??:0
    [1] 3: ./a.out(+0xb7240) [0x562c66237240] ?? ??:0
    [1] 4: ./a.out(+0xb73ea) [0x562c662373ea] ?? ??:0
    [1] 5: ./a.out(+0xd430e) [0x562c6625430e] ?? ??:0
    [1] 6: ./a.out(+0x81ef2) [0x562c66201ef2] ?? ??:0
    [1] 7: ./a.out(+0x8336b) [0x562c6620336b] ?? ??:0
    [1] 8: ./a.out(+0x16d9c) [0x562c66196d9c] ?? ??:0
    [1] 9: ./a.out(+0xb1bf) [0x562c6618b1bf] ?? ??:0
    [1] 10: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7) [0x7fd32a5b9b97] ?? ??:0
    [1] 11: ./a.out(+0xb0ca) [0x562c6618b0ca] ?? ??:0
    bash: line 1:  3011 Aborted                 (core dumped) env 'AMUDP_SLAVE_ARGS=2,192,168,2,201,142,221,' './a.out'
    

    Finally, when using (master nodes IP is 192.168.2.201)

    export GASNET_SUPERNODE_MAXSIZE=1
    export GASNET_MASTERIP=192.168.2.201
    

    I get correct output:

    upcxx-run -backtrace -n 3 -N 3 ./a.out
    Hello world from process 0 out of 3 processes
    Hello world from process 1 out of 3 processes
    Hello world from process 2 out of 3 processes
    

    Your suggested solution of export GASNET_SUPERNODE_MAXSIZE=1 led me to more errors which I was able to solve (no doubt, you would have been able to help had I not solved the problem). It appears that two env vars are needed to get this working: GASNET_SUPERNODE_MAXSIZE=1 and GASNET_MASTERIP=192.168.2.201 (which is the IP address of the master node).

    Based on what you have said, can I look forward to the next release removing the need for export GASNET_SUPERNODE_MAXSIZE=1? The next step is multiple processes per node. Tomorrow evening I will make an attempt to use a snapshot.

    NOTE: UPCXX may turn out to be the greatest thing to happen to c++ since std::shard_ptr, keep up the great work!

  3. Dan Bonachea

    Glad to hear we are making progress! It sounds like we have multiple independent problems here.

    These lines:

    GASNET:  P#0:   (127.0.0.1:45782)       tag: 0x7f0001010000082d
    GASNET:  P#1:   (192.168.2.203:39829)   tag: 0x7f0001010001082d
    GASNET:  P#2:   (192.168.2.202:37189)   tag: 0x7f0001010002082d
    

    indicate that one of your worker processes (likely ubuntu-server-1) is DNS resolving its own hostname onto the localhost network (127.0.0.1) instead of the shared network (192.168.2.x), resulting in one worker process who cannot communicate with the others. The new release diagnoses this with a better error message, but a workaround is still required. The simplest one is GASNET_MASTERIP=192.168.2.201 as you've done, the alternative is to fix DNS on your master to resolve its own hostname to the shared network (eg by editing /etc/hosts).

    Regarding the PSHM mapping failure:
    Are you certain all the systems believe they have separate hostnames, as reported from the compute node? Ie run hostname on each of the nodes and make sure the answers are distinct. If they are all distinct then the PSHM error is not the defect I expected and we should investigate further...

  4. Matthew reporter

    After-Note: Same deal as the last post, solution at the bottom

    developer@ubuntu-server-1:~$ hostname
    ubuntu-server-1
    
    developer@ubuntu-server-2:~$ hostname
    ubuntu-server-2
    
    developer@ubuntu-server-3:~$ hostname
    ubuntu-server-3
    

    Hosts file on master node:

    developer@ubuntu-server-1:~$ cat /etc/hosts
    127.0.0.1 localhost
    192.168.2.201   ubuntu-server-1
    192.168.2.202   ubuntu-server-2
    192.168.2.203   ubuntu-server-3
    
    # The following lines are desirable for IPv6 capable hosts
    ::1     ip6-localhost ip6-loopback
    fe00::0 ip6-localnet
    ff00::0 ip6-mcastprefix
    ff02::1 ip6-allnodes
    ff02::2 ip6-allrouters
    

    on ubuntu-server-2

    developer@ubuntu-server-2:~$ cat /etc/hosts
    127.0.0.1 localhost
    127.0.1.1 ubuntu-server-2
    192.168.2.201   ubuntu-server-1
    192.168.2.203   ubuntu-server-3
    
    # The following lines are desirable for IPv6 capable hosts
    ::1     ip6-localhost ip6-loopback
    fe00::0 ip6-localnet
    ff00::0 ip6-mcastprefix
    ff02::1 ip6-allnodes
    ff02::2 ip6-allrouters
    

    on ubuntu-server-3

    developer@ubuntu-server-3:~$ cat /etc/hosts
    127.0.0.1 localhost
    127.0.1.1 ubuntu-server-3
    192.168.2.201   ubuntu-server-1
    192.168.2.202   ubuntu-server-2
    
    # The following lines are desirable for IPv6 capable hosts
    ::1     ip6-localhost ip6-loopback
    fe00::0 ip6-localnet
    ff00::0 ip6-mcastprefix
    ff02::1 ip6-allnodes
    ff02::2 ip6-allrouters
    

    I changed the host file on ubuntu-server-1 to have its own static IP (192.168.2.201) instead of 127.0.0.1 and it removed the need for the export to set master node ip.

    Solution:

    I changes the HOSTS file on ALL the nodes to have their own static IP (instead of just the master node) and “export GASNET_SUPERNODE_MAXSIZE=1” is no longer required.

    New hosts files:

    Hosts file on master node:

    developer@ubuntu-server-1:~$ cat /etc/hosts
    127.0.0.1 localhost
    192.168.2.201   ubuntu-server-1
    192.168.2.202   ubuntu-server-2
    192.168.2.203   ubuntu-server-3
    
    # The following lines are desirable for IPv6 capable hosts
    ::1     ip6-localhost ip6-loopback
    fe00::0 ip6-localnet
    ff00::0 ip6-mcastprefix
    ff02::1 ip6-allnodes
    ff02::2 ip6-allrouters
    

    on ubuntu-server-2

    developer@ubuntu-server-2:~$ cat /etc/hosts
    127.0.0.1 localhost
    192.168.2.202   ubuntu-server-2
    192.168.2.201   ubuntu-server-1
    192.168.2.203   ubuntu-server-3
    
    # The following lines are desirable for IPv6 capable hosts
    ::1     ip6-localhost ip6-loopback
    fe00::0 ip6-localnet
    ff00::0 ip6-mcastprefix
    ff02::1 ip6-allnodes
    ff02::2 ip6-allrouters
    

    on ubuntu-server-3

    developer@ubuntu-server-3:~$ cat /etc/hosts
    127.0.0.1 localhost
    192.168.2.203   ubuntu-server-3
    192.168.2.201   ubuntu-server-1
    192.168.2.202   ubuntu-server-2
    
    # The following lines are desirable for IPv6 capable hosts
    ::1     ip6-localhost ip6-loopback
    fe00::0 ip6-localnet
    ff00::0 ip6-mcastprefix
    ff02::1 ip6-allnodes
    ff02::2 ip6-allrouters
    

    Thanks for the help.

  5. Log in to comment