Barrier synchronization failures

Issue #163 resolved
Dan Bonachea created an issue

Attached is simple test of upcxx::barrier() correctness. It is intermittently failing for non-trivial process counts. Failures have been observed in all of: the 2018.3.2 release, the current develop branch (c402e07), and the current Pull request 43 branch (32a8e10). Example traces below from dirac/gcc/smp/debug.

The failures seem to indicate that either the barrier algorithm is not correctly synchronizing the processes, or possibly that rpc() acknowledgements are not being correctly synchronized. The failures are most common/prevalent on smp-conduit, presumably because the fast shared-memory bypass channel exposes a race, but I've also seen violations that include a network conduit. The failures seem most prevalent at process counts over 16, but I've seen it fail for as few as 8 processes.

upcxx::barrier() is currently using a hand-rolled barrier implementation over GASNet AM, so that seems the most likely culprit. Tracking this down is a blocker, at least until we can rule out barrier correctness as a root cause.

Failure with 2018.3.2 release:

{pcp-d-5 ~/UPC/code} which upcxx-meta
/usr/local/pkg/upcxx-dirac/gcc-7.3.0-p1/stable-2018.3.2/bin/upcxx-meta
{pcp-d-5 ~/UPC/code} which g++
/usr/local/pkg/gcc/7.3.0-p1/bin/g++
{pcp-d-5 ~/UPC/code} setenv UPCXX_CODEMODE debug ; setenv UPCXX_GASNET_CONDUIT smp ; `upcxx-meta CXX` `upcxx-meta PPFLAGS` -std=c++11 barrier.cpp -o barrier-release `upcxx-meta LIBFLAGS` `upcxx-meta LDFLAGS` 
{pcp-d-5 ~/UPC/code} upcxx-run -np 8 barrier-release 1000     
Running 1000 of barrier test on 8 ranks...
SUCCESS
{pcp-d-5 ~/UPC/code} upcxx-run -np 16 barrier-release 1000 
Running 1000 of barrier test on 16 ranks...
SUCCESS
{pcp-d-5 ~/UPC/code} upcxx-run -np 17 barrier-release 1000 
Running 1000 of barrier test on 17 ranks...
1: ERROR gp_peer.is_null()
*** Caught a fatal signal: SIGABRT(6) on node 1/17
NOTICE: Before reporting bugs, run with GASNET_BACKTRACE=1 in the environment to generate a backtrace. 
{pcp-d-5 ~/UPC/code} upcxx-run -np 18 barrier-release 1000  
Running 1000 of barrier test on 18 ranks...
SUCCESS
{pcp-d-5 ~/UPC/code} upcxx-run -np 18 barrier-release 1000
Running 1000 of barrier test on 18 ranks...
142: ERROR gp_peer.is_null(): ERROR gp_peer.is_null()

: ERROR gp_peer.is_null()*** Caught a fatal signal: SIGABRT(6) on node 1/18
NOTICE: Before reporting bugs, run with GASNET_BACKTRACE=1 in the environment to generate a backtrace. 

*** Caught a fatal signal: SIGABRT(6) on node 2/18
NOTICE: Before reporting bugs, run with GASNET_BACKTRACE=1 in the environment to generate a backtrace. 
*** Caught a fatal signal: SIGABRT(6) on node 4/18
NOTICE: Before reporting bugs, run with GASNET_BACKTRACE=1 in the environment to generate a backtrace. 

Failure with Nightly build:

{pcp-d-5 ~/UPC/code} which upcxx
/usr/local/pkg/upcxx-dirac/gcc-8.2.0/nightly/bin/upcxx
{pcp-d-5 ~/UPC/code} upcxx --version
UPC++ version 20180303 upcxx-2018.3.3-46-gc402e07
Copyright (c) 2018, The Regents of the University of California,
through Lawrence Berkeley National Laboratory.
http://upcxx.lbl.gov

g++ (GCC) 8.2.0
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

{pcp-d-5 ~/UPC/code} env UPCXX_GASNET_CONDUIT=smp upcxx -g barrier.cpp -o barrier-develop
{pcp-d-5 ~/UPC/code} upcxx-run -np 8 barrier-develop
Running 100000 of barrier test on 8 ranks...
SUCCESS
{pcp-d-5 ~/UPC/code} upcxx-run -np 16 barrier-develop 100000 
Running 100000 of barrier test on 16 ranks...
SUCCESS
{pcp-d-5 ~/UPC/code} upcxx-run -np 17 barrier-develop 100000 
Running 100000 of barrier test on 17 ranks...
812: ERROR gp_peer.is_null(): ERROR gp_peer.is_null()

*** Caught a fatal signal: SIGABRT(6) on node 8/17
*** Caught a fatal signal: SIGABRT(6) on node 12/17
NOTICE: Before reporting bugs, run with GASNET_BACKTRACE=1 in the environment to generate a backtrace. 
NOTICE: Before reporting bugs, run with GASNET_BACKTRACE=1 in the environment to generate a backtrace. 
{pcp-d-5 ~/UPC/code} upcxx-run -np 16 barrier-develop 100000 
Running 100000 of barrier test on 16 ranks...
SUCCESS
{pcp-d-5 ~/UPC/code} upcxx-run -np 16 barrier-develop 100000
Running 100000 of barrier test on 16 ranks...
SUCCESS
{pcp-d-5 ~/UPC/code} upcxx-run -np 16 barrier-develop 100000
Running 100000 of barrier test on 16 ranks...
SUCCESS
{pcp-d-5 ~/UPC/code} upcxx-run -np 16 barrier-develop 100000
Running 100000 of barrier test on 16 ranks...
SUCCESS
{pcp-d-5 ~/UPC/code} upcxx-run -np 16 barrier-develop 100000
Running 100000 of barrier test on 16 ranks...
12: ERROR gp_peer.is_null()
*** Caught a fatal signal: SIGABRT(6) on node 12/16
NOTICE: Before reporting bugs, run with GASNET_BACKTRACE=1 in the environment to generate a backtrace. 

Failure with PR branch:

{pcp-d-5 ~/UPC/code} env UPCXX_GASNET_CONDUIT=smp ../upcxx/jdbachan-subteams/bin/upcxx -g barrier.cpp -o barrier-branch
{pcp-d-5 ~/UPC/code} upcxx-run -np 8 barrier-branch
Running 100000 of barrier test on 8 ranks...
5: ERROR gp_peer.is_null()
*** Caught a fatal signal: SIGABRT(6) on node 5/8
NOTICE: Before reporting bugs, run with GASNET_BACKTRACE=1 in the environment to generate a backtrace. 
{pcp-d-5 ~/UPC/code} upcxx-run -np 4 barrier-branch 
Running 100000 of barrier test on 4 ranks...
SUCCESS
{pcp-d-5 ~/UPC/code} upcxx-run -np 8 barrier-branch 
Running 100000 of barrier test on 8 ranks...
SUCCESS
{pcp-d-5 ~/UPC/code} upcxx-run -np 12 barrier-branch 
Running 100000 of barrier test on 12 ranks...
0: ERROR gp_peer.is_null()
*** Caught a fatal signal: SIGABRT(6) on node 0/12
NOTICE: Before reporting bugs, run with GASNET_BACKTRACE=1 in the environment to generate a backtrace. 
Abort

Comments (3)

  1. john bachan

    So I reproduced this issue, then changed gp_peer from a static to global variable, and the issue went away. So I conclude that the compiler was delaying initializing the static until first use, which happens after the rpc lands.

  2. Dan Bonachea reporter

    I conclude that the compiler was delaying initializing the static until first use, which happens after the rpc lands.

    Thanks for tracking this down John!

    The explanation is surprising, but makes sense. I can confirm that adding a barrier after the static declaration or moving it to global scope fixes the problem.

    I think UPC++ has actually exposed the result of a compiler optimization that is not permitted to be visible. Here is the relevant text from C++11 [stmt.dcl] :

    The zero-initialization (8.5) of all block-scope variables with static storage duration (3.7.1) or thread storage duration (3.7.2) is performed before any other initialization takes place. Constant initialization (3.6.2) of a block-scope entity with static storage duration, if applicable, is performed before its block is first entered. [...] Otherwise such a variable is initialized the first time control passes through its declaration; such a variable is considered initialized upon the completion of its initialization.

    Here C++ clearly requires the constant initialization of the gp_peer local-scope static variable to happen before entering the containing block (which in this case is the body of main), but it appears g++ 8.x is "bending the rules" by delaying that until reaching the declaration (which is required for non-constant initializers). This trick normally would not be visible (even in threaded codes), but it's visible from a UPC++ lambda that is implicitly bound to the address of that variable and invoked by function pointer in the context of the runtime code before execution has reached the declaration on the target process.

    The rule-breaking here seems clearly the compiler's fault, so unfortunately there's probably nothing that can be done about this (unless there's a magical compiler switch to force strict initialization conformance). We may just have to warn users against accessing local-scope static variables from rpc.

  3. Log in to comment