RPC compute-pi regression on PPC/clang-4/smp/opt

Issue #236 wontfix
Dan Bonachea created an issue

CI shows the RPC-based compute-pi example is now failing on PPC/clang-4/smp/opt.

Example failure:

--- App stdout ---
Testing compute-pi-multi-examples.cpp with 4 ranks
Calculating pi with 40000 trials, distributed across 4 ranks.
rpc: pi estimate: 3.1631, rank 0 alone: 3.1712
rpc_no_barrier: pi estimate: 3.1631, rank 0 alone: 3.1712
global_ptrs: pi estimate: 3.1631, rank 0 alone: 3.1712
distobj: pi estimate: 3.1631, rank 0 alone: 3.1712
async_distobj: pi estimate: 3.1631, rank 0 alone: 3.1712
atomics: pi estimate: 3.1631, rank 0 alone: 3.1712
quiescence: pi estimate: -195846, rank 0 alone: 3.1712
--- App stderr ---
*** FATAL ERROR (proc 0): 
//////////////////////////////////////////////////
UPC++ assertion failure:
 rank=0
 file=/home/phargrov/upcnightly/EX-ppc64el-smp-clang_old/runtime/work/opt/upcxx/example/prog-guide/compute-pi-multi-examples.cpp:89

hits mismatch between quiescence and atomics

To have UPC++ freeze during these errors so you can attach a debugger, rerun the program with GASNET_FREEZE_ON_ERROR=1 in the environment.
//////////////////////////////////////////////////

NOTICE: Before reporting bugs, run with GASNET_BACKTRACE=1 in the environment to generate a backtrace. 
NOTICE: We recommend linking the debug version of GASNet to assist you in resolving this application issue.
*** Caught a fatal signal (proc 0): SIGABRT(6)
------------------

Analysis of the Run history shows this configuration/test has a long history of passing, up through 06-29: eaba120
And so far it's been consistently broken since 07-06, after the merge of the pull request #86 at b0590aa

The only GEX changes in the interval in question (def24e8-085d7b5) are to source files in other conduits or tests that are not compiled into this configuration. This points the finger rather squarely at the pull request #86 as a causal factor, although it's also possible the new serialization code is treading in UB or tickling a bug in a way relevant to the clang 4 optimizer.

The RPC code in question is very simple and probably models many other similar RPC codes:

int64_t hits = 0;
// counts the number of ranks for which the RPC has completed
int n_done = 0;

int64_t reduce_to_rank0(int64_t my_hits)
{
    // cannot wait for the RPC - there is no return
    upcxx::rpc_ff(0, [](int64_t my_hits) { hits += my_hits; n_done++; }, my_hits);
    if (upcxx::rank_me() == 0) {
        // spin waiting for RPCs from all ranks to complete
        // When spinning, call the progress function to 
        // ensure rank 0 processes waiting RPCs
        while (n_done != upcxx::rank_n()) upcxx::progress();
    }
    // wait until all RPCs have been processed (quiescence)
    upcxx::barrier();
    return hits;
}

We need to investigate this regression, and either deploy a workaround in the serialization logic (is possible) or investigate raising the clang floor (if we decide it's a bug).

Comments (9)

  1. Paul Hargrove

    @jdbachan This is occurring on the same VM where you are poking at PGI. The compilers in use are setup specifically for our "floor" testing:

       `CC=/usr/local/pkg/clang/4.0.0-gcc640/bin/clang`
       `CXX="/usr/local/pkg/clang/4.0.0-gcc640/bin/clang++ -std=c++14"`
    

    I am attempting to get an equivalent compiler on Summit to rule out some issue specific to this VM.

  2. Paul Hargrove

    I am attempting to get an equivalent compiler on Summit to rule out some issue specific to this VM.

    This was a dead-end because the installation of GCC 6.4.0 on Summit does not conform to llvm/clang expectations and therefore I cannot build an equivalent compiler there. The only thing I can build would use libstdc++ from GCC 4.8.5, which is too old for UPC++.

    There is another PPC64 (not VM) system I have access to.
    So, I may still be able to reproduce w/o a VM, but it would not be a system we can all use.

  3. Paul Hargrove

    There is another PPC64 (not VM) system I have access to. So, I may still be able to reproduce w/o a VM, but it would not be a system we can all use.

    I can confirm that an equivalent compiler build fails this test in the same manner on a (non-VM) Power8 system.
    I believe this means it is "safe" to pursue this issue on the VM, knowing that it is not an artifact of the virtualization.

    I am going to see if I can find time to narrow the range of compilers showing this issue (but that is not a promise that I can do so).

  4. Paul Hargrove

    Having ruled out the VM itself as a contributor to the problem, I've returned to testing on the VM where the problem first appeared.

    Running with the distro-provided clang-4.0.1 (backed by libstdc++ from g++ 8.3) also shows the problem. So, this rules out both my building of clang-4.0 and my build's use of libstdc++ from g++ 6.4.0 as contributing factors.

    Meanwhile, the distro-provided clang-{5.0.1,6.0.0,7.0.0,8.0.0} do NOT reproduce the error (all backed by libstdc++ from g++ 8.3).

    Note the failing versions of clang and libstdc++ are just fine on x86-64 (config EX-dirac-ibv-clang_old).

    These facts place suspicion on a bug in clang-4.0.x on ppc64el (but does not prove such). It remains possible that our C++ code is treading into UB territory and only this particular toolchain is converting that to "bad" behavior.

    Dan observed to me in Slack that this config is passing -std=c++14 as part of the value of CXX. I have confirmed that passing -std=c++11 instead makes no difference.

    Dan has suggested that we may need to raise our clang floor due to this issue if it cannot be worked around (regardless of who/what is at fault). However, I want to note that we don't yet claim support for any version of clang on ppc64el. When we do, we can certainly state clang-5 as the minimum, and doing so would not represent a regression. Of course this "clang_old" tester should be adjusted to match whatever we document as the ppc64/clang floor.

    For completeness, I tried PR96 with this configuration and can report that this assertion failure remains.

    Finally, I want to mention that I've run the GASNet-level tests with every configuration I've reported on and have seen no sign of problems with the atomics.

  5. Dan Bonachea reporter

    Finally, I want to mention that I've run the GASNet-level tests with every configuration I've reported on and have seen no sign of problems with the atomics.

    Note when Paul says "the atomics" he means GASNet tools atomics (which back GASNet remote atomics), and NOT the clang-provided std::atomic, which backs portions of the UPC++ runtime queues. Those still remain potentially suspect.

  6. Paul Hargrove

    I can report that building clang-4 with a few different compilers did not change the results. However, I am concerned that when built with clang-7 or clang-8, the resulting clang-4 fails its make check (which it passed built with 2 gcc's and with itself). Not entirely sure what to make of that.

  7. Dan Bonachea reporter

    Pull request #98 proposes to add clang 5.0 as our floor compiler version on Linux/ppc64le.

    If accepted, this issue can be closed as "wontfix" for an unsupported compiler version.

  8. Dan Bonachea reporter

    Add clang v5.0+ as a supported compiler on Linux/ppc64le

    Note this deliberately leaves clang-4.0/Linux/ppc64le as unvalidated/unsupported, resolving issue #236

    → <<cset 84369a2a8f85>>

  9. Log in to comment