PGI compiler crashes on test/memberof.cpp with debug

Issue #390 wontfix
Paul Hargrove created an issue

Multiple version of the PGI compliers have been seen to crash (ICE) compiling test/memberof.cpp. One such failure, seen in GitLab CI with PGI 19.1 on an x86_64/Linux system looks like the following:

Compiling test-memberof-seq-debug-smp    FAILED
 "/builds/anl/upcpp/bld/upcxx.assert1.optlev0.dbgsym1.gasnet_seq.smp/include/upc
           xx/utility.hpp", line 298: internal error: write_name_entry_to_file: 
           inconsistant number of debug entries 
       char xbuf_[size + align-1];
                 ^
 1 catastrophic error detected in the compilation of "/builds/anl/upcpp/test/memberof.cpp".
 Compilation aborted.
 pgc++-Fatal-/usr/local/pkg/pgi/linux86-64-llvm/19.1/bin/pggpp1-llvm TERMINATED by signal 6

This has also been seen on ppc64le/Linux with a recent compiler version: such as here

Note that this message is the same seen with issue138.cpp, which is known to be "big".
A search on the PGI/NVIDIA forums finds this issue, which suggests large inputs overflow some internal table, but doesn't give much more info than that.

Reading other google hits on "inconsistant number of debug entries" suggest that omitting -g was the best (only?) work-around.

Comments (15)

  1. Paul Hargrove reporter

    I tried several things last night as potential work-arounds for the PGI 19.1 (floor) compiler on Dirac.
    I plan to also test PGI 20.4 on Summit(dev) when either/both returns from the current scheduled maintenance.
    None of the following on the upcxx command line was effective:

    • The base case to which the following were appended: upcxx -g [...]/memberof.cpp
    • -Mnodwarf, -Mdwarf2, -Mdwarf3, -Mcoff or -Melf
    • -O0
    • -Wc,-O1 or -Wc,-O3 (indirection prevents upcxx from complaining about -g and -O together)
    • -std=c++14 or -std=c++17
    • -Mnollvm (introduced new compile errors)
    • -s (in case symbol table generation would be elided knowing the linker was instructed to strip)

    There is no (documented) equivalent of GCC's -g0, but I tried -Wc,-g0, -nog,-no-g, -Mnog and -Mno-g all of which were rejected as pgc++-Error-Unknown switch: [foo].

    I was also unable to resolve the problem by maximizing potentially relevant resource limits:

    $ ulimit -a|grep kby
    data seg size           (kbytes, -d) unlimited
    max locked memory       (kbytes, -l) unlimited
    max memory size         (kbytes, -m) unlimited
    stack size              (kbytes, -s) unlimited
    virtual memory          (kbytes, -v) unlimited
    
  2. Paul Hargrove reporter

    I've discovered that the EX-summitdev-ibv-pgi configuration is misconfigured such that PGI 20.4 is used for the C compiler, but 19.1 (via mpicxx) is mistakenly being used as CXX when configuring UPC++. So, it is possible that the problem is NOT present in recent builds (which I am looking into soon).

  3. Paul Hargrove reporter

    So, it is possible that the problem is NOT present in recent builds (which I am looking into soon).

    I have confirmed on Summitdev that a build actually using the 20.4 PGI compilers takes an extraordinarily long time to build this test, but does not fail (and the test runs and prints SUCCESS), while 19.9 fails as seen in the nightly testing.

    Using Summit to test 19.9, 19.10, 20.1 and 20.4, I find 19.9 still failing (as expected) and the other three all passing.
    So, my testing suggests this issue was fixed in the 19.10 PGI release, though https://www.pgroup.com/support/release-tprs-2019.htm does not list anything that sounds remotely related to me.

  4. Paul Hargrove reporter

    I can additionally confirm that upcxx -g [...]/issue138.cpp -DMINIMAL fails with the same "inconsistant number of debug entries" message when compiled using PGI 18.10 or 19.9, but passes with 19.10, 20.1 and 20.4. This is consistent with the theory that something was fixed between 19.9 and 19.10.

  5. Dan Bonachea

    I think we agree this is very clearly a bug in the PGI compiler, so marking external.

    Based on Paul's investigations it appears this has been fixed upstream (or at least insofar as this test case can determine), so there's probably no point in reporting this to the PGI maintainers.

    This issue should probably be closed as "wontfix", although we might still want to find a means to tweak our CI to ignore this known failure conditional on PGI version (both test harnesses currently lack the ability to discern based on compiler version).

  6. Paul Hargrove reporter

    I agree that lacking a reproducer for the current release makes a bug report to the vendor a waste of effort.

    As I mentioned in (I think) our 2020-07-022 call, I have an idea to create a upcxx_config.mak at the same point in the build process as the upcxx_config.hpp. This would provide a key piece of the logic needed to make for version-conditional actions (e.g. exclusions or work-arounds). So, while "wontfix" is our normal response to an external bug like this one, I think this can remain open (but maybe not "critical"?) pending such improvement to our make dev-check infrastructure.

  7. Paul Hargrove reporter

    We have implemented (pull request #242) the idea of a generated makefile fragment, mentioned in my previous comment. However, we discarded the idea of using that mechanism to exclude ICE-inducing tests.

    Instead, we are pursuing conditional #error UPCXX_TEST_SKIPPED in such tests, and logic in the test/check make targets to recognize this case. Proposal appears in pull request 248.

  8. Paul Hargrove reporter

    tests: skip memberof.cpp on old PGI + DEBUG

    This commit adds UPCXX_TEST_SKIPPED to memberof.cpp, conditional on DEBUG builds with PGI older than 19.10.

    This addresses issue 390, eliminating the make dev-check failures.

    → <<cset b445eafc6e18>>

  9. Dan Bonachea

    We've now seen this same symptom in a second test (nodiscard) for PGI versions prior to 19.4 (see discussion in pull request #263).

    We believe the defect is fixed in current versions of PGI and we disable the affected tests on debug builds using those versions, so resolving this as WONTFIX

  10. Dan Bonachea

    The current workaround for this problem was added in pull request #275 and strips off the -g command-line option while compiling affected tests to disable the buggy dwarf2 symbol generation code in PGI 19.1..4

  11. Dan Bonachea

    I've now seen evidence on summit of this failure using PGI 20.4, with an expanded version of rpc-ctor-trace currently in development. So I no longer believe that PGI fully solved this problem after 19.4, although they may have pushed it out to only happening on larger inputs.

    PGC++-S-0000-Internal compiler error. read_debug_info: bad debug file
           0  (/gpfs/alpine/csc296/scratch/hargrove/gitlab-builds/vwc8ZDjA/0/anl/upcpp/test/rpc-ctor-trace.cpp)
    
  12. Amir Kamil

    Would it be worth splitting up these tests into two or more, so that we maintain coverage with PGI?

  13. Dan Bonachea

    Would it be worth splitting up these tests into two or more, so that we maintain coverage with PGI?

    The current state of the workaround is we still always build and run the affected tests, we just omit the -g compiler option on affected tests to have the compiler skip the buggy/resource-intensive debug symbol generation. So we aren't really lacking any coverage here, except for coverage of the debug symbol generation, which we know is non-scalable/broken on this compiler (and not our problem to fix).

  14. Dan Bonachea

    Problem still occurs with the new PGI 19.3 floor on dirac, for at least the memberof test, so the workaround is still relevant

  15. Log in to comment