CUDA 9.2 failures on x86_64/Linux and C2050 (Fermi) GPUs

Issue #314 wontfix
Dan Bonachea created an issue

One of our CI configurations EX-dirac-ibv-pgi is deterministically failing both upcxx/copy and padded_cuda_allocator tests. The former manifests as a validation failure with garbage results, and the latter manifests as an error return from a tiny cudaMallocPitch call when the device heap should be otherwise empty.

This config is running PGI 19.10 on pcp-d-{8,9} which are equipped with Tesla C2050 GPUs, and thus are capped at CUDA 9.2 which is the last release to support this hardware. An otherwise identical configuration EX-dirac-ibv-pgi_old running PGI 19.2 on the same nodes consistently passes the same tests.

Our working theory, consistent with all observed evidence, is that something about the new PGI compiler has broken ABI compatibility with the old CUDA release.

Note that despite impacting the same copy test, this problem appears independent of issue #241, which is a non-deterministic failure for all compilers with a different symptom (cleared memory is observed, instead of garbage).

Comments (3)

  1. Dan Bonachea reporter

    We believe this to be an external bug that should not affect any modern GPU hardware (which should be running CUDA 10 or later). The recommended workaround for ancient GPU hardware is to run the older PGI compiler.

    We don't plan on reporting this further, since it's unlikely anyone cares.

  2. Paul Hargrove

    Our working theory, consistent with all observed evidence, is that something about the new PGI compiler has broken ABI compatibility with the old CUDA release.

    Part of that "observed evidence" is that the same PGI 19.10 on Summit, Summitdev, and Cori's GPU nodes (all with CUDA 10.x) do NOT exhibit the failure.
    So, as Dan says, it is a combination of old CUDA and new PGI that is suspected to be at fault (not the new PGI alone) .

  3. Paul Hargrove

    Dan and I misdiagnosed this problem earlier today.

    TL;DR:

    • CUDA-9.2 + C2050 GPU is broken with any compiler (not PGI-specific)
    • same CUDA + newer GPU is fine
    • same GPU + older CUDA is fine

    Full version:

    First, I just went to confirm that the just-released PGI 20.1 on pcp-d-{8,9} still shows the problem. It does.

    However, I noticed that the "otherwise identical configuration EX-dirac-ibv-pgi_old" is actually testing CUDA 9.0, not 9.2. Upon further investigation, I find that our floor of PGI-19.1 also fails copy.cpp deterministically with CUDA 9.2.

    Meanwhile, CUDA 9.0 and 9.1 are working fine with (at least) PGI 19.1, 19.10 and 20.1.

    Moving on to test CUDA 9.2 with non-PGI compilers found that at least GCC 6.4.0 and GCC 7.4.0 also show the deterministic failure.

    The CUDA packages we use came directly from NVIDIA's RPM repo, and pass checksum validation.

    I found an update to the kernel driver that supports this ancient GPU, but installing it did not resolve the problem.

    Trying a different node pair (pcp-d-{1,2}) with the identical hardware STILL produces the deterministic failure.

    Running smp-conduit on pcp-d-8 fails just as it does with ibv-conduit on the pair pcp-d-{8,9}.

    However, smp-conduit on pcp-d-6 with a Tesla K20Xm (Kepler family) PASSES just fine.

    Similarly, running smp-conduit on a node with a Tesla M4 (Maxwell family) also PASSES.

    Other than the GPU-specifc kernel drivers, the tests described here far are all 100% identical software stacks (compilers and CUDA are on an NFS shared mount).

    So, this is most likely an incompatibility between CUDA 9.2 and the ancient C2050 GPUs.
    NIVIDA docs claim they should work together, but our evidence seems to indicate otherwise.

    This remains "WONTFIX" since we are talking about a mid-2011 GPU.
    Recommended work-around for those experiencing this problem with C2050 (or other Fermi family) GPUs is to buy new GPUs downgrade to CUDA 9.1 (or 9.0).

  4. Log in to comment