CUDA 9.2 failures on x86_64/Linux and C2050 (Fermi) GPUs

Comments (3)

Dan Bonachea reporter

changed status to wontfix

We believe this to be an external bug that should not affect any modern GPU hardware (which should be running CUDA 10 or later). The recommended workaround for ancient GPU hardware is to run the older PGI compiler.

We don't plan on reporting this further, since it's unlikely anyone cares.

2020-02-12T00:17:22+00:00

Paul Hargrove

Our working theory, consistent with all observed evidence, is that something about the new PGI compiler has broken ABI compatibility with the old CUDA release.

Part of that "observed evidence" is that the same PGI 19.10 on Summit, Summitdev, and Cori's GPU nodes (all with CUDA 10.x) do NOT exhibit the failure.
So, as Dan says, it is a combination of old CUDA and new PGI that is suspected to be at fault (not the new PGI alone) .

2020-02-12T00:28:42+00:00

Paul Hargrove

changed title to CUDA 9.2 failures on x86_64/Linux and C2050 (Fermi) GPUs
edited description

Dan and I misdiagnosed this problem earlier today.

TL;DR:

CUDA-9.2 + C2050 GPU is broken with any compiler (not PGI-specific)
same CUDA + newer GPU is fine
same GPU + older CUDA is fine

Full version:

First, I just went to confirm that the just-released PGI 20.1 on pcp-d-{8,9} still shows the problem. It does.

However, I noticed that the "otherwise identical configuration EX-dirac-ibv-pgi_old" is actually testing CUDA 9.0, not 9.2. Upon further investigation, I find that our floor of PGI-19.1 also fails copy.cpp deterministically with CUDA 9.2.

Meanwhile, CUDA 9.0 and 9.1 are working fine with (at least) PGI 19.1, 19.10 and 20.1.

Moving on to test CUDA 9.2 with non-PGI compilers found that at least GCC 6.4.0 and GCC 7.4.0 also show the deterministic failure.

The CUDA packages we use came directly from NVIDIA's RPM repo, and pass checksum validation.

I found an update to the kernel driver that supports this ancient GPU, but installing it did not resolve the problem.

Trying a different node pair (pcp-d-{1,2}) with the identical hardware STILL produces the deterministic failure.

Running smp-conduit on pcp-d-8 fails just as it does with ibv-conduit on the pair pcp-d-{8,9}.

However, smp-conduit on pcp-d-6 with a Tesla K20Xm (Kepler family) PASSES just fine.

Similarly, running smp-conduit on a node with a Tesla M4 (Maxwell family) also PASSES.

Other than the GPU-specifc kernel drivers, the tests described here far are all 100% identical software stacks (compilers and CUDA are on an NFS shared mount).

So, this is most likely an incompatibility between CUDA 9.2 and the ancient C2050 GPUs.
NIVIDA docs claim they should work together, but our evidence seems to indicate otherwise.

This remains "WONTFIX" since we are talking about a mid-2011 GPU.
Recommended work-around for those experiencing this problem with C2050 (or other Fermi family) GPUs is to ~~buy new GPUs~~ downgrade to CUDA 9.1 (or 9.0).

2020-02-12T08:00:19+00:00

Dan Bonachea reporter
- changed status to wontfix
We believe this to be an external bug that should not affect any modern GPU hardware (which should be running CUDA 10 or later). The recommended workaround for ancient GPU hardware is to run the older PGI compiler.

We don't plan on reporting this further, since it's unlikely anyone cares.
- 2020-02-12T00:17:22+00:00
Paul Hargrove
Our working theory, consistent with all observed evidence, is that something about the new PGI compiler has broken ABI compatibility with the old CUDA release.

Part of that "observed evidence" is that the same PGI 19.10 on Summit, Summitdev, and Cori's GPU nodes (all with CUDA 10.x) do NOT exhibit the failure.
So, as Dan says, it is a combination of old CUDA and new PGI that is suspected to be at fault (not the new PGI alone) .
- 2020-02-12T00:28:42+00:00
Paul Hargrove
- changed title to CUDA 9.2 failures on x86_64/Linux and C2050 (Fermi) GPUs
- edited description
Dan and I misdiagnosed this problem earlier today.

TL;DR:
- CUDA-9.2 + C2050 GPU is broken with any compiler (not PGI-specific)
- same CUDA + newer GPU is fine
- same GPU + older CUDA is fine
Full version:

First, I just went to confirm that the just-released PGI 20.1 on pcp-d-{8,9} still shows the problem. It does.

However, I noticed that the "otherwise identical configuration EX-dirac-ibv-pgi_old" is actually testing CUDA 9.0, not 9.2. Upon further investigation, I find that our floor of PGI-19.1 also fails copy.cpp deterministically with CUDA 9.2.

Meanwhile, CUDA 9.0 and 9.1 are working fine with (at least) PGI 19.1, 19.10 and 20.1.

Moving on to test CUDA 9.2 with non-PGI compilers found that at least GCC 6.4.0 and GCC 7.4.0 also show the deterministic failure.

The CUDA packages we use came directly from NVIDIA's RPM repo, and pass checksum validation.

I found an update to the kernel driver that supports this ancient GPU, but installing it did not resolve the problem.

Trying a different node pair (pcp-d-{1,2}) with the identical hardware STILL produces the deterministic failure.

Running smp-conduit on pcp-d-8 fails just as it does with ibv-conduit on the pair pcp-d-{8,9}.

However, smp-conduit on pcp-d-6 with a Tesla K20Xm (Kepler family) PASSES just fine.

Similarly, running smp-conduit on a node with a Tesla M4 (Maxwell family) also PASSES.

Other than the GPU-specifc kernel drivers, the tests described here far are all 100% identical software stacks (compilers and CUDA are on an NFS shared mount).

So, this is most likely an incompatibility between CUDA 9.2 and the ancient C2050 GPUs.
NIVIDA docs claim they should work together, but our evidence seems to indicate otherwise.

This remains "WONTFIX" since we are talking about a mid-2011 GPU.
Recommended work-around for those experiencing this problem with C2050 (or other Fermi family) GPUs is to ~~buy new GPUs~~ downgrade to CUDA 9.1 (or 9.0).
- 2020-02-12T08:00:19+00:00
Log in to comment