cannon_cuda fails to validate with 16 ranks

Issue #381 resolved
Dan Bonachea created an issue

We now finally have cannon_cuda running in nightly automated tests, and they have discovered a validation failure for the CUDA version at 16 ranks.

The program runs correctly with 1 and 4 ranks on the same pcp-d-{10,11} hardware (where there are exactly two physical nodes with one Tesla M4 in each).

Failure output:

Running Cannon's algorithm for 2048x2048 matrix multiplication.
ERROR: rank:1 expected 102400 got 120832
ERROR: rank:4 expected 206848 got 188416
ERROR: rank:5 expected 233472 got 176128
ERROR: rank:0 expected 92160 got 108544
ERROR: rank:2 expected 112640 got 129024
ERROR: rank:6 expected 260096 got 196608
ERROR: rank:7 expected 286720 got 285904
ERROR: rank:9 expected 364544 got 335872
ERROR: rank:11 expected 450560 got 417792
ERROR: rank:12 expected 436224 got 542720
ERROR: rank:14 expected 555008 got 684032
ERROR: rank:15 expected 614400 got 720896
ERROR: rank:13 expected 495616 got 622592
ERROR: rank:8 expected 321536 got 212992
ERROR: rank:10 expected 407552 got 294912
Initialization: 0.0143579 s
Parallel Compute: 0.254471 s
Verification: 0.0014017 s

It's possible this is another symptom of issue #241, so might not be worth tracking down at this time.

Comments (4)

  1. Dan Bonachea reporter

    The same code also fails on the (completely different) old-high-sierra mac system using smp-conduit and more than 4 ranks. I've seen the failure with all of 9, 16 and 25 ranks. Also, the "wrong answer" values vary between back-to-back runs of the same configuration.

    From this we can conclude the network is not relevant, and a race is probably involved. The fact it only occurs at larger scales (and consistently never at 4 ranks) suggests an app-level data race, but it might still be issue #241, which has never been adequately diagnosed.

  2. Max Grossman

    Reproduced issue on summit with 16 ranks. The code only fails when compiled for CUDA, and I haven’t been able to get a failure for the host version. This could be because of a difference in the two code paths, or because the CPU version is much slower and so it is hiding a data race.

  3. Max Grossman

    This issue is resolved by PR 28: https://bitbucket.org/berkeleylab/upcxx-extras/pull-requests/28/fix-cross-iteration-race-in-cannon-example

    The cause of this bug was a mishandling of the asynchrony of cublas kernels. By default, cublas kernels (e.g. GEMM) run asynchronously with respect to the host application. We were not synchronizing on the kernel, and as a result there was a race between the kernel on iteration i reading its inputs and the upcxx::copy on iteration i+1 overwriting those inputs. The fix was to do a cudaDeviceSynchronize after the cublas GEMM kernel on every iteration except the last. The last iteration does not need to be synchronized because we eventually do a cudaMemcpy which implicitly waits for all prior kernels on the default stream.

  4. Log in to comment