cannon_cuda fails to validate with 16 ranks

We now finally have cannon_cuda running in nightly automated tests, and they have discovered a validation failure for the CUDA version at 16 ranks.

The program runs correctly with 1 and 4 ranks on the same pcp-d-{10,11} hardware (where there are exactly two physical nodes with one Tesla M4 in each).

Failure output:

Running Cannon's algorithm for 2048x2048 matrix multiplication.
ERROR: rank:1 expected 102400 got 120832
ERROR: rank:4 expected 206848 got 188416
ERROR: rank:5 expected 233472 got 176128
ERROR: rank:0 expected 92160 got 108544
ERROR: rank:2 expected 112640 got 129024
ERROR: rank:6 expected 260096 got 196608
ERROR: rank:7 expected 286720 got 285904
ERROR: rank:9 expected 364544 got 335872
ERROR: rank:11 expected 450560 got 417792
ERROR: rank:12 expected 436224 got 542720
ERROR: rank:14 expected 555008 got 684032
ERROR: rank:15 expected 614400 got 720896
ERROR: rank:13 expected 495616 got 622592
ERROR: rank:8 expected 321536 got 212992
ERROR: rank:10 expected 407552 got 294912
Initialization: 0.0143579 s
Parallel Compute: 0.254471 s
Verification: 0.0014017 s

It's possible this is another symptom of issue ~~#241~~, so might not be worth tracking down at this time.

Comments (4)