some kernels broken with CUDA

Issue #321 resolved
Jonas Thies
created an issue

I noticed now that the phist+ghost+CUDA tests have been failing since more than a month...

The reason I didn't notce it before is that the bug causes a hanging MPI run and the test is aborted without notification after a timeout.

The failing test can be reproduced by

env GHOST_TYPE=cuda ./phist-1.3.1-kernels-test --gtest_filter=*fused_spmv_mvTmv

It Looks like there is an error message from GEMM in some fallback variant on the GPU, and in hybrid CPU/GPU mode the progrm just hangs because the error is not propagated to the other processes (we don't do that in phist, may be our fault).

putting the priority of the issue to critical because we don't see any tests anymore

Comments (21)

  1. Dominik Ernst

    I checked out and built phist's devel branch and ran the command, but I see 48 of 48 tests passed. Can you specify more details about how you are building ghost and phist?

  2. Jonas Thies reporter

    it depends a lot on the exact configuration because it's a fallback kernel that may or may not be triggered. I will do some more experiments and post more Information this Weekend, hopefully.

  3. Jonas Thies reporter

    I think it may be an issue with the Driver, the last stable build was with Driver Version 384.81, the first failing one (Build 551) with Version 390.25. Unfortunately the Output does not contain the ghost Revision (I will Change that asap), and the corresponding GHOST buuild is no Longer available to look it up. The build date of the first failed build was Feb 4 2018, the date of the last success was Jan 2, 2018. @MelvenZ we should test if the Driver on your GPU works correctly.

  4. Jonas Thies reporter

    currently I get CUDA Errors from tsmm_inplace in some cases. In larger contexts like block orthogonalization tests These lead to hanging MPI Jobs because rank 0 Returns on error and rank 1 continues until the next reduce. Does 'make check' in phist run on your Systems?

    Some Output from the failing tests:

    2: [ RUN ] CMvecSdMatFusedTest_10_1_1.fused_mvsdi_mvTmv 2: [GHOST] PE0 ERROR at ghost_tsmttsm_cu_rm_fallback() <tsmttsm_var2_cuvar_var.cu:117>: CUDA Error: invalid argument (11) 2: [GHOST] PE0 ERROR at ghost_tsmttsmu_cuda_x_x_x_1_rm() <tsmttsm_var2_cu__var_var.cu:177>: CUDA Error: invalid argument (11)

    the hanging test produces the output

    11: [ RUN ] CCholQR_Test_59_5.with_random_vectors 11: [GHOST] PE0 INFO at tsmttsm_plain_kernel() <tsmttsm_var2_plainvar_var.cpp:53>: In UNALIGNED Row-Major TSMTTSM with arbitrary block sizes 5x5 <- 5x29 * 29x5, St7complexIfE St7complexIfE 11: [GHOST] PE1 ERROR at ghost_tsmttsm_cu_rm_fallback() <tsmttsm_var2_cuvar_var.cu:117>: CUDA Error: invalid argument (11) 11: [GHOST] PE1 ERROR at ghost_tsmttsmu_cuda_x_x_x_1_rm() <tsmttsm_var2_cuvar_var.cu:177>: CUDA Error: invalid argument (11) 11: PE1: Error code -1 (Error in CUDA) returned from call gemm_err

  5. Dominik Ernst

    Thank you, phist builds fine now. I reproduced and fixed two bugs. The phist kernel tests fail to run through with an unrelated error message, though:

    SSparseMatFusedTest_speye_25_1:

    [GHOST] PE0 ERROR at ghost_bincrs_header_read() <bincrs_func.c:167>: Could not open binary CRS file Sspeye25.bin: No such file or directory
    

    which looks like I forgot something when building.

  6. Jonas Thies reporter

    The following error still occurs, but only with ghost in Debug mode:

    env GHOST_TYPE=cuda ./phist-1.4.2-kernels-test-Debug --gtest_filter=CMvecSdMatFusedTestWithAlignedViews_10_1_4.fused_mvsd_mvTmv

    output:

    [GHOST] PE0 INFO at ghost_gemm() <gemm.c:526>: Transparently call special implementation TSMM [GHOST] PE0 INFO at ghost_tsmm() <tsmm.cpp:320>: Try: xcols=4, vcols=1, impl=CUDA, unaligned, unroll=1, dt=any, multipleof=1, storage=Row-major [GHOST] PE0 INFO at ghost_tsmm() <tsmm.cpp:362>: Potentially non-optimal: xcols=arbitrary, vcols=arbitrary, impl=CUDA, unaligned, unroll=1, dt=any, multipleof=1, storage=Row-major [GHOST] PE0 INFO at ghost_gemm() <gemm.c:518>: Transparently call special implementation TSMTTSM [GHOST] PE0 INFO at ghost_tsmttsm() <tsmttsm.cpp:334>: Try wstor=Row-major, wcols=4, vcols=4, impl=CUDA, unaligned, unroll=1, dt=any [GHOST] PE0 INFO at ghost_tsmttsm() <tsmttsm.cpp:363>: Found kernel with highest specialization grade: wstor=Row-major, wcols=4, vcols=4, impl=CUDA, unaligned, unroll=1, dt=any [GHOST] PE0 ERROR at ghost_cu_download2d() <cu_util.c:178>: CUDA Error: an illegal memory access was encountered (77) [GHOST] PE0 ERROR at ghost_cu_upload2d() <cu_util.c:159>: CUDA Error: an illegal memory access was encountered (77) [GHOST] PE0 ERROR at ghost_cu_barrier() <cu_util.c:228>: CUDA Error: an illegal memory access was encountered (77) [GHOST] PE0 ERROR at ghost_tsmmu_cuda_x_x_x_1_1_rm() <tsmm_var2_cuvar_var.cu:101>: CUDA Error: an illegal memory access was encountered (77) PE0: Error code -1 (Error in CUDA) returned from call ghost_gemm(W,V,(char)"N",C,(char)"N",(void)&alpha,(void)&beta,GHOST_GEMM_NO_REDUCE,GHOST_GEMM_DEFAULT) (file /home/thie_jo/essex/phist/src/kernels/ghost/kernels_def.hpp, line 1803) PE0: Error code -1 (functional error) returned from call (file /home/thie_jo/essex/phist/src/tools/phist_tasks.h, line 167) PE0: Error code -1 (functional error) returned from call (void)*(iflag) (file /home/thie_jo/essex/phist/src/kernels/ghost/kernels_def.hpp, line 1804) /home/thie_jo/essex/phist/test/kernels/MvecSdMatFusedTest_def.hpp:326: Failure Value of: iflag_ Actual: -1 Expected: 0

  7. Log in to comment