zgetrf_batched shfl kernel failure seen on pytorch unit tests

Issue #43 resolved
Aswin John Mathews created an issue

A few of the unit tests are failing on PyTorch on AMD gpus with MAGMA installed.

https://github.com/pytorch/pytorch/issues/55552

The issue was traced down the PyTorch stack to MAGMA. The tests are disabled for now: https://github.com/pytorch/pytorch/pull/55534

The pytorch build installs https://bitbucket.org/icl/magma/src/hipMAGMA/

The issue was root-caused to the function, "sgesv_batched", which calls "sgetrf_batched". Specifically, there are two kernels within "sgetrf_batched":

1. sgetrf_batched_smallsq_noshfl 

2. sgetrf_batched_smallsq_shfl

By manually updating zgetrf_batched, the issue was seen to be resolved: 

| git diff src/zgetrf_batched.cpp
| diff --git a/src/zgetrf_batched.cpp b/src/zgetrf_batched.cpp
| index 3abbd99f..b552eb3a 100644
| --- a/src/zgetrf_batched.cpp
| +++ b/src/zgetrf_batched.cpp
| @@ -110,13 +110,17 @@ magma_zgetrf_batched(
|
| /* Special case for tiny square matrices */
| if( m == n && m <= 32 ){
|- magma_int_t arch = magma_getdevice_arch();
|- if(arch >= 700){
|- return magma_zgetrf_batched_smallsq_noshfl( m, dA_array, ldda, ipiv_array, info_array, batchCount, queue );
|- }
|- else{
|- return magma_zgetrf_batched_smallsq_shfl( m, dA_array, ldda, ipiv_array, info_array, batchCount, queue );
|- }
|+ #ifdef HAVE_CUDA
|+ magma_int_t arch = magma_getdevice_arch();
|+ if(arch >= 700){
|+ return magma_zgetrf_batched_smallsq_noshfl( m, dA_array, ldda, ipiv_array, info_array, batchCount, queue );
|+ }
|+ else{
|+ return magma_zgetrf_batched_smallsq_shfl( m, dA_array, ldda, ipiv_array, info_array, batchCount, queue );
|+ }
|+ #elif defined(HAVE_HIP)
|+ return magma_zgetrf_batched_smallsq_noshfl( m, dA_array, ldda, ipiv_array, info_array, batchCount, queue );
|+ #endif
| }

For https://github.com/pytorch/pytorch/blob/master/test/test_linalg.py#L3328, the input and output variables at entry and exit of the calls can be seen here:

Opening a PR to disable _shfl kernel. We (AMD) need this to complete MAGMA support with ROCM.

Anyone has any ideas why this kernel is failing, or better suggestions ?

Comments (13)

  1. Ahmad Abdelfattah

    Hi Aswin,

    Thank you for reporting the issue. The hipMAGMA branch has been merged a while ago to master, and so the master branch supports both CUDA and HIP builds. I apologize about the confusion that this branch is still there.

    The master branch already has the fix you suggested in the PR. Please try testing PyTorch with MAGMA installed from master and let us know.

    The *getrf_batched_smallsq_shfl fails on AMD GPUs because they do not support the warp-shuffle operations (which exist in NVIDIA GPUs). This is why the kernel must be disabled for the HIP backend, as you suggested.

  2. Cade Brown

    We should probably either remove or archive hipMAGMA branch so people do not try this anymore

  3. Aswin John Mathews reporter

    Thanks for the info.

    We are testing out the master branch out now : There seems to be one test failure, seemingly, due to a free() operation on an invalid pointer. We are triaging this issue.

    Cade, could you hold off on changing the hipMAGMA branch till we get everything resolved on our end ?

    Thanks,

    Aswin.

  4. Cade Brown

    Sure, but that is an old branch. We may rename it to hipMAGMA-old that way it is clear that it shouldn’t be used(as you may have discovered, there are some bugs in there!), but the history is preserved

  5. Aswin John Mathews reporter

    okay.

    I have identified the possible issue with the master branch.

    https://bitbucket.org/icl/magma/src/6e3a460f9badffb1c391b2d34bb1477ad8eb7367/interface_cuda/interface.cpp#lines-1193

    queue->ptrArray__ is not NULL, which causes magma_free() on a random pointer, since it’s not set for HAVE_HIP at https://bitbucket.org/icl/magma/src/6e3a460f9badffb1c391b2d34bb1477ad8eb7367/interface_cuda/interface.cpp#lines-910

    What do you think ? We can close this PR, and open a new one.

  6. Jeff Daily

    @Cade Brown please don’t rename the hipMAGMA branch. At least for the moment, we baked that branch name into our PyTorch CI image and our manylinux-rocm wheel image. If you renamed it, our images would start failing to build.

    We are definitely moving to the master branch for magma, we just aren’t there yet. We can let you all know as soon as we’re done with the hipMAGMA branch, if that works for you.

  7. Aswin John Mathews reporter

    Hello @Cade Brown

    @Cade Brown

    , issue still exists on one of the pytorch tests although a lot of the failing tests now pass.

    https://bitbucket.org/icl/magma/src/4c2588119ee636999b230b248fca3e3824a23be9/interface_cuda/interface.cpp#lines-1100

    must have

    queue->ptrArray__ = NULL;

    because

    https://bitbucket.org/icl/magma/src/4c2588119ee636999b230b248fca3e3824a23be9/interface_cuda/interface.cpp#lines-1194

    is executed for HAVE_HIP.

    New PR: https://bitbucket.org/icl/magma/pull-requests/9/fixes-issue-in-hip-with-uninitialized

  8. Log in to comment