zgetrf_batched shfl kernel failure seen on pytorch unit tests

Issue #43 resolved

Mathews, Aswin created an issue 2021-04-14

A few of the unit tests are failing on PyTorch on AMD gpus with MAGMA installed.

https://github.com/pytorch/pytorch/issues/55552

The issue was traced down the PyTorch stack to MAGMA. The tests are disabled for now: https://github.com/pytorch/pytorch/pull/55534

‌

The pytorch build installs https://bitbucket.org/icl/magma/src/hipMAGMA/.

The issue was root-caused to the function, "sgesv_batched", which calls "sgetrf_batched". Specifically, there are two kernels within "sgetrf_batched":

1. sgetrf_batched_smallsq_noshfl

2. sgetrf_batched_smallsq_shfl

By manually updating zgetrf_batched, the issue was seen to be resolved:

| git diff src/zgetrf_batched.cpp
| diff --git a/src/zgetrf_batched.cpp b/src/zgetrf_batched.cpp
| index 3abbd99f..b552eb3a 100644
| --- a/src/zgetrf_batched.cpp
| +++ b/src/zgetrf_batched.cpp
| @@ -110,13 +110,17 @@ magma_zgetrf_batched(
|
| /* Special case for tiny square matrices */
| if( m == n && m <= 32 ){
|- magma_int_t arch = magma_getdevice_arch();
|- if(arch >= 700){
|- return magma_zgetrf_batched_smallsq_noshfl( m, dA_array, ldda, ipiv_array, info_array, batchCount, queue );
|- }
|- else{
|- return magma_zgetrf_batched_smallsq_shfl( m, dA_array, ldda, ipiv_array, info_array, batchCount, queue );
|- }
|+ #ifdef HAVE_CUDA
|+ magma_int_t arch = magma_getdevice_arch();
|+ if(arch >= 700){
|+ return magma_zgetrf_batched_smallsq_noshfl( m, dA_array, ldda, ipiv_array, info_array, batchCount, queue );
|+ }
|+ else{
|+ return magma_zgetrf_batched_smallsq_shfl( m, dA_array, ldda, ipiv_array, info_array, batchCount, queue );
|+ }
|+ #elif defined(HAVE_HIP)
|+ return magma_zgetrf_batched_smallsq_noshfl( m, dA_array, ldda, ipiv_array, info_array, batchCount, queue );
|+ #endif
| }

For https://github.com/pytorch/pytorch/blob/master/test/test_linalg.py#L3328, the input and output variables at entry and exit of the calls can be seen here:

Opening a PR to disable _shfl kernel. We (AMD) need this to complete MAGMA support with ROCM.

Anyone has any ideas why this kernel is failing, or better suggestions ?

Comments (13)

Mathews, Aswin reporter
New PR with change: https://bitbucket.org/icl/magma/pull-requests/8/hipmagma
- 2021-04-14T21:44:58+00:00
Ahmad Abdelfattah
Hi Aswin,

Thank you for reporting the issue. The hipMAGMA branch has been merged a while ago to master, and so the master branch supports both CUDA and HIP builds. I apologize about the confusion that this branch is still there.

The master branch already has the fix you suggested in the PR. Please try testing PyTorch with MAGMA installed from master and let us know.

The *getrf_batched_smallsq_shfl fails on AMD GPUs because they do not support the warp-shuffle operations (which exist in NVIDIA GPUs). This is why the kernel must be disabled for the HIP backend, as you suggested.
- 2021-04-15T04:25:06+00:00
Cade Brown
We should probably either remove or archive hipMAGMA branch so people do not try this anymore

‌

‌

‌
- 2021-04-15T18:22:09+00:00
Mathews, Aswin reporter
Thanks for the info.

We are testing out the master branch out now : There seems to be one test failure, seemingly, due to a free() operation on an invalid pointer. We are triaging this issue.

Cade, could you hold off on changing the hipMAGMA branch till we get everything resolved on our end ?

Thanks,

Aswin.
- 2021-04-15T18:25:29+00:00
Cade Brown
Sure, but that is an old branch. We may rename it to hipMAGMA-old that way it is clear that it shouldn’t be used(as you may have discovered, there are some bugs in there!), but the history is preserved
- 2021-04-15T19:17:11+00:00
Mathews, Aswin reporter
okay.

I have identified the possible issue with the master branch.

https://bitbucket.org/icl/magma/src/6e3a460f9badffb1c391b2d34bb1477ad8eb7367/interface_cuda/interface.cpp#lines-1193

queue->ptrArray__ is not NULL, which causes magma_free() on a random pointer, since it’s not set for HAVE_HIP at https://bitbucket.org/icl/magma/src/6e3a460f9badffb1c391b2d34bb1477ad8eb7367/interface_cuda/interface.cpp#lines-910

What do you think ? We can close this PR, and open a new one.

‌
- 2021-04-15T19:47:56+00:00
Cade Brown
Aswin John Mathews, don’t worry, I see the issue you have pointed to with pointer initialization. Commit https://bitbucket.org/icl/magma/commits/4c2588119ee636999b230b248fca3e3824a23be9 should fix this (it is on the master branch)

‌
- 2021-04-15T21:33:44+00:00
Mathews, Aswin reporter
Thanks!

We are testing now.
- 2021-04-15T22:24:59+00:00
Daily, Jeff
@Cade Brown please don’t rename the hipMAGMA branch. At least for the moment, we baked that branch name into our PyTorch CI image and our manylinux-rocm wheel image. If you renamed it, our images would start failing to build.

We are definitely moving to the master branch for magma, we just aren’t there yet. We can let you all know as soon as we’re done with the hipMAGMA branch, if that works for you.
- 2021-04-15T22:41:20+00:00
Mathews, Aswin reporter
Hello @Cade Brown

@Cade Brown

, issue still exists on one of the pytorch tests although a lot of the failing tests now pass.

‌

https://bitbucket.org/icl/magma/src/4c2588119ee636999b230b248fca3e3824a23be9/interface_cuda/interface.cpp#lines-1100

must have

queue->ptrArray__ = NULL;

because

https://bitbucket.org/icl/magma/src/4c2588119ee636999b230b248fca3e3824a23be9/interface_cuda/interface.cpp#lines-1194

is executed for HAVE_HIP.

‌

New PR: https://bitbucket.org/icl/magma/pull-requests/9/fixes-issue-in-hip-with-uninitialized
- 2021-04-16T15:39:33+00:00
Mathews, Aswin reporter
We have finished testing. Almost all pytorch tests pass with MAGMA backend, except for one test_cholesky_solve_autograd_cuda_float64, which fails sometimes and seem to be caused by some other issue (https://github.com/pytorch/pytorch/issues/50743#issuecomment-767686754).

We would appreciate if the above PR is merged in soon.
- 2021-04-19T19:46:39+00:00
Mathews, Aswin reporter
Thanks for merging the code!
- 2021-04-19T21:29:22+00:00
Mathews, Aswin reporter
- changed status to resolved
MAGMA works with ROCm4.2 with merged commits.
- 2021-04-19T21:31:12+00:00
Log in to comment

Assignee: –

Type: bug

Priority: major

Status: resolved

Votes: 0

Watchers: 3