getrf_batched kernel produces NaNs on singular square inputs of size <=32

[reposted from MAGMA forum https://icl.cs.utk.edu/magma/forum/viewtopic.php?f=2&t=4035]

The subject of this message summarizes the issue, here's a reproducer based on pytorch:

>>> import torch
>>> m, n = 3, 3
>>> torch.ones(1, m, n, device='cuda').lu()
(tensor([[[1., 1., 1.],
         [1., 0., 0.],
         [1., nan, nan]]], device='cuda:0'), tensor([[1, 2, 3]], device='cuda:0', dtype=torch.int32))

Notice the nan entries appear only when m == n and m <= 32, for other cases, the getrf_batched works correctly.

The source of this issue is likely in the kernel functions implemented in magmablas/zgetrf_batched_smallsq_shfl.cu and ./magmablas/zgetrf_batched_smallsq_noshfl.cu .

Best regards,
Pearu

Comments (7)