`magmablas_dgemm_vbatched` fails for `batchCount >= 65536`

Issue #14 resolved
Trey White created an issue

I am using magmablas_dgemm_vbatched on NVidia V100 GPUs with the mini-app https://github.com/justinlietz/small_dgemms, which calls many small dgemms. When I run with batchCount >= 65536, the call to magmablas_dgemm_vbatched gives incorrect results.

The source code appears to be intended to support larger values of batchCount, but I think it has bugs. I was able to get correct results by making the following changes.

  1. Change the magic number 65536 to 65535 in gemm_template_kernel_vbatched.cuh. This value is used to limit the Z dimension of the Cuda block grid, which is actually limited to the smaller number.
  2. Add the offset batch_starting_id to more of the arguments at line 236 of file gemm_template_kernel_vbatched.cuh. The existing code adds the offset only to arguments dA_array, dB_array, and dC_array, but it should be added to all the array arguments. Here is the modified line that is generating correct results for the small_dgemms mini-app:

gemm_template_vbatched_nn_kernel<T, DIM_X, DIM_Y, BLK_M, BLK_N, BLK_K, DIM_XA, DIM_YA, DIM_XB, DIM_YB, CONJA, CONJB>
<<<dimGrid, dimBlock, 0, queue->cuda_stream()>>>(m+batch_starting_id, n+batch_starting_id, k+batch_starting_id, dA_array+batch_starting_id, ldda+batch_starting_id, dB_array+batch_starting_id, lddb+batch_starting_id, dC_array+batch_starting_id, lddc+batch_starting_id, alpha, beta, roffA, coffA, roffB, coffB, roffC, coffC, specM, specN, specK);

I think the other versions of this call (nt, tn, tt) need similar corrections.

Comments (5)

  1. Ahmad Abdelfattah

    Thanks for reporting the problem. The issue should be fixed now. I did not do a comprehensive testing, but the failed tests I tried are not failing anymore. Please let us know if you are still having issues with this kernel.

  2. Log in to comment