I am using
magmablas_dgemm_vbatched on NVidia V100 GPUs with the mini-app https://github.com/justinlietz/small_dgemms, which calls many small dgemms. When I run with
batchCount >= 65536, the call to
magmablas_dgemm_vbatched gives incorrect results.
The source code appears to be intended to support larger values of
batchCount, but I think it has bugs. I was able to get correct results by making the following changes.
- Change the magic number
gemm_template_kernel_vbatched.cuh. This value is used to limit the Z dimension of the Cuda block grid, which is actually limited to the smaller number.
- Add the offset
batch_starting_idto more of the arguments at line 236 of file
gemm_template_kernel_vbatched.cuh. The existing code adds the offset only to arguments
dC_array, but it should be added to all the array arguments. Here is the modified line that is generating correct results for the
gemm_template_vbatched_nn_kernel<T, DIM_X, DIM_Y, BLK_M, BLK_N, BLK_K, DIM_XA, DIM_YA, DIM_XB, DIM_YB, CONJA, CONJB>
<<<dimGrid, dimBlock, 0, queue->cuda_stream()>>>(m+batch_starting_id, n+batch_starting_id, k+batch_starting_id, dA_array+batch_starting_id, ldda+batch_starting_id, dB_array+batch_starting_id, lddb+batch_starting_id, dC_array+batch_starting_id, lddc+batch_starting_id, alpha, beta, roffA, coffA, roffB, coffB, roffC, coffC, specM, specN, specK);
I think the other versions of this call (
tt) need similar corrections.