getri_outofplace_batched fails when batchCount is > >=65536

Issue #10 resolved
Vishwak S created an issue

Hi,

I am using MAGMA's batched getri operation for batched inverse, but this seems to fail when the number of batches are greater than or equal to 65536.

Below are the outputs from the tests:

Single Precision:

% MAGMA 2.3.0  compiled for CUDA capability >= 6.0, 32-bit magma_int_t, 64-bit pointer.
% CUDA runtime 8000, driver 9010. OpenMP threads 40. 
% device 0: GeForce GTX 1080 Ti, 1582.0 MHz clock, 11178.5 MiB memory, capability 6.1
% device 1: GeForce GTX 1080 Ti, 1582.0 MHz clock, 11178.5 MiB memory, capability 6.1
% device 2: GeForce GTX 1080 Ti, 1582.0 MHz clock, 11178.5 MiB memory, capability 6.1
% device 3: GeForce GTX 1080 Ti, 1582.0 MHz clock, 11178.5 MiB memory, capability 6.1
% Sat Nov  3 11:13:45 2018
% Usage: ./testing/testing_sgetri_batched [options] [-h|--help]

% batchCount   N    CPU Gflop/s (ms)    GPU Gflop/s (ms)   ||I - A*A^{-1}||_1 / (N*cond(A))
%===============================================================================
     65535     2     ---   (  ---  )      0.03 (  41.40)   6.15e-08   ok
     65536     2     ---   (  ---  )      0.03 (  32.24)   1.68e+07   failed
     68523     2     ---   (  ---  )      0.03 (  43.34)   1.68e+07   failed

Double Precision:

% MAGMA 2.3.0  compiled for CUDA capability >= 6.0, 32-bit magma_int_t, 64-bit pointer.
% CUDA runtime 8000, driver 9010. OpenMP threads 40. 
% device 0: GeForce GTX 1080 Ti, 1582.0 MHz clock, 11178.5 MiB memory, capability 6.1
% device 1: GeForce GTX 1080 Ti, 1582.0 MHz clock, 11178.5 MiB memory, capability 6.1
% device 2: GeForce GTX 1080 Ti, 1582.0 MHz clock, 11178.5 MiB memory, capability 6.1
% device 3: GeForce GTX 1080 Ti, 1582.0 MHz clock, 11178.5 MiB memory, capability 6.1
% Sat Nov  3 11:15:12 2018
% Usage: ./testing/testing_dgetri_batched [options] [-h|--help]

% batchCount   N    CPU Gflop/s (ms)    GPU Gflop/s (ms)   ||I - A*A^{-1}||_1 / (N*cond(A))
%===============================================================================
     65535     2     ---   (  ---  )      0.01 (  81.26)   1.14e-16   ok
     65536     2     ---   (  ---  )      0.02 (  58.56)   9.01e+15   failed
     68523     2     ---   (  ---  )      0.02 (  66.03)   9.01e+15   failed

I passed the option --matrix rand_dominant to ensure that the random matrices generated are not singular by chance.

It would be great if you could provide a solution for this issue or indicate if this is expected behavior. Thank you.

Comments (4)

  1. Mark Gates

    Not intended, but not surprising, because CUDA limits kernel launches to y and z grid dimension to 65535. Easiest fix would be to use the x grid dimension for batch count.

  2. Log in to comment