getri_outofplace_batched fails when batchCount is > >=65536
Issue #10
resolved
Hi,
I am using MAGMA's batched getri
operation for batched inverse, but this seems to fail when the number of batches are greater than or equal to 65536.
Below are the outputs from the tests:
Single Precision:
% MAGMA 2.3.0 compiled for CUDA capability >= 6.0, 32-bit magma_int_t, 64-bit pointer.
% CUDA runtime 8000, driver 9010. OpenMP threads 40.
% device 0: GeForce GTX 1080 Ti, 1582.0 MHz clock, 11178.5 MiB memory, capability 6.1
% device 1: GeForce GTX 1080 Ti, 1582.0 MHz clock, 11178.5 MiB memory, capability 6.1
% device 2: GeForce GTX 1080 Ti, 1582.0 MHz clock, 11178.5 MiB memory, capability 6.1
% device 3: GeForce GTX 1080 Ti, 1582.0 MHz clock, 11178.5 MiB memory, capability 6.1
% Sat Nov 3 11:13:45 2018
% Usage: ./testing/testing_sgetri_batched [options] [-h|--help]
% batchCount N CPU Gflop/s (ms) GPU Gflop/s (ms) ||I - A*A^{-1}||_1 / (N*cond(A))
%===============================================================================
65535 2 --- ( --- ) 0.03 ( 41.40) 6.15e-08 ok
65536 2 --- ( --- ) 0.03 ( 32.24) 1.68e+07 failed
68523 2 --- ( --- ) 0.03 ( 43.34) 1.68e+07 failed
Double Precision:
% MAGMA 2.3.0 compiled for CUDA capability >= 6.0, 32-bit magma_int_t, 64-bit pointer.
% CUDA runtime 8000, driver 9010. OpenMP threads 40.
% device 0: GeForce GTX 1080 Ti, 1582.0 MHz clock, 11178.5 MiB memory, capability 6.1
% device 1: GeForce GTX 1080 Ti, 1582.0 MHz clock, 11178.5 MiB memory, capability 6.1
% device 2: GeForce GTX 1080 Ti, 1582.0 MHz clock, 11178.5 MiB memory, capability 6.1
% device 3: GeForce GTX 1080 Ti, 1582.0 MHz clock, 11178.5 MiB memory, capability 6.1
% Sat Nov 3 11:15:12 2018
% Usage: ./testing/testing_dgetri_batched [options] [-h|--help]
% batchCount N CPU Gflop/s (ms) GPU Gflop/s (ms) ||I - A*A^{-1}||_1 / (N*cond(A))
%===============================================================================
65535 2 --- ( --- ) 0.01 ( 81.26) 1.14e-16 ok
65536 2 --- ( --- ) 0.02 ( 58.56) 9.01e+15 failed
68523 2 --- ( --- ) 0.02 ( 66.03) 9.01e+15 failed
I passed the option --matrix rand_dominant
to ensure that the random matrices generated are not singular by chance.
It would be great if you could provide a solution for this issue or indicate if this is expected behavior. Thank you.
Comments (4)
-
-
-
assigned issue to
-
assigned issue to
-
All batch routines should no longer fail for batches larger than 65535 (as of 3424178).
-
- changed status to resolved
- Log in to comment
Not intended, but not surprising, because CUDA limits kernel launches to y and z grid dimension to 65535. Easiest fix would be to use the x grid dimension for batch count.