ssyevd and dsyevd fail for N > 92672 with possible fix

Hi,

I’ve posted about this in the MAGMA forums and gotten a good bit of help from Stan (who has also replicated this issue). I have successfully compiled MAGMA with 64-bit integer support (both with OpenBLAS and MKL; my report here is using the MKL version). When I do the following, I get an error that claims that the GPU is out of memory. I’m using four Nvidia A100’s each with 40 GB on a machine with 256 GB RAM.

$ testing/testing_ssyevd -N 92673 -JV --ngpu 4
% MAGMA 2.6.1  64-bit magma_int_t, 64-bit pointer.
Compiled with CUDA support for 8.0
% CUDA runtime 11030, driver 11030. OpenMP threads 32. MKL 2021.0.3, MKL threads 32. 
% device 0: NVIDIA A100-PCIE-40GB, 1410.0 MHz clock, 40536.2 MiB memory, capability 8.0
% device 1: NVIDIA A100-PCIE-40GB, 1410.0 MHz clock, 40536.2 MiB memory, capability 8.0
% device 2: NVIDIA A100-PCIE-40GB, 1410.0 MHz clock, 40536.2 MiB memory, capability 8.0
% device 3: NVIDIA A100-PCIE-40GB, 1410.0 MHz clock, 40536.2 MiB memory, capability 8.0
% Tue Sep 28 13:49:21 2021
% Usage: testing/testing_ssyevd [options] [-h|--help]
% jobz = Vectors needed, uplo = Lower, ngpu = 4
%   N   CPU Time (sec)   GPU Time (sec)   |S-S_magma|   |A-USU^H|   |I-U^H U|
%============================================================================
magma_ssyevd returned error -113: cannot allocate memory on GPU device.
92673      ---            210.2290           ---           ---         ---      ok

However, using either the command line nvidia-smi or calls to cuMemGetInfo in the code reveals that only about 10 GB is being used on each GPU. Running testing_dsyevd with the same parameters gives the same issue, though in this case about 20 GB is used per GPU.

This is a quite important issue to me as my current systems of interest have N ~ 95000 and N ~ 110000. Thus, I did some work to try to track this down.

I tracked this sequence of function calls:

magma_ssyevd_m (line 280) --> magma_ssytrd_mgpu (line 389) --> magma_slatrd_mgpu (line 439) --> magmablas_ssymv_mgpu_sync (line 877) --> magma_queue_sync (line 1240) --> cudaStreamSynchronize

The call to cudaStreamSynchronize returns a cudaErrorIllegalAddress (err = 700). This is, apparently, typically due to earlier corruption of the cuda context. I searched backwards through the function calls by inserting calls to cudaStreamSynchronize until I found the place where the error occurs.

The error seems to appear immediately after the call to ssymv_kernel_L_mgpu from magmablas_ssymv_mgpu (at line 774 in ssymv_mgpu.cu). Taking a look at ssymv_kernel_L_mgpu, I can see that n and lda are declared as int. As a crude attempt at solving this, I changed every int in this function to magma_int_t.

This seems to work!

$ testing/testing_ssyevd -N 93000 -JV --ngpu 4 
% MAGMA 2.6.1  64-bit magma_int_t, 64-bit pointer.
Compiled with CUDA support for 8.0
% CUDA runtime 11030, driver 11030. OpenMP threads 32. MKL 2021.0.3, MKL threads 32. 
% device 0: NVIDIA A100-PCIE-40GB, 1410.0 MHz clock, 40536.2 MiB memory, capability 8.0
% device 1: NVIDIA A100-PCIE-40GB, 1410.0 MHz clock, 40536.2 MiB memory, capability 8.0
% device 2: NVIDIA A100-PCIE-40GB, 1410.0 MHz clock, 40536.2 MiB memory, capability 8.0
% device 3: NVIDIA A100-PCIE-40GB, 1410.0 MHz clock, 40536.2 MiB memory, capability 8.0
% Mon Oct  4 13:23:37 2021
% Usage: testing/testing_ssyevd [options] [-h|--help]

% jobz = Vectors needed, uplo = Lower, ngpu = 4
%   N   CPU Time (sec)   GPU Time (sec)   |S-S_magma|   |A-USU^H|   |I-U^H U|
%============================================================================
93000      ---            340.3685           ---           ---         ---      ok

I have not yet tried this for double precision, nor have I yet made any attempt to check for correctness of the results.

I hope that this is helpful!

Cheers,

tom

Comments (1)