- changed status to resolved
ssyevd and dsyevd fail for N > 92672 with possible fix
Hi,
I’ve posted about this in the MAGMA forums and gotten a good bit of help from Stan (who has also replicated this issue). I have successfully compiled MAGMA with 64-bit integer support (both with OpenBLAS and MKL; my report here is using the MKL version). When I do the following, I get an error that claims that the GPU is out of memory. I’m using four Nvidia A100’s each with 40 GB on a machine with 256 GB RAM.
$ testing/testing_ssyevd -N 92673 -JV --ngpu 4
% MAGMA 2.6.1 64-bit magma_int_t, 64-bit pointer.
Compiled with CUDA support for 8.0
% CUDA runtime 11030, driver 11030. OpenMP threads 32. MKL 2021.0.3, MKL threads 32.
% device 0: NVIDIA A100-PCIE-40GB, 1410.0 MHz clock, 40536.2 MiB memory, capability 8.0
% device 1: NVIDIA A100-PCIE-40GB, 1410.0 MHz clock, 40536.2 MiB memory, capability 8.0
% device 2: NVIDIA A100-PCIE-40GB, 1410.0 MHz clock, 40536.2 MiB memory, capability 8.0
% device 3: NVIDIA A100-PCIE-40GB, 1410.0 MHz clock, 40536.2 MiB memory, capability 8.0
% Tue Sep 28 13:49:21 2021
% Usage: testing/testing_ssyevd [options] [-h|--help]
% jobz = Vectors needed, uplo = Lower, ngpu = 4
% N CPU Time (sec) GPU Time (sec) |S-S_magma| |A-USU^H| |I-U^H U|
%============================================================================
magma_ssyevd returned error -113: cannot allocate memory on GPU device.
92673 --- 210.2290 --- --- --- ok
However, using either the command line nvidia-smi or calls to cuMemGetInfo in the code reveals that only about 10 GB is being used on each GPU. Running testing_dsyevd with the same parameters gives the same issue, though in this case about 20 GB is used per GPU.
This is a quite important issue to me as my current systems of interest have N ~ 95000 and N ~ 110000. Thus, I did some work to try to track this down.
I tracked this sequence of function calls:
magma_ssyevd_m (line 280) --> magma_ssytrd_mgpu (line 389) --> magma_slatrd_mgpu (line 439) --> magmablas_ssymv_mgpu_sync (line 877) --> magma_queue_sync (line 1240) --> cudaStreamSynchronize
The call to cudaStreamSynchronize returns a cudaErrorIllegalAddress (err = 700). This is, apparently, typically due to earlier corruption of the cuda context. I searched backwards through the function calls by inserting calls to cudaStreamSynchronize until I found the place where the error occurs.
The error seems to appear immediately after the call to ssymv_kernel_L_mgpu from magmablas_ssymv_mgpu (at line 774 in ssymv_mgpu.cu). Taking a look at ssymv_kernel_L_mgpu, I can see that n and lda are declared as int. As a crude attempt at solving this, I changed every int in this function to magma_int_t.
This seems to work!
$ testing/testing_ssyevd -N 93000 -JV --ngpu 4
% MAGMA 2.6.1 64-bit magma_int_t, 64-bit pointer.
Compiled with CUDA support for 8.0
% CUDA runtime 11030, driver 11030. OpenMP threads 32. MKL 2021.0.3, MKL threads 32.
% device 0: NVIDIA A100-PCIE-40GB, 1410.0 MHz clock, 40536.2 MiB memory, capability 8.0
% device 1: NVIDIA A100-PCIE-40GB, 1410.0 MHz clock, 40536.2 MiB memory, capability 8.0
% device 2: NVIDIA A100-PCIE-40GB, 1410.0 MHz clock, 40536.2 MiB memory, capability 8.0
% device 3: NVIDIA A100-PCIE-40GB, 1410.0 MHz clock, 40536.2 MiB memory, capability 8.0
% Mon Oct 4 13:23:37 2021
% Usage: testing/testing_ssyevd [options] [-h|--help]
% jobz = Vectors needed, uplo = Lower, ngpu = 4
% N CPU Time (sec) GPU Time (sec) |S-S_magma| |A-USU^H| |I-U^H U|
%============================================================================
93000 --- 340.3685 --- --- --- ok
I have not yet tried this for double precision, nor have I yet made any attempt to check for correctness of the results.
I hope that this is helpful!
Cheers,
tom
Comments (1)
-
- Log in to comment
Hi Tom,
We are making a sweep over the lingering issues in MAGMA. This one should be resolved as of 345beb7. Please let us know if otherwise.
Thank you for suggesting the fix.