ssyevd and dsyevd fail for N > 92672 with possible fix

Issue #53 resolved
Tom Carroll created an issue

Hi,

I’ve posted about this in the MAGMA forums and gotten a good bit of help from Stan (who has also replicated this issue). I have successfully compiled MAGMA with 64-bit integer support (both with OpenBLAS and MKL; my report here is using the MKL version). When I do the following, I get an error that claims that the GPU is out of memory. I’m using four Nvidia A100’s each with 40 GB on a machine with 256 GB RAM.

$ testing/testing_ssyevd -N 92673 -JV --ngpu 4
% MAGMA 2.6.1  64-bit magma_int_t, 64-bit pointer.
Compiled with CUDA support for 8.0
% CUDA runtime 11030, driver 11030. OpenMP threads 32. MKL 2021.0.3, MKL threads 32. 
% device 0: NVIDIA A100-PCIE-40GB, 1410.0 MHz clock, 40536.2 MiB memory, capability 8.0
% device 1: NVIDIA A100-PCIE-40GB, 1410.0 MHz clock, 40536.2 MiB memory, capability 8.0
% device 2: NVIDIA A100-PCIE-40GB, 1410.0 MHz clock, 40536.2 MiB memory, capability 8.0
% device 3: NVIDIA A100-PCIE-40GB, 1410.0 MHz clock, 40536.2 MiB memory, capability 8.0
% Tue Sep 28 13:49:21 2021
% Usage: testing/testing_ssyevd [options] [-h|--help]
% jobz = Vectors needed, uplo = Lower, ngpu = 4
%   N   CPU Time (sec)   GPU Time (sec)   |S-S_magma|   |A-USU^H|   |I-U^H U|
%============================================================================
magma_ssyevd returned error -113: cannot allocate memory on GPU device.
92673      ---            210.2290           ---           ---         ---      ok

However, using either the command line nvidia-smi or calls to cuMemGetInfo in the code reveals that only about 10 GB is being used on each GPU. Running testing_dsyevd with the same parameters gives the same issue, though in this case about 20 GB is used per GPU.

This is a quite important issue to me as my current systems of interest have N ~ 95000 and N ~ 110000. Thus, I did some work to try to track this down.

I tracked this sequence of function calls:

magma_ssyevd_m (line 280) --> magma_ssytrd_mgpu (line 389) --> magma_slatrd_mgpu (line 439) --> magmablas_ssymv_mgpu_sync (line 877) --> magma_queue_sync (line 1240) --> cudaStreamSynchronize

The call to cudaStreamSynchronize returns a cudaErrorIllegalAddress (err = 700). This is, apparently, typically due to earlier corruption of the cuda context. I searched backwards through the function calls by inserting calls to cudaStreamSynchronize until I found the place where the error occurs.

The error seems to appear immediately after the call to ssymv_kernel_L_mgpu from magmablas_ssymv_mgpu (at line 774 in ssymv_mgpu.cu). Taking a look at ssymv_kernel_L_mgpu, I can see that n and lda are declared as int. As a crude attempt at solving this, I changed every int in this function to magma_int_t.

This seems to work!

$ testing/testing_ssyevd -N 93000 -JV --ngpu 4 
% MAGMA 2.6.1  64-bit magma_int_t, 64-bit pointer.
Compiled with CUDA support for 8.0
% CUDA runtime 11030, driver 11030. OpenMP threads 32. MKL 2021.0.3, MKL threads 32. 
% device 0: NVIDIA A100-PCIE-40GB, 1410.0 MHz clock, 40536.2 MiB memory, capability 8.0
% device 1: NVIDIA A100-PCIE-40GB, 1410.0 MHz clock, 40536.2 MiB memory, capability 8.0
% device 2: NVIDIA A100-PCIE-40GB, 1410.0 MHz clock, 40536.2 MiB memory, capability 8.0
% device 3: NVIDIA A100-PCIE-40GB, 1410.0 MHz clock, 40536.2 MiB memory, capability 8.0
% Mon Oct  4 13:23:37 2021
% Usage: testing/testing_ssyevd [options] [-h|--help]

% jobz = Vectors needed, uplo = Lower, ngpu = 4
%   N   CPU Time (sec)   GPU Time (sec)   |S-S_magma|   |A-USU^H|   |I-U^H U|
%============================================================================
93000      ---            340.3685           ---           ---         ---      ok

I have not yet tried this for double precision, nor have I yet made any attempt to check for correctness of the results.

I hope that this is helpful!

Cheers,

tom

Comments (1)

  1. Ahmad Abdelfattah

    Hi Tom,
    We are making a sweep over the lingering issues in MAGMA. This one should be resolved as of 345beb7. Please let us know if otherwise.

    Thank you for suggesting the fix.

  2. Log in to comment