magma_sgesv_gpu returned error -112: cannot allocate memory on CPU host.

Issue #35 resolved
Edgar Weckert created an issue

Hi, I’m not sure whether this is the right place to post that issue. During testing how stable magma (vers. 2.5.4) is for larger matrices in 32 bit floats I encountered an error for matrices larger than N=46336. It seems that the allocation of the work array (which corresponds to exactly 181 MB in case of the largest working dimension) in segtrf_gpu as a pinned host ram portion comes back with an error. The error also occurs for magma_cgesv, it is independent from the compiler (gcc-9.30, icc-19.1.3.304), CUDA toolkit versions (10.2, 11.2), CUDA driver version (455.45.01, 460.27.04), linux kernel version (3.10.0, 4.19.0, 5.4.0) and GPUs (RTX 3090, TESLA K40m, GTX 1080Ti). Allocation of the pinned RAM earlier (before other malloc - calls) in the course of [sc]getrf_gpu routine allows the pinned allocation without error but a failure occurs later (info=1). There is enough host and GPU RAM available for the sizes of the problem (c-routine only used on RTX 3090) and larger pinned allocations are possible in other contexts.

Any ideas how to solve this problem ?

Best, Edgar

P.S. Sample outputs:
largest working case (testing_cgesv_gpu):
\% MAGMA 2.5.4 compiled for CUDA capability >= 3.5, 64-bit magma_int_t, 64-bit pointer.
\% CUDA runtime 11020, driver 11020. OpenMP threads 18. MKL 2020.0.4, MKL threads 18.
\% device 0: GeForce RTX 3090, 1800.0 MHz clock, 24260.2 MiB memory, capability 8.6
\% device 1: GeForce GTX 1080 Ti, 1582.0 MHz clock, 11178.5 MiB memory, capability 6.1
\% Sat Jan 2 16:00:11 2021
\% Usage: ./testing_cgesv_gpu [options] [-h|--help]

\% N NRHS CPU Gflop/s (sec) GPU Gflop/s (sec) ||B - AX|| / N*||A||*||X||
\%===============================================================================
46336 1 3026.08 ( 87.67) 19283.46 ( 13.76) 9.50e-16 ok

Failure for larger dimensions:
\% N NRHS CPU Gflop/s (sec) GPU Gflop/s (sec) ||B - AX|| / N*||A||*||X||
\%===============================================================================
magma_cgesv_gpu returned error -112: cannot allocate memory on CPU host.
46337 1 3074.35 ( 86.30) 11206393.34 ( 0.02) 0.00e+00 ok

corresponding system message from the driver:
VRM: Xid (PCI:0000:65:00): 31, pid=173621, Ch 00000041, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_5 faulted @ 0x7f73_a2007000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ

Comments (8)

  1. Mark Gates

    For n > sqrt( 2^31 ) = 46340, or so, depending on padding in lda, you would need to compile MAGMA with ILP64 support to use 64-bit integers, and link with an ILP64 BLAS. For example, see make.inc-examples/make.inc.mkl-gcc-ilp64. What BLAS & LAPACK library are you using?

  2. Mark Gates

    Though on closer reading of your post, I don’t know why it would fail to allocate the work array. Based on your reported 181 MiB of 4-byte floats, that would be 46336 x 1024, which should be fine with 32-bit integers. Still, other computations with matrices of that size (n x n) may encounter issues — basically wherever an offset (i + j*lda) is computed that exceeds 2^31.

    All the MAGMA alloc routines already take size_t, which is generally 64-bit, so they shouldn’t have an issue, regardless of whether MAGMA is compiled with or without ILP64. The code that calls an alloc routine could have an issue, for instance:

    int     n32 = 47000;  // 32-bit int
    int64_t n64 = 47000;
    magma_smalloc( &ptr, n32*n32 );          // fails; n*n overflows
    magma_smalloc( &ptr, size_t(n32)*n32 );  // ok, C++ syntax
    magma_smalloc( &ptr, (size_t)n32*n32 );  // ok, C syntax
    magma_smalloc( &ptr, n64*n64 );          // ok
    

  3. Edgar Weckert reporter

    Hi Mark, I compiled magma with ILP64 support, both with gcc and icc using the corresponding make.inc files from the example directory. The BLAS and and LAPACK library were in both cases from intel mkl ver. 2020.0.4 . The lapack part of the test also seems to work. Therfore, I do not think this is a blas problem, since the first error occurs on trying to allocate pinned-ram on the host in the routine sgetrf_gpu for the array 'work'. This array is rather small as compared to the other arrays needed (~181 MB) but according to my debugging the only one that magma allocated 'pinned'. I tried several other pinned options with no success. If one uses a normal (pageable) malloc call to reserve space for 'work' on the host (would be slower to exchange data with the GPU), the malloc is successfull but the program fails later on. I also tried to place the pinned memory malloc call before other mallocs in sgetrf_gpu then the pinned memory call is successfull but the test still fails. In other CUDA program on my system I'm able to allocate much larger pinned ram areas than the one required for 'work' without problems. I hope this information helps, Edgar

  4. Edgar Weckert reporter

    What normally what happens in these cases, eg. something is still working with N=46336 and not with N=46337 is that you see something strange happening in the 'size_t size' argument that goes to the malloc routine that I monitored as well. I can sent you screen dumps of that later. Nothing strange to see here except that the test or the pinned-malloc call fails. Actually the case that both routine show the problem cgesv_gpu and sgesv_gpu for the same problem size with cgesv requiring twice as much ram indicates that it is not an overflow in some of the other malloc calls.

    Edgar

  5. Ahmad Abdelfattah

    Hi,

    We are making a sweep over the lingering issues in MAGMA. This issue should now be fixed as of 725793b.

    For 32-bit builds with mkl, you should be able to run testing_sgesv_gpu up to 46340, after which the tester fails to allocate/initialize memory due to integer overflow.

    I also tried a 64-bit builds, and was able to run testing_sgesv_gpu with a matrix of size 140k.

  6. Edgar Weckert reporter

    Hi Ahmad,

    thanks for looking into that. Works now for me as well. If it was only a bug in the scheduler, it probably was less of an issue for real applications. However, I now get only a little bit more than half of the GFLOPS performance than before with exactly the same setup, except a newer CUDA version. Any idea, what is the reason for that?

    Best

    Edgar

  7. Ahmad Abdelfattah

    The bug was actually for some auxiliary kernels where pointer arithmetic would experience an overflow for large sizes.

    For the performance regression, I don’t see it on my system, an A100-SXM4 GPU hosted by an AMD EPYC 7742 CPU running MKL 2023.0.2. The performance for CGESV is around 80% of the peak performance.

    ./testing_cgesv_gpu -N 46336 -c --niter 2
    % MAGMA 2.7.2 svn 32-bit magma_int_t, 64-bit pointer.
    % Compiled with CUDA support for 7.0
    % CUDA runtime 12010, driver 12030. OpenMP threads 256. MKL 2023.0.2, MKL threads 128.
    % device 0: NVIDIA A100-SXM4-80GB, 1410.0 MHz clock, 81049.6 MiB memory, capability 8.0
    % device 1: NVIDIA A100-SXM4-80GB, 1410.0 MHz clock, 81049.6 MiB memory, capability 8.0
    % device 2: NVIDIA A100-SXM4-80GB, 1410.0 MHz clock, 81049.6 MiB memory, capability 8.0
    % device 3: NVIDIA A100-SXM4-80GB, 1410.0 MHz clock, 81049.6 MiB memory, capability 8.0
    % device 4: NVIDIA A100-SXM4-80GB, 1410.0 MHz clock, 81049.6 MiB memory, capability 8.0
    % device 5: NVIDIA A100-SXM4-80GB, 1410.0 MHz clock, 81049.6 MiB memory, capability 8.0
    % device 6: NVIDIA A100-SXM4-80GB, 1410.0 MHz clock, 81049.6 MiB memory, capability 8.0
    % device 7: NVIDIA A100-SXM4-80GB, 1410.0 MHz clock, 81049.6 MiB memory, capability 8.0
    % Mon Mar 11 12:35:07 2024
    % Usage: ./testing_cgesv_gpu [options] [-h|--help]
    
    %   N  NRHS   CPU Gflop/s (sec)   GPU Gflop/s (sec)   ||B - AX|| / N*||A||*||X||
    %===============================================================================
    46336     1     ---   (  ---  )   15546.15 (  17.07)   2.39e-15   ok
    46336     1     ---   (  ---  )   15555.64 (  17.06)   2.25e-10   ok
    

    So I don’t have a definite answer to why you are experiencing such a slowdown, but one thing you can try is to use a GPU-only factorization if something has changed on your CPU setup. By default, MAGMA uses a hybrid CPU-GPU factorization, which sometimes underperforms depending on the CPU. You can change that by editing line 94 in src/cgesv_gpu.cpp. Simply replace magma_cgetrf_gpu with magma_cgetrf_native (they have the same arguments). Hopefully you should see a better performance.

    On my side, I do get a better performance, around 90% of the peak performance

    % MAGMA 2.7.2 svn 32-bit magma_int_t, 64-bit pointer.
    % Compiled with CUDA support for 7.0
    % CUDA runtime 12010, driver 12030. OpenMP threads 256. MKL 2023.0.2, MKL threads 128. 
    % device 0: NVIDIA A100-SXM4-80GB, 1410.0 MHz clock, 81049.6 MiB memory, capability 8.0
    % device 1: NVIDIA A100-SXM4-80GB, 1410.0 MHz clock, 81049.6 MiB memory, capability 8.0
    % device 2: NVIDIA A100-SXM4-80GB, 1410.0 MHz clock, 81049.6 MiB memory, capability 8.0
    % device 3: NVIDIA A100-SXM4-80GB, 1410.0 MHz clock, 81049.6 MiB memory, capability 8.0
    % device 4: NVIDIA A100-SXM4-80GB, 1410.0 MHz clock, 81049.6 MiB memory, capability 8.0
    % device 5: NVIDIA A100-SXM4-80GB, 1410.0 MHz clock, 81049.6 MiB memory, capability 8.0
    % device 6: NVIDIA A100-SXM4-80GB, 1410.0 MHz clock, 81049.6 MiB memory, capability 8.0
    % device 7: NVIDIA A100-SXM4-80GB, 1410.0 MHz clock, 81049.6 MiB memory, capability 8.0
    % Mon Mar 11 12:51:27 2024
    % Usage: ./testing_cgesv_gpu [options] [-h|--help]
    
    %   N  NRHS   CPU Gflop/s (sec)   GPU Gflop/s (sec)   ||B - AX|| / N*||A||*||X||
    %===============================================================================
    46336     1     ---   (  ---  )   17536.65 (  15.13)   2.17e-16   ok
    46336     1     ---   (  ---  )   17556.98 (  15.11)   2.71e-10   ok 
    

  8. Edgar Weckert reporter

    Hi Ahmad,

    thanks a lot, this was a very helpful hint. I'm back to the performance as before.

    Best Edgar

  9. Log in to comment