magma_sgesv_gpu returned error -112: cannot allocate memory on CPU host.

Issue #35 new
Edgar Weckert created an issue

Hi, I’m not sure whether this is the right place to post that issue. During testing how stable magma (vers. 2.5.4) is for larger matrices in 32 bit floats I encountered an error for matrices larger than N=46336. It seems that the allocation of the work array (which corresponds to exactly 181 MB in case of the largest working dimension) in segtrf_gpu as a pinned host ram portion comes back with an error. The error also occurs for magma_cgesv, it is independent from the compiler (gcc-9.30, icc-, CUDA toolkit versions (10.2, 11.2), CUDA driver version (455.45.01, 460.27.04), linux kernel version (3.10.0, 4.19.0, 5.4.0) and GPUs (RTX 3090, TESLA K40m, GTX 1080Ti). Allocation of the pinned RAM earlier (before other malloc - calls) in the course of [sc]getrf_gpu routine allows the pinned allocation without error but a failure occurs later (info=1). There is enough host and GPU RAM available for the sizes of the problem (c-routine only used on RTX 3090) and larger pinned allocations are possible in other contexts.

Any ideas how to solve this problem ?

Best, Edgar

P.S. Sample outputs:
largest working case (testing_cgesv_gpu):
\% MAGMA 2.5.4 compiled for CUDA capability >= 3.5, 64-bit magma_int_t, 64-bit pointer.
\% CUDA runtime 11020, driver 11020. OpenMP threads 18. MKL 2020.0.4, MKL threads 18.
\% device 0: GeForce RTX 3090, 1800.0 MHz clock, 24260.2 MiB memory, capability 8.6
\% device 1: GeForce GTX 1080 Ti, 1582.0 MHz clock, 11178.5 MiB memory, capability 6.1
\% Sat Jan 2 16:00:11 2021
\% Usage: ./testing_cgesv_gpu [options] [-h|--help]

\% N NRHS CPU Gflop/s (sec) GPU Gflop/s (sec) ||B - AX|| / N*||A||*||X||
46336 1 3026.08 ( 87.67) 19283.46 ( 13.76) 9.50e-16 ok

Failure for larger dimensions:
\% N NRHS CPU Gflop/s (sec) GPU Gflop/s (sec) ||B - AX|| / N*||A||*||X||
magma_cgesv_gpu returned error -112: cannot allocate memory on CPU host.
46337 1 3074.35 ( 86.30) 11206393.34 ( 0.02) 0.00e+00 ok

corresponding system message from the driver:
VRM: Xid (PCI:0000:65:00): 31, pid=173621, Ch 00000041, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_5 faulted @ 0x7f73_a2007000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ

Comments (4)

  1. Mark Gates

    For n > sqrt( 2^31 ) = 46340, or so, depending on padding in lda, you would need to compile MAGMA with ILP64 support to use 64-bit integers, and link with an ILP64 BLAS. For example, see What BLAS & LAPACK library are you using?

  2. Mark Gates

    Though on closer reading of your post, I don’t know why it would fail to allocate the work array. Based on your reported 181 MiB of 4-byte floats, that would be 46336 x 1024, which should be fine with 32-bit integers. Still, other computations with matrices of that size (n x n) may encounter issues — basically wherever an offset (i + j*lda) is computed that exceeds 2^31.

    All the MAGMA alloc routines already take size_t, which is generally 64-bit, so they shouldn’t have an issue, regardless of whether MAGMA is compiled with or without ILP64. The code that calls an alloc routine could have an issue, for instance:

    int     n32 = 47000;  // 32-bit int
    int64_t n64 = 47000;
    magma_smalloc( &ptr, n32*n32 );          // fails; n*n overflows
    magma_smalloc( &ptr, size_t(n32)*n32 );  // ok, C++ syntax
    magma_smalloc( &ptr, (size_t)n32*n32 );  // ok, C syntax
    magma_smalloc( &ptr, n64*n64 );          // ok

  3. Edgar Weckert reporter

    Hi Mark, I compiled magma with ILP64 support, both with gcc and icc using the corresponding files from the example directory. The BLAS and and LAPACK library were in both cases from intel mkl ver. 2020.0.4 . The lapack part of the test also seems to work. Therfore, I do not think this is a blas problem, since the first error occurs on trying to allocate pinned-ram on the host in the routine sgetrf_gpu for the array 'work'. This array is rather small as compared to the other arrays needed (~181 MB) but according to my debugging the only one that magma allocated 'pinned'. I tried several other pinned options with no success. If one uses a normal (pageable) malloc call to reserve space for 'work' on the host (would be slower to exchange data with the GPU), the malloc is successfull but the program fails later on. I also tried to place the pinned memory malloc call before other mallocs in sgetrf_gpu then the pinned memory call is successfull but the test still fails. In other CUDA program on my system I'm able to allocate much larger pinned ram areas than the one required for 'work' without problems. I hope this information helps, Edgar

  4. Edgar Weckert reporter

    What normally what happens in these cases, eg. something is still working with N=46336 and not with N=46337 is that you see something strange happening in the 'size_t size' argument that goes to the malloc routine that I monitored as well. I can sent you screen dumps of that later. Nothing strange to see here except that the test or the pinned-malloc call fails. Actually the case that both routine show the problem cgesv_gpu and sgesv_gpu for the same problem size with cgesv requiring twice as much ram indicates that it is not an overflow in some of the other malloc calls.


  5. Log in to comment