NVIDIA Multi-Process Service MPS

Issue #57 new
jg piccinali repo owner created an issue

PizDaint

CRAY_CUDA_MPS
               Overrides the site default for execution in simultaneous
               contexts on GPU-equipped nodes (e.g. Hyper Q, CUDA proxy).
               Setting to 1 or on will enable the CUDA proxy. To disable
               CUDA proxy, set to 0 or off. Debugging and use of
               performance tools to collect GPU statistics is only
               supported with the CUDA proxy disabled.

Setup

  • git clone https://github.com/lichinka/L2.git L2_lichinka.git
  • cd L2_lichinka.git/17591/
  • module swap PrgEnv-cray PrgEnv-gnu
  • module swap gcc gcc/4.8.2
  • module load craype-accel-nvidia35

Compile

  • cc proxy.c -o $PE_ENV

Run

  • export CRAY_CUDA_MPS=1
  • sbatch.sh santis 1 ./GNU 4 4 1

Big1: 2000x2000x2000

Running cublas on 2000x2000x2000 with 1 and then with 4 PEs...
2000x2000x2000 DGEMM -- 1 PE, overall Gflops = 1008.579518 0.015864 s.
1- pid 0, my Gflops = 1008.579518 0.015864 s.

2000x2000x2000 DGEMM -- 4 PE, overall Gflops = 76.414291 0.209385 s.
2- pid 3, my Gflops = 83.771418 0.190996 s.
2- pid 1, my Gflops = 90.812583 0.176187 s.
2- pid 2, my Gflops = 76.420469 0.209368 s.
2- pid 0, my Gflops = 1070.231465 0.014950 s.

Small1: 2000x500x2000

2000x500x2000 DGEMM -- 1 PE, overall Gflops = 979.691445 0.004083 s.
3- pid 0, my Gflops = 979.920332 0.004082 s.

2000x500x2000 DGEMM -- 4 PE, overall Gflops = 1053.597048 0.015186 s.
4- pid 0, my Gflops = 263.469581 0.015182 s.
4- pid 1, my Gflops = 331.617963 0.012062 s.
4- pid 3, my Gflops = 310.482197 0.012883 s.
4- pid 2, my Gflops = 291.737080 0.013711 s.

Run (nvprof/nvvp)

  • export CRAY_CUDA_MPS=1
  • unset COMPUTE_PROFILE
  • export PMI_NO_FORK=1
  • sbatch.sh santis 1 ./GNU 4 4 1 "" "" "-b nvprof -o nvprof.output.%h.%p"
  • nvvp

Big1/1

n1.png

Big1/4

n2.png

Small1/1

  • x

Small1/4

n4.png

Comments (7)

  1. jg piccinali reporter

    scorep/1.4.2

    • scorep --mpp=mpi --cuda cc proxy.c
    • export SCOREP_ENABLE_PROFILING=false
    • export SCOREP_ENABLE_TRACING=true
    • export SCOREP_CUDA_ENABLE=yes

    Big1/1

    v1.png

    Big1/4

    v2.png

    Small1/1

    v3.png

    Small1/4

    v4.png

  2. jg piccinali reporter

    perftools-lite/6.2.5

    • module load perftools-lite
    • export CRAYPAT_LITE=gpu
    • cc proxy2.c -o GNU.2+ptl625
    • export CRAY_CUDA_MPS=1

    p.png p2_profil.png

  3. jg piccinali reporter
    • aprun -n1 nvidia-smi -q
      • Compute Mode : Exclusive_Process
    EXCLUSIVE_PROCESS – the GPU is assigned to only one process at a time, and
    individual process threads may submit work to the GPU concurrently.
    
  4. Log in to comment