PizDaint

pdf
blog
man aprun

CRAY_CUDA_MPS
               Overrides the site default for execution in simultaneous
               contexts on GPU-equipped nodes (e.g. Hyper Q, CUDA proxy).
               Setting to 1 or on will enable the CUDA proxy. To disable
               CUDA proxy, set to 0 or off. Debugging and use of
               performance tools to collect GPU statistics is only
               supported with the CUDA proxy disabled.

Setup

git clone https://github.com/lichinka/L2.git L2_lichinka.git
cd L2_lichinka.git/17591/
module swap PrgEnv-cray PrgEnv-gnu
module swap gcc gcc/4.8.2
module load craype-accel-nvidia35

Compile

cc proxy.c -o $PE_ENV

Run

export CRAY_CUDA_MPS=1
sbatch.sh santis 1 ./GNU 4 4 1

Big1: 2000x2000x2000

Running cublas on 2000x2000x2000 with 1 and then with 4 PEs...
2000x2000x2000 DGEMM -- 1 PE, overall Gflops = 1008.579518 0.015864 s.
1- pid 0, my Gflops = 1008.579518 0.015864 s.

2000x2000x2000 DGEMM -- 4 PE, overall Gflops = 76.414291 0.209385 s.
2- pid 3, my Gflops = 83.771418 0.190996 s.
2- pid 1, my Gflops = 90.812583 0.176187 s.
2- pid 2, my Gflops = 76.420469 0.209368 s.
2- pid 0, my Gflops = 1070.231465 0.014950 s.

Small1: 2000x500x2000

2000x500x2000 DGEMM -- 1 PE, overall Gflops = 979.691445 0.004083 s.
3- pid 0, my Gflops = 979.920332 0.004082 s.

2000x500x2000 DGEMM -- 4 PE, overall Gflops = 1053.597048 0.015186 s.
4- pid 0, my Gflops = 263.469581 0.015182 s.
4- pid 1, my Gflops = 331.617963 0.012062 s.
4- pid 3, my Gflops = 310.482197 0.012883 s.
4- pid 2, my Gflops = 291.737080 0.013711 s.

Run (nvprof/nvvp)

export CRAY_CUDA_MPS=1
unset COMPUTE_PROFILE
export PMI_NO_FORK=1
sbatch.sh santis 1 ./GNU 4 4 1 "" "" "-b nvprof -o nvprof.output.%h.%p"
nvvp

NVIDIA Multi-Process Service MPS

PizDaint

Setup

Compile

Run

Big1: 2000x2000x2000

Small1: 2000x500x2000

Run (nvprof/nvvp)

Big1/1

Big1/4

Small1/1

Small1/4

Comments (7)

scorep/1.4.2

Big1/1

Big1/4

Small1/1

Small1/4

5000x5000x5000 / 1mpi

5000x5000x5000 / 4mpi

perftools-lite/6.2.5