PROPOSALS Craypat-lite: CUDA

MPI+CUDA (PizDaint)

Get the src

ssh daint
cd $SCRATCH
git clone https://github.com/eth-cscs/proposals.git proposals.git

Cloning into 'proposals.git'...
remote: Counting objects: 339, done.
remote: Total 339 (delta 0), reused 0 (delta 0)
Receiving objects: 100% (339/339), 300.16 KiB | 234 KiB/s, done.
Resolving deltas: 100% (139/139), done.

cd proposals.git/vihps/JACOBI_CUDA/

Setup

module swap PrgEnv-cray PrgEnv-gnu
module load craype-accel-nvidia35
module use /project/csstaff/proposals
module load perflite/622cuda
echo CRAYPAT_LITE=$CRAYPAT_LITE

CRAYPAT_LITE=gpu

module list

Currently Loaded Modulefiles:
  1) modules/3.2.10.2
  2) nodestat/2.2-1.0502.53712.3.109.ari
  3) sdb/1.0-1.0502.55976.5.27.ari
  4) alps/5.2.1-2.0502.9041.11.6.ari
  5) lustre-cray_ari_s/2.5_3.0.101_0.31.1_1.0502.8394.10.1-1.0502.17198.8.51
  6) udreg/2.3.2-1.0502.9275.1.12.ari
  7) ugni/5.0-1.0502.9685.4.24.ari
  8) gni-headers/3.0-1.0502.9684.5.2.ari
  9) dmapp/7.0.1-1.0502.9501.5.219.ari
 10) xpmem/0.1-2.0502.55507.3.2.ari
 11) hss-llm/7.2.0
 12) Base-opts/1.0.2-1.0502.53325.1.2.ari
 13) craype-network-aries
 14) craype/2.2.1
 15) craype-sandybridge
 16) slurm
 17) cray-mpich/7.1.1
 18) ddt/4.3rc7
 19) linux/jg
 20) gcc/4.8.2
 21) totalview-support/1.1.4
 22) totalview/8.11.0
 23) cray-libsci/13.0.1
 24) pmi/5.0.6-1.0000.10439.140.2.ari
 25) atp/1.7.5
 26) PrgEnv-gnu/5.2.40
 27) rca/1.0.0-2.0502.53711.3.127.ari
 28) perflite/622cuda
 29) cray-libsci_acc/3.0.2
 30) cudatoolkit/5.5.22-1.0502.7944.3.1
 31) craype-accel-nvidia35

Compile

mkdir -p bin
make clean
make bin/jacobi_mpi+openmp+cuda PREP=

cc -std=c99 -O3 -march=native -DOMP_MEMLOCALTIY \
-fopenmp -DUSE_MPI -I/include \
-c src/jacobi_cuda.c -o bin/jacobi_mpi+cuda.o

nvcc -ccbin=cc -O3 \
-arch=sm_35 -Xcompiler -march=native \
-c src/jacobi_cuda_kernel.cu -o bin/jacobi_cuda_kernel.o

cc -fopenmp -lm -lstdc++ -L/lib64 -lcudart \
bin/jacobi_mpi+cuda.o \
bin/jacobi_cuda_kernel.o \
-o bin/jacobi_mpi+openmp+cuda

INFO: A maximum of 17 functions from group 'aio' will be traced.
INFO: A maximum of 285 functions from group 'cuda' will be traced.
INFO: A maximum of 107 functions from group 'io' will be traced.
INFO: A maximum of 699 functions from group 'mpi' will be traced.
INFO: A maximum of 32 functions from group 'omp' will be traced.
INFO: creating the CrayPat-instrumented executable 
    'bin/jacobi_mpi+openmp+cuda' (gpu) ...OK

Run

cd ./bin/
sbatch ../batch/run_jacobi_mpi+openmp+cuda.sbatch

Submitted batch job 2381

Reports

cat o_2381

...
#################################################################
#                                                               #
#            CrayPat-lite Performance Statistics                #
#                                                               #
#################################################################

CrayPat/X:  Version 6.2.2 Revision 13378 (xf 13240)  11/20/14 14:32:58
Experiment:                  lite  lite/gpu     
Number of PEs (MPI ranks):      2
Numbers of PEs per Node:        1  PE on each of  2  Nodes
Numbers of Threads per PE:      3
Number of Cores per Socket:     8
Execution start time:  Wed Jan 28 15:45:47 2015
System name and speed:  santis01 2601 MHz

Avg Process Time:   6.703 secs              
High Memory:      175.551 MBytes     87.775 MBytes per PE
I/O Read Rate:     65.050 MBytes/sec        
I/O Write Rate:     2.690 MBytes/sec        

Table 1:  Profile by Function Group and Function

  Time% |     Time |     Imb. |   Imb. |    Calls |Group
        |          |     Time |  Time% |          | Function
        |          |          |        |          |  PE=HIDE
        |          |          |        |          |   Thread=HIDE

 100.0% | 6.172222 |       -- |     -- | 231370.0 |Total
|----------------------------------------------------------------
|  93.0% | 5.739670 | 0.083060 |   2.9% |      1.0 |USER
||---------------------------------------------------------------
||  93.0% | 5.739670 | 0.083060 |   2.9% |      1.0 |main
||===============================================================
|   4.7% | 0.292761 |       -- |     -- |  94873.0 |CUDA
||---------------------------------------------------------------
||   2.4% | 0.146580 | 0.001524 |   2.1% |  17131.5 |cudaMalloc
||   1.3% | 0.079082 | 0.079062 | 100.0% |    229.0 |cudaSetDevice
||   1.0% | 0.061897 | 0.000072 |   0.2% |  77390.0 |cudaMemcpy
||===============================================================
|   1.4% | 0.086431 |       -- |     -- | 128937.5 |ETC
|================================================================

Table 2:  Accelerator Table by Function (top 10 functions shown)

   Host |  Host |   Acc | Acc Copy | Acc Copy | Events |Function=[max10]
  Time% |  Time |  Time |       In |      Out |        | PE=HIDE
        |       |       | (MBytes) | (MBytes) |        |  Thread=HIDE

 100.0% | 0.102 | 0.109 |   70.059 |   42.824 |   6003 |Total
|-----------------------------------------------------------------------------
|  48.4% | 0.049 | 0.064 |   70.055 |   42.820 |   2003 |cudaMemcpy
|  23.3% | 0.024 | 0.020 |    0.004 |       -- |   2000 |launch_jacobi_kernel_async
|  17.9% | 0.018 | 0.025 |       -- |    0.004 |   1000 |wait_jacobi_kernel
|  10.4% | 0.011 |    -- |       -- |       -- |   1000 |launch_copy_kernel
|=============================================================================

Program invocation:  ../bin/jacobi_mpi+openmp+cuda 4096 4096 0.15

For a complete report with expanded tables and notes, run:
  pat_report /scratch/santis/piccinal/proposals.git/vihps/JACOBI_CUDA/bin/jacobi_mpi+openmp+cuda+23039-14t.ap2

For help identifying callers of particular functions:
  pat_report -O callers+src /scratch/santis/piccinal/proposals.git/vihps/JACOBI_CUDA/bin/jacobi_mpi+openmp+cuda+23039-14t.ap2
To see the entire call tree:
  pat_report -O calltree+src /scratch/santis/piccinal/proposals.git/vihps/JACOBI_CUDA/bin/jacobi_mpi+openmp+cuda+23039-14t.ap2

For interactive, graphical performance analysis, run:
  app2 /scratch/santis/piccinal/proposals.git/vihps/JACOBI_CUDA/bin/jacobi_mpi+openmp+cuda+23039-14t.ap2

================  End of CrayPat-lite output  ==========================

MPI+CUDA (PizDaint)

Get the src

Setup

Compile

Run

Reports

Comments (3)