PROPOSALS Craypat-lite: CUDA
Issue #20
new
MPI+CUDA (PizDaint)
Get the src
- ssh daint
- cd $SCRATCH
- git clone https://github.com/eth-cscs/proposals.git proposals.git
Cloning into 'proposals.git'...
remote: Counting objects: 339, done.
remote: Total 339 (delta 0), reused 0 (delta 0)
Receiving objects: 100% (339/339), 300.16 KiB | 234 KiB/s, done.
Resolving deltas: 100% (139/139), done.
- cd proposals.git/vihps/JACOBI_CUDA/
Setup
- module swap PrgEnv-cray PrgEnv-gnu
- module load craype-accel-nvidia35
- module use /project/csstaff/proposals
- module load perflite/622cuda
- echo CRAYPAT_LITE=$CRAYPAT_LITE
CRAYPAT_LITE=gpu
- module list
Currently Loaded Modulefiles:
1) modules/3.2.10.2
2) nodestat/2.2-1.0502.53712.3.109.ari
3) sdb/1.0-1.0502.55976.5.27.ari
4) alps/5.2.1-2.0502.9041.11.6.ari
5) lustre-cray_ari_s/2.5_3.0.101_0.31.1_1.0502.8394.10.1-1.0502.17198.8.51
6) udreg/2.3.2-1.0502.9275.1.12.ari
7) ugni/5.0-1.0502.9685.4.24.ari
8) gni-headers/3.0-1.0502.9684.5.2.ari
9) dmapp/7.0.1-1.0502.9501.5.219.ari
10) xpmem/0.1-2.0502.55507.3.2.ari
11) hss-llm/7.2.0
12) Base-opts/1.0.2-1.0502.53325.1.2.ari
13) craype-network-aries
14) craype/2.2.1
15) craype-sandybridge
16) slurm
17) cray-mpich/7.1.1
18) ddt/4.3rc7
19) linux/jg
20) gcc/4.8.2
21) totalview-support/1.1.4
22) totalview/8.11.0
23) cray-libsci/13.0.1
24) pmi/5.0.6-1.0000.10439.140.2.ari
25) atp/1.7.5
26) PrgEnv-gnu/5.2.40
27) rca/1.0.0-2.0502.53711.3.127.ari
28) perflite/622cuda
29) cray-libsci_acc/3.0.2
30) cudatoolkit/5.5.22-1.0502.7944.3.1
31) craype-accel-nvidia35
Compile
- mkdir -p bin
- make clean
- make bin/jacobi_mpi+openmp+cuda PREP=
cc -std=c99 -O3 -march=native -DOMP_MEMLOCALTIY \
-fopenmp -DUSE_MPI -I/include \
-c src/jacobi_cuda.c -o bin/jacobi_mpi+cuda.o
nvcc -ccbin=cc -O3 \
-arch=sm_35 -Xcompiler -march=native \
-c src/jacobi_cuda_kernel.cu -o bin/jacobi_cuda_kernel.o
cc -fopenmp -lm -lstdc++ -L/lib64 -lcudart \
bin/jacobi_mpi+cuda.o \
bin/jacobi_cuda_kernel.o \
-o bin/jacobi_mpi+openmp+cuda
INFO: A maximum of 17 functions from group 'aio' will be traced.
INFO: A maximum of 285 functions from group 'cuda' will be traced.
INFO: A maximum of 107 functions from group 'io' will be traced.
INFO: A maximum of 699 functions from group 'mpi' will be traced.
INFO: A maximum of 32 functions from group 'omp' will be traced.
INFO: creating the CrayPat-instrumented executable
'bin/jacobi_mpi+openmp+cuda' (gpu) ...OK
Run
- cd ./bin/
- sbatch ../batch/run_jacobi_mpi+openmp+cuda.sbatch
Submitted batch job 2381
Reports
- cat o_2381
...
#################################################################
# #
# CrayPat-lite Performance Statistics #
# #
#################################################################
CrayPat/X: Version 6.2.2 Revision 13378 (xf 13240) 11/20/14 14:32:58
Experiment: lite lite/gpu
Number of PEs (MPI ranks): 2
Numbers of PEs per Node: 1 PE on each of 2 Nodes
Numbers of Threads per PE: 3
Number of Cores per Socket: 8
Execution start time: Wed Jan 28 15:45:47 2015
System name and speed: santis01 2601 MHz
Avg Process Time: 6.703 secs
High Memory: 175.551 MBytes 87.775 MBytes per PE
I/O Read Rate: 65.050 MBytes/sec
I/O Write Rate: 2.690 MBytes/sec
Table 1: Profile by Function Group and Function
Time% | Time | Imb. | Imb. | Calls |Group
| | Time | Time% | | Function
| | | | | PE=HIDE
| | | | | Thread=HIDE
100.0% | 6.172222 | -- | -- | 231370.0 |Total
|----------------------------------------------------------------
| 93.0% | 5.739670 | 0.083060 | 2.9% | 1.0 |USER
||---------------------------------------------------------------
|| 93.0% | 5.739670 | 0.083060 | 2.9% | 1.0 |main
||===============================================================
| 4.7% | 0.292761 | -- | -- | 94873.0 |CUDA
||---------------------------------------------------------------
|| 2.4% | 0.146580 | 0.001524 | 2.1% | 17131.5 |cudaMalloc
|| 1.3% | 0.079082 | 0.079062 | 100.0% | 229.0 |cudaSetDevice
|| 1.0% | 0.061897 | 0.000072 | 0.2% | 77390.0 |cudaMemcpy
||===============================================================
| 1.4% | 0.086431 | -- | -- | 128937.5 |ETC
|================================================================
Table 2: Accelerator Table by Function (top 10 functions shown)
Host | Host | Acc | Acc Copy | Acc Copy | Events |Function=[max10]
Time% | Time | Time | In | Out | | PE=HIDE
| | | (MBytes) | (MBytes) | | Thread=HIDE
100.0% | 0.102 | 0.109 | 70.059 | 42.824 | 6003 |Total
|-----------------------------------------------------------------------------
| 48.4% | 0.049 | 0.064 | 70.055 | 42.820 | 2003 |cudaMemcpy
| 23.3% | 0.024 | 0.020 | 0.004 | -- | 2000 |launch_jacobi_kernel_async
| 17.9% | 0.018 | 0.025 | -- | 0.004 | 1000 |wait_jacobi_kernel
| 10.4% | 0.011 | -- | -- | -- | 1000 |launch_copy_kernel
|=============================================================================
Program invocation: ../bin/jacobi_mpi+openmp+cuda 4096 4096 0.15
For a complete report with expanded tables and notes, run:
pat_report /scratch/santis/piccinal/proposals.git/vihps/JACOBI_CUDA/bin/jacobi_mpi+openmp+cuda+23039-14t.ap2
For help identifying callers of particular functions:
pat_report -O callers+src /scratch/santis/piccinal/proposals.git/vihps/JACOBI_CUDA/bin/jacobi_mpi+openmp+cuda+23039-14t.ap2
To see the entire call tree:
pat_report -O calltree+src /scratch/santis/piccinal/proposals.git/vihps/JACOBI_CUDA/bin/jacobi_mpi+openmp+cuda+23039-14t.ap2
For interactive, graphical performance analysis, run:
app2 /scratch/santis/piccinal/proposals.git/vihps/JACOBI_CUDA/bin/jacobi_mpi+openmp+cuda+23039-14t.ap2
================ End of CrayPat-lite output ==========================
Comments (3)
-
-
reporter Tracing both MPI/OpenMP and CUDA at the same time ?
Actually, it is already the case:
grep -- -g $CRAYPAT_ROOT/share/config/lite
-gcuda # Enable the CUDA trace group.
-g mpi
-g omp
-g io
MPI/OMP was not showing in the report because my testcase is too simple. An easy workaround is to run pat_report:
pat_report -T jacobi_mpi+openmp+cuda+23039-14t.ap2
Table 1: Profile by Function Group and Function Time% | Time | Imb. | Imb. | Calls |Group | | Time | Time% | | Function | | | | | PE=HIDE | | | | | Thread=HIDE 100.0% | 6.172222 | -- | -- | 231370.0 |Total |---------------------------------------------------------------------------- | 93.0% | 5.739670 | 0.083060 | 2.9% | 1.0 |USER ||--------------------------------------------------------------------------- || 93.0% | 5.739670 | 0.083060 | 2.9% | 1.0 |main ||=========================================================================== | 4.7% | 0.292761 | -- | -- | 94873.0 |CUDA ||--------------------------------------------------------------------------- || 2.4% | 0.146580 | 0.001524 | 2.1% | 17131.5 |cudaMalloc || 1.3% | 0.079082 | 0.079062 | 100.0% | 229.0 |cudaSetDevice || 1.0% | 0.061897 | 0.000072 | 0.2% | 77390.0 |cudaMemcpy || 0.1% | 0.004895 | 0.004895 | 100.0% | 1.5 |cudaGetExportTable || 0.0% | 0.000283 | 0.000006 | 4.1% | 93.0 |cudaFree || 0.0% | 0.000025 | 0.000000 | 2.7% | 28.0 |cudaDeviceSynchronize ||=========================================================================== | 1.4% | 0.086431 | -- | -- | 128937.5 |ETC ||--------------------------------------------------------------------------- || 0.6% | 0.038842 | 0.000265 | 1.4% | 60104.5 |launch_jacobi_kernel_async || 0.4% | 0.026534 | 0.000033 | 0.2% | 35523.0 |wait_jacobi_kernel || 0.3% | 0.016905 | 0.000061 | 0.7% | 32084.0 |launch_copy_kernel || 0.1% | 0.004001 | 0.004001 | 100.0% | 223.0 |==LO_MEMORY== libcuda.so.1 || 0.0% | 0.000104 | 0.000017 | 28.2% | 1000.0 |gomp_team_end || 0.0% | 0.000032 | 0.000032 | 100.0% | 2.5 |cudbgApiInit || 0.0% | 0.000014 | 0.000014 | 100.0% | 0.5 |gomp_team_start ||=========================================================================== | 0.5% | 0.033342 | -- | -- | 1004.0 |MPI_SYNC ||--------------------------------------------------------------------------- || 0.3% | 0.016696 | 0.007961 | 47.7% | 1000.0 |MPI_Allreduce(sync) || 0.2% | 0.010159 | 0.010135 | 99.8% | 1.0 |MPI_Init(sync) || 0.1% | 0.006462 | 0.006387 | 98.8% | 2.0 |MPI_Barrier(sync) || 0.0% | 0.000025 | 0.000017 | 66.2% | 1.0 |MPI_Finalize(sync) ||=========================================================================== | 0.3% | 0.018889 | -- | -- | 2538.5 |MPI ||--------------------------------------------------------------------------- || 0.2% | 0.013701 | 0.000047 | 0.7% | 1530.5 |MPI_Sendrecv || 0.1% | 0.005161 | 0.000530 | 18.6% | 1000.0 |MPI_Allreduce || 0.0% | 0.000013 | 0.000004 | 43.7% | 2.0 |MPI_Barrier || 0.0% | 0.000010 | 0.000000 | 4.5% | 2.0 |MPI_Wtime || 0.0% | 0.000002 | 0.000000 | 0.9% | 1.0 |MPI_Finalize || 0.0% | 0.000001 | 0.000000 | 44.7% | 1.0 |MPI_Init || 0.0% | 0.000001 | 0.000000 | 1.5% | 1.0 |MPI_Comm_rank || 0.0% | 0.000000 | 0.000000 | 15.3% | 1.0 |MPI_Comm_size ||=========================================================================== | 0.0% | 0.001036 | -- | -- | 4009.0 |OMP ||--------------------------------------------------------------------------- || 0.0% | 0.001024 | 0.000010 | 1.9% | 2004.5 |omp_get_num_threads || 0.0% | 0.000012 | 0.000000 | 3.8% | 2004.5 |omp_get_thread_num ||=========================================================================== | 0.0% | 0.000077 | 0.000077 | 100.0% | 6.5 |STDIO ||--------------------------------------------------------------------------- || 0.0% | 0.000077 | 0.000077 | 100.0% | 6.5 |printf ||=========================================================================== | 0.0% | 0.000016 | 0.000016 | 100.0% | 0.5 |PTHREAD ||--------------------------------------------------------------------------- || 0.0% | 0.000016 | 0.000016 | 100.0% | 0.5 |pthread_create
-
reporter cat /scratch/santis/lucamar/gromacs/patlite/gmx_mpi+27359-14t.rpt
################################################################# # # # CrayPat-lite Performance Statistics # # # ################################################################# CrayPat/X: Version 6.2.2 Revision 13378 (xf 13240) 11/20/14 14:32:58 Experiment: lite lite/gpu Number of PEs (MPI ranks): 2 Numbers of PEs per Node: 1 PE on each of 2 Nodes Numbers of Threads per PE: 10 Number of Cores per Socket: 8 Execution start time: Tue Feb 3 10:50:04 2015 System name and speed: santis01 2601 MHz Avg Process Time: 115.500 secs High Memory: 710.465 MBytes 355.232 MBytes per PE I/O Read Rate: 70.075 MBytes/sec I/O Write Rate: 35.215 MBytes/sec Table 1: Profile by Function Group and Function Time% | Time | Imb. | Imb. | Calls |Group | | Time | Time% | | Function | | | | | PE=HIDE | | | | | Thread=HIDE 100.0% | 112.554892 | -- | -- | 17090509.0 |Total |----------------------------------------------------------------------------- | 81.0% | 91.121734 | 0.813199 | 1.8% | 1.0 |USER ||---------------------------------------------------------------------------- || 81.0% | 91.121734 | 0.813199 | 1.8% | 1.0 |main ||============================================================================ | 9.6% | 10.773268 | -- | -- | 14442639.0 |ETC ||---------------------------------------------------------------------------- || 2.4% | 2.653874 | 0.012315 | 0.9% | 3032131.0 |cu_copy_H2D_async || 1.8% | 2.001851 | 0.012485 | 1.2% | 2924520.5 |__device_stub__Z39nbnxn_kernel_ElecEwTwinCut_VdwLJ_F_cuda11cu_atomdata10cu_nbparam8cu_plistb || 1.4% | 1.560283 | 0.011285 | 1.4% | 2833267.5 |cu_copy_D2H_async || 1.1% | 1.263958 | 0.003769 | 0.6% | 1262597.5 |nbnxn_cuda_wait_gpu ||============================================================================ | 8.2% | 9.274940 | -- | -- | 623283.5 |MPI ||---------------------------------------------------------------------------- || 6.0% | 6.744410 | 0.790307 | 21.0% | 475142.5 |MPI_Sendrecv || 2.2% | 2.457671 | 0.005005 | 0.4% | 138953.5 |MPI_Alltoall |============================================================================= Table 2: Accelerator Table by Function (top 10 functions shown) Host | Host | Acc | Acc Copy | Acc Copy | Events |Function=[max10] Time% | Time | Time | In | Out | | PE=HIDE | | | (MBytes) | (MBytes) | | Thread=HIDE 100.0% | 4.226 | 0.002 | 12605 | 8824 | 373813 |Total |----------------------------------------------------------------------------- | 39.7% | 1.676 | -- | 12604 | -- | 158760 |cu_copy_H2D_async | 31.7% | 1.339 | -- | -- | -- | 89856 |__device_stub__Z39nbnxn_kernel_ElecEwTwinCut_VdwLJ_F_cuda11cu_atomdata10cu_nbparam8cu_plistb | 25.0% | 1.057 | -- | -- | 8824 | 115005 |cu_copy_D2H_async | 2.6% | 0.110 | -- | -- | -- | 7488 |__device_stub__Z40nbnxn_kernel_ElecEwTwinCut_VdwLJ_VF_cuda11cu_atomdata10cu_nbparam8cu_plistb | 1.0% | 0.040 | -- | -- | -- | 2498 |__device_stub__Z46nbnxn_kernel_ElecEwTwinCut_VdwLJ_VF_prune_cuda11cu_atomdata10cu_nbparam8cu_plistb |============================================================================= Program invocation: /apps/santis/sandbox/lucamar/bin/gmx504/patlite/bin/gmx_mpi mdrun -gpu_id 0 -npme -1 -s crambin.tpr For a complete report with expanded tables and notes, run: pat_report /scratch/santis/lucamar/gromacs/patlite/gmx_mpi+27359-14t.ap2 For help identifying callers of particular functions: pat_report -O callers+src /scratch/santis/lucamar/gromacs/patlite/gmx_mpi+27359-14t.ap2 To see the entire call tree: pat_report -O calltree+src /scratch/santis/lucamar/gromacs/patlite/gmx_mpi+27359-14t.ap2 For interactive, graphical performance analysis, run: app2 /scratch/santis/lucamar/gromacs/patlite/gmx_mpi+27359-14t.ap2 ================ End of CrayPat-lite output ==========================
- Log in to comment
That looks great: would it be possible to create a profile tracing both MPI/OpenMP and CUDA at the same time with perftools/lite? Thanks a lot for your effort!