PROPOSALS Craypat-lite: CUDA

Issue #20 new
jg piccinali repo owner created an issue

MPI+CUDA (PizDaint)

Get the src

Cloning into 'proposals.git'...
remote: Counting objects: 339, done.
remote: Total 339 (delta 0), reused 0 (delta 0)
Receiving objects: 100% (339/339), 300.16 KiB | 234 KiB/s, done.
Resolving deltas: 100% (139/139), done.
  • cd proposals.git/vihps/JACOBI_CUDA/

Setup

  • module swap PrgEnv-cray PrgEnv-gnu
  • module load craype-accel-nvidia35
  • module use /project/csstaff/proposals
  • module load perflite/622cuda
  • echo CRAYPAT_LITE=$CRAYPAT_LITE
CRAYPAT_LITE=gpu
  • module list
Currently Loaded Modulefiles:
  1) modules/3.2.10.2
  2) nodestat/2.2-1.0502.53712.3.109.ari
  3) sdb/1.0-1.0502.55976.5.27.ari
  4) alps/5.2.1-2.0502.9041.11.6.ari
  5) lustre-cray_ari_s/2.5_3.0.101_0.31.1_1.0502.8394.10.1-1.0502.17198.8.51
  6) udreg/2.3.2-1.0502.9275.1.12.ari
  7) ugni/5.0-1.0502.9685.4.24.ari
  8) gni-headers/3.0-1.0502.9684.5.2.ari
  9) dmapp/7.0.1-1.0502.9501.5.219.ari
 10) xpmem/0.1-2.0502.55507.3.2.ari
 11) hss-llm/7.2.0
 12) Base-opts/1.0.2-1.0502.53325.1.2.ari
 13) craype-network-aries
 14) craype/2.2.1
 15) craype-sandybridge
 16) slurm
 17) cray-mpich/7.1.1
 18) ddt/4.3rc7
 19) linux/jg
 20) gcc/4.8.2
 21) totalview-support/1.1.4
 22) totalview/8.11.0
 23) cray-libsci/13.0.1
 24) pmi/5.0.6-1.0000.10439.140.2.ari
 25) atp/1.7.5
 26) PrgEnv-gnu/5.2.40
 27) rca/1.0.0-2.0502.53711.3.127.ari
 28) perflite/622cuda
 29) cray-libsci_acc/3.0.2
 30) cudatoolkit/5.5.22-1.0502.7944.3.1
 31) craype-accel-nvidia35

Compile

  • mkdir -p bin
  • make clean
  • make bin/jacobi_mpi+openmp+cuda PREP=
cc -std=c99 -O3 -march=native -DOMP_MEMLOCALTIY \
-fopenmp -DUSE_MPI -I/include \
-c src/jacobi_cuda.c -o bin/jacobi_mpi+cuda.o

nvcc -ccbin=cc -O3 \
-arch=sm_35 -Xcompiler -march=native \
-c src/jacobi_cuda_kernel.cu -o bin/jacobi_cuda_kernel.o

cc -fopenmp -lm -lstdc++ -L/lib64 -lcudart \
bin/jacobi_mpi+cuda.o \
bin/jacobi_cuda_kernel.o \
-o bin/jacobi_mpi+openmp+cuda

INFO: A maximum of 17 functions from group 'aio' will be traced.
INFO: A maximum of 285 functions from group 'cuda' will be traced.
INFO: A maximum of 107 functions from group 'io' will be traced.
INFO: A maximum of 699 functions from group 'mpi' will be traced.
INFO: A maximum of 32 functions from group 'omp' will be traced.
INFO: creating the CrayPat-instrumented executable 
    'bin/jacobi_mpi+openmp+cuda' (gpu) ...OK

Run

  • cd ./bin/
  • sbatch ../batch/run_jacobi_mpi+openmp+cuda.sbatch
Submitted batch job 2381

Reports

  • cat o_2381
...
#################################################################
#                                                               #
#            CrayPat-lite Performance Statistics                #
#                                                               #
#################################################################

CrayPat/X:  Version 6.2.2 Revision 13378 (xf 13240)  11/20/14 14:32:58
Experiment:                  lite  lite/gpu     
Number of PEs (MPI ranks):      2
Numbers of PEs per Node:        1  PE on each of  2  Nodes
Numbers of Threads per PE:      3
Number of Cores per Socket:     8
Execution start time:  Wed Jan 28 15:45:47 2015
System name and speed:  santis01 2601 MHz

Avg Process Time:   6.703 secs              
High Memory:      175.551 MBytes     87.775 MBytes per PE
I/O Read Rate:     65.050 MBytes/sec        
I/O Write Rate:     2.690 MBytes/sec        

Table 1:  Profile by Function Group and Function

  Time% |     Time |     Imb. |   Imb. |    Calls |Group
        |          |     Time |  Time% |          | Function
        |          |          |        |          |  PE=HIDE
        |          |          |        |          |   Thread=HIDE

 100.0% | 6.172222 |       -- |     -- | 231370.0 |Total
|----------------------------------------------------------------
|  93.0% | 5.739670 | 0.083060 |   2.9% |      1.0 |USER
||---------------------------------------------------------------
||  93.0% | 5.739670 | 0.083060 |   2.9% |      1.0 |main
||===============================================================
|   4.7% | 0.292761 |       -- |     -- |  94873.0 |CUDA
||---------------------------------------------------------------
||   2.4% | 0.146580 | 0.001524 |   2.1% |  17131.5 |cudaMalloc
||   1.3% | 0.079082 | 0.079062 | 100.0% |    229.0 |cudaSetDevice
||   1.0% | 0.061897 | 0.000072 |   0.2% |  77390.0 |cudaMemcpy
||===============================================================
|   1.4% | 0.086431 |       -- |     -- | 128937.5 |ETC
|================================================================

Table 2:  Accelerator Table by Function (top 10 functions shown)

   Host |  Host |   Acc | Acc Copy | Acc Copy | Events |Function=[max10]
  Time% |  Time |  Time |       In |      Out |        | PE=HIDE
        |       |       | (MBytes) | (MBytes) |        |  Thread=HIDE

 100.0% | 0.102 | 0.109 |   70.059 |   42.824 |   6003 |Total
|-----------------------------------------------------------------------------
|  48.4% | 0.049 | 0.064 |   70.055 |   42.820 |   2003 |cudaMemcpy
|  23.3% | 0.024 | 0.020 |    0.004 |       -- |   2000 |launch_jacobi_kernel_async
|  17.9% | 0.018 | 0.025 |       -- |    0.004 |   1000 |wait_jacobi_kernel
|  10.4% | 0.011 |    -- |       -- |       -- |   1000 |launch_copy_kernel
|=============================================================================

Program invocation:  ../bin/jacobi_mpi+openmp+cuda 4096 4096 0.15

For a complete report with expanded tables and notes, run:
  pat_report /scratch/santis/piccinal/proposals.git/vihps/JACOBI_CUDA/bin/jacobi_mpi+openmp+cuda+23039-14t.ap2

For help identifying callers of particular functions:
  pat_report -O callers+src /scratch/santis/piccinal/proposals.git/vihps/JACOBI_CUDA/bin/jacobi_mpi+openmp+cuda+23039-14t.ap2
To see the entire call tree:
  pat_report -O calltree+src /scratch/santis/piccinal/proposals.git/vihps/JACOBI_CUDA/bin/jacobi_mpi+openmp+cuda+23039-14t.ap2

For interactive, graphical performance analysis, run:
  app2 /scratch/santis/piccinal/proposals.git/vihps/JACOBI_CUDA/bin/jacobi_mpi+openmp+cuda+23039-14t.ap2

================  End of CrayPat-lite output  ==========================

Comments (3)

  1. Luca Marsella

    That looks great: would it be possible to create a profile tracing both MPI/OpenMP and CUDA at the same time with perftools/lite? Thanks a lot for your effort!

  2. jg piccinali reporter

    Tracing both MPI/OpenMP and CUDA at the same time ?

    Actually, it is already the case:

    grep -- -g $CRAYPAT_ROOT/share/config/lite
    

    -gcuda # Enable the CUDA trace group.

    -g mpi

    -g omp

    -g io

    MPI/OMP was not showing in the report because my testcase is too simple. An easy workaround is to run pat_report:

    pat_report -T jacobi_mpi+openmp+cuda+23039-14t.ap2
    
    Table 1:  Profile by Function Group and Function
    
      Time% |     Time |     Imb. |   Imb. |    Calls |Group
            |          |     Time |  Time% |          | Function
            |          |          |        |          |  PE=HIDE
            |          |          |        |          |   Thread=HIDE
    
     100.0% | 6.172222 |       -- |     -- | 231370.0 |Total
    |----------------------------------------------------------------------------
    |  93.0% | 5.739670 | 0.083060 |   2.9% |      1.0 |USER
    ||---------------------------------------------------------------------------
    ||  93.0% | 5.739670 | 0.083060 |   2.9% |      1.0 |main
    ||===========================================================================
    |   4.7% | 0.292761 |       -- |     -- |  94873.0 |CUDA
    ||---------------------------------------------------------------------------
    ||   2.4% | 0.146580 | 0.001524 |   2.1% |  17131.5 |cudaMalloc
    ||   1.3% | 0.079082 | 0.079062 | 100.0% |    229.0 |cudaSetDevice
    ||   1.0% | 0.061897 | 0.000072 |   0.2% |  77390.0 |cudaMemcpy
    ||   0.1% | 0.004895 | 0.004895 | 100.0% |      1.5 |cudaGetExportTable
    ||   0.0% | 0.000283 | 0.000006 |   4.1% |     93.0 |cudaFree
    ||   0.0% | 0.000025 | 0.000000 |   2.7% |     28.0 |cudaDeviceSynchronize
    ||===========================================================================
    |   1.4% | 0.086431 |       -- |     -- | 128937.5 |ETC
    ||---------------------------------------------------------------------------
    ||   0.6% | 0.038842 | 0.000265 |   1.4% |  60104.5 |launch_jacobi_kernel_async
    ||   0.4% | 0.026534 | 0.000033 |   0.2% |  35523.0 |wait_jacobi_kernel
    ||   0.3% | 0.016905 | 0.000061 |   0.7% |  32084.0 |launch_copy_kernel
    ||   0.1% | 0.004001 | 0.004001 | 100.0% |    223.0 |==LO_MEMORY== libcuda.so.1
    ||   0.0% | 0.000104 | 0.000017 |  28.2% |   1000.0 |gomp_team_end
    ||   0.0% | 0.000032 | 0.000032 | 100.0% |      2.5 |cudbgApiInit
    ||   0.0% | 0.000014 | 0.000014 | 100.0% |      0.5 |gomp_team_start
    ||===========================================================================
    |   0.5% | 0.033342 |       -- |     -- |   1004.0 |MPI_SYNC
    ||---------------------------------------------------------------------------
    ||   0.3% | 0.016696 | 0.007961 |  47.7% |   1000.0 |MPI_Allreduce(sync)
    ||   0.2% | 0.010159 | 0.010135 |  99.8% |      1.0 |MPI_Init(sync)
    ||   0.1% | 0.006462 | 0.006387 |  98.8% |      2.0 |MPI_Barrier(sync)
    ||   0.0% | 0.000025 | 0.000017 |  66.2% |      1.0 |MPI_Finalize(sync)
    ||===========================================================================
    |   0.3% | 0.018889 |       -- |     -- |   2538.5 |MPI
    ||---------------------------------------------------------------------------
    ||   0.2% | 0.013701 | 0.000047 |   0.7% |   1530.5 |MPI_Sendrecv
    ||   0.1% | 0.005161 | 0.000530 |  18.6% |   1000.0 |MPI_Allreduce
    ||   0.0% | 0.000013 | 0.000004 |  43.7% |      2.0 |MPI_Barrier
    ||   0.0% | 0.000010 | 0.000000 |   4.5% |      2.0 |MPI_Wtime
    ||   0.0% | 0.000002 | 0.000000 |   0.9% |      1.0 |MPI_Finalize
    ||   0.0% | 0.000001 | 0.000000 |  44.7% |      1.0 |MPI_Init
    ||   0.0% | 0.000001 | 0.000000 |   1.5% |      1.0 |MPI_Comm_rank
    ||   0.0% | 0.000000 | 0.000000 |  15.3% |      1.0 |MPI_Comm_size
    ||===========================================================================
    |   0.0% | 0.001036 |       -- |     -- |   4009.0 |OMP
    ||---------------------------------------------------------------------------
    ||   0.0% | 0.001024 | 0.000010 |   1.9% |   2004.5 |omp_get_num_threads
    ||   0.0% | 0.000012 | 0.000000 |   3.8% |   2004.5 |omp_get_thread_num
    ||===========================================================================
    |   0.0% | 0.000077 | 0.000077 | 100.0% |      6.5 |STDIO
    ||---------------------------------------------------------------------------
    ||   0.0% | 0.000077 | 0.000077 | 100.0% |      6.5 |printf
    ||===========================================================================
    |   0.0% | 0.000016 | 0.000016 | 100.0% |      0.5 |PTHREAD
    ||---------------------------------------------------------------------------
    ||   0.0% | 0.000016 | 0.000016 | 100.0% |      0.5 |pthread_create
    
  3. jg piccinali reporter

    cat /scratch/santis/lucamar/gromacs/patlite/gmx_mpi+27359-14t.rpt

    #################################################################
    #                                                               #
    #            CrayPat-lite Performance Statistics                #
    #                                                               #
    #################################################################
    
    CrayPat/X:  Version 6.2.2 Revision 13378 (xf 13240)  11/20/14 14:32:58
    Experiment:                  lite  lite/gpu     
    Number of PEs (MPI ranks):      2
    Numbers of PEs per Node:        1  PE on each of  2  Nodes
    Numbers of Threads per PE:     10
    Number of Cores per Socket:     8
    Execution start time:  Tue Feb  3 10:50:04 2015
    System name and speed:  santis01 2601 MHz
    
    Avg Process Time: 115.500 secs               
    High Memory:      710.465 MBytes     355.232 MBytes per PE
    I/O Read Rate:     70.075 MBytes/sec         
    I/O Write Rate:    35.215 MBytes/sec         
    
    Table 1:  Profile by Function Group and Function
    
      Time% |       Time |     Imb. |  Imb. |      Calls |Group
            |            |     Time | Time% |            | Function
            |            |          |       |            |  PE=HIDE
            |            |          |       |            |   Thread=HIDE
    
     100.0% | 112.554892 |       -- |    -- | 17090509.0 |Total
    |-----------------------------------------------------------------------------
    |  81.0% |  91.121734 | 0.813199 |  1.8% |        1.0 |USER
    ||----------------------------------------------------------------------------
    ||  81.0% |  91.121734 | 0.813199 |  1.8% |        1.0 |main
    ||============================================================================
    |   9.6% |  10.773268 |       -- |    -- | 14442639.0 |ETC
    ||----------------------------------------------------------------------------
    ||   2.4% |   2.653874 | 0.012315 |  0.9% |  3032131.0 |cu_copy_H2D_async
    ||   1.8% |   2.001851 | 0.012485 |  1.2% |  2924520.5 |__device_stub__Z39nbnxn_kernel_ElecEwTwinCut_VdwLJ_F_cuda11cu_atomdata10cu_nbparam8cu_plistb
    ||   1.4% |   1.560283 | 0.011285 |  1.4% |  2833267.5 |cu_copy_D2H_async
    ||   1.1% |   1.263958 | 0.003769 |  0.6% |  1262597.5 |nbnxn_cuda_wait_gpu
    ||============================================================================
    |   8.2% |   9.274940 |       -- |    -- |   623283.5 |MPI
    ||----------------------------------------------------------------------------
    ||   6.0% |   6.744410 | 0.790307 | 21.0% |   475142.5 |MPI_Sendrecv
    ||   2.2% |   2.457671 | 0.005005 |  0.4% |   138953.5 |MPI_Alltoall
    |=============================================================================
    
    Table 2:  Accelerator Table by Function (top 10 functions shown)
    
       Host |  Host |   Acc | Acc Copy | Acc Copy | Events |Function=[max10]
      Time% |  Time |  Time |       In |      Out |        | PE=HIDE
            |       |       | (MBytes) | (MBytes) |        |  Thread=HIDE
    
     100.0% | 4.226 | 0.002 |    12605 |     8824 | 373813 |Total
    |-----------------------------------------------------------------------------
    |  39.7% | 1.676 |    -- |    12604 |       -- | 158760 |cu_copy_H2D_async
    |  31.7% | 1.339 |    -- |       -- |       -- |  89856 |__device_stub__Z39nbnxn_kernel_ElecEwTwinCut_VdwLJ_F_cuda11cu_atomdata10cu_nbparam8cu_plistb
    |  25.0% | 1.057 |    -- |       -- |     8824 | 115005 |cu_copy_D2H_async
    |   2.6% | 0.110 |    -- |       -- |       -- |   7488 |__device_stub__Z40nbnxn_kernel_ElecEwTwinCut_VdwLJ_VF_cuda11cu_atomdata10cu_nbparam8cu_plistb
    |   1.0% | 0.040 |    -- |       -- |       -- |   2498 |__device_stub__Z46nbnxn_kernel_ElecEwTwinCut_VdwLJ_VF_prune_cuda11cu_atomdata10cu_nbparam8cu_plistb
    |=============================================================================
    
    Program invocation:
      /apps/santis/sandbox/lucamar/bin/gmx504/patlite/bin/gmx_mpi mdrun -gpu_id 0 -npme -1 -s crambin.tpr
    
    For a complete report with expanded tables and notes, run:
      pat_report /scratch/santis/lucamar/gromacs/patlite/gmx_mpi+27359-14t.ap2
    
    For help identifying callers of particular functions:
      pat_report -O callers+src /scratch/santis/lucamar/gromacs/patlite/gmx_mpi+27359-14t.ap2
    To see the entire call tree:
      pat_report -O calltree+src /scratch/santis/lucamar/gromacs/patlite/gmx_mpi+27359-14t.ap2
    
    For interactive, graphical performance analysis, run:
      app2 /scratch/santis/lucamar/gromacs/patlite/gmx_mpi+27359-14t.ap2
    
    ================  End of CrayPat-lite output  ==========================
    
  4. Log in to comment