JACOBI - MPI/CUDA (scorep)

Issue #7 new
jg piccinali repo owner created an issue

DAINT

Get the src

Compile

CCE

  • module load PrgEnv-cray # cce/8.2.4
  • module load craype-accel-nvidia35
  • module load scorep/1.3

GNU

module swap PrgEnv-cray PrgEnv-gnu # **gcc/4.8.2**
module swap craype/2.05 craype/2.2.0
module swap cray-libsci/12.1.3 cray-libsci/13.0.1
module rm cray-mpich/6.2.2
module load cray-mpich/7.0.3
module load craype-accel-nvidia35
module load scorep/1.3

PGI

  • module swap PrgEnv-cray PrgEnv-pgi # pgi/14.1.0
  • module load cudatoolkit
  • module load scorep/1.3

INTEL

  • module swap PrgEnv-cray PrgEnv-intel # intel/14.0.1.106
  • module load cudatoolkit
  • module load scorep/1.3

  • make clean

  • make PREP=scorep

Run

export SCOREP_ENABLE_PROFILING=false
export SCOREP_ENABLE_TRACING=true
export SCOREP_CUDA_ENABLE=yes
salloc -N2
export OMP_NUM_THREADS=8
aprun -n2 -N1 -d $OMP_NUM_THREADS \
      jacobi_mpi+openmp+cuda.* \
     4096 4096 0.5
exit
  CUDA Driver Version / Runtime Version     5.5 / 5.5
  CUDA Capability Major/Minor version number:    3.5
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Maximum sizes of each dimension of a block:    1024 x 1024 x 64
  Maximum sizes of each dimension of a grid:     2147483647 x 65535 x 65535

Jacobi relaxation Calculation: 4096 x 4096 mesh with 2 processes and 8 threads
 + one Tesla K20X for each process.
        1024 of 2049 local rows are calculated on the CPU 
to balance the load between the CPU and the GPU. (150 iterations max)
    0, 0.250000
  100, 0.002395
 total: 0.431415 s
Application 134913 resources: utime ~12s, stime ~0s, Rss ~197512, inblocks ~43551, outblocks ~43509

Reports

Tracing

  • vampir83 scorep-*/traces.otf2

Comments (15)

  1. jg piccinali reporter

    Compile

    module load papi/5.3.2.1
    export LD_LIBRARY_PATH=/opt/cray/papi/5.3.2.1/lib64:$LD_LIBRARY_PATH
    
    SC=/apps/daint/5.2.UP02/scorep/1.4/gnu482sharedlibsmpi711/bin/scorep
    $SC --cuda nvcc  -O3 -arch=sm_35  -c jacobi_cuda_kernel.cu
    $SC --mpp=mpi --thread=omp  --cuda cc -D_CSCS_ITMAX=150 -O3   -DOMP_MEMLOCALITY -fopenmp -DUSE_MPI  -c jacobi_cuda.c -o jacobi_mpi+cuda.o
    $SC --mpp=mpi --thread=omp  --cuda cc -DOMP_MEMLOCALITY -fopenmp -lcudart jacobi_mpi+cuda.o jacobi_cuda_kernel.o \
    -o ./jacobi_mpi+openmp+cuda.GNU.SANTIS+sc14
    

    Run

    salloc -N2
    export SCOREP_ENABLE_PROFILING=false
    export SCOREP_ENABLE_TRACING=true
    export SCOREP_CUDA_ENABLE=yes
    export OMP_NUM_THREADS=8
    

    aprun -n2 -N1 -d $OMP_NUM_THREADS ./jacobi_mpi+openmp+cuda.GNU.SANTIS+sc14 4096 4096 0.5

      CUDA Driver Version / Runtime Version     6.5 / 6.5
      CUDA Capability Major/Minor version number:    3.5
      Maximum number of threads per multiprocessor:  2048
      Maximum number of threads per block:           1024
      Maximum sizes of each dimension of a block:    1024 x 1024 x 64
      Maximum sizes of each dimension of a grid:     2147483647 x 65535 x 65535
    
    Jacobi relaxation Calculation: 4096 x 4096 mesh with 2 processes and 8 threads + one Tesla K20X for each process.
        1024 of 2049 local rows are calculated on the CPU to balance the load between the CPU and the GPU. (150 iterations max)
        0, 0.250000
      100, 0.002395
     total: 1.089108 s
    [Score-P] src/adapters/cuda/scorep_cupti4_activity.c:257: 
    Warning: [CUPTI Activity] Destroying buffer which is currently in use (6936736, 1, 0)
    

    Report

    Screen Shot 2015-04-13 at 10.38.23.png

  2. jg piccinali reporter

    scorep/1.4.1

    PizDaint

    Setup

    module swap PrgEnv-cray PrgEnv-gnu
    module load craype-accel-nvidia35
    module load papi/5.4.0.1    # !!!
    SC=/apps/daint/5.2.UP02/scorep/1.4.1/gnu482sci1303mpi720cuda6514acc311otf151opa113cube431/bin/scorep
    

    Compile

    cd ~/parallel-debuggers.git/jacobi.git/src/GNU
    
    $SC --cuda nvcc  -O3 -arch=sm_35  -c jacobi_cuda_kernel.cu
    
    $SC --mpp=mpi --thread=omp  --cuda cc -D_CSCS_ITMAX=150 -O3 -DOMP_MEMLOCALITY -fopenmp -DUSE_MPI \
      -c jacobi_cuda.c -o jacobi_mpi+cuda.o
    
    $SC --mpp=mpi --thread=omp  --cuda cc -DOMP_MEMLOCALITY -fopenmp \
    -lcudart jacobi_mpi+cuda.o jacobi_cuda_kernel.o \
    -o ./jacobi_mpi+openmp+cuda.GNU.DAINT+sc141
    
    nvcc  -O3 -arch=sm_35  -c jacobi_cuda_kernel.cu
    
    cc -D_CSCS_ITMAX=150 -O3 -DOMP_MEMLOCALITY -fopenmp -DUSE_MPI \
      -c jacobi_cuda.c -o jacobi_mpi+cuda.o
    
    cc -DOMP_MEMLOCALITY -fopenmp \
    -lcudart jacobi_mpi+cuda.o jacobi_cuda_kernel.o \
    -o ./jacobi_mpi+openmp+cuda.GNU.DAINT+notool
    

    Run (no tool)

    • aprun -n 2 -N 1 -d 8 -j 1 jacobi_mpi+openmp+cuda.GNU.DAINT+notool 4096 4096 0.5
      CUDA Driver Version / Runtime Version     6.5 / 6.5
      CUDA Capability Major/Minor version number:    3.5
      Maximum number of threads per multiprocessor:  2048
      Maximum number of threads per block:           1024
      Maximum sizes of each dimension of a block:    1024 x 1024 x 64
      Maximum sizes of each dimension of a grid:     2147483647 x 65535 x 65535
    
    Jacobi relaxation Calculation: 4096 x 4096 mesh 
    with 2 processes and 8 threads + one Tesla K20X for each process.
        1024 of 2049 local rows are calculated on the CPU 
    to balance the load between the CPU and the GPU. (150 iterations max)
        0, 0.250000
      100, 0.002395
     total: 5.444724 s
    real 8.09
    

    Run (tracing)

    salloc -N2
    export SCOREP_ENABLE_PROFILING=false
    export SCOREP_ENABLE_TRACING=true
    export SCOREP_CUDA_ENABLE=yes
    export OMP_NUM_THREADS=8
    ~/KEEP/slurm/sbatch.sh daint 2 jacobi_mpi+openmp+cuda.GNU.DAINT+sc141 2 1 8 "4096 4096 0.5"
    
    • aprun -n2 -N1 -d8 ./jacobi_mpi+openmp+cuda.GNU.DAINT+sc141 4096 4096 0.5
    [Score-P] src/adapters/cuda/scorep_cupti4_activity.c:257: 
    Warning: [CUPTI Activity] Destroying buffer which is currently in use (23345104, 1, 0)
    _pmiu_daemon(SIGCHLD): [NID 00166] [c0-0c2s9n2] 
    [Thu May 21 16:07:48 2015] PE RANK 1 exit signal Segmentation fault
    

    Analyze

    • ignoring segfault above...
    • /apps/ela/vampir/8.4.1/bin/vampir scorep-20150521_1611_88946296565420/traces.otf2 eff.png

    Debug

    [Score-P] src/adapters/cuda/scorep_cupti_callbacks.c:337: Warning: [CUPTI] Call to 'cuptiErr' failed with message: 'CUPTI_ERROR_NOT_INITIALIZED'
    [Score-P] src/adapters/cuda/scorep_cupti4_activity.c:277: Warning: [CUPTI] Call to 'cuptiActivityRegisterCallbacks( buffer_requested_callback, buff
    er_completed_callback )' failed with message: 'CUPTI_ERROR_NOT_INITIALIZED'
    [Score-P] src/adapters/cuda/scorep_cupti_activity.c:781: Warning: [CUPTI] Call to 'cuptiActivityEnable( CUPTI_ACTIVITY_KIND_CONCURRENT_KERNEL )' fa
    iled with message: 'CUPTI_ERROR_NOT_INITIALIZED'
    [Score-P] src/adapters/cuda/scorep_cupti_activity.c:795: Warning: [CUPTI] Call to 'cuptiActivityEnable( CUPTI_ACTIVITY_KIND_MEMCPY )' failed with m
    essage: 'CUPTI_ERROR_NOT_INITIALIZED'
    [Score-P] src/adapters/cuda/scorep_cupti_callbacks.c:361: Warning: [CUPTI] Call to 'cuptiEnableDomain( 1, scorep_cupti_callbacks_subscriber, CUPTI_
    CB_DOMAIN_RUNTIME_API )' failed with message: 'CUPTI_ERROR_INVALID_PARAMETER'
    [Score-P] src/adapters/cuda/scorep_cupti_callbacks.c:384: Warning: [CUPTI] Call to 'cuptiEnableDomain( 1, scorep_cupti_callbacks_subscriber, CUPTI_
    CB_DOMAIN_SYNCHRONIZE )' failed with message: 'CUPTI_ERROR_INVALID_PARAMETER'
    [Score-P] src/adapters/cuda/scorep_cupti_callbacks.c:388: Warning: [CUPTI] Call to 'cuptiEnableDomain( 1, scorep_cupti_callbacks_subscriber, CUPTI_
    CB_DOMAIN_RESOURCE )' failed with message: 'CUPTI_ERROR_INVALID_PARAMETER'
    [Score-P] src/adapters/cuda/scorep_cupti_callbacks.c:337: Warning: [CUPTI] Call to 'cuptiErr' failed with message: 'CUPTI_ERROR_NOT_INITIALIZED'
    [Score-P] src/adapters/cuda/scorep_cupti4_activity.c:277: Warning: [CUPTI] Call to 'cuptiActivityRegisterCallbacks( buffer_requested_callback, buff
    er_completed_callback )' failed with message: 'CUPTI_ERROR_NOT_INITIALIZED'
    [Score-P] src/adapters/cuda/scorep_cupti_activity.c:781: Warning: [CUPTI] Call to 'cuptiActivityEnable( CUPTI_ACTIVITY_KIND_CONCURRENT_KERNEL )' fa
    iled with message: 'CUPTI_ERROR_NOT_INITIALIZED'
    [Score-P] src/adapters/cuda/scorep_cupti_activity.c:795: Warning: [CUPTI] Call to 'cuptiActivityEnable( CUPTI_ACTIVITY_KIND_MEMCPY )' failed with m
    essage: 'CUPTI_ERROR_NOT_INITIALIZED'
    [Score-P] src/adapters/cuda/scorep_cupti_callbacks.c:361: Warning: [CUPTI] Call to 'cuptiEnableDomain( 1, scorep_cupti_callbacks_subscriber, CUPTI_
    CB_DOMAIN_RUNTIME_API )' failed with message: 'CUPTI_ERROR_INVALID_PARAMETER'
    [Score-P] src/adapters/cuda/scorep_cupti_callbacks.c:384: Warning: [CUPTI] Call to 'cuptiEnableDomain( 1, scorep_cupti_callbacks_subscriber, CUPTI_
    CB_DOMAIN_SYNCHRONIZE )' failed with message: 'CUPTI_ERROR_INVALID_PARAMETER'
    [Score-P] src/adapters/cuda/scorep_cupti_callbacks.c:388: Warning: [CUPTI] Call to 'cuptiEnableDomain( 1, scorep_cupti_callbacks_subscriber, CUPTI_
    CB_DOMAIN_RESOURCE )' failed with message: 'CUPTI_ERROR_INVALID_PARAMETER'
    
    Processes 0-1: 
    Process stopped in jacobi kernel (jacobi_cuda_kernel.cu:44) with signal CUDA_EXCEPTION_10 (Device Illegal Address).
    Reason/Origin: kill, sigsend or raise
    Your program will probably be terminated if you continue.
    You can use the stack controls to see what the process was doing at the time.
    
    Currently Loaded Modulefiles:
      1) modules/3.2.10.3
      2) nodestat/2.2-1.0502.53712.3.109.ari
      3) sdb/1.0-1.0502.55976.5.27.ari
      4) alps/5.2.1-2.0502.9041.11.6.ari
      5) lustre-cray_ari_s/2.5_3.0.101_0.31.1_1.0502.8394.10.1-1.0502.17198.8.51
      6) udreg/2.3.2-1.0502.9275.1.12.ari
      7) ugni/5.0-1.0502.9685.4.24.ari
      8) gni-headers/3.0-1.0502.9684.5.2.ari
      9) dmapp/7.0.1-1.0502.9501.5.219.ari
     10) xpmem/0.1-2.0502.55507.3.2.ari
     11) hss-llm/7.2.0
     12) Base-opts/1.0.2-1.0502.53325.1.2.ari
     13) craype-network-aries
     14) craype/2.3.0
     15) craype-sandybridge
     16) slurm
     17) cray-mpich/7.2.0
     18) ddt/5.0
     19) gcc/4.8.2
     20) totalview-support/1.1.4
     21) totalview/8.11.0
     22) cray-libsci/13.0.3
     23) pmi/5.0.6-1.0000.10439.140.2.ari
     24) atp/1.8.1
     25) PrgEnv-gnu/5.2.40
     26) /linux/jg
     27) cray-libsci_acc/3.1.1
     28) cudatoolkit/6.5.14-1.0502.9613.6.1
     29) craype-accel-nvidia35
     30) papi/5.4.0.1
    
  3. jg piccinali reporter

    From: Ronny Tschueter

    module load scorep/1.4.1 allows to get rid of of the cupti warnings + adding a call to cudaDeviceReset().

    This can be safely ignored...:

    [NID 00013] 2015-05-26 14:34:28 Apid 158049: initiated application termination
    
  4. jg piccinali reporter

    scorep/1.4.2

    GNU (OK)

    Setup

    • module swap PrgEnv-cray PrgEnv-gnu
    • module load craype-accel-nvidia35
    • module load papi/5.4.1.1 # !!!
    • module load scorep/1.4.2

    Compile

    SC=/apps/santis/scorep/1.4.2/gnu482sci1304mpi722cuda6514acc311otf151opa114cube431/bin/scorep
    
    $SC --cuda nvcc  -arch=sm_35 -O3  -c ../jacobi_cuda_kernel.cu -o jacobi_cuda_kernel.o
    
    $SC --mpp=mpi --thread=omp --cuda cc  -D_CSCS_ITMAX=100 -DOMP_MEMLOCALITY -DUSE_MPI \
    -fopenmp -std=c99 -O3 -c ../jacobi_cuda.c -o jacobi_cuda.o
    
    $SC --mpp=mpi --thread=omp --cuda cc  -D_CSCS_ITMAX=100 -DOMP_MEMLOCALITY -DUSE_MPI \
    -fopenmp -std=c99 -O3 jacobi_cuda_kernel.o jacobi_cuda.o  -o GNU.santis+sc142
    

    ❗ --cuda cc

    Run

    • export SCOREP_ENABLE_PROFILING=false
    • export SCOREP_ENABLE_TRACING=true
    • export SCOREP_CUDA_ENABLE=yes
    • sbatch.sh santis 5 GNU.santis+sc142 1 1 4 "4096 4096 0.1"

    eff0.png

  5. jg piccinali reporter

    scorep/1.4.2

    INTEL (OK)

    Setup

    • module swap PrgEnv-cray PrgEnv-intel
    • module load craype-accel-nvidia35
    • module load papi/5.4.1.1 # !!!
    • module load scorep/1.4.2

    Compile

    SC=/apps/santis/scorep/1.4.2/int1501sci1304mpi722cuda6514acc311otf151opa114cube431/bin/scorep
    
    $SC --cuda nvcc  -arch=sm_35 -O3  -c ../jacobi_cuda_kernel.cu -o jacobi_cuda_kernel.o
    
    $SC --mpp=mpi --thread=omp --cuda cc  -D_CSCS_ITMAX=100 -DOMP_MEMLOCALITY -DUSE_MPI \
    -openmp -O3 -c ../jacobi_cuda.c -o jacobi_cuda.o
    
    $SC --mpp=mpi --thread=omp --cuda cc  -D_CSCS_ITMAX=100 -DOMP_MEMLOCALITY -DUSE_MPI \
    -openmp -O3 jacobi_cuda_kernel.o jacobi_cuda.o  -o INTEL.santis+sc142
    

    ❗ --cuda cc

    ❗ ignoring gcc: unrecognized option '-tcollect'

    Run

    • export SCOREP_ENABLE_PROFILING=false
    • export SCOREP_ENABLE_TRACING=true
    • export SCOREP_CUDA_ENABLE=yes
    • sbatch.sh santis 5 INTEL.santis+sc142 1 1 4 "4096 4096 0.1"

    00.png

  6. Log in to comment