-
assigned issue to
- edited description
JACOBI - MPI/CUDA (scorep)
Issue #7
new
DAINT
Get the src
- ssh -Y daint01
- git clone --single-branch -b jacobi https://github.com/eth-cscs/parallel-debuggers.git
- cd parallel-debuggers/jacobi.git/src/
Compile
CCE
- module load PrgEnv-cray # cce/8.2.4
- module load craype-accel-nvidia35
- module load scorep/1.3
GNU
module swap PrgEnv-cray PrgEnv-gnu # **gcc/4.8.2**
module swap craype/2.05 craype/2.2.0
module swap cray-libsci/12.1.3 cray-libsci/13.0.1
module rm cray-mpich/6.2.2
module load cray-mpich/7.0.3
module load craype-accel-nvidia35
module load scorep/1.3
PGI
- module swap PrgEnv-cray PrgEnv-pgi # pgi/14.1.0
- module load cudatoolkit
- module load scorep/1.3
INTEL
- module swap PrgEnv-cray PrgEnv-intel # intel/14.0.1.106
- module load cudatoolkit
-
module load scorep/1.3
-
make clean
- make PREP=scorep
Run
export SCOREP_ENABLE_PROFILING=false
export SCOREP_ENABLE_TRACING=true
export SCOREP_CUDA_ENABLE=yes
salloc -N2
export OMP_NUM_THREADS=8
aprun -n2 -N1 -d $OMP_NUM_THREADS \
jacobi_mpi+openmp+cuda.* \
4096 4096 0.5
exit
CUDA Driver Version / Runtime Version 5.5 / 5.5
CUDA Capability Major/Minor version number: 3.5
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 2147483647 x 65535 x 65535
Jacobi relaxation Calculation: 4096 x 4096 mesh with 2 processes and 8 threads
+ one Tesla K20X for each process.
1024 of 2049 local rows are calculated on the CPU
to balance the load between the CPU and the GPU. (150 iterations max)
0, 0.250000
100, 0.002395
total: 0.431415 s
Application 134913 resources: utime ~12s, stime ~0s, Rss ~197512, inblocks ~43551, outblocks ~43509
Reports
Tracing
- vampir83 scorep-*/traces.otf2
Comments (15)
-
reporter -
reporter - attached 20140903_CCE.png
- attached 20140903_PGI.png
- attached 20140903_INT.png
- edited description
- attached 20140903_GNU.png
-
reporter - edited description
-
reporter - edited description
-
reporter - vampirtrace ?
- Power ?
-
reporter - edited description
-
reporter - edited description
-
reporter - changed title to JACOBI - MPI/CUDA
-
reporter Compile
module load papi/5.3.2.1 export LD_LIBRARY_PATH=/opt/cray/papi/5.3.2.1/lib64:$LD_LIBRARY_PATH SC=/apps/daint/5.2.UP02/scorep/1.4/gnu482sharedlibsmpi711/bin/scorep $SC --cuda nvcc -O3 -arch=sm_35 -c jacobi_cuda_kernel.cu $SC --mpp=mpi --thread=omp --cuda cc -D_CSCS_ITMAX=150 -O3 -DOMP_MEMLOCALITY -fopenmp -DUSE_MPI -c jacobi_cuda.c -o jacobi_mpi+cuda.o $SC --mpp=mpi --thread=omp --cuda cc -DOMP_MEMLOCALITY -fopenmp -lcudart jacobi_mpi+cuda.o jacobi_cuda_kernel.o \ -o ./jacobi_mpi+openmp+cuda.GNU.SANTIS+sc14
Run
salloc -N2 export SCOREP_ENABLE_PROFILING=false export SCOREP_ENABLE_TRACING=true export SCOREP_CUDA_ENABLE=yes export OMP_NUM_THREADS=8
aprun -n2 -N1 -d $OMP_NUM_THREADS ./jacobi_mpi+openmp+cuda.GNU.SANTIS+sc14 4096 4096 0.5
CUDA Driver Version / Runtime Version 6.5 / 6.5 CUDA Capability Major/Minor version number: 3.5 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Maximum sizes of each dimension of a block: 1024 x 1024 x 64 Maximum sizes of each dimension of a grid: 2147483647 x 65535 x 65535 Jacobi relaxation Calculation: 4096 x 4096 mesh with 2 processes and 8 threads + one Tesla K20X for each process. 1024 of 2049 local rows are calculated on the CPU to balance the load between the CPU and the GPU. (150 iterations max) 0, 0.250000 100, 0.002395 total: 1.089108 s [Score-P] src/adapters/cuda/scorep_cupti4_activity.c:257: Warning: [CUPTI Activity] Destroying buffer which is currently in use (6936736, 1, 0)
Report
-
reporter scorep/1.4.1
PizDaint
Setup
module swap PrgEnv-cray PrgEnv-gnu module load craype-accel-nvidia35 module load papi/5.4.0.1 # !!! SC=/apps/daint/5.2.UP02/scorep/1.4.1/gnu482sci1303mpi720cuda6514acc311otf151opa113cube431/bin/scorep
Compile
cd ~/parallel-debuggers.git/jacobi.git/src/GNU $SC --cuda nvcc -O3 -arch=sm_35 -c jacobi_cuda_kernel.cu $SC --mpp=mpi --thread=omp --cuda cc -D_CSCS_ITMAX=150 -O3 -DOMP_MEMLOCALITY -fopenmp -DUSE_MPI \ -c jacobi_cuda.c -o jacobi_mpi+cuda.o $SC --mpp=mpi --thread=omp --cuda cc -DOMP_MEMLOCALITY -fopenmp \ -lcudart jacobi_mpi+cuda.o jacobi_cuda_kernel.o \ -o ./jacobi_mpi+openmp+cuda.GNU.DAINT+sc141
nvcc -O3 -arch=sm_35 -c jacobi_cuda_kernel.cu cc -D_CSCS_ITMAX=150 -O3 -DOMP_MEMLOCALITY -fopenmp -DUSE_MPI \ -c jacobi_cuda.c -o jacobi_mpi+cuda.o cc -DOMP_MEMLOCALITY -fopenmp \ -lcudart jacobi_mpi+cuda.o jacobi_cuda_kernel.o \ -o ./jacobi_mpi+openmp+cuda.GNU.DAINT+notool
Run (no tool)
- aprun -n 2 -N 1 -d 8 -j 1 jacobi_mpi+openmp+cuda.GNU.DAINT+notool 4096 4096 0.5
CUDA Driver Version / Runtime Version 6.5 / 6.5 CUDA Capability Major/Minor version number: 3.5 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Maximum sizes of each dimension of a block: 1024 x 1024 x 64 Maximum sizes of each dimension of a grid: 2147483647 x 65535 x 65535 Jacobi relaxation Calculation: 4096 x 4096 mesh with 2 processes and 8 threads + one Tesla K20X for each process. 1024 of 2049 local rows are calculated on the CPU to balance the load between the CPU and the GPU. (150 iterations max) 0, 0.250000 100, 0.002395 total: 5.444724 s real 8.09
Run (tracing)
salloc -N2 export SCOREP_ENABLE_PROFILING=false export SCOREP_ENABLE_TRACING=true export SCOREP_CUDA_ENABLE=yes export OMP_NUM_THREADS=8 ~/KEEP/slurm/sbatch.sh daint 2 jacobi_mpi+openmp+cuda.GNU.DAINT+sc141 2 1 8 "4096 4096 0.5"
- aprun -n2 -N1 -d8 ./jacobi_mpi+openmp+cuda.GNU.DAINT+sc141 4096 4096 0.5
[Score-P] src/adapters/cuda/scorep_cupti4_activity.c:257: Warning: [CUPTI Activity] Destroying buffer which is currently in use (23345104, 1, 0) _pmiu_daemon(SIGCHLD): [NID 00166] [c0-0c2s9n2] [Thu May 21 16:07:48 2015] PE RANK 1 exit signal Segmentation fault
Analyze
- ignoring segfault above...
- /apps/ela/vampir/8.4.1/bin/vampir scorep-20150521_1611_88946296565420/traces.otf2
Debug
[Score-P] src/adapters/cuda/scorep_cupti_callbacks.c:337: Warning: [CUPTI] Call to 'cuptiErr' failed with message: 'CUPTI_ERROR_NOT_INITIALIZED' [Score-P] src/adapters/cuda/scorep_cupti4_activity.c:277: Warning: [CUPTI] Call to 'cuptiActivityRegisterCallbacks( buffer_requested_callback, buff er_completed_callback )' failed with message: 'CUPTI_ERROR_NOT_INITIALIZED' [Score-P] src/adapters/cuda/scorep_cupti_activity.c:781: Warning: [CUPTI] Call to 'cuptiActivityEnable( CUPTI_ACTIVITY_KIND_CONCURRENT_KERNEL )' fa iled with message: 'CUPTI_ERROR_NOT_INITIALIZED' [Score-P] src/adapters/cuda/scorep_cupti_activity.c:795: Warning: [CUPTI] Call to 'cuptiActivityEnable( CUPTI_ACTIVITY_KIND_MEMCPY )' failed with m essage: 'CUPTI_ERROR_NOT_INITIALIZED' [Score-P] src/adapters/cuda/scorep_cupti_callbacks.c:361: Warning: [CUPTI] Call to 'cuptiEnableDomain( 1, scorep_cupti_callbacks_subscriber, CUPTI_ CB_DOMAIN_RUNTIME_API )' failed with message: 'CUPTI_ERROR_INVALID_PARAMETER' [Score-P] src/adapters/cuda/scorep_cupti_callbacks.c:384: Warning: [CUPTI] Call to 'cuptiEnableDomain( 1, scorep_cupti_callbacks_subscriber, CUPTI_ CB_DOMAIN_SYNCHRONIZE )' failed with message: 'CUPTI_ERROR_INVALID_PARAMETER' [Score-P] src/adapters/cuda/scorep_cupti_callbacks.c:388: Warning: [CUPTI] Call to 'cuptiEnableDomain( 1, scorep_cupti_callbacks_subscriber, CUPTI_ CB_DOMAIN_RESOURCE )' failed with message: 'CUPTI_ERROR_INVALID_PARAMETER' [Score-P] src/adapters/cuda/scorep_cupti_callbacks.c:337: Warning: [CUPTI] Call to 'cuptiErr' failed with message: 'CUPTI_ERROR_NOT_INITIALIZED' [Score-P] src/adapters/cuda/scorep_cupti4_activity.c:277: Warning: [CUPTI] Call to 'cuptiActivityRegisterCallbacks( buffer_requested_callback, buff er_completed_callback )' failed with message: 'CUPTI_ERROR_NOT_INITIALIZED' [Score-P] src/adapters/cuda/scorep_cupti_activity.c:781: Warning: [CUPTI] Call to 'cuptiActivityEnable( CUPTI_ACTIVITY_KIND_CONCURRENT_KERNEL )' fa iled with message: 'CUPTI_ERROR_NOT_INITIALIZED' [Score-P] src/adapters/cuda/scorep_cupti_activity.c:795: Warning: [CUPTI] Call to 'cuptiActivityEnable( CUPTI_ACTIVITY_KIND_MEMCPY )' failed with m essage: 'CUPTI_ERROR_NOT_INITIALIZED' [Score-P] src/adapters/cuda/scorep_cupti_callbacks.c:361: Warning: [CUPTI] Call to 'cuptiEnableDomain( 1, scorep_cupti_callbacks_subscriber, CUPTI_ CB_DOMAIN_RUNTIME_API )' failed with message: 'CUPTI_ERROR_INVALID_PARAMETER' [Score-P] src/adapters/cuda/scorep_cupti_callbacks.c:384: Warning: [CUPTI] Call to 'cuptiEnableDomain( 1, scorep_cupti_callbacks_subscriber, CUPTI_ CB_DOMAIN_SYNCHRONIZE )' failed with message: 'CUPTI_ERROR_INVALID_PARAMETER' [Score-P] src/adapters/cuda/scorep_cupti_callbacks.c:388: Warning: [CUPTI] Call to 'cuptiEnableDomain( 1, scorep_cupti_callbacks_subscriber, CUPTI_ CB_DOMAIN_RESOURCE )' failed with message: 'CUPTI_ERROR_INVALID_PARAMETER'
Processes 0-1: Process stopped in jacobi kernel (jacobi_cuda_kernel.cu:44) with signal CUDA_EXCEPTION_10 (Device Illegal Address). Reason/Origin: kill, sigsend or raise Your program will probably be terminated if you continue. You can use the stack controls to see what the process was doing at the time.
Currently Loaded Modulefiles: 1) modules/3.2.10.3 2) nodestat/2.2-1.0502.53712.3.109.ari 3) sdb/1.0-1.0502.55976.5.27.ari 4) alps/5.2.1-2.0502.9041.11.6.ari 5) lustre-cray_ari_s/2.5_3.0.101_0.31.1_1.0502.8394.10.1-1.0502.17198.8.51 6) udreg/2.3.2-1.0502.9275.1.12.ari 7) ugni/5.0-1.0502.9685.4.24.ari 8) gni-headers/3.0-1.0502.9684.5.2.ari 9) dmapp/7.0.1-1.0502.9501.5.219.ari 10) xpmem/0.1-2.0502.55507.3.2.ari 11) hss-llm/7.2.0 12) Base-opts/1.0.2-1.0502.53325.1.2.ari 13) craype-network-aries 14) craype/2.3.0 15) craype-sandybridge 16) slurm 17) cray-mpich/7.2.0 18) ddt/5.0 19) gcc/4.8.2 20) totalview-support/1.1.4 21) totalview/8.11.0 22) cray-libsci/13.0.3 23) pmi/5.0.6-1.0000.10439.140.2.ari 24) atp/1.8.1 25) PrgEnv-gnu/5.2.40 26) /linux/jg 27) cray-libsci_acc/3.1.1 28) cudatoolkit/6.5.14-1.0502.9613.6.1 29) craype-accel-nvidia35 30) papi/5.4.0.1
-
reporter From: Ronny Tschueter
module load scorep/1.4.1
allows to get rid of of the cupti warnings + adding a call tocudaDeviceReset()
.This can be safely ignored...:
[NID 00013] 2015-05-26 14:34:28 Apid 158049: initiated application termination
-
reporter scorep/141 (GNU only)
-
reporter - changed title to JACOBI - MPI/CUDA (scorep)
-
reporter scorep/1.4.2
GNU (OK)
Setup
- module swap PrgEnv-cray PrgEnv-gnu
- module load craype-accel-nvidia35
- module load papi/5.4.1.1 # !!!
- module load scorep/1.4.2
Compile
SC=/apps/santis/scorep/1.4.2/gnu482sci1304mpi722cuda6514acc311otf151opa114cube431/bin/scorep $SC --cuda nvcc -arch=sm_35 -O3 -c ../jacobi_cuda_kernel.cu -o jacobi_cuda_kernel.o $SC --mpp=mpi --thread=omp --cuda cc -D_CSCS_ITMAX=100 -DOMP_MEMLOCALITY -DUSE_MPI \ -fopenmp -std=c99 -O3 -c ../jacobi_cuda.c -o jacobi_cuda.o $SC --mpp=mpi --thread=omp --cuda cc -D_CSCS_ITMAX=100 -DOMP_MEMLOCALITY -DUSE_MPI \ -fopenmp -std=c99 -O3 jacobi_cuda_kernel.o jacobi_cuda.o -o GNU.santis+sc142
--cuda cc
Run
- export SCOREP_ENABLE_PROFILING=false
- export SCOREP_ENABLE_TRACING=true
- export SCOREP_CUDA_ENABLE=yes
- sbatch.sh santis 5 GNU.santis+sc142 1 1 4 "4096 4096 0.1"
-
reporter scorep/1.4.2
INTEL (OK)
Setup
- module swap PrgEnv-cray PrgEnv-intel
- module load craype-accel-nvidia35
- module load papi/5.4.1.1 # !!!
- module load scorep/1.4.2
Compile
SC=/apps/santis/scorep/1.4.2/int1501sci1304mpi722cuda6514acc311otf151opa114cube431/bin/scorep $SC --cuda nvcc -arch=sm_35 -O3 -c ../jacobi_cuda_kernel.cu -o jacobi_cuda_kernel.o $SC --mpp=mpi --thread=omp --cuda cc -D_CSCS_ITMAX=100 -DOMP_MEMLOCALITY -DUSE_MPI \ -openmp -O3 -c ../jacobi_cuda.c -o jacobi_cuda.o $SC --mpp=mpi --thread=omp --cuda cc -D_CSCS_ITMAX=100 -DOMP_MEMLOCALITY -DUSE_MPI \ -openmp -O3 jacobi_cuda_kernel.o jacobi_cuda.o -o INTEL.santis+sc142
--cuda cc
ignoring
gcc: unrecognized option '-tcollect'
Run
- export SCOREP_ENABLE_PROFILING=false
- export SCOREP_ENABLE_TRACING=true
- export SCOREP_CUDA_ENABLE=yes
- sbatch.sh santis 5 INTEL.santis+sc142 1 1 4 "4096 4096 0.1"
- Log in to comment