memcpy cuda (nvvp/nvprof)
Issue #26
new
Setup:
ssh santis01
git clone https://github.com/eth-cscs/SciComp.git SciComp.git
cd SciComp.git/Training/cuda/mpicuda/C
module swap PrgEnv-cray PrgEnv-gnu
module load craype-accel-nvidia35
module swap cudatoolkit/6.5.14-1.0502.9613.6.1
export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH
Compile:
make
ldd GNU.SANTIS |grep cuda
2: libcudart.so.6.5 =>
/opt/nvidia/cudatoolkit6.5/6.5.14-1.0502.9613.6.1/lib64/libcudart.so.6.5
(0x00002acbbbb38000)
3: libcuda.so.1 =>
/opt/cray/nvidia/default/lib64/libcuda.so.1
(0x00002acbbbe05000)
18: libcublas.so.6.5 =>
/opt/nvidia/cudatoolkit6.5/6.5.14-1.0502.9613.6.1/lib64/libcublas.so.6.5
(0x00002acbc3278000)
Run:
salloc
aprun -n1 ./GNU.SANTIS 128
=== get_gpu_info ===
Process 0 on nid00012 out of 1 Device 0 (Tesla K20X)
=== /proc/driver/nvidia/version ===
NVRM version: NVIDIA UNIX x86_64 Kernel Module 340.81
Wed Feb 18 16:28:19 PST 2015
=== cudaGetDeviceProperties ===
Device 0: "Tesla K20X"
CUDA Driver Version / Runtime Version 6.5 / 6.5
CUDA Capability Major/Minor version number: 3.5
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 2147483647 x 65535 x 65535
More infos with : aprun -n1 nvidia-smi -q
0 64 127
0 0 0
Profile:
unset COMPUTE_PROFILE
export PMI_NO_FORK=1
aprun -n1 nvprof -o nvprof.output.%h.%p ./GNU.SANTIS 128
Error:
======== Error: unable to locate profiling library libcuinj64.so.
======== Make sure the CUDA toolkit is properly installed.
Comments (4)
-
reporter -
reporter memcopy3
Ben's code implements a newton solver on a local nonlinear problem. Running 5 newton iterations is just enough for the kernel to take longer than data transfer. This code allows to overlap the D2H and H2D transfers with computation and thus to obtain a good speedup.
Setup
- module swap PrgEnv-cray PrgEnv-gnu
- module load craype-accel-nvidia35
Currently Loaded Modulefiles: 1) modules/3.2.10.3 2) nodestat/2.2-1.0502.53712.3.109.ari 3) sdb/1.0-1.0502.55976.5.27.ari 4) alps/5.2.1-2.0502.9041.11.6.ari 5) lustre-cray_ari_s/2.5_3.0.101_0.31.1_1.0502.8394.10.1-1.0502.17198.8.51 6) udreg/2.3.2-1.0502.9275.1.12.ari 7) ugni/5.0-1.0502.9685.4.24.ari 8) gni-headers/3.0-1.0502.9684.5.2.ari 9) dmapp/7.0.1-1.0502.9501.5.219.ari 10) xpmem/0.1-2.0502.55507.3.2.ari 11) hss-llm/7.2.0 12) Base-opts/1.0.2-1.0502.53325.1.2.ari 13) craype-network-aries 14) craype-sandybridge 15) craype/2.4.0 16) slurm 17) cray-mpich/7.2.2 18) ddt/5.0 19) gcc/4.8.2 20) totalview-support/1.1.4 21) totalview/8.11.0 22) cray-libsci/13.0.4 23) pmi/5.0.7-1.0000.10678.155.25.ari 24) atp/1.8.2 25) PrgEnv-gnu/5.2.40 26) cray-libsci_acc/3.1.1 27) cudatoolkit/6.5.14-1.0502.9613.6.1 28) craype-accel-nvidia35
Compile
- cd apps/daint/5.2.UP02/sandbox/jgp/cuda-examples.git/
- nvcc -c -arch=sm_35 -std=c++11 -O3 memcopy3.cu
- cc memcopy3.o -lcublas -lcuda -o memcopy3
Run
No overlap
- aprun -n1 ./memcopy3 20 1
memory copy overlap test of length N = 1048576 : 8MB with 1 chunks total : 0.00388045
- export PMI_NO_FORK=1; aprun -b -n1 nvprof -o nvprof.output.%h.%p ./memcopy3 20 1
- nvvp nvprof.output.nid00015.29230
Overlap (split into 10 chunks)
- aprun -n1 ./memcopy3 20 10
memory copy overlap test of length N = 1048576 : 8MB with 10 chunks total : 0.00180227
- export PMI_NO_FORK=1; aprun -b -n1 nvprof -o nvprof.output.%h.%p ./memcopy3 20 10
- nvvp nvprof.output.nid00015.29335
-
reporter - changed title to memcpy (nvvp/nvprof)
-
reporter - changed title to memcpy cuda (nvvp/nvprof)
- Log in to comment
Workaround: aprun -b
aprun -b -n1 nvprof -o nvprof.output.%h.%p ./GNU.SANTIS 128