memcpy cuda (nvvp/nvprof)

Issue #26 new
jg piccinali repo owner created an issue

Setup:

ssh santis01
git clone https://github.com/eth-cscs/SciComp.git   SciComp.git
cd SciComp.git/Training/cuda/mpicuda/C

module swap PrgEnv-cray PrgEnv-gnu
module load craype-accel-nvidia35
module swap cudatoolkit/6.5.14-1.0502.9613.6.1
export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH

Compile:

make
ldd GNU.SANTIS |grep cuda

2:  libcudart.so.6.5 => 
/opt/nvidia/cudatoolkit6.5/6.5.14-1.0502.9613.6.1/lib64/libcudart.so.6.5 
(0x00002acbbbb38000)
3:  libcuda.so.1 => 
/opt/cray/nvidia/default/lib64/libcuda.so.1 
(0x00002acbbbe05000)
18: libcublas.so.6.5 => 
/opt/nvidia/cudatoolkit6.5/6.5.14-1.0502.9613.6.1/lib64/libcublas.so.6.5 
(0x00002acbc3278000)

Run:

salloc
aprun -n1 ./GNU.SANTIS 128
=== get_gpu_info ===
Process 0 on nid00012 out of 1 Device 0 (Tesla K20X)

=== /proc/driver/nvidia/version ===
NVRM version: NVIDIA UNIX x86_64 Kernel Module  340.81  
Wed Feb 18 16:28:19 PST 2015

=== cudaGetDeviceProperties ===
Device 0: "Tesla K20X"
  CUDA Driver Version / Runtime Version     6.5 / 6.5
  CUDA Capability Major/Minor version number:    3.5
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Maximum sizes of each dimension of a block:    1024 x 1024 x 64
  Maximum sizes of each dimension of a grid:     2147483647 x 65535 x 65535
  More infos with : aprun -n1 nvidia-smi  -q

0 64 127
0 0 0 

Profile:

unset COMPUTE_PROFILE
export PMI_NO_FORK=1
aprun -n1 nvprof -o nvprof.output.%h.%p   ./GNU.SANTIS 128

Error:

======== Error: unable to locate profiling library libcuinj64.so.
======== Make sure the CUDA toolkit is properly installed.

Comments (4)

  1. jg piccinali reporter

    Workaround: aprun -b

    aprun -b -n1 nvprof -o nvprof.output.%h.%p ./GNU.SANTIS 128

    ==5278== NVPROF is profiling process 5278, command: ./GNU.SANTIS 128
    
    === get_gpu_info ===
    Process 0 on nid00012 out of 1 Device 0 (Tesla K20X)
    
    === /proc/driver/nvidia/version ===
    NVRM version: NVIDIA UNIX x86_64 Kernel Module  340.87  Thu Mar 19 23:39:02 PDT 2015
    
    === cudaGetDeviceProperties ===
    
    Device 0: "Tesla K20X"
      CUDA Driver Version / Runtime Version     6.5 / 6.5
      CUDA Capability Major/Minor version number:    3.5
      Maximum number of threads per multiprocessor:  2048
      Maximum number of threads per block:           1024
      Maximum sizes of each dimension of a block:    1024 x 1024 x 64
      Maximum sizes of each dimension of a grid:     2147483647 x 65535 x 65535
      More infos with : aprun -n1 nvidia-smi  -q
    
    
    0 64 127
    0 0 0
    ==5278== Generated result file: 
    /apps/todi/ddt/CSCS/SciComp.git/Training/cuda/mpicuda/C/nvprof.output.nid00012.5278
    
  2. jg piccinali reporter

    memcopy3

    Ben's code implements a newton solver on a local nonlinear problem.
    Running 5 newton iterations is just enough for the kernel to take longer than data transfer.
    This code allows to overlap the D2H and H2D transfers with computation and thus to obtain a good speedup.
    

    Setup

    • module swap PrgEnv-cray PrgEnv-gnu
    • module load craype-accel-nvidia35
    Currently Loaded Modulefiles:
      1) modules/3.2.10.3
      2) nodestat/2.2-1.0502.53712.3.109.ari
      3) sdb/1.0-1.0502.55976.5.27.ari
      4) alps/5.2.1-2.0502.9041.11.6.ari
      5) lustre-cray_ari_s/2.5_3.0.101_0.31.1_1.0502.8394.10.1-1.0502.17198.8.51
      6) udreg/2.3.2-1.0502.9275.1.12.ari
      7) ugni/5.0-1.0502.9685.4.24.ari
      8) gni-headers/3.0-1.0502.9684.5.2.ari
      9) dmapp/7.0.1-1.0502.9501.5.219.ari
     10) xpmem/0.1-2.0502.55507.3.2.ari
     11) hss-llm/7.2.0
     12) Base-opts/1.0.2-1.0502.53325.1.2.ari
     13) craype-network-aries
     14) craype-sandybridge
     15) craype/2.4.0
     16) slurm
     17) cray-mpich/7.2.2
     18) ddt/5.0
     19) gcc/4.8.2
     20) totalview-support/1.1.4
     21) totalview/8.11.0
     22) cray-libsci/13.0.4
     23) pmi/5.0.7-1.0000.10678.155.25.ari
     24) atp/1.8.2
     25) PrgEnv-gnu/5.2.40
     26) cray-libsci_acc/3.1.1
     27) cudatoolkit/6.5.14-1.0502.9613.6.1
     28) craype-accel-nvidia35
    

    Compile

    • cd apps/daint/5.2.UP02/sandbox/jgp/cuda-examples.git/
    • nvcc -c -arch=sm_35 -std=c++11 -O3 memcopy3.cu
    • cc memcopy3.o -lcublas -lcuda -o memcopy3

    Run

    No overlap

    • aprun -n1 ./memcopy3 20 1
    memory copy overlap test of length N = 1048576 : 8MB with 1 chunks
    total : 0.00388045
    
    • export PMI_NO_FORK=1; aprun -b -n1 nvprof -o nvprof.output.%h.%p ./memcopy3 20 1
    • nvvp nvprof.output.nid00015.29230 eff_nvvp0.png

    Overlap (split into 10 chunks)

    • aprun -n1 ./memcopy3 20 10
    memory copy overlap test of length N = 1048576 : 8MB with 10 chunks
    total : 0.00180227
    
    • export PMI_NO_FORK=1; aprun -b -n1 nvprof -o nvprof.output.%h.%p ./memcopy3 20 10
    • nvvp nvprof.output.nid00015.29335 eff_nvvp1.png
  3. Log in to comment