LAMMPS (scorep)

Issue #49 new
jg piccinali repo owner created an issue

Scorep/142

Daint

Setup

  • src: lammps-10Feb15
  • module swap PrgEnv-cray PrgEnv-gnu
  • module swap gcc gcc/4.8.2
  • module load fftw
  • module load craype-accel-nvidia35

Compile

  • cd src; make package-update
  • cd ../lib/gpu/
  • make -f /project/csstaff/lucamar/install_scripts/lammps/xc30/Makefile-gnu_gpu.xc30
  • cd src
  • make yes-standard yes-gpu yes-user-reaxc yes-user-omp
  • make no-voronoi no-reax no-poems no-meam no-kim
  • make xc30 # ---> lmp_xc30

Run

  • bipcanpT110_298.16_2:

    • /project/csstaff/lucamar/test/lammps/test.in (without test.restart)
    • sbatch.sh santis 10 ./lmp_xc30 32 8 1 "-in test.in"
    • ERROR: Pair_coeff command before simulation box is defined (../input.cpp:1472)
  • ffield_SiC2.reaxc:

    • cd /apps/santis/sandbox/jgp/lammps/lammps-10Feb15/src/JG/testB
    • inputs: ffield_SiC2.reaxc, input.dat, 04cn.in
    • egrep "run|thermo|processors" 04cn.in
processors 4 4 1
thermo      10
thermo_style    multi
run 50         # <------- max steps
  • aprun -n16 -N4 -d2 -j1 ./lmp_xc30 -in 04cn.in
Loop time of 75.1372 on 32 procs (16 MPI x 2 OpenMP) 
for 50 steps with 176128 atoms
real 81.38

Comments (3)

  1. jg piccinali reporter

    Compile (scorep)

    • module load scorep/1.4.2
    • lib/gpu
      • grep scorep Makefile-gnu_gpu.xc30
    NVCC = scorep --cuda nvcc 
    CUDR_CPP = scorep --mpp=mpi --cuda CC ...
    

    nvcc fatal

    nvcc  -I/opt/nvidia/cudatoolkit6.5/6.5.14-1.0502.9613.6.1/include -DUNIX -Xptxas  -v --use_fast_math -arch=sm_35 \
    -D_SINGLE_DOUBLE \
    -DNV_KERNEL \
    lal_atom.cu --cubin -o atom.cubin
    # OK: ptxas info    : Used 16 registers, 348 bytes cmem[0], 8 bytes cmem[2]
    
    scorep --cuda \
    nvcc  -I/opt/nvidia/cudatoolkit6.5/6.5.14-1.0502.9613.6.1/include -DUNIX -Xptxas  -v --use_fast_math -arch=sm_35 \
    -D_SINGLE_DOUBLE \
    -DNV_KERNEL \
    lal_atom.cu --cubin -o atom.cubin
    # NOT OK:
    # nvcc fatal   : A single input file is required for a non-link phase when an outputfile is specified
    # workaround => remove -o atom.cubin
    

    09/2015: Received patch for scorep !!!

  2. jg piccinali reporter
    • For C++ codes, it is mandatory to use scorep-score -m
    SCOREP_REGION_NAMES_BEGIN
     EXCLUDE
     MANGLED
    ...
    SCOREP_REGION_NAMES_END
    

    cube.png

  3. jg piccinali reporter

    Running

    • aprun -n 16 -N 4 -d 2 -j 1 ./lmp_xc30+notool -in 04cn.in
    ---------------- Step       50 ----- CPU =    538.6383 (sec) ----------------
    TotEng   = -27281940.2435 KinEng   =   1264882.8364 Temp     =      2409.2954
    PotEng   = -28546823.0799 E_bond   =         0.0000 E_angle  =         0.0000
    E_dihed  =         0.0000 E_impro  =         0.0000 E_vdwl   = -28540060.4526
    E_coul   =     -6762.6273 E_long   =         0.0000 Press    =     -1214.5315
    Loop time of 538.639 on 32 procs (16 MPI x 2 OpenMP) for 50 steps with 176128 atoms
    real 562.85
    

    Profiling (scorep)

    • aprun -n 16 -N 4 -d 2 -j 1 ./lmp_xc30+sc142 -in 04cn.in
    • square scorep-n16N4d2P/ lammpsP.jpg

    Filtering (scorep)

    • scorep-score -r -m scorep-n16N4d2P/profile.cubex
    Estimated aggregate size of event trace:                   724GB
    Estimated requirements for largest trace buffer (max_buf): 56GB
    Estimated memory requirements (SCOREP_TOTAL_MEMORY):       56GB
    
    SCOREP_REGION_NAMES_BEGIN
     EXCLUDE
     MANGLED
    ...
    SCOREP_REGION_NAMES_END
    
    • scorep-score -f scorep-n16N4d2P/filterjg scorep-n16N4d2P/profile.cubex
    Estimated aggregate size of event trace:                   11MB
    Estimated requirements for largest trace buffer (max_buf): 762kB
    Estimated memory requirements (SCOREP_TOTAL_MEMORY):       7MB
    

    Tracing (scorep)

    • export SCOREP_ENABLE_PROFILING=false
    • export SCOREP_ENABLE_TRACING=true
    • export SCOREP_CUDA_ENABLE=yes
    • export SCOREP_FILTERING_FILE=scorep-n16N4d2P/filterjg
    • grep run 04cn.in ==> run 10
    • aprun -n 16 -N 4 -d 2 -j 1 ./lmp_xc30+sc142 -in 04cn.in
    ---------------- Step       50 ----- CPU =     76.6258 (sec) ----------------
    TotEng   = -27281940.2435 KinEng   =   1264882.8364 Temp     =      2409.2954
    PotEng   = -28546823.0799 E_bond   =         0.0000 E_angle  =         0.0000
    E_dihed  =         0.0000 E_impro  =         0.0000 E_vdwl   = -28540060.4526
    E_coul   =     -6762.6273 E_long   =         0.0000 Press    =     -1214.5315
    Loop time of 76.6259 on 32 procs (16 MPI x 2 OpenMP) for 50 steps with 176128 atoms
    real 84.44
    

    lammpsT1.jpg lammpsT2.jpg

  4. Log in to comment