Issue #49 new

jg piccinali repo owner created an issue 2015-08-21

Scorep/142

Daint

Setup

src: lammps-10Feb15
module swap PrgEnv-cray PrgEnv-gnu
module swap gcc gcc/4.8.2
module load fftw
module load craype-accel-nvidia35

Compile

cd src; make package-update
cd ../lib/gpu/
make -f /project/csstaff/lucamar/install_scripts/lammps/xc30/Makefile-gnu_gpu.xc30
cd src
make yes-standard yes-gpu yes-user-reaxc yes-user-omp
make no-voronoi no-reax no-poems no-meam no-kim
make xc30 # ---> lmp_xc30

Run

bipcanpT110_298.16_2:
- /project/csstaff/lucamar/test/lammps/test.in (without test.restart)
- sbatch.sh santis 10 ./lmp_xc30 32 8 1 "-in test.in"
- ERROR: Pair_coeff command before simulation box is defined (../input.cpp:1472)
ffield_SiC2.reaxc:
- cd /apps/santis/sandbox/jgp/lammps/lammps-10Feb15/src/JG/testB
- inputs: ffield_SiC2.reaxc, input.dat, 04cn.in
- egrep "run|thermo|processors" 04cn.in

processors 4 4 1
thermo      10
thermo_style    multi
run 50         # <------- max steps

aprun -n16 -N4 -d2 -j1 ./lmp_xc30 -in 04cn.in

Loop time of 75.1372 on 32 procs (16 MPI x 2 OpenMP) 
for 50 steps with 176128 atoms
real 81.38

Comments (3)

jg piccinali reporter

Compile (scorep)

module load scorep/1.4.2
lib/gpu
- grep scorep Makefile-gnu_gpu.xc30

NVCC = scorep --cuda nvcc 
CUDR_CPP = scorep --mpp=mpi --cuda CC ...

nvcc fatal

wget https://raw.githubusercontent.com/lammps/lammps/master/lib/gpu/lal_atom.cu
module swap PrgEnv-cray PrgEnv-gnu
module swap gcc gcc/4.8.2
module load fftw craype-accel-nvidia35
module load scorep/1.4.2

nvcc  -I/opt/nvidia/cudatoolkit6.5/6.5.14-1.0502.9613.6.1/include -DUNIX -Xptxas  -v --use_fast_math -arch=sm_35 \
-D_SINGLE_DOUBLE \
-DNV_KERNEL \
lal_atom.cu --cubin -o atom.cubin
# OK: ptxas info    : Used 16 registers, 348 bytes cmem[0], 8 bytes cmem[2]

scorep --cuda \
nvcc  -I/opt/nvidia/cudatoolkit6.5/6.5.14-1.0502.9613.6.1/include -DUNIX -Xptxas  -v --use_fast_math -arch=sm_35 \
-D_SINGLE_DOUBLE \
-DNV_KERNEL \
lal_atom.cu --cubin -o atom.cubin
# NOT OK:
# nvcc fatal   : A single input file is required for a non-link phase when an outputfile is specified
# workaround => remove -o atom.cubin

09/2015: Received patch for scorep !!!

2015-08-21T12:50:33+00:00

jg piccinali reporter
- For C++ codes, it is mandatory to use scorep-score -m
```
SCOREP_REGION_NAMES_BEGIN
 EXCLUDE
 MANGLED
...
SCOREP_REGION_NAMES_END
```
- 2015-08-25T09:11:46+00:00

jg piccinali reporter

Running

aprun -n 16 -N 4 -d 2 -j 1 ./lmp_xc30+notool -in 04cn.in

---------------- Step       50 ----- CPU =    538.6383 (sec) ----------------
TotEng   = -27281940.2435 KinEng   =   1264882.8364 Temp     =      2409.2954
PotEng   = -28546823.0799 E_bond   =         0.0000 E_angle  =         0.0000
E_dihed  =         0.0000 E_impro  =         0.0000 E_vdwl   = -28540060.4526
E_coul   =     -6762.6273 E_long   =         0.0000 Press    =     -1214.5315
Loop time of 538.639 on 32 procs (16 MPI x 2 OpenMP) for 50 steps with 176128 atoms
real 562.85

Profiling (scorep)

aprun -n 16 -N 4 -d 2 -j 1 ./lmp_xc30+sc142 -in 04cn.in
square scorep-n16N4d2P/

Filtering (scorep)

scorep-score -r -m scorep-n16N4d2P/profile.cubex

Estimated aggregate size of event trace:                   724GB
Estimated requirements for largest trace buffer (max_buf): 56GB
Estimated memory requirements (SCOREP_TOTAL_MEMORY):       56GB

SCOREP_REGION_NAMES_BEGIN
 EXCLUDE
 MANGLED
...
SCOREP_REGION_NAMES_END

scorep-score -f scorep-n16N4d2P/filterjg scorep-n16N4d2P/profile.cubex

Estimated aggregate size of event trace:                   11MB
Estimated requirements for largest trace buffer (max_buf): 762kB
Estimated memory requirements (SCOREP_TOTAL_MEMORY):       7MB

Tracing (scorep)

export SCOREP_ENABLE_PROFILING=false
export SCOREP_ENABLE_TRACING=true
export SCOREP_CUDA_ENABLE=yes
export SCOREP_FILTERING_FILE=scorep-n16N4d2P/filterjg
grep run 04cn.in ==> run 10
aprun -n 16 -N 4 -d 2 -j 1 ./lmp_xc30+sc142 -in 04cn.in

---------------- Step       50 ----- CPU =     76.6258 (sec) ----------------
TotEng   = -27281940.2435 KinEng   =   1264882.8364 Temp     =      2409.2954
PotEng   = -28546823.0799 E_bond   =         0.0000 E_angle  =         0.0000
E_dihed  =         0.0000 E_impro  =         0.0000 E_vdwl   = -28540060.4526
E_coul   =     -6762.6273 E_long   =         0.0000 Press    =     -1214.5315
Loop time of 76.6259 on 32 procs (16 MPI x 2 OpenMP) for 50 steps with 176128 atoms
real 84.44

2015-08-29T18:07:27+00:00

Log in to comment

Assignee: –

Type: bug

Priority: major

Status: new

Votes: 0

Watchers: 1