- edited description
CP2K (scorep)
Scorep/142
Daint
Setup
- src
- module swap PrgEnv-cray PrgEnv-gnu
- module swap gcc gcc/4.8.2
- module load fftw
- module load craype-accel-nvidia35
Compile
- cd /apps/daint/5.2.UP02/sandbox/jgp/cp2k/GNU482/cp2k-code-15721-trunk/cp2k/makefiles/
- make ARCH=CRAY-XC30-gfortran-gpu VERSION=psmp
- /project/csstaff/lucamar/install_scripts/cp2k/xc30/CRAY-XC30-gfortran-gpu.psmp
Run
- cd exe/CRAY-XC30-gfortran-gpu
- ln -s /apps/daint/5.2.UP02/sandbox/jgp/cp2k/in/* .
- ./sbatch.sh santis 10 ./cp2k.psmp 8 1 8 H2O-dft-ls_NREP2.inp
------------------------------------------------------------
- T I M I N G
-------------------------------------------------------------
SUBROUTINE CALLS ASD SELF TIME TOTAL TIME
MAXIMUM AVERAGE MAXIMUM AVERAGE MAXIMUM
CP2K 1 1.0 0.028 0.029 30.801 30.802
real 35.07
Compile (scorep)
- module load scorep/1.4.2
- make ARCH=CRAY-XC30-gfortran-gpusc142 VERSION=psmp
NVCC = scorep --cuda nvcc
CC = scorep --mpp=mpi --thread=omp:pomp_tpd cc
FC = scorep --mpp=mpi --thread=omp:pomp_tpd ftn
LD = scorep --mpp=mpi --thread=omp:pomp_tpd ftn
- Warning: line-map.c: file "src/colloc_int_body.f90" left but not entered
Profile (scorep)
- vim H2O-dft-ls_NREP2.inp
&TIMINGS OFF
TIME_MPI .FALSE.
&END
- export SCOREP_ENABLE_PROFILING=true # true or false
- export SCOREP_ENABLE_TRACING=false # true or false
- export SCOREP_CUDA_ENABLE=no
- export SCOREP_TOTAL_MEMORY=17000000
[Score-P] src/measurement/SCOREP_Memory.c:145:
Error: No free memory page available: Out of memory.
Please increase SCOREP_TOTAL_MEMORY=17000000 and try again.
#13 0x2150FD5 in __list_callstackentry_MOD_list_callstackentry_pop
#14 0x2105A51 in __timings_MOD_timestop
#15 0x1FE170A in __dbcsr_error_handling_MOD_dbcsr_error_stop
- export CRAY_CUDA_MPS=1 # without tool
- export CRAY_CUDA_MPS=0 # with tool
Comments (7)
-
reporter -
reporter - cuda-gdb -c core -e cp2k.psmp
(cuda-gdb) bt #9 scorep_pomp_lock_destroy (lock=lock@entry=0x2aab277e549c) at ../src/adapters/pomp/SCOREP_Pomp_Lock.c:216 #10 0x00002aaaabd168f7 in POMP2_Destroy_lock (s=0x2aab277e549c) at ../src/adapters/pomp/SCOREP_Pomp_Omp.c:837 #11 0x000000000113616b in ?? () #12 0x0000000000000000 in ?? ()
- addr2line -e cp2k.psmp 0x000000000113616b
- cp2k/src/task_list_methods.F:2665
2662 !$ IF (.not.scatter) THEN 2663 !$omp do 2664 !$ do i=1,nthread*10 2665 !$ call omp_destroy_lock(locks(i)) 2666 !$ end do 2667 !$omp end do 2668 !$ END IF
-
reporter Profiling (scorep, no OpenMP)
[Score-P] src/measurement/profiling/scorep_profile_collapse.c:77: Warning: Score-P callpath depth limitation of 30 exceeded. Reached callpath depth was 44
- export SCOREP_TOTAL_MEMORY=50M
- export SCOREP_PROFILING_MAX_CALLPATH_DEPTH=10
- scorep-score scorep-n8N1d1/profile.cubex
Estimated aggregate size of event trace: 73MB Estimated requirements for largest trace buffer (max_buf): 10MB Estimated memory requirements (SCOREP_TOTAL_MEMORY): 12MB flt type max_buf[B] visits time[s] time[%] time/visit[us] region ALL 9,607,220 2,900,785 1010.67 100.0 348.41 ALL USR 9,269,260 2,833,655 1006.39 99.6 355.16 USR MPI 214,126 29,128 2.71 0.3 93.00 MPI COM 116,948 35,942 1.57 0.2 43.60 COM OMP 6,886 2,060 0.00 0.0 1.14 OMP real 132.86
- scorep-score scorep-n8N1d2/profile.cubex
Estimated aggregate size of event trace: 74MB Estimated requirements for largest trace buffer (max_buf): 10MB Estimated memory requirements (SCOREP_TOTAL_MEMORY): 14MB flt type max_buf[B] visits time[s] time[%] time/visit[us] region ALL 9,639,950 2,910,817 1166.90 100.0 400.88 ALL USR 9,301,578 2,843,591 1162.84 99.7 408.94 USR MPI 214,126 29,128 2.35 0.2 80.58 MPI COM 117,000 35,958 1.48 0.1 41.12 COM OMP 7,246 2,140 0.23 0.0 107.91 OMP real 89.10
Tracing (scorep, no OpenMP)
- export SCOREP_ENABLE_PROFILING=false
- export SCOREP_ENABLE_TRACING=true
- export SCOREP_CUDA_ENABLE=yes
- export SCOREP_TOTAL_MEMORY=2000M
<------------ ?
-
reporter Timers off
&GLOBAL PROJECT H2O RUN_TYPE ENERGY PRINT_LEVEL MEDIUM &TIMINGS OFF TIME_MPI .FALSE. &END &PROGRAM_RUN_INFO OFF &END EXTENDED_FFT_LENGTHS &END GLOBAL
-
reporter try --disable opari2...
-
To disable locks and criticals, instrument as follows:
scorep --opari=--disable=critical,locks gcc ...
To get a list of constructs that can be disabled, see opari2 --help.
-
reporter #atomic_kind_types_MOD_*
_cp_error_handling_MOD*
_cp_files_MOD*
__cp_linked_list*MOD*
_cp_log_handling_MOD*
_cp_output_handling_MOD*
#_cp_parser_methods_MOD*
_cp_parser_types_MOD*
_cp_units_MOD*
_cp_dbscr_interface_MOD*
_dbcsr_config_MOD*
_dbcsr_error_handling_MOD*
#_environment_MOD*
_machine_MOD*
_machine_gfortran_MOD*
_ma_config_MOD*
_input_keyword_types_MOD*
_input_enumeration_types_MOD*
_input_section_types_MOD*
_input_val_types_MOD*
_reference_manager_MOD*
_string_utilities_MOD*
_util_MOD*
#task_list_methods_MOD*
#_timings_MOD*
_timings_MODtimer
I agree that Scalasca's inclusive routine timings agree well with those reported by cp2k. For the Scalasca measurement you needed to specify a large value for ESD_PATHS (150k) since there are lots of unique execution callpaths in your instrumented (non-filtered) execution. Similarly ESD_BUFFER_SIZE needed to be almost 4MB as there are lots of measured routines combined with lots of callpaths: in the analysis report there are 2100 measured routines, 30 of which are MPI, and 1820 of which are purely computational (USR) routines. The number of measured callpaths also has a proportional impact on the size of the analysis reports (e.g., epitome.cube is over 200MB) and the time to process them. Selective instrumentation and/or measurement filtering can substantially reduce all of these costs (as well as reducing measurement overhead). One approach would be to create a filter containing all of the routines marked USR in epik.score. Alternatively, you could try the attached filter which will filter some of the most prolific modules: for example there are 23200 callpaths for routines in __dbcsr_error_handling_MOD which are all USR.
- Log in to comment