CP2K (scorep)

Issue #48 new
jg piccinali repo owner created an issue

Scorep/142

Daint

Setup

  • src
  • module swap PrgEnv-cray PrgEnv-gnu
  • module swap gcc gcc/4.8.2
  • module load fftw
  • module load craype-accel-nvidia35

Compile

  • cd /apps/daint/5.2.UP02/sandbox/jgp/cp2k/GNU482/cp2k-code-15721-trunk/cp2k/makefiles/
  • make ARCH=CRAY-XC30-gfortran-gpu VERSION=psmp
    • /project/csstaff/lucamar/install_scripts/cp2k/xc30/CRAY-XC30-gfortran-gpu.psmp

Run

  • cd exe/CRAY-XC30-gfortran-gpu
  • ln -s /apps/daint/5.2.UP02/sandbox/jgp/cp2k/in/* .
  • ./sbatch.sh santis 10 ./cp2k.psmp 8 1 8 H2O-dft-ls_NREP2.inp
 ------------------------------------------------------------
 -                                T I M I N G
 -------------------------------------------------------------
 SUBROUTINE       CALLS  ASD         SELF TIME        TOTAL TIME 
             MAXIMUM       AVERAGE  MAXIMUM  AVERAGE  MAXIMUM
 CP2K                1  1.0    0.028    0.029   30.801   30.802
real 35.07

Compile (scorep)

  • module load scorep/1.4.2
  • make ARCH=CRAY-XC30-gfortran-gpusc142 VERSION=psmp
NVCC     = scorep --cuda nvcc
CC       = scorep --mpp=mpi --thread=omp:pomp_tpd cc
FC       = scorep --mpp=mpi --thread=omp:pomp_tpd ftn
LD       = scorep --mpp=mpi --thread=omp:pomp_tpd ftn
  • Warning: line-map.c: file "src/colloc_int_body.f90" left but not entered

Profile (scorep)

&TIMINGS OFF
    TIME_MPI .FALSE.
&END
  • export SCOREP_ENABLE_PROFILING=true # true or false
  • export SCOREP_ENABLE_TRACING=false # true or false
  • export SCOREP_CUDA_ENABLE=no
  • export SCOREP_TOTAL_MEMORY=17000000
[Score-P] src/measurement/SCOREP_Memory.c:145: 
Error: No free memory page available: Out of memory. 
Please increase SCOREP_TOTAL_MEMORY=17000000 and try again.
#13  0x2150FD5 in __list_callstackentry_MOD_list_callstackentry_pop
#14  0x2105A51 in __timings_MOD_timestop
#15  0x1FE170A in __dbcsr_error_handling_MOD_dbcsr_error_stop
  • export CRAY_CUDA_MPS=1 # without tool
  • export CRAY_CUDA_MPS=0 # with tool

Comments (7)

  1. jg piccinali reporter
    • cuda-gdb -c core -e cp2k.psmp
    (cuda-gdb) bt
    #9  scorep_pomp_lock_destroy (lock=lock@entry=0x2aab277e549c) 
    at ../src/adapters/pomp/SCOREP_Pomp_Lock.c:216
    #10 0x00002aaaabd168f7 in POMP2_Destroy_lock (s=0x2aab277e549c) 
    at ../src/adapters/pomp/SCOREP_Pomp_Omp.c:837
    #11 0x000000000113616b in ?? ()
    #12 0x0000000000000000 in ?? ()
    
    • addr2line -e cp2k.psmp 0x000000000113616b
      • cp2k/src/task_list_methods.F:2665
    2662 !$ IF (.not.scatter) THEN
    2663 !$omp do
    2664 !$  do i=1,nthread*10
    2665 !$    call omp_destroy_lock(locks(i))  
    2666 !$  end do
    2667 !$omp end do
    2668 !$ END IF
    
  2. jg piccinali reporter

    Profiling (scorep, no OpenMP)

    [Score-P] src/measurement/profiling/scorep_profile_collapse.c:77: 
    Warning: Score-P callpath depth limitation of 30 exceeded.
    Reached callpath depth was 44
    
    • export SCOREP_TOTAL_MEMORY=50M
    • export SCOREP_PROFILING_MAX_CALLPATH_DEPTH=10
    • scorep-score scorep-n8N1d1/profile.cubex
    Estimated aggregate size of event trace:                   73MB
    Estimated requirements for largest trace buffer (max_buf): 10MB
    Estimated memory requirements (SCOREP_TOTAL_MEMORY):       12MB
    flt     type max_buf[B]    visits time[s] time[%] time/visit[us]  region
             ALL  9,607,220 2,900,785 1010.67   100.0         348.41  ALL
             USR  9,269,260 2,833,655 1006.39    99.6         355.16  USR
             MPI    214,126    29,128    2.71     0.3          93.00  MPI
             COM    116,948    35,942    1.57     0.2          43.60  COM
             OMP      6,886     2,060    0.00     0.0           1.14  OMP
    real 132.86
    
    • scorep-score scorep-n8N1d2/profile.cubex
    Estimated aggregate size of event trace:                   74MB
    Estimated requirements for largest trace buffer (max_buf): 10MB
    Estimated memory requirements (SCOREP_TOTAL_MEMORY):       14MB
    flt     type max_buf[B]    visits time[s] time[%] time/visit[us]  region
             ALL  9,639,950 2,910,817 1166.90   100.0         400.88  ALL
             USR  9,301,578 2,843,591 1162.84    99.7         408.94  USR
             MPI    214,126    29,128    2.35     0.2          80.58  MPI
             COM    117,000    35,958    1.48     0.1          41.12  COM
             OMP      7,246     2,140    0.23     0.0         107.91  OMP
    real 89.10
    

    cuben8N1d2.png

    Tracing (scorep, no OpenMP)

    • export SCOREP_ENABLE_PROFILING=false
    • export SCOREP_ENABLE_TRACING=true
    • export SCOREP_CUDA_ENABLE=yes
    • export SCOREP_TOTAL_MEMORY=2000M <------------ ? vampir.png
  3. jg piccinali reporter

    Timers off

    &GLOBAL
      PROJECT H2O
      RUN_TYPE ENERGY
      PRINT_LEVEL MEDIUM
      &TIMINGS OFF
        TIME_MPI .FALSE.
      &END
      &PROGRAM_RUN_INFO OFF
      &END
      EXTENDED_FFT_LENGTHS
    
    &END GLOBAL
    
  4. Christian Feld

    To disable locks and criticals, instrument as follows:

    scorep --opari=--disable=critical,locks gcc ...

    To get a list of constructs that can be disabled, see opari2 --help.

  5. jg piccinali reporter

    #atomic_kind_types_MOD_*
    _cp_error_handling_MOD*
    _cp_files_MOD*
    __cp_linked_list*MOD*
    _cp_log_handling_MOD*
    _cp_output_handling_MOD*
    #_cp_parser_methods_MOD*
    _cp_parser_types_MOD*
    _cp_units_MOD*
    _cp_dbscr_interface_MOD*
    _dbcsr_config_MOD*
    _dbcsr_error_handling_MOD*
    #_environment_MOD*
    _machine_MOD*
    _machine_gfortran_MOD*
    _ma_config_MOD*
    _input_keyword_types_MOD*
    _input_enumeration_types_MOD*
    _input_section_types_MOD*
    _input_val_types_MOD*
    _reference_manager_MOD*
    _string_utilities_MOD*
    _util_MOD*
    #task_list_methods_MOD*
    #_timings_MOD*
    _timings_MODtimer

    I agree that Scalasca's inclusive routine timings agree well with those
    
    reported by cp2k. For the Scalasca measurement you needed to specify a large
    
    value for ESD_PATHS (150k) since there are lots of unique execution callpaths
    
    in your instrumented (non-filtered) execution. Similarly ESD_BUFFER_SIZE
    
    needed to be almost 4MB as there are lots of measured routines combined with
    
    lots of callpaths: in the analysis report there are 2100 measured routines, 30
    
    of which are MPI, and 1820 of which are purely computational (USR) routines.
    
    The number of measured callpaths also has a proportional impact on the size of
    
    the analysis reports (e.g., epitome.cube is over 200MB) and the time to
    
    process them.
    
    Selective instrumentation and/or measurement filtering can substantially
    
    reduce all of these costs (as well as reducing measurement overhead).
    
    One approach would be to create a filter containing all of the routines marked
    
    USR in epik.score. Alternatively, you could try the attached filter which
    
    will filter some of the most prolific modules: for example there are 23200
    
    callpaths for routines in __dbcsr_error_handling_MOD which are all USR.
    

  6. Log in to comment