perfsuite (BlueWaters)

Issue #29 new
jg piccinali repo owner created an issue

PizDora

Profile

aprun -n 8 -N 8 -d 1 -j 1 psrun -f -p bt-mz_C.8
  • MPI programs, use the "-f" option (meaning "fork") for "psrun";
  • OpenMP programs, use the "-p" option (meaning "pthread");
  • Hybrid programs (MPI+OpenMP), use both "-f -p" options.
  • F_INC=-g
bt-mz_C.8.0.4588.nid00034.xml
bt-mz_C.8.0.4589.nid00034.xml
bt-mz_C.8.0.4590.nid00034.xml
bt-mz_C.8.0.4591.nid00034.xml
bt-mz_C.8.0.4592.nid00034.xml
bt-mz_C.8.0.4593.nid00034.xml
bt-mz_C.8.0.4594.nid00034.xml
bt-mz_C.8.0.4595.nid00034.xml

Analyze

psprocess bt-mz_C.*.xml 
Event Count Information
=======================================================
Index Description                                                  Counter Value
--------------------------------------------------------------------------------
    1 Total cycles..............................................  36,000,548,251
    2 Instructions completed....................................  71,217,539,311

Event Index
--------------------------------------------------------------------------------
 1: PAPI_TOT_CYC     2: PAPI_TOT_INS    

Statistics
======================================================
Counting domain.................................................            user
Multiplexed.....................................................              no
Graduated instructions per cycle................................           1.978
MIPS (cycles)...................................................       5,145.389
MIPS (wall clock)...............................................       5,962.264
CPU time (seconds)..............................................          13.841
Wall clock time (seconds).......................................          11.945
% CPU utilization...............................................         115.876

PizDaint

Setup

module swap PrgEnv-cray PrgEnv-gnu

Compile

cd /apps/daint/5.2.UP02/perfsuitebw/1.1.4/
cd CSCS/proposals.git/vihps/NPB3.3-MZ-MPI/

make bt-mz CLASS=C NPROCS=8 MAIN=bt \
FLINKFLAGS="-dynamic -O3 -fopenmp" \
F_INC=-g

Run

cd bin
cp ../BT-MZ/inputbt-mz.data.sample inputbt-mz.data 
aprun -n8 -N8 -d1 -j1  bt-mz_C.8
  • BT-MZ Benchmark Completed.

Profile

source /apps/daint/5.2.UP02/perfsuitebw/1.1.4/gnu_482/bin/psenv.sh
aprun -n 8 -N 8 -d 1 -j 1 psrun -f -p bt-mz_C.8
psprocess bt-mz_C.8.0.19793.nid00012.xml
PerfSuite Hardware Performance Summary Report

Version                      : 1.0
Created                      : Mon Jun 01 14:05:37 CEST 2015
Generator                    : psprocess Java version 0.1
XML Source                   : bt-mz_C.8.0.19793.nid00012.xml

Execution Information
================================================================================
Collector                    : libpshwpc
Date                         : Mon Jun 01 14:05:31 CEST 2015
Host                         : nid00012
Process ID                   : 19793
Thread                       : 0
User                         : piccinal
Command                      : bt-mz_C.8

Processor and System Information
================================================================================
Node CPUs                    : 16
Vendor                       : Intel
Brand                        : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
CPUID Info                   : family: 6, model: 45, stepping: 7
CPU Revision                 : 7
Clock (MHz)                  : 2601.000
Memory (MB)                  : 32220.29
Pagesize (KB)                : 4

Cache Information
================================================================================
Cache levels                 : 3
--------------------------------
Level 1
Type                         : instruction
Size (KB)                    : 32
Linesize (B)                 : 64
Associativity                : 8
Type                         : data
Size (KB)                    : 32
Linesize (B)                 : 64
Associativity                : 8
--------------------------------
Level 2
Type                         : unified
Size (KB)                    : 256
Linesize (B)                 : 64
Associativity                : 8
--------------------------------
Level 3
Type                         : unified
Size (KB)                    : 20480
Linesize (B)                 : 64
Associativity                : 20

Event Count Information
================================================================================
Index Description                                                  Counter Value
--------------------------------------------------------------------------------
    1 Conditional branch instructions...........................     225,129,721
    2 Branch instructions.......................................     343,974,189
    3 Conditional branch instructions mispredicted..............       1,876,275
    4 Conditional branch instructions not taken.................      84,619,228
    5 Floating point divide instructions........................      92,144,835
    6 Floating point operations.................................  16,073,988,775
    7 Level 1 data cache misses.................................     373,760,238
    8 Level 1 instruction cache misses..........................         806,819
    9 Level 2 data cache accesses...............................     373,760,238
   10 Level 2 instruction cache accesses........................         973,617
   11 Level 2 instruction cache misses..........................         392,637
   12 Level 2 cache misses......................................      35,577,559
   13 Level 3 data cache reads..................................      27,565,820
   14 Level 3 instruction cache accesses........................         392,637
   15 Level 3 total cache accesses..............................      35,577,559
   16 Level 3 cache misses......................................       9,936,629
   17 Level 3 total cache writes................................       4,528,089
   18 Load instructions.........................................  12,163,521,733
   19 Store instructions........................................   6,581,176,220
   20 Cycles with no instruction issue..........................   2,201,866,739
   21 Instruction translation lookaside buffer misses...........          22,990
   22 Total cycles..............................................  16,531,314,158
   23 Instructions completed....................................  33,872,958,388

Event Index
--------------------------------------------------------------------------------
 1: PAPI_BR_CN       2: PAPI_BR_INS      3: PAPI_BR_MSP      4: PAPI_BR_NTK     
 5: PAPI_FDV_INS     6: PAPI_FP_OPS      7: PAPI_L1_DCM      8: PAPI_L1_ICM     
 9: PAPI_L2_DCA     10: PAPI_L2_ICA     11: PAPI_L2_ICM     12: PAPI_L2_TCM     
13: PAPI_L3_DCR     14: PAPI_L3_ICA     15: PAPI_L3_TCA     16: PAPI_L3_TCM     
17: PAPI_L3_TCW     18: PAPI_LD_INS     19: PAPI_SR_INS     20: PAPI_STL_ICY    
21: PAPI_TLB_IM     22: PAPI_TOT_CYC    23: PAPI_TOT_INS    

Statistics
================================================================================
Counting domain.................................................            user
Multiplexed.....................................................             yes
Floating point operations per cycle.............................           0.972
Floating point operations per graduated instruction.............           0.475
Graduated instructions per cycle................................           2.049
Graduated instructions per level 1 instruction cache miss.......      41,983.342
Percentage of cycles with no instruction issued.................          13.319
Graduated loads and stores per floating point operation.........           1.166
Level 2 cache miss ratio (data), data cache miss counts derived.           0.094
Level 2 cache miss ratio (instruction)..........................           0.403
Level 3 cache miss ratio........................................           0.279
Bandwidth used to level 2 cache (MB/s)..........................         358.252
Bandwidth used to level 3 cache (MB/s)..........................         100.058
MFLOPS (cycles).................................................       2,529.045
MFLOPS (wall clock).............................................       2,886.490
MIPS (cycles)...................................................       5,329.496
MIPS (wall clock)...............................................       6,082.745
CPU time (seconds)..............................................           6.356
Wall clock time (seconds).......................................           5.569
% CPU utilization...............................................         114.134

Cupti issue

  • Error 101 for CUDA Driver API function 'cuCtxCreate'. cuptiQuery failed
  • => must recompile papi without cuda... eff.png
Currently Loaded Modulefiles:
  1) modules/3.2.10.3
  2) nodestat/2.2-1.0502.53712.3.109.ari
  3) sdb/1.0-1.0502.55976.5.27.ari
  4) alps/5.2.1-2.0502.9041.11.6.ari
  5) lustre-cray_ari_s/2.5_3.0.101_0.31.1_1.0502.8394.10.1-1.0502.17198.8.51
  6) udreg/2.3.2-1.0502.9275.1.12.ari
  7) ugni/5.0-1.0502.9685.4.24.ari
  8) gni-headers/3.0-1.0502.9684.5.2.ari
  9) dmapp/7.0.1-1.0502.9501.5.219.ari
 10) xpmem/0.1-2.0502.55507.3.2.ari
 11) hss-llm/7.2.0
 12) Base-opts/1.0.2-1.0502.53325.1.2.ari
 13) craype-network-aries
 14) craype/2.3.0
 15) craype-sandybridge
 16) slurm
 17) cray-mpich/7.2.0
 18) ddt/5.0
 19) gcc/4.8.2
 20) totalview-support/1.1.4
 21) totalview/8.11.0
 22) cray-libsci/13.0.3
 23) pmi/5.0.6-1.0000.10439.140.2.ari
 24) atp/1.8.1
 25) PrgEnv-gnu/5.2.40

Comments (10)

  1. jg piccinali reporter

    /apps/daint/5.2.UP02/perfsuitebw/1.1.4/gnu_482/share/perfsuite/xml/pshwpc/papi_sandybridge.xml

    <ps_hwpc_eventlist class="PAPI">
           Configuration file for Intel Sandy Bridge systems.
    
           $Id: papi_sandybridge.xml,v 1.1 2012/05/07 20:01:01 ruiliu Exp $
           =================================================== -->
      <ps_hwpc_event type="preset" name="PAPI_BR_CN" />
      <ps_hwpc_event type="preset" name="PAPI_BR_INS" />
      <ps_hwpc_event type="preset" name="PAPI_BR_MSP" />
      <ps_hwpc_event type="preset" name="PAPI_BR_NTK" />
      <ps_hwpc_event type="preset" name="PAPI_FDV_INS" />
      <ps_hwpc_event type="preset" name="PAPI_L1_DCM" />
      <ps_hwpc_event type="preset" name="PAPI_L1_ICM" />
      <ps_hwpc_event type="preset" name="PAPI_L2_DCA" />
      <ps_hwpc_event type="preset" name="PAPI_L2_ICA" />
      <ps_hwpc_event type="preset" name="PAPI_L2_ICM" />
      <ps_hwpc_event type="preset" name="PAPI_L2_TCM" />
      <ps_hwpc_event type="preset" name="PAPI_L3_DCR" />
      <ps_hwpc_event type="preset" name="PAPI_L3_ICA" />
      <ps_hwpc_event type="preset" name="PAPI_L3_TCA" />
      <ps_hwpc_event type="preset" name="PAPI_L3_TCM" />
      <ps_hwpc_event type="preset" name="PAPI_L3_TCW" />
      <ps_hwpc_event type="preset" name="PAPI_LD_INS" />
      <ps_hwpc_event type="preset" name="PAPI_SR_INS" />
      <ps_hwpc_event type="preset" name="PAPI_STL_ICY" />
      <ps_hwpc_event type="preset" name="PAPI_TLB_IM" />
      <ps_hwpc_event type="preset" name="PAPI_TOT_CYC" />
      <ps_hwpc_event type="preset" name="PAPI_TOT_INS" />
    
    </ps_hwpc_eventlist>
    
  2. jg piccinali reporter

    Instrumentation

    Coding

        ret = ps_hwpc_init();
        if ( ret != PS_SUCCESS ) { fatal(ret, "Error in ps_hwpc_init");  }  
        ret = ps_hwpc_start();
        if ( ret != PS_SUCCESS ) { fatal(ret, "Error in ps_hwpc_start (1)");  }   
        ret = ps_hwpc_suspend(); 
        if ( ret != PS_SUCCESS ) { fatal(ret, "Error in ps_hwpc_suspend");    } 
    
        ret = ps_hwpc_stop(OUTPREFIX);
        if ( ret != PS_SUCCESS ) { fatal(ret, "Error in ps_hwpc_stop");    }
    
        ret = ps_hwpc_shutdown();
        if ( ret != PS_SUCCESS ) { fatal(ret, "Error in ps_hwpc_shutdown");    }
    

    Compile

    • cd /apps/daint/5.2.UP02/perfsuitebw/1.1.4/gnu_482/share/perfsuite/examples/hl
    • make clean
    • make CC=cc CSCS="-dynamic -lexpat"
    cc -c -g -O -I/apps/daint/5.2.UP02/perfsuitebw/1.1.4/gnu_482/include hl.c
    
    cc -o hl hl.o -L/apps/daint/5.2.UP02/perfsuitebw/1.1.4/gnu_482/lib \
    -L/apps/daint/5.2.UP02/sandbox/jgp/papi/5.4.1/gnu_482/lib \
    -lpshwpc -lperfsuite -lpapi -dynamic -lexpat
    

    Run

    • aprun -n1 ./a.out

    Analyze

    • psprocess hlout.11499.santis01.xml
    PerfSuite Hardware Performance Summary Report
    
    Version                      : 1.0
    Created                      : Mon Jun 01 14:43:55 CEST 2015
    Generator                    : psprocess Java version 0.1
    XML Source                   : hlout.11499.santis01.xml
    
    Execution Information
    ================================================================================
    Collector                    : libpshwpc
    Date                         : Mon Jun 01 14:40:24 CEST 2015
    Host                         : santis01
    Process ID                   : 11499
    Thread                       : 0
    User                         : piccinal
    Command                      : hl
    
    Processor and System Information
    ================================================================================
    Node CPUs                    : 16
    Vendor                       : Intel
    Brand                        : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
    CPUID Info                   : family: 6, model: 45, stepping: 7
    CPU Revision                 : 7
    Clock (MHz)                  : 2601.000
    Memory (MB)                  : 32217.99
    Pagesize (KB)                : 4
    
    Cache Information
    ================================================================================
    Cache levels                 : 3
    --------------------------------
    Level 1
    Type                         : instruction
    Size (KB)                    : 32
    Linesize (B)                 : 64
    Associativity                : 8
    Type                         : data
    Size (KB)                    : 32
    Linesize (B)                 : 64
    Associativity                : 8
    --------------------------------
    Level 2
    Type                         : unified
    Size (KB)                    : 256
    Linesize (B)                 : 64
    Associativity                : 8
    --------------------------------
    Level 3
    Type                         : unified
    Size (KB)                    : 20480
    Linesize (B)                 : 64
    Associativity                : 20
    
    Event Count Information
    ================================================================================
    Index Description                                                  Counter Value
    --------------------------------------------------------------------------------
        1 Conditional branch instructions...........................      58,878,672
        2 Branch instructions.......................................      59,196,806
        3 Conditional branch instructions mispredicted..............             257
        4 Conditional branch instructions not taken.................           1,471
        5 Floating point divide instructions........................               0
        6 Floating point operations.................................             139
        7 Level 1 data cache misses.................................             516
        8 Level 1 instruction cache misses..........................             677
        9 Level 2 data cache accesses...............................             516
       10 Level 2 instruction cache accesses........................              72
       11 Level 2 instruction cache misses..........................               6
       12 Level 2 cache misses......................................             -24
       13 Level 3 data cache reads..................................             167
       14 Level 3 instruction cache accesses........................               6
       15 Level 3 total cache accesses..............................             -24
       16 Level 3 cache misses......................................              79
       17 Level 3 total cache writes................................               8
       18 Load instructions.........................................     160,395,802
       19 Store instructions........................................               0
       20 Cycles with no instruction issue..........................               0
       21 Instruction translation lookaside buffer misses...........               0
       22 Total cycles..............................................     118,530,081
       23 Instructions completed....................................     296,061,934
    
    Event Index
    --------------------------------------------------------------------------------
     1: PAPI_BR_CN       2: PAPI_BR_INS      3: PAPI_BR_MSP      4: PAPI_BR_NTK     
     5: PAPI_FDV_INS     6: PAPI_FP_OPS      7: PAPI_L1_DCM      8: PAPI_L1_ICM     
     9: PAPI_L2_DCA     10: PAPI_L2_ICA     11: PAPI_L2_ICM     12: PAPI_L2_TCM     
    13: PAPI_L3_DCR     14: PAPI_L3_ICA     15: PAPI_L3_TCA     16: PAPI_L3_TCM     
    17: PAPI_L3_TCW     18: PAPI_LD_INS     19: PAPI_SR_INS     20: PAPI_STL_ICY    
    21: PAPI_TLB_IM     22: PAPI_TOT_CYC    23: PAPI_TOT_INS    
    
    Statistics
    ================================================================================
    Counting domain.................................................            user
    Multiplexed.....................................................             yes
    Floating point operations per cycle.............................           0.000
    Floating point operations per graduated instruction.............           0.000
    Graduated instructions per cycle................................           2.498
    Graduated instructions per level 1 instruction cache miss.......     437,314.526
    Percentage of cycles with no instruction issued.................           0.000
    Graduated loads and stores per floating point operation.........   1,153,926.633
    Level 2 cache miss ratio (data), data cache miss counts derived.          -0.058
    Level 2 cache miss ratio (instruction)..........................           0.083
    Level 3 cache miss ratio........................................          -3.292
    Bandwidth used to level 2 cache (MB/s)..........................          -0.034
    Bandwidth used to level 3 cache (MB/s)..........................           0.111
    MFLOPS (cycles).................................................           0.003
    MFLOPS (wall clock).............................................           0.003
    MIPS (cycles)...................................................       6,496.723
    MIPS (wall clock)...............................................       6,806.692
    CPU time (seconds)..............................................           0.046
    Wall clock time (seconds).......................................           0.043
    % CPU utilization...............................................         104.771
    
  3. jg piccinali reporter

    MFLOPS not available on Intel Haswell:

    cray-perftools:
        The document that specifies performance monitoring events for Intel
        processors does not include events that could be used to compute a
        count of floating point operations for Haswell processors: Intel 64
        and IA-32 Architectures Software Developer's Manual, Order Number
        253665-050US, February 2014.
    
  4. jg piccinali reporter

    dgemm

    intel+openblas

    Compile

    • module swap PrgEnv-cray PrgEnv-intel
    • make dgemm-naive
    • cc -c -dynamic -O2 -g0 -mavx -fopenmp dgemm-naive.c
    • cc -o dgemm-naive dgemm.o dgemm-naive.o -dynamic -O2 -g0 -mavx -fopenmp -L/users/fgilles/Projects/OpenBlas/libopenblas.a

    Run

    • source /apps/daint/5.2.UP02/perfsuitebw/1.1.4/int_1501/bin/psenv.sh
    • aprun -n1 psrun ./dgemm-naive
    Size: 512 512 512   Gflop/s: 4.64157 blas Gflops: 16.4371
    

    Analyze

    • psprocess dgemm-naive.23265.nid00012.xml
    PerfSuite Hardware Performance Summary Report
    
    Version                      : 1.0
    Created                      : Mon Jun 01 16:35:09 CEST 2015
    Generator                    : psprocess Java version 0.1
    XML Source                   : dgemm-naive.23265.nid00012.xml
    
    Execution Information
    ================================================================================
    Collector                    : libpshwpc
    Date                         : Mon Jun 01 16:34:51 CEST 2015
    Host                         : nid00012
    Process ID                   : 23265
    Thread                       : 0
    User                         : piccinal
    Command                      : dgemm-naive
    
    Processor and System Information
    ================================================================================
    Node CPUs                    : 16
    Vendor                       : Intel
    Brand                        : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
    CPUID Info                   : family: 6, model: 45, stepping: 7
    CPU Revision                 : 7
    Clock (MHz)                  : 2601.000
    Memory (MB)                  : 32220.29
    Pagesize (KB)                : 4
    
    Cache Information
    ================================================================================
    Cache levels                 : 3
    --------------------------------
    Level 1
    Type                         : instruction
    Size (KB)                    : 32
    Linesize (B)                 : 64
    Associativity                : 8
    Type                         : data
    Size (KB)                    : 32
    Linesize (B)                 : 64
    Associativity                : 8
    --------------------------------
    Level 2
    Type                         : unified
    Size (KB)                    : 256
    Linesize (B)                 : 64
    Associativity                : 8
    --------------------------------
    Level 3
    Type                         : unified
    Size (KB)                    : 20480
    Linesize (B)                 : 64
    Associativity                : 20
    
    Event Count Information
    ================================================================================
    Index Description                                                  Counter Value
    --------------------------------------------------------------------------------
        1 Conditional branch instructions...........................      11,144,896
        2 Branch instructions.......................................      14,867,440
        3 Conditional branch instructions mispredicted..............          14,728
        4 Conditional branch instructions not taken.................       2,261,883
        5 Floating point divide instructions........................              86
        6 Floating point operations.................................               0
        7 Level 1 data cache misses.................................      23,463,527
        8 Level 1 instruction cache misses..........................              90
        9 Level 2 data cache accesses...............................      23,463,527
       10 Level 2 instruction cache accesses........................             111
       11 Level 2 instruction cache misses..........................              75
       12 Level 2 cache misses......................................      12,186,803
       13 Level 3 data cache reads..................................       9,817,206
       14 Level 3 instruction cache accesses........................              75
       15 Level 3 total cache accesses..............................      12,186,803
       16 Level 3 cache misses......................................              10
       17 Level 3 total cache writes................................              66
       18 Load instructions.........................................     116,334,341
       19 Store instructions........................................      27,544,422
       20 Cycles with no instruction issue..........................         517,295
       21 Instruction translation lookaside buffer misses...........           4,415
       22 Total cycles..............................................     187,479,084
       23 Instructions completed....................................     457,393,487
    
    Event Index
    --------------------------------------------------------------------------------
     1: PAPI_BR_CN       2: PAPI_BR_INS      3: PAPI_BR_MSP      4: PAPI_BR_NTK     
     5: PAPI_FDV_INS     6: PAPI_FP_OPS      7: PAPI_L1_DCM      8: PAPI_L1_ICM     
     9: PAPI_L2_DCA     10: PAPI_L2_ICA     11: PAPI_L2_ICM     12: PAPI_L2_TCM     
    13: PAPI_L3_DCR     14: PAPI_L3_ICA     15: PAPI_L3_TCA     16: PAPI_L3_TCM     
    17: PAPI_L3_TCW     18: PAPI_LD_INS     19: PAPI_SR_INS     20: PAPI_STL_ICY    
    21: PAPI_TLB_IM     22: PAPI_TOT_CYC    23: PAPI_TOT_INS    
    
    Statistics
    ================================================================================
    Counting domain.................................................            user
    Multiplexed.....................................................             yes
    Floating point operations per cycle.............................           0.000
    Floating point operations per graduated instruction.............           0.000
    Graduated instructions per cycle................................           2.440
    Graduated instructions per level 1 instruction cache miss.......   5,082,149.856
    Percentage of cycles with no instruction issued.................           0.276
    Level 2 cache miss ratio (data), data cache miss counts derived.           0.519
    Level 2 cache miss ratio (instruction)..........................           0.676
    Level 3 cache miss ratio........................................           0.000
    Bandwidth used to level 2 cache (MB/s)..........................      10,820.748
    Bandwidth used to level 3 cache (MB/s)..........................           0.009
    MFLOPS (cycles).................................................           0.000
    MFLOPS (wall clock).............................................           0.000
    MIPS (cycles)...................................................       6,345.670
    MIPS (wall clock)...............................................       5,386.716
    CPU time (seconds)..............................................           0.072
    Wall clock time (seconds).......................................           0.085
    % CPU utilization...............................................          84.888
    

    intel+mkl

    Run

    • aprun -n1 psrun ./dgemm-naive
    Size: 512 512 512   Gflop/s: 4.61794 blas Gflops: 9.23317
    

    Analyze

    • psprocess dgemm-naive.12019.nid00013.xml
    PerfSuite Hardware Performance Summary Report
    
    Version                      : 1.0
    Created                      : Mon Jun 01 16:40:51 CEST 2015
    Generator                    : psprocess Java version 0.1
    XML Source                   : dgemm-naive.12019.nid00013.xml
    
    Execution Information
    ================================================================================
    Collector                    : libpshwpc
    Date                         : Mon Jun 01 16:40:31 CEST 2015
    Host                         : nid00013
    Process ID                   : 12019
    Thread                       : 0
    User                         : piccinal
    Command                      : dgemm-naive
    
    Processor and System Information
    ================================================================================
    Node CPUs                    : 16
    Vendor                       : Intel
    Brand                        : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
    CPUID Info                   : family: 6, model: 45, stepping: 7
    CPU Revision                 : 7
    Clock (MHz)                  : 2601.000
    Memory (MB)                  : 32220.29
    Pagesize (KB)                : 4
    
    Cache Information
    ================================================================================
    Cache levels                 : 3
    --------------------------------
    Level 1
    Type                         : instruction
    Size (KB)                    : 32
    Linesize (B)                 : 64
    Associativity                : 8
    Type                         : data
    Size (KB)                    : 32
    Linesize (B)                 : 64
    Associativity                : 8
    --------------------------------
    Level 2
    Type                         : unified
    Size (KB)                    : 256
    Linesize (B)                 : 64
    Associativity                : 8
    --------------------------------
    Level 3
    Type                         : unified
    Size (KB)                    : 20480
    Linesize (B)                 : 64
    Associativity                : 20
    
    Event Count Information
    ================================================================================
    Index Description                                                  Counter Value
    --------------------------------------------------------------------------------
        1 Conditional branch instructions...........................      10,631,132
        2 Branch instructions.......................................      15,031,701
        3 Conditional branch instructions mispredicted..............          15,715
        4 Conditional branch instructions not taken.................       2,146,559
        5 Floating point divide instructions........................             156
        6 Floating point operations.................................               0
        7 Level 1 data cache misses.................................      24,995,469
        8 Level 1 instruction cache misses..........................             172
        9 Level 2 data cache accesses...............................      24,995,469
       10 Level 2 instruction cache accesses........................             252
       11 Level 2 instruction cache misses..........................             166
       12 Level 2 cache misses......................................      12,983,919
       13 Level 3 data cache reads..................................      10,410,725
       14 Level 3 instruction cache accesses........................             166
       15 Level 3 total cache accesses..............................      12,983,919
       16 Level 3 cache misses......................................               5
       17 Level 3 total cache writes................................             102
       18 Load instructions.........................................     108,701,545
       19 Store instructions........................................      26,668,418
       20 Cycles with no instruction issue..........................       1,003,490
       21 Instruction translation lookaside buffer misses...........           4,975
       22 Total cycles..............................................     178,273,106
       23 Instructions completed....................................     409,256,479
    
    Event Index
    --------------------------------------------------------------------------------
     1: PAPI_BR_CN       2: PAPI_BR_INS      3: PAPI_BR_MSP      4: PAPI_BR_NTK     
     5: PAPI_FDV_INS     6: PAPI_FP_OPS      7: PAPI_L1_DCM      8: PAPI_L1_ICM     
     9: PAPI_L2_DCA     10: PAPI_L2_ICA     11: PAPI_L2_ICM     12: PAPI_L2_TCM     
    13: PAPI_L3_DCR     14: PAPI_L3_ICA     15: PAPI_L3_TCA     16: PAPI_L3_TCM     
    17: PAPI_L3_TCW     18: PAPI_LD_INS     19: PAPI_SR_INS     20: PAPI_STL_ICY    
    21: PAPI_TLB_IM     22: PAPI_TOT_CYC    23: PAPI_TOT_INS    
    
    Statistics
    ================================================================================
    Counting domain.................................................            user
    Multiplexed.....................................................             yes
    Floating point operations per cycle.............................           0.000
    Floating point operations per graduated instruction.............           0.000
    Graduated instructions per cycle................................           2.296
    Graduated instructions per level 1 instruction cache miss.......   2,379,398.134
    Percentage of cycles with no instruction issued.................           0.563
    Level 2 cache miss ratio (data), data cache miss counts derived.           0.519
    Level 2 cache miss ratio (instruction)..........................           0.659
    Level 3 cache miss ratio........................................           0.000
    Bandwidth used to level 2 cache (MB/s)..........................      12,123.843
    Bandwidth used to level 3 cache (MB/s)..........................           0.005
    MFLOPS (cycles).................................................           0.000
    MFLOPS (wall clock).............................................           0.000
    MIPS (cycles)...................................................       5,971.041
    MIPS (wall clock)...............................................       4,160.875
    CPU time (seconds)..............................................           0.069
    Wall clock time (seconds).......................................           0.098
    % CPU utilization...............................................          69.684
    
  5. Log in to comment