PROPOSALS Craypat-lite: MPI/OpenMP

Issue #1 new
jg piccinali repo owner created an issue

MPI+OPENMP (PizDaint)

Get the src

Cloning into 'proposals.git'...
remote: Counting objects: 339, done.
remote: Total 339 (delta 0), reused 0 (delta 0)
Receiving objects: 100% (339/339), 300.16 KiB | 234 KiB/s, done.
Resolving deltas: 100% (139/139), done.
  • cd proposals.git/vihps/NPB3.3-MZ-MPI/BT-MZ/

Setup

  • module swap PrgEnv-cray PrgEnv-gnu
  • module use /project/csstaff/proposals
  • module load perflite/622
  • echo CRAYPAT_LITE=$CRAYPAT_LITE
CRAYPAT_LITE = sample_profile
  • module list
Currently Loaded Modulefiles:
  1) modules/3.2.10.2
  2) nodestat/2.2-1.0502.53712.3.109.ari
  3) sdb/1.0-1.0502.55976.5.27.ari
  4) alps/5.2.1-2.0502.9041.11.6.ari
  5) lustre-cray_ari_s/2.5_3.0.101_0.31.1_1.0502.8394.10.1-1.0502.17198.8.51
  6) udreg/2.3.2-1.0502.9275.1.12.ari
  7) ugni/5.0-1.0502.9685.4.24.ari
  8) gni-headers/3.0-1.0502.9684.5.2.ari
  9) dmapp/7.0.1-1.0502.9501.5.219.ari
 10) xpmem/0.1-2.0502.55507.3.2.ari
 11) hss-llm/7.2.0
 12) Base-opts/1.0.2-1.0502.53325.1.2.ari
 13) craype-network-aries
 14) craype/2.2.1
 15) craype-sandybridge
 16) slurm
 17) cray-mpich/7.1.1
 18) ddt/4.3rc7
 19) gcc/4.8.2
 20) totalview-support/1.1.4
 21) totalview/8.11.0
 22) cray-libsci/13.0.1
 23) pmi/5.0.6-1.0000.10439.140.2.ari
 24) atp/1.7.5
 25) PrgEnv-gnu/5.2.40
 26) rca/1.0.0-2.0502.53711.3.127.ari
 27) perflite/622(default)

Compile

  • make clean
  • make bt-mz CLASS=C NPROCS=8 MAIN=bt
ftn -c  -O3 -fopenmp -ffixed-line-length-none    bt.f
ftn -c  -O3 -fopenmp -ffixed-line-length-none    initialize.f
ftn -c  -O3 -fopenmp -ffixed-line-length-none    exact_solution.f
ftn -c  -O3 -fopenmp -ffixed-line-length-none    exact_rhs.f
ftn -c  -O3 -fopenmp -ffixed-line-length-none    set_constants.f
ftn -c  -O3 -fopenmp -ffixed-line-length-none    adi.f
ftn -c  -O3 -fopenmp -ffixed-line-length-none    rhs.f
ftn -c  -O3 -fopenmp -ffixed-line-length-none    zone_setup.f
ftn -c  -O3 -fopenmp -ffixed-line-length-none    x_solve.f
ftn -c  -O3 -fopenmp -ffixed-line-length-none    y_solve.f
ftn -c  -O3 -fopenmp -ffixed-line-length-none    exch_qbc.f
ftn -c  -O3 -fopenmp -ffixed-line-length-none    solve_subs.f
ftn -c  -O3 -fopenmp -ffixed-line-length-none    z_solve.f
ftn -c  -O3 -fopenmp -ffixed-line-length-none    add.f
ftn -c  -O3 -fopenmp -ffixed-line-length-none    error.f
ftn -c  -O3 -fopenmp -ffixed-line-length-none    verify.f
ftn -c  -O3 -fopenmp -ffixed-line-length-none    mpi_setup.f
ftn -O3 -fopenmp -ffixed-line-length-none    -o ../bin/bt-mz_C.8 *.o

INFO: A maximum of 44 functions from group 'io' will be traced.
INFO: A maximum of 107 functions from group 'mpi' will be traced.
INFO: A maximum of 32 functions from group 'omp' will be traced.
INFO: A maximum of 23 functions from group 'realtime' will be traced.
INFO: A maximum of 52 functions from group 'syscall' will be traced.
INFO: creating the CrayPat-instrumented executable
 '../bin/bt-mz_C.8' (sample_profile) ...OK
Built executable ../bin/bt-mz_C.8

Run

  • cd ../bin/
  • sbatch ../jobscript/daint/run.sbatch
  • while read a b c;do e echo $a $b $c|sh |awk '{print "sbatch.sh daint 5 bt-mz_C."$1" "$1,$2,$3" \"\" \"\" \"\" -Ausup"}';done < in
Submitted batch job 2380

Reports

  • cat slurm-*.out
#################################################################
#                                                               #
#            CrayPat-lite Performance Statistics                #
#                                                               #
#################################################################

CrayPat/X:  Version 6.2.2 Revision 13378 (xf 13240)  11/20/14 14:32:58
Experiment:                  lite  lite/sample_profile
Number of PEs (MPI ranks):      8
Numbers of PEs per Node:        2  PEs on each of  4  Nodes
Numbers of Threads per PE:      4
Number of Cores per Socket:     8
Execution start time:  Wed Jan 28 15:35:29 2015
System name and speed:  santis02 2601 MHz

Avg Process Time:    28.773 secs               
High Memory:           1401 MBytes     175.176 MBytes per PE
MFLOPS (aggregate):   21516 M/sec         2690 M/sec per PE
I/O Read Rate:       52.164 MBytes/sec         
I/O Write Rate:       5.842 MBytes/sec         
Avg CPU Energy:       17635 joules        4409 joules per node
Avg CPU Power:      612.902 watts      153.225 watts per node
Avg ACC Energy:        2383 joules     595.750 joules per node
Avg ACC Power:       82.821 watts       20.705 watts per node

Table 1:  Profile by Function Group and Function (top 10 functions shown)

  Samp% |   Samp | Imb. |  Imb. |Group
        |        | Samp | Samp% | Function
        |        |      |       |  PE=HIDE
        |        |      |       |   Thread=HIDE

 100.0% | 2837.4 |   -- |    -- |Total
|----------------------------------------------------------
|  91.0% | 2580.9 |   -- |    -- |USER
||---------------------------------------------------------
||  23.2% |  657.8 | 44.2 |  7.2% |binvcrhs_
||  13.5% |  384.1 | 24.9 |  7.0% |z_solve_._omp_fn.0
||  12.4% |  352.2 | 49.8 | 14.1% |y_solve_._omp_fn.0
||  12.4% |  350.5 | 15.5 |  4.8% |x_solve_._omp_fn.0
||  11.9% |  338.2 | 42.8 | 12.8% |matmul_sub_
||  11.3% |  321.0 | 20.0 |  6.7% |compute_rhs_._omp_fn.0
||   3.7% |  103.8 | 30.2 | 25.8% |matvec_sub_
||=========================================================
|   4.4% |  126.0 |   -- |    -- |ETC
||---------------------------------------------------------
||   2.1% |   59.6 |  7.4 | 12.6% |gomp_team_barrier_wait_end
||   2.1% |   59.4 | 16.6 | 25.0% |gomp_barrier_wait_end
||=========================================================
|   2.3% |   65.0 |   -- |    -- |MPI
|   2.3% |   64.1 | 32.9 | 38.7% |PTHREAD
||---------------------------------------------------------
||   2.3% |   64.1 | 32.9 | 38.7% |pthread_join
|==========================================================

Table 2:  File Input Stats by Filename

     Read |     Read |  Read Rate | Reads | Bytes/ |File Name[max15]
     Time |   MBytes | MBytes/sec |       |   Call | PE=HIDE

 0.000962 | 0.050205 |  52.163873 | 829.0 |  63.50 |Total
|-------------------------------------------------------------------
| 0.000957 | 0.050190 |  52.438244 | 827.0 |  63.64 |/proc/self/maps
| 0.000005 | 0.000015 |   2.864533 |   2.0 |   8.00 |_UnknownFile_
|===================================================================

Table 3:  File Output Stats by Filename

    Write |    Write | Write Rate | Writes | Bytes/ |File Name[max15]
     Time |   MBytes | MBytes/sec |        |   Call | PE=HIDE

 0.000870 | 0.005084 |   5.842064 |  159.0 |  33.53 |Total
|--------------------------------------------------------------------
| 0.000774 | 0.001880 |   2.430055 |   54.0 |  36.50 |stdout
| 0.000097 | 0.003204 |  33.127191 |  105.0 |  32.00 |_UnknownFile_
|====================================================================

Program invocation:  ./bt-mz_C.8

For a complete report with expanded tables and notes, run:
  pat_report /scratch/santis/piccinal/proposals.git/vihps/NPB3.3-MZ-MPI/bin/bt-mz_C.8+22453-14s.ap2

For help identifying callers of particular functions:
  pat_report -O callers+src /scratch/santis/piccinal/proposals.git/vihps/NPB3.3-MZ-MPI/bin/bt-mz_C.8+22453-14s.ap2
To see the entire call tree:
  pat_report -O calltree+src /scratch/santis/piccinal/proposals.git/vihps/NPB3.3-MZ-MPI/bin/bt-mz_C.8+22453-14s.ap2

For interactive, graphical performance analysis, run:
  app2 /scratch/santis/piccinal/proposals.git/vihps/NPB3.3-MZ-MPI/bin/bt-mz_C.8+22453-14s.ap2

================  End of CrayPat-lite output  ==========================

MPI+OPENMP (PizDora)

#################################################################
#                                                               #
#            CrayPat-lite Performance Statistics                #
#                                                               #
#################################################################

CrayPat/X:  Version 6.2.2 Revision 13378 (xf 13240)  11/20/14 14:32:58
Experiment:                  lite  lite/sample_profile
Number of PEs (MPI ranks):      8
Numbers of PEs per Node:        6  PEs on  1  Node
                                2  PEs on  1  Node
Numbers of Threads per PE:      4
Number of Cores per Socket:    12
Execution start time:  Thu Jan 29 16:09:53 2015
System name and speed:  dora21 2601 MHz

Avg Process Time:        24.848 secs           
High Memory:               1123 MBytes 140.339 MBytes per PE
MFLOPS:           Not supported (see observation below)
I/O Read Rate:           62.434 MBytes/sec         
I/O Write Rate:           7.885 MBytes/sec         
Avg CPU Energy:            8116 joules    4058 joules per node
Avg CPU Power:          326.620 watts  163.310 watts per node

Table 1:  Profile by Function Group and Function (top 10 functions shown)

  Samp% |   Samp | Imb. |  Imb. |Group
        |        | Samp | Samp% | Function
        |        |      |       |  PE=HIDE
        |        |      |       |   Thread=HIDE

 100.0% | 2417.2 |   -- |    -- |Total
|----------------------------------------------------------
|  85.8% | 2074.1 |   -- |    -- |USER
||---------------------------------------------------------
||  19.6% |  474.5 | 22.5 |  5.2% |binvcrhs_
||  14.7% |  355.2 | 10.8 |  3.4% |z_solve_._omp_fn.0
||  12.9% |  311.2 | 34.8 | 11.5% |y_solve_._omp_fn.0
||  12.5% |  302.1 | 17.9 |  6.4% |x_solve_._omp_fn.0
||  10.2% |  247.0 | 31.0 | 12.7% |compute_rhs_._omp_fn.0
||   9.5% |  229.4 | 17.6 |  8.2% |matmul_sub_
||   3.1% |   76.0 | 15.0 | 18.8% |matvec_sub_
||=========================================================
|   7.2% |  174.1 |   -- |    -- |ETC
||---------------------------------------------------------
||   4.0% |   97.4 |  8.6 |  9.3% |GOMP_parallel
||   2.9% |   71.2 | 10.8 | 15.0% |gomp_team_barrier_wait_end
||=========================================================
|   5.2% |  125.5 | 48.5 | 31.9% |PTHREAD
||---------------------------------------------------------
||   5.2% |  125.5 | 48.5 | 31.9% |pthread_join
|==========================================================

===================  Observations and suggestions  ===================


MFLOPS not available on Intel Haswell:

    The document that specifies performance monitoring events for Intel
    processors does not include events that could be used to compute a
    count of floating point operations for Haswell processors: Intel 64
    and IA-32 Architectures Software Developer's Manual, Order Number
    253665-050US, February 2014.


Node utilization:

    The placement of PEs on nodes would be better balanced with 4 PEs on
    each of the 2 nodes. Use qsub -l mppnppn=4 and aprun -N 4.

=========================  End Observations  =========================

Table 2:  File Input Stats by Filename

     Read |     Read |  Read Rate | Reads | Bytes/ |File Name[max15]
     Time |   MBytes | MBytes/sec |       |   Call | PE=HIDE

 0.000802 | 0.050041 |  62.434119 | 804.0 |  65.26 |Total
|-------------------------------------------------------------------
| 0.000801 | 0.050034 |  62.490427 | 803.0 |  65.33 |/proc/self/maps
| 0.000001 | 0.000008 |   9.036455 |   1.0 |   8.00 |_UnknownFile_
|===================================================================

Table 3:  File Output Stats by Filename

    Write |    Write | Write Rate | Writes | Bytes/ |File Name[max15]
     Time |   MBytes | MBytes/sec |        |   Call | PE=HIDE

 0.000459 | 0.003619 |   7.884816 |  111.0 |  34.19 |Total
|--------------------------------------------------------------------
| 0.000415 | 0.001880 |   4.525584 |   54.0 |  36.50 |stdout
| 0.000044 | 0.001740 |  39.841886 |   57.0 |  32.00 |_UnknownFile_
|====================================================================

Program invocation:  ./bt-mz_C.8

For a complete report with expanded tables and notes, run:
  pat_report /scratch/daint/piccinal/proposals.git/vihps/NPB3.3-MZ-MPI/bin/bt-mz_C.8+3634-1091s.ap2

For help identifying callers of particular functions:
  pat_report -O callers+src /scratch/daint/piccinal/proposals.git/vihps/NPB3.3-MZ-MPI/bin/bt-mz_C.8+3634-1091s.ap2
To see the entire call tree:
  pat_report -O calltree+src /scratch/daint/piccinal/proposals.git/vihps/NPB3.3-MZ-MPI/bin/bt-mz_C.8+3634-1091s.ap2

For interactive, graphical performance analysis, run:
  app2 /scratch/daint/piccinal/proposals.git/vihps/NPB3.3-MZ-MPI/bin/bt-mz_C.8+3634-1091s.ap2

================  End of CrayPat-lite output  ==========================

Ignore this (xpat/620) part

#################################################################
#                                                               #
#            CrayPat-lite Performance Statistics                #
#                                                               #
#################################################################

CrayPat/X:  Version 6.2.0.12614 Revision 12614 (xf 12504)  04/14/14 17:11:54
Experiment:                  lite  lite/sample_profile
Number of PEs (MPI ranks):     16
Numbers of PEs per Node:       16
Numbers of Threads per PE:      1
Number of Cores per Socket:    16
Execution start time:  Mon May 19 16:11:02 2014
System name and speed:  todi4 2100 MHz

Wall Clock Time:    3.833166 secs
High Memory:           14.61 MBytes
MFLOPS (aggregate):  5038.24 M/sec
I/O Read Rate:          0.75 MBytes/Sec
I/O Write Rate:        77.96 MBytes/Sec

Table 1:  Profile by Function Group and Function (top 10 functions shown)

  Samp% |  Samp |  Imb. |  Imb. |Group
        |       |  Samp | Samp% | Function
        |       |       |       |  PE=HIDE

 100.0% | 371.9 |    -- |    -- |Total
|-------------------------------------------------------
|  69.1% | 257.1 |    -- |    -- |MPI
||------------------------------------------------------
||  62.9% | 233.8 |  85.2 | 28.5% |mpi_waitall
||   6.1% |  22.8 |  21.2 | 51.5% |MPI_BARRIER
||======================================================
|  29.7% | 110.4 |    -- |    -- |USER
||------------------------------------------------------
||  10.9% |  40.4 | 103.6 | 76.8% |binvcrhs_
||   3.3% |  12.4 |  30.6 | 75.8% |matmul_sub_
||   3.3% |  12.4 |  27.6 | 73.5% |y_solve_.LOOP@li.52
||   3.3% |  12.2 |  24.8 | 71.4% |x_solve_.LOOP@li.54
||   2.7% |  10.2 |  28.8 | 78.8% |z_solve_.LOOP@li.52
||   1.4% |   5.1 |   9.9 | 70.2% |compute_rhs_.LOOP@li.38
||   1.3% |   5.0 |  12.0 | 75.3% |compute_rhs_.LOOP@li.80
||   1.1% |   4.1 |   9.9 | 75.7% |matvec_sub_
||======================================================
|   1.2% |   4.3 |    -- |    -- |ETC
|=======================================================

===================  Observations and suggestions  ===================


MPI utilization:

    No suggestions were made because all ranks are on one node.

=========================  End Observations  =========================

Table 2:  File Input Stats by Filename

     Read |     Read |  Read Rate | Reads | Bytes/ |File Name[max10]
     Time |   MBytes | MBytes/sec |       |   Call | PE=HIDE

 0.000191 | 0.000143 |   0.747926 |   3.0 |  50.00 |Total
|-------------------------------------------------------------------
| 0.000191 | 0.000143 |   0.747926 |   3.0 |  50.00 |inputbt-mz.data
|===================================================================

Table 3:  File Output Stats by Filename

    Write |    Write | Write Rate | Writes | Bytes/ |File Name[max10]
     Time |   MBytes | MBytes/sec |        |   Call | PE=HIDE

 0.000026 | 0.002050 |  77.957734 |   51.0 |  42.16 |Total
|--------------------------------------------------------------------
| 0.000026 | 0.002050 |  77.957734 |   51.0 |  42.16 |stdout
|====================================================================

Program invocation:  ./CRAY.TODI.A.4 

For a complete report with expanded tables and notes, run:
  pat_report /users/piccinal/pug.git/src/npbmz.git/bin/CRAY.TODI.A.4+27797-2s.ap2

For help identifying callers of particular functions:
  pat_report -O callers+src /users/piccinal/pug.git/src/npbmz.git/bin/CRAY.TODI.A.4+27797-2s.ap2
To see the entire call tree:
  pat_report -O calltree+src /users/piccinal/pug.git/src/npbmz.git/bin/CRAY.TODI.A.4+27797-2s.ap2

For interactive, graphical performance analysis, run:
  app2 /users/piccinal/pug.git/src/npbmz.git/bin/CRAY.TODI.A.4+27797-2s.ap2

================  End of CrayPat-lite output  ==========================

Comments (7)

  1. Log in to comment