- edited description
PROPOSALS Craypat-lite: MPI/OpenMP
Issue #1
new
MPI+OPENMP (PizDaint)
Get the src
- ssh daint
- cd $SCRATCH
- git clone https://github.com/eth-cscs/proposals.git proposals.git
Cloning into 'proposals.git'...
remote: Counting objects: 339, done.
remote: Total 339 (delta 0), reused 0 (delta 0)
Receiving objects: 100% (339/339), 300.16 KiB | 234 KiB/s, done.
Resolving deltas: 100% (139/139), done.
- cd proposals.git/vihps/NPB3.3-MZ-MPI/BT-MZ/
Setup
- module swap PrgEnv-cray PrgEnv-gnu
- module use /project/csstaff/proposals
- module load perflite/622
- echo CRAYPAT_LITE=$CRAYPAT_LITE
CRAYPAT_LITE = sample_profile
- module list
Currently Loaded Modulefiles:
1) modules/3.2.10.2
2) nodestat/2.2-1.0502.53712.3.109.ari
3) sdb/1.0-1.0502.55976.5.27.ari
4) alps/5.2.1-2.0502.9041.11.6.ari
5) lustre-cray_ari_s/2.5_3.0.101_0.31.1_1.0502.8394.10.1-1.0502.17198.8.51
6) udreg/2.3.2-1.0502.9275.1.12.ari
7) ugni/5.0-1.0502.9685.4.24.ari
8) gni-headers/3.0-1.0502.9684.5.2.ari
9) dmapp/7.0.1-1.0502.9501.5.219.ari
10) xpmem/0.1-2.0502.55507.3.2.ari
11) hss-llm/7.2.0
12) Base-opts/1.0.2-1.0502.53325.1.2.ari
13) craype-network-aries
14) craype/2.2.1
15) craype-sandybridge
16) slurm
17) cray-mpich/7.1.1
18) ddt/4.3rc7
19) gcc/4.8.2
20) totalview-support/1.1.4
21) totalview/8.11.0
22) cray-libsci/13.0.1
23) pmi/5.0.6-1.0000.10439.140.2.ari
24) atp/1.7.5
25) PrgEnv-gnu/5.2.40
26) rca/1.0.0-2.0502.53711.3.127.ari
27) perflite/622(default)
Compile
- make clean
- make bt-mz CLASS=C NPROCS=8 MAIN=bt
ftn -c -O3 -fopenmp -ffixed-line-length-none bt.f
ftn -c -O3 -fopenmp -ffixed-line-length-none initialize.f
ftn -c -O3 -fopenmp -ffixed-line-length-none exact_solution.f
ftn -c -O3 -fopenmp -ffixed-line-length-none exact_rhs.f
ftn -c -O3 -fopenmp -ffixed-line-length-none set_constants.f
ftn -c -O3 -fopenmp -ffixed-line-length-none adi.f
ftn -c -O3 -fopenmp -ffixed-line-length-none rhs.f
ftn -c -O3 -fopenmp -ffixed-line-length-none zone_setup.f
ftn -c -O3 -fopenmp -ffixed-line-length-none x_solve.f
ftn -c -O3 -fopenmp -ffixed-line-length-none y_solve.f
ftn -c -O3 -fopenmp -ffixed-line-length-none exch_qbc.f
ftn -c -O3 -fopenmp -ffixed-line-length-none solve_subs.f
ftn -c -O3 -fopenmp -ffixed-line-length-none z_solve.f
ftn -c -O3 -fopenmp -ffixed-line-length-none add.f
ftn -c -O3 -fopenmp -ffixed-line-length-none error.f
ftn -c -O3 -fopenmp -ffixed-line-length-none verify.f
ftn -c -O3 -fopenmp -ffixed-line-length-none mpi_setup.f
ftn -O3 -fopenmp -ffixed-line-length-none -o ../bin/bt-mz_C.8 *.o
INFO: A maximum of 44 functions from group 'io' will be traced.
INFO: A maximum of 107 functions from group 'mpi' will be traced.
INFO: A maximum of 32 functions from group 'omp' will be traced.
INFO: A maximum of 23 functions from group 'realtime' will be traced.
INFO: A maximum of 52 functions from group 'syscall' will be traced.
INFO: creating the CrayPat-instrumented executable
'../bin/bt-mz_C.8' (sample_profile) ...OK
Built executable ../bin/bt-mz_C.8
Run
- cd ../bin/
- sbatch ../jobscript/daint/run.sbatch
- while read a b c;do e echo $a $b $c|sh |awk '{print "sbatch.sh daint 5 bt-mz_C."$1" "$1,$2,$3" \"\" \"\" \"\" -Ausup"}';done < in
Submitted batch job 2380
Reports
- cat slurm-*.out
#################################################################
# #
# CrayPat-lite Performance Statistics #
# #
#################################################################
CrayPat/X: Version 6.2.2 Revision 13378 (xf 13240) 11/20/14 14:32:58
Experiment: lite lite/sample_profile
Number of PEs (MPI ranks): 8
Numbers of PEs per Node: 2 PEs on each of 4 Nodes
Numbers of Threads per PE: 4
Number of Cores per Socket: 8
Execution start time: Wed Jan 28 15:35:29 2015
System name and speed: santis02 2601 MHz
Avg Process Time: 28.773 secs
High Memory: 1401 MBytes 175.176 MBytes per PE
MFLOPS (aggregate): 21516 M/sec 2690 M/sec per PE
I/O Read Rate: 52.164 MBytes/sec
I/O Write Rate: 5.842 MBytes/sec
Avg CPU Energy: 17635 joules 4409 joules per node
Avg CPU Power: 612.902 watts 153.225 watts per node
Avg ACC Energy: 2383 joules 595.750 joules per node
Avg ACC Power: 82.821 watts 20.705 watts per node
Table 1: Profile by Function Group and Function (top 10 functions shown)
Samp% | Samp | Imb. | Imb. |Group
| | Samp | Samp% | Function
| | | | PE=HIDE
| | | | Thread=HIDE
100.0% | 2837.4 | -- | -- |Total
|----------------------------------------------------------
| 91.0% | 2580.9 | -- | -- |USER
||---------------------------------------------------------
|| 23.2% | 657.8 | 44.2 | 7.2% |binvcrhs_
|| 13.5% | 384.1 | 24.9 | 7.0% |z_solve_._omp_fn.0
|| 12.4% | 352.2 | 49.8 | 14.1% |y_solve_._omp_fn.0
|| 12.4% | 350.5 | 15.5 | 4.8% |x_solve_._omp_fn.0
|| 11.9% | 338.2 | 42.8 | 12.8% |matmul_sub_
|| 11.3% | 321.0 | 20.0 | 6.7% |compute_rhs_._omp_fn.0
|| 3.7% | 103.8 | 30.2 | 25.8% |matvec_sub_
||=========================================================
| 4.4% | 126.0 | -- | -- |ETC
||---------------------------------------------------------
|| 2.1% | 59.6 | 7.4 | 12.6% |gomp_team_barrier_wait_end
|| 2.1% | 59.4 | 16.6 | 25.0% |gomp_barrier_wait_end
||=========================================================
| 2.3% | 65.0 | -- | -- |MPI
| 2.3% | 64.1 | 32.9 | 38.7% |PTHREAD
||---------------------------------------------------------
|| 2.3% | 64.1 | 32.9 | 38.7% |pthread_join
|==========================================================
Table 2: File Input Stats by Filename
Read | Read | Read Rate | Reads | Bytes/ |File Name[max15]
Time | MBytes | MBytes/sec | | Call | PE=HIDE
0.000962 | 0.050205 | 52.163873 | 829.0 | 63.50 |Total
|-------------------------------------------------------------------
| 0.000957 | 0.050190 | 52.438244 | 827.0 | 63.64 |/proc/self/maps
| 0.000005 | 0.000015 | 2.864533 | 2.0 | 8.00 |_UnknownFile_
|===================================================================
Table 3: File Output Stats by Filename
Write | Write | Write Rate | Writes | Bytes/ |File Name[max15]
Time | MBytes | MBytes/sec | | Call | PE=HIDE
0.000870 | 0.005084 | 5.842064 | 159.0 | 33.53 |Total
|--------------------------------------------------------------------
| 0.000774 | 0.001880 | 2.430055 | 54.0 | 36.50 |stdout
| 0.000097 | 0.003204 | 33.127191 | 105.0 | 32.00 |_UnknownFile_
|====================================================================
Program invocation: ./bt-mz_C.8
For a complete report with expanded tables and notes, run:
pat_report /scratch/santis/piccinal/proposals.git/vihps/NPB3.3-MZ-MPI/bin/bt-mz_C.8+22453-14s.ap2
For help identifying callers of particular functions:
pat_report -O callers+src /scratch/santis/piccinal/proposals.git/vihps/NPB3.3-MZ-MPI/bin/bt-mz_C.8+22453-14s.ap2
To see the entire call tree:
pat_report -O calltree+src /scratch/santis/piccinal/proposals.git/vihps/NPB3.3-MZ-MPI/bin/bt-mz_C.8+22453-14s.ap2
For interactive, graphical performance analysis, run:
app2 /scratch/santis/piccinal/proposals.git/vihps/NPB3.3-MZ-MPI/bin/bt-mz_C.8+22453-14s.ap2
================ End of CrayPat-lite output ==========================
MPI+OPENMP (PizDora)
#################################################################
# #
# CrayPat-lite Performance Statistics #
# #
#################################################################
CrayPat/X: Version 6.2.2 Revision 13378 (xf 13240) 11/20/14 14:32:58
Experiment: lite lite/sample_profile
Number of PEs (MPI ranks): 8
Numbers of PEs per Node: 6 PEs on 1 Node
2 PEs on 1 Node
Numbers of Threads per PE: 4
Number of Cores per Socket: 12
Execution start time: Thu Jan 29 16:09:53 2015
System name and speed: dora21 2601 MHz
Avg Process Time: 24.848 secs
High Memory: 1123 MBytes 140.339 MBytes per PE
MFLOPS: Not supported (see observation below)
I/O Read Rate: 62.434 MBytes/sec
I/O Write Rate: 7.885 MBytes/sec
Avg CPU Energy: 8116 joules 4058 joules per node
Avg CPU Power: 326.620 watts 163.310 watts per node
Table 1: Profile by Function Group and Function (top 10 functions shown)
Samp% | Samp | Imb. | Imb. |Group
| | Samp | Samp% | Function
| | | | PE=HIDE
| | | | Thread=HIDE
100.0% | 2417.2 | -- | -- |Total
|----------------------------------------------------------
| 85.8% | 2074.1 | -- | -- |USER
||---------------------------------------------------------
|| 19.6% | 474.5 | 22.5 | 5.2% |binvcrhs_
|| 14.7% | 355.2 | 10.8 | 3.4% |z_solve_._omp_fn.0
|| 12.9% | 311.2 | 34.8 | 11.5% |y_solve_._omp_fn.0
|| 12.5% | 302.1 | 17.9 | 6.4% |x_solve_._omp_fn.0
|| 10.2% | 247.0 | 31.0 | 12.7% |compute_rhs_._omp_fn.0
|| 9.5% | 229.4 | 17.6 | 8.2% |matmul_sub_
|| 3.1% | 76.0 | 15.0 | 18.8% |matvec_sub_
||=========================================================
| 7.2% | 174.1 | -- | -- |ETC
||---------------------------------------------------------
|| 4.0% | 97.4 | 8.6 | 9.3% |GOMP_parallel
|| 2.9% | 71.2 | 10.8 | 15.0% |gomp_team_barrier_wait_end
||=========================================================
| 5.2% | 125.5 | 48.5 | 31.9% |PTHREAD
||---------------------------------------------------------
|| 5.2% | 125.5 | 48.5 | 31.9% |pthread_join
|==========================================================
=================== Observations and suggestions ===================
MFLOPS not available on Intel Haswell:
The document that specifies performance monitoring events for Intel
processors does not include events that could be used to compute a
count of floating point operations for Haswell processors: Intel 64
and IA-32 Architectures Software Developer's Manual, Order Number
253665-050US, February 2014.
Node utilization:
The placement of PEs on nodes would be better balanced with 4 PEs on
each of the 2 nodes. Use qsub -l mppnppn=4 and aprun -N 4.
========================= End Observations =========================
Table 2: File Input Stats by Filename
Read | Read | Read Rate | Reads | Bytes/ |File Name[max15]
Time | MBytes | MBytes/sec | | Call | PE=HIDE
0.000802 | 0.050041 | 62.434119 | 804.0 | 65.26 |Total
|-------------------------------------------------------------------
| 0.000801 | 0.050034 | 62.490427 | 803.0 | 65.33 |/proc/self/maps
| 0.000001 | 0.000008 | 9.036455 | 1.0 | 8.00 |_UnknownFile_
|===================================================================
Table 3: File Output Stats by Filename
Write | Write | Write Rate | Writes | Bytes/ |File Name[max15]
Time | MBytes | MBytes/sec | | Call | PE=HIDE
0.000459 | 0.003619 | 7.884816 | 111.0 | 34.19 |Total
|--------------------------------------------------------------------
| 0.000415 | 0.001880 | 4.525584 | 54.0 | 36.50 |stdout
| 0.000044 | 0.001740 | 39.841886 | 57.0 | 32.00 |_UnknownFile_
|====================================================================
Program invocation: ./bt-mz_C.8
For a complete report with expanded tables and notes, run:
pat_report /scratch/daint/piccinal/proposals.git/vihps/NPB3.3-MZ-MPI/bin/bt-mz_C.8+3634-1091s.ap2
For help identifying callers of particular functions:
pat_report -O callers+src /scratch/daint/piccinal/proposals.git/vihps/NPB3.3-MZ-MPI/bin/bt-mz_C.8+3634-1091s.ap2
To see the entire call tree:
pat_report -O calltree+src /scratch/daint/piccinal/proposals.git/vihps/NPB3.3-MZ-MPI/bin/bt-mz_C.8+3634-1091s.ap2
For interactive, graphical performance analysis, run:
app2 /scratch/daint/piccinal/proposals.git/vihps/NPB3.3-MZ-MPI/bin/bt-mz_C.8+3634-1091s.ap2
================ End of CrayPat-lite output ==========================
Ignore this (xpat/620) part
#################################################################
# #
# CrayPat-lite Performance Statistics #
# #
#################################################################
CrayPat/X: Version 6.2.0.12614 Revision 12614 (xf 12504) 04/14/14 17:11:54
Experiment: lite lite/sample_profile
Number of PEs (MPI ranks): 16
Numbers of PEs per Node: 16
Numbers of Threads per PE: 1
Number of Cores per Socket: 16
Execution start time: Mon May 19 16:11:02 2014
System name and speed: todi4 2100 MHz
Wall Clock Time: 3.833166 secs
High Memory: 14.61 MBytes
MFLOPS (aggregate): 5038.24 M/sec
I/O Read Rate: 0.75 MBytes/Sec
I/O Write Rate: 77.96 MBytes/Sec
Table 1: Profile by Function Group and Function (top 10 functions shown)
Samp% | Samp | Imb. | Imb. |Group
| | Samp | Samp% | Function
| | | | PE=HIDE
100.0% | 371.9 | -- | -- |Total
|-------------------------------------------------------
| 69.1% | 257.1 | -- | -- |MPI
||------------------------------------------------------
|| 62.9% | 233.8 | 85.2 | 28.5% |mpi_waitall
|| 6.1% | 22.8 | 21.2 | 51.5% |MPI_BARRIER
||======================================================
| 29.7% | 110.4 | -- | -- |USER
||------------------------------------------------------
|| 10.9% | 40.4 | 103.6 | 76.8% |binvcrhs_
|| 3.3% | 12.4 | 30.6 | 75.8% |matmul_sub_
|| 3.3% | 12.4 | 27.6 | 73.5% |y_solve_.LOOP@li.52
|| 3.3% | 12.2 | 24.8 | 71.4% |x_solve_.LOOP@li.54
|| 2.7% | 10.2 | 28.8 | 78.8% |z_solve_.LOOP@li.52
|| 1.4% | 5.1 | 9.9 | 70.2% |compute_rhs_.LOOP@li.38
|| 1.3% | 5.0 | 12.0 | 75.3% |compute_rhs_.LOOP@li.80
|| 1.1% | 4.1 | 9.9 | 75.7% |matvec_sub_
||======================================================
| 1.2% | 4.3 | -- | -- |ETC
|=======================================================
=================== Observations and suggestions ===================
MPI utilization:
No suggestions were made because all ranks are on one node.
========================= End Observations =========================
Table 2: File Input Stats by Filename
Read | Read | Read Rate | Reads | Bytes/ |File Name[max10]
Time | MBytes | MBytes/sec | | Call | PE=HIDE
0.000191 | 0.000143 | 0.747926 | 3.0 | 50.00 |Total
|-------------------------------------------------------------------
| 0.000191 | 0.000143 | 0.747926 | 3.0 | 50.00 |inputbt-mz.data
|===================================================================
Table 3: File Output Stats by Filename
Write | Write | Write Rate | Writes | Bytes/ |File Name[max10]
Time | MBytes | MBytes/sec | | Call | PE=HIDE
0.000026 | 0.002050 | 77.957734 | 51.0 | 42.16 |Total
|--------------------------------------------------------------------
| 0.000026 | 0.002050 | 77.957734 | 51.0 | 42.16 |stdout
|====================================================================
Program invocation: ./CRAY.TODI.A.4
For a complete report with expanded tables and notes, run:
pat_report /users/piccinal/pug.git/src/npbmz.git/bin/CRAY.TODI.A.4+27797-2s.ap2
For help identifying callers of particular functions:
pat_report -O callers+src /users/piccinal/pug.git/src/npbmz.git/bin/CRAY.TODI.A.4+27797-2s.ap2
To see the entire call tree:
pat_report -O calltree+src /users/piccinal/pug.git/src/npbmz.git/bin/CRAY.TODI.A.4+27797-2s.ap2
For interactive, graphical performance analysis, run:
app2 /users/piccinal/pug.git/src/npbmz.git/bin/CRAY.TODI.A.4+27797-2s.ap2
================ End of CrayPat-lite output ==========================
Comments (7)
-
reporter -
reporter - edited description
-
reporter - changed title to PROPOSALS Craypat-lite
-
reporter - edited description
- changed title to PROPOSALS Craypat-lite: MPI/OpenMP
-
reporter - edited description
-
reporter - edited description
-
reporter - edited description
- Log in to comment