- edited description
Loop work estimates (CCE compiler only)
Issue #44
new
Description
- Cray Perftools User Guide
- man /opt/cray/perftools/default/man/man1/reveal.1
- man /opt/cray/perftools/default/man/man1/intro_craypat.1
Get the src:
- git clone EuroHack15.git
- cd examples/qwiklab
Setup:
- module load perftools/6.2.3
- module list
Currently Loaded Modulefiles:
1) modules/3.2.10.3
2) nodestat/2.2-1.0502.53712.3.109.ari
3) sdb/1.0-1.0502.55976.5.27.ari
4) alps/5.2.1-2.0502.9041.11.6.ari
5) lustre-cray_ari_s/2.5_3.0.101_0.31.1_1.0502.8394.10.1-1.0502.17198.8.51
6) udreg/2.3.2-1.0502.9275.1.12.ari
7) ugni/5.0-1.0502.9685.4.24.ari
8) gni-headers/3.0-1.0502.9684.5.2.ari
9) dmapp/7.0.1-1.0502.9501.5.219.ari
10) xpmem/0.1-2.0502.55507.3.2.ari
11) hss-llm/7.2.0
12) Base-opts/1.0.2-1.0502.53325.1.2.ari
13) craype-network-aries
14) craype/2.4.0
15) cce/8.3.12
16) totalview-support/1.1.4
17) totalview/8.11.0
18) cray-libsci/13.0.4
19) pmi/5.0.7-1.0000.10678.155.25.ari
20) rca/1.0.0-2.0502.53711.3.127.ari
21) atp/1.8.2
22) PrgEnv-cray/5.2.40
23) craype-sandybridge
24) slurm
25) cray-mpich/7.2.2
26) ddt/5.0
27) perftools/6.2.3
Compile:
- module load perftools/6.2.3
- ftn -O3 -hnoomp -h profile_generate task1.F90 -o CCE8312
- pat_build -w CCE8312 # => CCE8312+pat
Run & Profile:
- aprun -n1 ./CCE8312+pat
CrayPat/X: Version 6.2.3 Revision 13730 03/23/15 16:01:49
PGO data version: L.14.1:B.3.1
Jacobi relaxation Calculation: 1024 x 1024 mesh
0 0.250000
100 0.002397
200 0.001204
300 0.000804
400 0.000603
500 0.000483
600 0.000403
700 0.000345
800 0.000302
900 0.000269
total: 0.920057 s
Experiment data file written:
./EuroHack15.git/examples/qwiklab/CRAY/CCE8312+pat+9804-2t.xf
- Run without tool
PGI/15.x: total: 1.059725 s
CCE/8.3.x: total: 0.840052 s
Loop work estimates
- pat_report -T CCE8312+pat+9804-2t.xf > xfT
Table 2: Inclusive and Exclusive Time in Loops (from -hprofile_generate)
Loop | Loop | Time | Loop | Loop | Loop | Loop |Function=/.LOOP[.]
Incl | Incl | (Loop | Hit | Trips | Trips | Trips |
Time% | Time | Adj.) | | Avg | Min | Max |
|-----------------------------------------------------------------------------
| 99.7% | 0.919511 | 0.000226 | 1 | 1000.0 | 1000 | 1000 |jacobi1_.LOOP.1.li.41
| 64.5% | 0.595301 | 0.013419 | 1000 | 1022.0 | 1022 | 1022 |jacobi1_.LOOP.2.li.43
| 63.1% | 0.581882 | 0.581882 | 1022000 | 1022.0 | 1022 | 1022 |jacobi1_.LOOP.3.li.44
| 35.1% | 0.323985 | 0.009816 | 1000 | 1022.0 | 1022 | 1022 |jacobi1_.LOOP.4.li.51
| 34.1% | 0.314169 | 0.314169 | 1022000 | 1022.0 | 1022 | 1022 |jacobi1_.LOOP.5.li.52
|===========================================
Comments (7)
-
reporter -
reporter !$acc kernels
Timings
CCE: total: 5.064316 s ( speedup=0.20x ) PGI: total: 5.098948 s ( speedup=0.16x )
Compiler report
PGI
- module load craype-accel-nvidia35
- pgfortran -acc -Minfo task2.F90
jacobi_acc_kernels: 13, Memory zero idiom, loop replaced by call to __c_mzero4 15, Memory zero idiom, loop replaced by call to __c_mzero4 23, Generating copyout(anew(2:1023,2:1023)) Generating copyin(a(:,:)) Generating Tesla code 24, Loop is parallelizable 25, Loop is parallelizable Accelerator kernel generated 24, !$acc loop gang ! blockidx%y 25, !$acc loop gang, vector(128) ! blockidx%x threadidx%x 28, Max reduction generated for error 33, Generating copyin(anew(2:1023,2:1023)) Generating copyout(a(2:1023,2:1023)) Generating Tesla code 34, Loop is parallelizable 35, Loop is parallelizable Accelerator kernel generated 34, !$acc loop gang ! blockidx%y 35, !$acc loop gang, vector(128) ! blockidx%x threadidx%x Memory copy idiom, loop replaced by call to __c_mcopy4
CCE
- ftn -rm -O3 -hacc task2.F90
ftn-6332 ftn: VECTOR File = task2.F90, Line = 12 A loop starting at line 12 was not vectorized because it does not map well onto the target architecture. ftn-6005 ftn: SCALAR File = task2.F90, Line = 12 A loop starting at line 12 was unrolled 8 times. ftn-6230 ftn: VECTOR File = task2.F90, Line = 13 A loop starting at line 13 was replaced with multiple library calls. ftn-6004 ftn: SCALAR File = task2.F90, Line = 14 A loop starting at line 14 was fused with the loop starting at line 12. ftn-6004 ftn: SCALAR File = task2.F90, Line = 15 A loop starting at line 15 was fused with the loop starting at line 13. ftn-3021 ftn: IPA File = task2.F90, Line = 19 "_CPU_TIME_4" (called from "jacobi_acc_kernels") was not inlined because the compiler was unable to locate the routine. ftn-6286 ftn: VECTOR File = task2.F90, Line = 21 A loop starting at line 21 was not vectorized because it contains input/output operations at line 41. ftn-6413 ftn: ACCEL File = task2.F90, Line = 23 A data region was created at line 23 and ending at line 31. ftn-6418 ftn: ACCEL File = task2.F90, Line = 23 If not already present: allocate memory and copy whole array "a" to accelerator, free at line 31 (acc_copyin). ftn-6416 ftn: ACCEL File = task2.F90, Line = 23 If not already present: allocate memory and copy whole array "anew" to accelerator, copy back at line 31 (acc_copy). ftn-6401 ftn: ACCEL File = task2.F90, Line = 24 A loop starting at line 24 was placed on the accelerator. ftn-6430 ftn: ACCEL File = task2.F90, Line = 24 A loop starting at line 24 was partitioned across the thread blocks. ftn-6415 ftn: ACCEL File = task2.F90, Line = 24 Allocate memory and copy variable "error" to accelerator, copy back at line 30 (acc_copy). ftn-6430 ftn: ACCEL File = task2.F90, Line = 25 A loop starting at line 25 was partitioned across the 128 threads within a threadblock. ftn-6413 ftn: ACCEL File = task2.F90, Line = 33 A data region was created at line 33 and ending at line 39. ftn-6418 ftn: ACCEL File = task2.F90, Line = 33 If not already present: allocate memory and copy whole array "anew" to accelerator, free at line 39 (acc_copyin). ftn-6416 ftn: ACCEL File = task2.F90, Line = 33 If not already present: allocate memory and copy whole array "a" to accelerator, copy back at line 39 (acc_copy). ftn-6401 ftn: ACCEL File = task2.F90, Line = 34 A loop starting at line 34 was placed on the accelerator. ftn-6430 ftn: ACCEL File = task2.F90, Line = 34 A loop starting at line 34 was partitioned across the thread blocks. ftn-6430 ftn: ACCEL File = task2.F90, Line = 35 A loop starting at line 35 was partitioned across the 128 threads within a threadblock. ftn-3021 ftn: IPA File = task2.F90, Line = 44 "_CPU_TIME_4" (called from "jacobi_acc_kernels") was not inlined because the compiler was unable to locate the routine.
-
reporter - edited description
-
reporter Data sloshing
PGI
- export PGI_ACC_TIME=1
- aprun -n1 exe
Accelerator Kernel Timing data /scratch/santis/piccinal/EuroHack15.git/examples/qwiklab/PGI/task2.F90 jacobi_acc_kernels NVIDIA devicenum=0 time(us): 28,178 23: data region reached 1000 times 23: data copyin transfers: 1000 device time(us): total=8,578 max=36 min=6 avg=8 31: data copyout transfers: 1000 device time(us): total=5,810 max=31 min=4 avg=5 23: compute region reached 1000 times 25: kernel launched 1000 times grid: [8x1022] block: [128] elapsed time(us): total=133,026 max=163 min=131 avg=133 25: reduction kernel launched 1000 times grid: [1] block: [256] elapsed time(us): total=40,693 max=65 min=39 avg=40 33: data region reached 1000 times 33: data copyin transfers: 1000 device time(us): total=7,789 max=36 min=5 avg=7 39: data copyout transfers: 1000 device time(us): total=6,001 max=31 min=5 avg=6 33: compute region reached 1000 times 35: kernel launched 1000 times grid: [8x1022] block: [128] elapsed time(us): total=86,275 max=152 min=85 avg=86
CCE
- export CRAY_ACC_DEBUG=2
- aprun -n1 exe
Jacobi relaxation Calculation: 1024 x 1024 mesh ACC: Initialize CUDA ACC: Get Device 0 ACC: Create Context ACC: Set Thread Context ACC: Start transfer 2 items from task2.F90:23 ACC: allocate, copy to acc 'a' (4194304 bytes) ACC: allocate, copy to acc 'anew' (4194304 bytes) ACC: End transfer (to acc 8388608 bytes, to host 0 bytes) ACC: Start transfer 3 items from task2.F90:24 ACC: allocate reusable <internal> (4 bytes) ACC: allocate reusable, copy to acc <internal> (4 bytes) ACC: allocate reusable <internal> (4088 bytes) ACC: End transfer (to acc 4 bytes, to host 0 bytes) ACC: Execute kernel jacobi_acc_kernels_$ck_L24_3 blocks:1022 threads:128 async(auto) from task2.F90:24 ACC: Wait async(auto) from task2.F90:30 ACC: Start transfer 3 items from task2.F90:30 ACC: copy to host, done reusable <internal> (4 bytes) ACC: done reusable <internal> (4 bytes) ACC: done reusable <internal> (0 bytes) ACC: End transfer (to acc 0 bytes, to host 4 bytes) ACC: Wait async(auto) from task2.F90:31 ACC: Start transfer 2 items from task2.F90:31 ACC: free 'a' (4194304 bytes) ACC: copy to host, free 'anew' (4194304 bytes) ACC: End transfer (to acc 0 bytes, to host 4194304 bytes) ACC: Start transfer 2 items from task2.F90:33 ACC: allocate, copy to acc 'a' (4194304 bytes) ACC: allocate, copy to acc 'anew' (4194304 bytes) ACC: End transfer (to acc 8388608 bytes, to host 0 bytes) ACC: Execute kernel jacobi_acc_kernels_$ck_L34_5 blocks:1022 threads:128 async(auto) from task2.F90:34 ACC: Wait async(auto) from task2.F90:39 ACC: Start transfer 2 items from task2.F90:39 ACC: copy to host, free 'a' (4194304 bytes) ACC: free 'anew' (4194304 bytes) ACC: End transfer (to acc 0 bytes, to host 4194304 bytes) ACC: Start transfer 2 items from task2.F90:23 0 0.250000 ACC: allocate, copy to acc 'a' (4194304 bytes) ACC: allocate, copy to acc 'anew' (4194304 bytes) ACC: End transfer (to acc 8388608 bytes, to host 0 bytes) ACC: Start transfer 3 items from task2.F90:24 ACC: reusable acquired <internal> (4 bytes) ACC: reusable acquired <internal> (4 bytes) ACC: reusable acquired <internal> (4088 bytes) ACC: End transfer (to acc 0 bytes, to host 0 bytes) ACC: Execute kernel jacobi_acc_kernels_$ck_L24_3 blocks:1022 threads:128 async(auto) from task2.F90:24 etc...
-
reporter - edited description
-
reporter !$acc data copy
clauses available for use with the
data
directive- copy( list ) - Allocates memory on GPU and copies data from host to GPU when entering region and copies data to the host when exiting region.
- copyin( list ) - Allocates memory on GPU and copies data from host to GPU when entering region.
- copyout( list ) - Allocates memory on GPU and copies data to the host when exiting region.
- create( list ) - Allocates memory on GPU but does not copy.
-
present( list ) - Data is already present on GPU from another containing data region.
Timings
- CCE: ftn -hacc -O3 task3.F90 ; unset CRAY_ACC_DEBUG ; aprun -n1 a.out
- PGI: pgfortran -acc -O3 task3.F90 ; unset PGI_ACC_TIME ; aprun -n1 a.out
CCE: total: 0.168010 s ( speedup = 5x ) PGI: total: 0.391570 s ( speedup = 2.7x )
Compiler reports
PGI
- module load craype-accel-nvidia35; pgfortran -acc -Minfo -O3 task3.F90
26, Loop is parallelizable Accelerator kernel generated 25, !$acc loop gang ! blockidx%y 26, !$acc loop gang, vector(128) ! blockidx%x threadidx%x 29, Max reduction generated for error 36, Loop is parallelizable Accelerator kernel generated 35, !$acc loop gang ! blockidx%y 36, !$acc loop gang, vector(128) ! blockidx%x threadidx%x Memory copy idiom, loop replaced by call to __c_mcopy4
- export PGI_ACC_TIME=1; aprun -n1 a.out
Accelerator Kernel Timing data /scratch/santis/piccinal/EuroHack15.git/examples/qwiklab/PGI/task3.F90 jacobi_acc_kernels_datacopy NVIDIA devicenum=0 time(us): 33 21: data region reached 1 time 21: data copyin transfers: 1 device time(us): total=21 max=21 min=21 avg=21 45: data copyout transfers: 1 device time(us): total=12 max=12 min=12 avg=12 24: compute region reached 1000 times 26: kernel launched 1000 times grid: [8x1022] block: [128] <------------- 1022blocks *8blocks *128threads/block = 1046528 threads 1024x1024 grid = 1048576 cells grids = gangs / block = vector sizes elapsed time(us): total=132,077 max=162 min=130 avg=132 <--------------- 132077usec = 0.132seconds 26: reduction kernel launched 1000 times grid: [1] block: [256] elapsed time(us): total=40,107 max=91 min=39 avg=40 34: compute region reached 1000 times 36: kernel launched 1000 times grid: [8x1022] block: [128] elapsed time(us): total=84,747 max=108 min=83 avg=84 <--------------- 84747usec = 0.084seconds
CCE
- ftn -rm -O3 -hacc task3.F90
$ grep region task3.lst A data region was created at line 21 and ending at line 45. A data region was created at line 24 and ending at line 32. A data region was created at line 34 and ending at line 40. $ grep region task2.lst A data region was created at line 23 and ending at line 31. A data region was created at line 33 and ending at line 39.
- export CRAY_ACC_DEBUG=2 ; aprun -n1 a.out
Jacobi relaxation Calculation: 1024 x 1024 mesh ACC: Initialize CUDA ACC: Get Device 0 ACC: Create Context ACC: Set Thread Context ACC: Start transfer 2 items from task3.F90:21 ACC: allocate, copy to acc 'a' (4194304 bytes) ACC: allocate 'anew' (4194304 bytes) ACC: End transfer (to acc 4194304 bytes, to host 0 bytes) ACC: Start transfer 3 items from task3.F90:25 ACC: allocate reusable <internal> (4 bytes) ACC: allocate reusable, copy to acc <internal> (4 bytes) ACC: allocate reusable <internal> (4088 bytes) ACC: End transfer (to acc 4 bytes, to host 0 bytes) ACC: Execute kernel jacobi_acc_kernels_datacopy_$ck_L25_3 blocks:1022 threads:128 async(auto) from task3.F90:25 ACC: Wait async(auto) from task3.F90:31 ACC: Start transfer 3 items from task3.F90:31 ACC: copy to host, done reusable <internal> (4 bytes) ACC: done reusable <internal> (4 bytes) ACC: done reusable <internal> (0 bytes) ACC: End transfer (to acc 0 bytes, to host 4 bytes) ACC: Execute kernel jacobi_acc_kernels_datacopy_$ck_L35_5 blocks:1022 threads:128 async(auto) from task3.F90:35 ACC: Start transfer 3 items from task3.F90:25 ACC: reusable acquired <internal> (4 bytes) ACC: reusable acquired <internal> (4 bytes) ACC: reusable acquired <internal> (4088 bytes) ACC: End transfer (to acc 0 bytes, to host 0 bytes) 0 0.250000 ACC: Execute kernel jacobi_acc_kernels_datacopy_$ck_L25_3 blocks:1022 threads:128 async(auto) from task3.F90:25 etc... ACC: Start transfer 3 items from task3.F90:31 ACC: copy to host, done reusable <internal> (4 bytes) ACC: done reusable <internal> (4 bytes) ACC: done reusable <internal> (0 bytes) ACC: End transfer (to acc 0 bytes, to host 4 bytes) ACC: Execute kernel jacobi_acc_kernels_datacopy_$ck_L35_5 blocks:1022 threads:128 async(auto) from task3.F90:35 ACC: Wait async(auto) from task3.F90:45 ACC: Start transfer 2 items from task3.F90:45 ACC: copy to host, free 'a' (4194304 bytes) ACC: free 'anew' (4194304 bytes) ACC: End transfer (to acc 0 bytes, to host 4194304 bytes)
-
reporter - Log in to comment