Issue #44 new

jg piccinali repo owner created an issue 2015-07-02

Description

Cray Perftools User Guide
man /opt/cray/perftools/default/man/man1/reveal.1
man /opt/cray/perftools/default/man/man1/intro_craypat.1

Get the src:

git clone EuroHack15.git
cd examples/qwiklab

Setup:

module load perftools/6.2.3
module list

Currently Loaded Modulefiles:
  1) modules/3.2.10.3
  2) nodestat/2.2-1.0502.53712.3.109.ari
  3) sdb/1.0-1.0502.55976.5.27.ari
  4) alps/5.2.1-2.0502.9041.11.6.ari
  5) lustre-cray_ari_s/2.5_3.0.101_0.31.1_1.0502.8394.10.1-1.0502.17198.8.51
  6) udreg/2.3.2-1.0502.9275.1.12.ari
  7) ugni/5.0-1.0502.9685.4.24.ari
  8) gni-headers/3.0-1.0502.9684.5.2.ari
  9) dmapp/7.0.1-1.0502.9501.5.219.ari
 10) xpmem/0.1-2.0502.55507.3.2.ari
 11) hss-llm/7.2.0
 12) Base-opts/1.0.2-1.0502.53325.1.2.ari
 13) craype-network-aries
 14) craype/2.4.0
 15) cce/8.3.12
 16) totalview-support/1.1.4
 17) totalview/8.11.0
 18) cray-libsci/13.0.4
 19) pmi/5.0.7-1.0000.10678.155.25.ari
 20) rca/1.0.0-2.0502.53711.3.127.ari
 21) atp/1.8.2
 22) PrgEnv-cray/5.2.40
 23) craype-sandybridge
 24) slurm
 25) cray-mpich/7.2.2
 26) ddt/5.0
 27) perftools/6.2.3

Compile:

module load perftools/6.2.3
ftn -O3 -hnoomp -h profile_generate task1.F90 -o CCE8312
pat_build -w CCE8312 # => CCE8312+pat

Run & Profile:

aprun -n1 ./CCE8312+pat

CrayPat/X:  Version 6.2.3 Revision 13730  03/23/15 16:01:49
PGO data version:  L.14.1:B.3.1
Jacobi relaxation Calculation: 1024 x 1024 mesh
     0   0.250000
   100   0.002397
   200   0.001204
   300   0.000804
   400   0.000603
   500   0.000483
   600   0.000403
   700   0.000345
   800   0.000302
   900   0.000269
total:  0.920057 s
Experiment data file written:
./EuroHack15.git/examples/qwiklab/CRAY/CCE8312+pat+9804-2t.xf

Run without tool

PGI/15.x: total:  1.059725 s
CCE/8.3.x: total:  0.840052 s

Loop work estimates

pat_report -T CCE8312+pat+9804-2t.xf > xfT

Table 2:  Inclusive and Exclusive Time in Loops (from -hprofile_generate)
  Loop |     Loop |     Time |    Loop |   Loop |  Loop |  Loop |Function=/.LOOP[.]
  Incl |     Incl |    (Loop |     Hit |  Trips | Trips | Trips |
 Time% |     Time |    Adj.) |         |    Avg |   Min |   Max |
|-----------------------------------------------------------------------------
| 99.7% | 0.919511 | 0.000226 |       1 | 1000.0 |  1000 |  1000 |jacobi1_.LOOP.1.li.41
| 64.5% | 0.595301 | 0.013419 |    1000 | 1022.0 |  1022 |  1022 |jacobi1_.LOOP.2.li.43
| 63.1% | 0.581882 | 0.581882 | 1022000 | 1022.0 |  1022 |  1022 |jacobi1_.LOOP.3.li.44
| 35.1% | 0.323985 | 0.009816 |    1000 | 1022.0 |  1022 |  1022 |jacobi1_.LOOP.4.li.51
| 34.1% | 0.314169 | 0.314169 | 1022000 | 1022.0 |  1022 |  1022 |jacobi1_.LOOP.5.li.52
|===========================================

Comments (7)

jg piccinali reporter
- edited description
- 2015-07-02T13:41:25+00:00

jg piccinali reporter

!$acc kernels

Timings

CCE: total:  5.064316 s  ( speedup=0.20x )
PGI: total:  5.098948 s ( speedup=0.16x )

Compiler report

PGI

module load craype-accel-nvidia35
pgfortran -acc -Minfo task2.F90

jacobi_acc_kernels:
     13, Memory zero idiom, loop replaced by call to __c_mzero4
     15, Memory zero idiom, loop replaced by call to __c_mzero4
     23, Generating copyout(anew(2:1023,2:1023))
         Generating copyin(a(:,:))
         Generating Tesla code
     24, Loop is parallelizable
     25, Loop is parallelizable
         Accelerator kernel generated
         24, !$acc loop gang ! blockidx%y
         25, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
         28, Max reduction generated for error
     33, Generating copyin(anew(2:1023,2:1023))
         Generating copyout(a(2:1023,2:1023))
         Generating Tesla code
     34, Loop is parallelizable
     35, Loop is parallelizable
         Accelerator kernel generated
         34, !$acc loop gang ! blockidx%y
         35, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
         Memory copy idiom, loop replaced by call to __c_mcopy4

CCE

ftn -rm -O3 -hacc task2.F90

ftn-6332 ftn: VECTOR File = task2.F90, Line = 12 
  A loop starting at line 12 was not vectorized because it does not map well onto the target architecture.

ftn-6005 ftn: SCALAR File = task2.F90, Line = 12 
  A loop starting at line 12 was unrolled 8 times.

ftn-6230 ftn: VECTOR File = task2.F90, Line = 13 
  A loop starting at line 13 was replaced with multiple library calls.

ftn-6004 ftn: SCALAR File = task2.F90, Line = 14 
  A loop starting at line 14 was fused with the loop starting at line 12.

ftn-6004 ftn: SCALAR File = task2.F90, Line = 15 
  A loop starting at line 15 was fused with the loop starting at line 13.

ftn-3021 ftn: IPA File = task2.F90, Line = 19 
  "_CPU_TIME_4" (called from "jacobi_acc_kernels") was not inlined because the compiler was unable to locate the routine.

ftn-6286 ftn: VECTOR File = task2.F90, Line = 21 
  A loop starting at line 21 was not vectorized because it contains input/output operations at line 41.

ftn-6413 ftn: ACCEL File = task2.F90, Line = 23 
  A data region was created at line 23 and ending at line 31.

ftn-6418 ftn: ACCEL File = task2.F90, Line = 23 
  If not already present: allocate memory and copy whole array "a" to accelerator, free at line 31 (acc_copyin).

ftn-6416 ftn: ACCEL File = task2.F90, Line = 23 
  If not already present: allocate memory and copy whole array "anew" to accelerator, copy back at line 31 (acc_copy).

ftn-6401 ftn: ACCEL File = task2.F90, Line = 24 
  A loop starting at line 24 was placed on the accelerator.

ftn-6430 ftn: ACCEL File = task2.F90, Line = 24 
  A loop starting at line 24 was partitioned across the thread blocks.

ftn-6415 ftn: ACCEL File = task2.F90, Line = 24 
  Allocate memory and copy variable "error" to accelerator, copy back at line 30 (acc_copy).

ftn-6430 ftn: ACCEL File = task2.F90, Line = 25 
  A loop starting at line 25 was partitioned across the 128 threads within a threadblock.

ftn-6413 ftn: ACCEL File = task2.F90, Line = 33 
  A data region was created at line 33 and ending at line 39.

ftn-6418 ftn: ACCEL File = task2.F90, Line = 33 
  If not already present: allocate memory and copy whole array "anew" to accelerator, free at line 39 (acc_copyin).

ftn-6416 ftn: ACCEL File = task2.F90, Line = 33 
  If not already present: allocate memory and copy whole array "a" to accelerator, copy back at line 39 (acc_copy).

ftn-6401 ftn: ACCEL File = task2.F90, Line = 34 
  A loop starting at line 34 was placed on the accelerator.

ftn-6430 ftn: ACCEL File = task2.F90, Line = 34 
  A loop starting at line 34 was partitioned across the thread blocks.

ftn-6430 ftn: ACCEL File = task2.F90, Line = 35 
  A loop starting at line 35 was partitioned across the 128 threads within a threadblock.

ftn-3021 ftn: IPA File = task2.F90, Line = 44 
  "_CPU_TIME_4" (called from "jacobi_acc_kernels") was not inlined because the compiler was unable to locate the routine.

2015-07-02T14:02:31+00:00

jg piccinali reporter
- edited description
- 2015-07-02T14:06:53+00:00

jg piccinali reporter

Data sloshing

PGI

export PGI_ACC_TIME=1
aprun -n1 exe

Accelerator Kernel Timing data
/scratch/santis/piccinal/EuroHack15.git/examples/qwiklab/PGI/task2.F90
  jacobi_acc_kernels  NVIDIA  devicenum=0
    time(us): 28,178
    23: data region reached 1000 times
        23: data copyin transfers: 1000
             device time(us): total=8,578 max=36 min=6 avg=8
        31: data copyout transfers: 1000
             device time(us): total=5,810 max=31 min=4 avg=5
    23: compute region reached 1000 times
        25: kernel launched 1000 times
            grid: [8x1022]  block: [128]
            elapsed time(us): total=133,026 max=163 min=131 avg=133
        25: reduction kernel launched 1000 times
            grid: [1]  block: [256]
            elapsed time(us): total=40,693 max=65 min=39 avg=40
    33: data region reached 1000 times
        33: data copyin transfers: 1000
             device time(us): total=7,789 max=36 min=5 avg=7
        39: data copyout transfers: 1000
             device time(us): total=6,001 max=31 min=5 avg=6
    33: compute region reached 1000 times
        35: kernel launched 1000 times
            grid: [8x1022]  block: [128]
            elapsed time(us): total=86,275 max=152 min=85 avg=86

CCE

export CRAY_ACC_DEBUG=2
aprun -n1 exe

Jacobi relaxation Calculation: 1024 x 1024 mesh
ACC: Initialize CUDA
ACC: Get Device 0
ACC: Create Context
ACC: Set Thread Context
ACC: Start transfer 2 items from task2.F90:23
ACC:       allocate, copy to acc 'a' (4194304 bytes)
ACC:       allocate, copy to acc 'anew' (4194304 bytes)
ACC: End transfer (to acc 8388608 bytes, to host 0 bytes)
ACC: Start transfer 3 items from task2.F90:24
ACC:       allocate reusable <internal> (4 bytes)
ACC:       allocate reusable, copy to acc <internal> (4 bytes)
ACC:       allocate reusable <internal> (4088 bytes)
ACC: End transfer (to acc 4 bytes, to host 0 bytes)
ACC: Execute kernel jacobi_acc_kernels_$ck_L24_3 blocks:1022 threads:128 async(auto) from task2.F90:24
ACC: Wait async(auto) from task2.F90:30
ACC: Start transfer 3 items from task2.F90:30
ACC:       copy to host, done reusable <internal> (4 bytes)
ACC:       done reusable <internal> (4 bytes)
ACC:       done reusable <internal> (0 bytes)
ACC: End transfer (to acc 0 bytes, to host 4 bytes)
ACC: Wait async(auto) from task2.F90:31
ACC: Start transfer 2 items from task2.F90:31
ACC:       free 'a' (4194304 bytes)
ACC:       copy to host, free 'anew' (4194304 bytes)
ACC: End transfer (to acc 0 bytes, to host 4194304 bytes)
ACC: Start transfer 2 items from task2.F90:33
ACC:       allocate, copy to acc 'a' (4194304 bytes)
ACC:       allocate, copy to acc 'anew' (4194304 bytes)
ACC: End transfer (to acc 8388608 bytes, to host 0 bytes)
ACC: Execute kernel jacobi_acc_kernels_$ck_L34_5 blocks:1022 threads:128 async(auto) from task2.F90:34
ACC: Wait async(auto) from task2.F90:39
ACC: Start transfer 2 items from task2.F90:39
ACC:       copy to host, free 'a' (4194304 bytes)
ACC:       free 'anew' (4194304 bytes)
ACC: End transfer (to acc 0 bytes, to host 4194304 bytes)
ACC: Start transfer 2 items from task2.F90:23
     0   0.250000
ACC:       allocate, copy to acc 'a' (4194304 bytes)
ACC:       allocate, copy to acc 'anew' (4194304 bytes)
ACC: End transfer (to acc 8388608 bytes, to host 0 bytes)
ACC: Start transfer 3 items from task2.F90:24
ACC:       reusable acquired <internal> (4 bytes)
ACC:       reusable acquired <internal> (4 bytes)
ACC:       reusable acquired <internal> (4088 bytes)
ACC: End transfer (to acc 0 bytes, to host 0 bytes)
ACC: Execute kernel jacobi_acc_kernels_$ck_L24_3 blocks:1022 threads:128 async(auto) from task2.F90:24
etc...

2015-07-02T14:21:49+00:00

jg piccinali reporter
- edited description
- 2015-07-02T14:22:41+00:00

jg piccinali reporter

!$acc data copy

clauses available for use with the `data` directive

copy( list ) - Allocates memory on GPU and copies data from host to GPU when entering region and copies data to the host when exiting region.
copyin( list ) - Allocates memory on GPU and copies data from host to GPU when entering region.
copyout( list ) - Allocates memory on GPU and copies data to the host when exiting region.
create( list ) - Allocates memory on GPU but does not copy.
present( list ) - Data is already present on GPU from another containing data region.
- OpenACC spec

Timings

CCE: ftn -hacc -O3 task3.F90 ; unset CRAY_ACC_DEBUG ; aprun -n1 a.out
PGI: pgfortran -acc -O3 task3.F90 ; unset PGI_ACC_TIME ; aprun -n1 a.out

CCE: total:  0.168010 s   ( speedup = 5x )
PGI: total:  0.391570 s  ( speedup = 2.7x )

Compiler reports

PGI

module load craype-accel-nvidia35; pgfortran -acc -Minfo -O3 task3.F90

     26, Loop is parallelizable
         Accelerator kernel generated
         25, !$acc loop gang ! blockidx%y
         26, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
         29, Max reduction generated for error

     36, Loop is parallelizable
         Accelerator kernel generated
         35, !$acc loop gang ! blockidx%y
         36, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
         Memory copy idiom, loop replaced by call to __c_mcopy4

export PGI_ACC_TIME=1; aprun -n1 a.out

Accelerator Kernel Timing data
/scratch/santis/piccinal/EuroHack15.git/examples/qwiklab/PGI/task3.F90
  jacobi_acc_kernels_datacopy  NVIDIA  devicenum=0
    time(us): 33
    21: data region reached 1 time
        21: data copyin transfers: 1
             device time(us): total=21 max=21 min=21 avg=21
        45: data copyout transfers: 1
             device time(us): total=12 max=12 min=12 avg=12

    24: compute region reached 1000 times
        26: kernel launched 1000 times
            grid: [8x1022]  block: [128]    <------------- 1022blocks *8blocks *128threads/block = 1046528 threads
                                                                             1024x1024 grid = 1048576 cells
                                                                             grids = gangs / block = vector sizes
            elapsed time(us): total=132,077 max=162 min=130 avg=132 <--------------- 132077usec = 0.132seconds
        26: reduction kernel launched 1000 times
            grid: [1]  block: [256]
            elapsed time(us): total=40,107 max=91 min=39 avg=40

    34: compute region reached 1000 times
        36: kernel launched 1000 times
            grid: [8x1022]  block: [128]                  
            elapsed time(us): total=84,747 max=108 min=83 avg=84  <--------------- 84747usec = 0.084seconds

CCE

ftn -rm -O3 -hacc task3.F90

$ grep region task3.lst
  A data region was created at line 21 and ending at line 45.
  A data region was created at line 24 and ending at line 32.
  A data region was created at line 34 and ending at line 40.

$ grep region task2.lst
  A data region was created at line 23 and ending at line 31.
  A data region was created at line 33 and ending at line 39.

export CRAY_ACC_DEBUG=2 ; aprun -n1 a.out

Jacobi relaxation Calculation: 1024 x 1024 mesh
ACC: Initialize CUDA
ACC: Get Device 0
ACC: Create Context
ACC: Set Thread Context
ACC: Start transfer 2 items from task3.F90:21
ACC:       allocate, copy to acc 'a' (4194304 bytes)
ACC:       allocate 'anew' (4194304 bytes)
ACC: End transfer (to acc 4194304 bytes, to host 0 bytes)
ACC: Start transfer 3 items from task3.F90:25
ACC:       allocate reusable <internal> (4 bytes)
ACC:       allocate reusable, copy to acc <internal> (4 bytes)
ACC:       allocate reusable <internal> (4088 bytes)
ACC: End transfer (to acc 4 bytes, to host 0 bytes)
ACC: Execute kernel jacobi_acc_kernels_datacopy_$ck_L25_3 blocks:1022 threads:128 async(auto) from task3.F90:25
ACC: Wait async(auto) from task3.F90:31
ACC: Start transfer 3 items from task3.F90:31
ACC:       copy to host, done reusable <internal> (4 bytes)
ACC:       done reusable <internal> (4 bytes)
ACC:       done reusable <internal> (0 bytes)
ACC: End transfer (to acc 0 bytes, to host 4 bytes)
ACC: Execute kernel jacobi_acc_kernels_datacopy_$ck_L35_5 blocks:1022 threads:128 async(auto) from task3.F90:35
ACC: Start transfer 3 items from task3.F90:25
ACC:       reusable acquired <internal> (4 bytes)
ACC:       reusable acquired <internal> (4 bytes)
ACC:       reusable acquired <internal> (4088 bytes)
ACC: End transfer (to acc 0 bytes, to host 0 bytes)
     0   0.250000
ACC: Execute kernel jacobi_acc_kernels_datacopy_$ck_L25_3 blocks:1022 threads:128 async(auto) from task3.F90:25


etc...


ACC: Start transfer 3 items from task3.F90:31
ACC:       copy to host, done reusable <internal> (4 bytes)
ACC:       done reusable <internal> (4 bytes)
ACC:       done reusable <internal> (0 bytes)
ACC: End transfer (to acc 0 bytes, to host 4 bytes)
ACC: Execute kernel jacobi_acc_kernels_datacopy_$ck_L35_5 blocks:1022 threads:128 async(auto) from task3.F90:35
ACC: Wait async(auto) from task3.F90:45
ACC: Start transfer 2 items from task3.F90:45
ACC:       copy to host, free 'a' (4194304 bytes)
ACC:       free 'anew' (4194304 bytes)
ACC: End transfer (to acc 0 bytes, to host 4194304 bytes)

2015-07-02T14:52:07+00:00

jg piccinali reporter
User instrumentation
- 2015-07-05T16:18:05+00:00
Log in to comment

Assignee: –

Type: bug

Priority: major

Status: new

Votes: 0

Watchers: 1

Description

Get the src:

Setup:

Compile:

Run & Profile:

Loop work estimates

Comments (7)

!$acc kernels

Timings

Compiler report

PGI

CCE

Data sloshing

PGI

CCE

!$acc data copy

clauses available for use with the data directive

Timings

Compiler reports

PGI

CCE

clauses available for use with the `data` directive