Loop work estimates (CCE compiler only)

Issue #44 new
jg piccinali repo owner created an issue

Description

  • Cray Perftools User Guide
  • man /opt/cray/perftools/default/man/man1/reveal.1
  • man /opt/cray/perftools/default/man/man1/intro_craypat.1

Get the src:

  • git clone EuroHack15.git
  • cd examples/qwiklab

Setup:

  • module load perftools/6.2.3
  • module list
Currently Loaded Modulefiles:
  1) modules/3.2.10.3
  2) nodestat/2.2-1.0502.53712.3.109.ari
  3) sdb/1.0-1.0502.55976.5.27.ari
  4) alps/5.2.1-2.0502.9041.11.6.ari
  5) lustre-cray_ari_s/2.5_3.0.101_0.31.1_1.0502.8394.10.1-1.0502.17198.8.51
  6) udreg/2.3.2-1.0502.9275.1.12.ari
  7) ugni/5.0-1.0502.9685.4.24.ari
  8) gni-headers/3.0-1.0502.9684.5.2.ari
  9) dmapp/7.0.1-1.0502.9501.5.219.ari
 10) xpmem/0.1-2.0502.55507.3.2.ari
 11) hss-llm/7.2.0
 12) Base-opts/1.0.2-1.0502.53325.1.2.ari
 13) craype-network-aries
 14) craype/2.4.0
 15) cce/8.3.12
 16) totalview-support/1.1.4
 17) totalview/8.11.0
 18) cray-libsci/13.0.4
 19) pmi/5.0.7-1.0000.10678.155.25.ari
 20) rca/1.0.0-2.0502.53711.3.127.ari
 21) atp/1.8.2
 22) PrgEnv-cray/5.2.40
 23) craype-sandybridge
 24) slurm
 25) cray-mpich/7.2.2
 26) ddt/5.0
 27) perftools/6.2.3

Compile:

  • module load perftools/6.2.3
  • ftn -O3 -hnoomp -h profile_generate task1.F90 -o CCE8312
  • pat_build -w CCE8312 # => CCE8312+pat

Run & Profile:

  • aprun -n1 ./CCE8312+pat
CrayPat/X:  Version 6.2.3 Revision 13730  03/23/15 16:01:49
PGO data version:  L.14.1:B.3.1
Jacobi relaxation Calculation: 1024 x 1024 mesh
     0   0.250000
   100   0.002397
   200   0.001204
   300   0.000804
   400   0.000603
   500   0.000483
   600   0.000403
   700   0.000345
   800   0.000302
   900   0.000269
total:  0.920057 s
Experiment data file written:
./EuroHack15.git/examples/qwiklab/CRAY/CCE8312+pat+9804-2t.xf
  • Run without tool
PGI/15.x: total:  1.059725 s
CCE/8.3.x: total:  0.840052 s

Loop work estimates

  • pat_report -T CCE8312+pat+9804-2t.xf > xfT
Table 2:  Inclusive and Exclusive Time in Loops (from -hprofile_generate)
  Loop |     Loop |     Time |    Loop |   Loop |  Loop |  Loop |Function=/.LOOP[.]
  Incl |     Incl |    (Loop |     Hit |  Trips | Trips | Trips |
 Time% |     Time |    Adj.) |         |    Avg |   Min |   Max |
|-----------------------------------------------------------------------------
| 99.7% | 0.919511 | 0.000226 |       1 | 1000.0 |  1000 |  1000 |jacobi1_.LOOP.1.li.41
| 64.5% | 0.595301 | 0.013419 |    1000 | 1022.0 |  1022 |  1022 |jacobi1_.LOOP.2.li.43
| 63.1% | 0.581882 | 0.581882 | 1022000 | 1022.0 |  1022 |  1022 |jacobi1_.LOOP.3.li.44
| 35.1% | 0.323985 | 0.009816 |    1000 | 1022.0 |  1022 |  1022 |jacobi1_.LOOP.4.li.51
| 34.1% | 0.314169 | 0.314169 | 1022000 | 1022.0 |  1022 |  1022 |jacobi1_.LOOP.5.li.52
|===========================================

Comments (7)

  1. jg piccinali reporter

    !$acc kernels

    Timings

    CCE: total:  5.064316 s  ( speedup=0.20x )
    PGI: total:  5.098948 s ( speedup=0.16x )
    

    Compiler report

    PGI

    • module load craype-accel-nvidia35
    • pgfortran -acc -Minfo task2.F90
    jacobi_acc_kernels:
         13, Memory zero idiom, loop replaced by call to __c_mzero4
         15, Memory zero idiom, loop replaced by call to __c_mzero4
         23, Generating copyout(anew(2:1023,2:1023))
             Generating copyin(a(:,:))
             Generating Tesla code
         24, Loop is parallelizable
         25, Loop is parallelizable
             Accelerator kernel generated
             24, !$acc loop gang ! blockidx%y
             25, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
             28, Max reduction generated for error
         33, Generating copyin(anew(2:1023,2:1023))
             Generating copyout(a(2:1023,2:1023))
             Generating Tesla code
         34, Loop is parallelizable
         35, Loop is parallelizable
             Accelerator kernel generated
             34, !$acc loop gang ! blockidx%y
             35, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
             Memory copy idiom, loop replaced by call to __c_mcopy4
    

    CCE

    • ftn -rm -O3 -hacc task2.F90
    ftn-6332 ftn: VECTOR File = task2.F90, Line = 12 
      A loop starting at line 12 was not vectorized because it does not map well onto the target architecture.
    
    ftn-6005 ftn: SCALAR File = task2.F90, Line = 12 
      A loop starting at line 12 was unrolled 8 times.
    
    ftn-6230 ftn: VECTOR File = task2.F90, Line = 13 
      A loop starting at line 13 was replaced with multiple library calls.
    
    ftn-6004 ftn: SCALAR File = task2.F90, Line = 14 
      A loop starting at line 14 was fused with the loop starting at line 12.
    
    ftn-6004 ftn: SCALAR File = task2.F90, Line = 15 
      A loop starting at line 15 was fused with the loop starting at line 13.
    
    ftn-3021 ftn: IPA File = task2.F90, Line = 19 
      "_CPU_TIME_4" (called from "jacobi_acc_kernels") was not inlined because the compiler was unable to locate the routine.
    
    ftn-6286 ftn: VECTOR File = task2.F90, Line = 21 
      A loop starting at line 21 was not vectorized because it contains input/output operations at line 41.
    
    ftn-6413 ftn: ACCEL File = task2.F90, Line = 23 
      A data region was created at line 23 and ending at line 31.
    
    ftn-6418 ftn: ACCEL File = task2.F90, Line = 23 
      If not already present: allocate memory and copy whole array "a" to accelerator, free at line 31 (acc_copyin).
    
    ftn-6416 ftn: ACCEL File = task2.F90, Line = 23 
      If not already present: allocate memory and copy whole array "anew" to accelerator, copy back at line 31 (acc_copy).
    
    ftn-6401 ftn: ACCEL File = task2.F90, Line = 24 
      A loop starting at line 24 was placed on the accelerator.
    
    ftn-6430 ftn: ACCEL File = task2.F90, Line = 24 
      A loop starting at line 24 was partitioned across the thread blocks.
    
    ftn-6415 ftn: ACCEL File = task2.F90, Line = 24 
      Allocate memory and copy variable "error" to accelerator, copy back at line 30 (acc_copy).
    
    ftn-6430 ftn: ACCEL File = task2.F90, Line = 25 
      A loop starting at line 25 was partitioned across the 128 threads within a threadblock.
    
    ftn-6413 ftn: ACCEL File = task2.F90, Line = 33 
      A data region was created at line 33 and ending at line 39.
    
    ftn-6418 ftn: ACCEL File = task2.F90, Line = 33 
      If not already present: allocate memory and copy whole array "anew" to accelerator, free at line 39 (acc_copyin).
    
    ftn-6416 ftn: ACCEL File = task2.F90, Line = 33 
      If not already present: allocate memory and copy whole array "a" to accelerator, copy back at line 39 (acc_copy).
    
    ftn-6401 ftn: ACCEL File = task2.F90, Line = 34 
      A loop starting at line 34 was placed on the accelerator.
    
    ftn-6430 ftn: ACCEL File = task2.F90, Line = 34 
      A loop starting at line 34 was partitioned across the thread blocks.
    
    ftn-6430 ftn: ACCEL File = task2.F90, Line = 35 
      A loop starting at line 35 was partitioned across the 128 threads within a threadblock.
    
    ftn-3021 ftn: IPA File = task2.F90, Line = 44 
      "_CPU_TIME_4" (called from "jacobi_acc_kernels") was not inlined because the compiler was unable to locate the routine.
    
  2. jg piccinali reporter

    Data sloshing

    PGI

    • export PGI_ACC_TIME=1
    • aprun -n1 exe
    Accelerator Kernel Timing data
    /scratch/santis/piccinal/EuroHack15.git/examples/qwiklab/PGI/task2.F90
      jacobi_acc_kernels  NVIDIA  devicenum=0
        time(us): 28,178
        23: data region reached 1000 times
            23: data copyin transfers: 1000
                 device time(us): total=8,578 max=36 min=6 avg=8
            31: data copyout transfers: 1000
                 device time(us): total=5,810 max=31 min=4 avg=5
        23: compute region reached 1000 times
            25: kernel launched 1000 times
                grid: [8x1022]  block: [128]
                elapsed time(us): total=133,026 max=163 min=131 avg=133
            25: reduction kernel launched 1000 times
                grid: [1]  block: [256]
                elapsed time(us): total=40,693 max=65 min=39 avg=40
        33: data region reached 1000 times
            33: data copyin transfers: 1000
                 device time(us): total=7,789 max=36 min=5 avg=7
            39: data copyout transfers: 1000
                 device time(us): total=6,001 max=31 min=5 avg=6
        33: compute region reached 1000 times
            35: kernel launched 1000 times
                grid: [8x1022]  block: [128]
                elapsed time(us): total=86,275 max=152 min=85 avg=86
    

    CCE

    • export CRAY_ACC_DEBUG=2
    • aprun -n1 exe
    Jacobi relaxation Calculation: 1024 x 1024 mesh
    ACC: Initialize CUDA
    ACC: Get Device 0
    ACC: Create Context
    ACC: Set Thread Context
    ACC: Start transfer 2 items from task2.F90:23
    ACC:       allocate, copy to acc 'a' (4194304 bytes)
    ACC:       allocate, copy to acc 'anew' (4194304 bytes)
    ACC: End transfer (to acc 8388608 bytes, to host 0 bytes)
    ACC: Start transfer 3 items from task2.F90:24
    ACC:       allocate reusable <internal> (4 bytes)
    ACC:       allocate reusable, copy to acc <internal> (4 bytes)
    ACC:       allocate reusable <internal> (4088 bytes)
    ACC: End transfer (to acc 4 bytes, to host 0 bytes)
    ACC: Execute kernel jacobi_acc_kernels_$ck_L24_3 blocks:1022 threads:128 async(auto) from task2.F90:24
    ACC: Wait async(auto) from task2.F90:30
    ACC: Start transfer 3 items from task2.F90:30
    ACC:       copy to host, done reusable <internal> (4 bytes)
    ACC:       done reusable <internal> (4 bytes)
    ACC:       done reusable <internal> (0 bytes)
    ACC: End transfer (to acc 0 bytes, to host 4 bytes)
    ACC: Wait async(auto) from task2.F90:31
    ACC: Start transfer 2 items from task2.F90:31
    ACC:       free 'a' (4194304 bytes)
    ACC:       copy to host, free 'anew' (4194304 bytes)
    ACC: End transfer (to acc 0 bytes, to host 4194304 bytes)
    ACC: Start transfer 2 items from task2.F90:33
    ACC:       allocate, copy to acc 'a' (4194304 bytes)
    ACC:       allocate, copy to acc 'anew' (4194304 bytes)
    ACC: End transfer (to acc 8388608 bytes, to host 0 bytes)
    ACC: Execute kernel jacobi_acc_kernels_$ck_L34_5 blocks:1022 threads:128 async(auto) from task2.F90:34
    ACC: Wait async(auto) from task2.F90:39
    ACC: Start transfer 2 items from task2.F90:39
    ACC:       copy to host, free 'a' (4194304 bytes)
    ACC:       free 'anew' (4194304 bytes)
    ACC: End transfer (to acc 0 bytes, to host 4194304 bytes)
    ACC: Start transfer 2 items from task2.F90:23
         0   0.250000
    ACC:       allocate, copy to acc 'a' (4194304 bytes)
    ACC:       allocate, copy to acc 'anew' (4194304 bytes)
    ACC: End transfer (to acc 8388608 bytes, to host 0 bytes)
    ACC: Start transfer 3 items from task2.F90:24
    ACC:       reusable acquired <internal> (4 bytes)
    ACC:       reusable acquired <internal> (4 bytes)
    ACC:       reusable acquired <internal> (4088 bytes)
    ACC: End transfer (to acc 0 bytes, to host 0 bytes)
    ACC: Execute kernel jacobi_acc_kernels_$ck_L24_3 blocks:1022 threads:128 async(auto) from task2.F90:24
    etc...
    
  3. jg piccinali reporter

    !$acc data copy

    clauses available for use with the data directive

    • copy( list ) - Allocates memory on GPU and copies data from host to GPU when entering region and copies data to the host when exiting region.
    • copyin( list ) - Allocates memory on GPU and copies data from host to GPU when entering region.
    • copyout( list ) - Allocates memory on GPU and copies data to the host when exiting region.
    • create( list ) - Allocates memory on GPU but does not copy.
    • present( list ) - Data is already present on GPU from another containing data region.

    Timings

    • CCE: ftn -hacc -O3 task3.F90 ; unset CRAY_ACC_DEBUG ; aprun -n1 a.out
    • PGI: pgfortran -acc -O3 task3.F90 ; unset PGI_ACC_TIME ; aprun -n1 a.out
    CCE: total:  0.168010 s   ( speedup = 5x )
    PGI: total:  0.391570 s  ( speedup = 2.7x )
    

    Compiler reports

    PGI

    • module load craype-accel-nvidia35; pgfortran -acc -Minfo -O3 task3.F90
         26, Loop is parallelizable
             Accelerator kernel generated
             25, !$acc loop gang ! blockidx%y
             26, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
             29, Max reduction generated for error
    
         36, Loop is parallelizable
             Accelerator kernel generated
             35, !$acc loop gang ! blockidx%y
             36, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
             Memory copy idiom, loop replaced by call to __c_mcopy4
    
    • export PGI_ACC_TIME=1; aprun -n1 a.out
    Accelerator Kernel Timing data
    /scratch/santis/piccinal/EuroHack15.git/examples/qwiklab/PGI/task3.F90
      jacobi_acc_kernels_datacopy  NVIDIA  devicenum=0
        time(us): 33
        21: data region reached 1 time
            21: data copyin transfers: 1
                 device time(us): total=21 max=21 min=21 avg=21
            45: data copyout transfers: 1
                 device time(us): total=12 max=12 min=12 avg=12
    
        24: compute region reached 1000 times
            26: kernel launched 1000 times
                grid: [8x1022]  block: [128]    <------------- 1022blocks *8blocks *128threads/block = 1046528 threads
                                                                                 1024x1024 grid = 1048576 cells
                                                                                 grids = gangs / block = vector sizes
                elapsed time(us): total=132,077 max=162 min=130 avg=132 <--------------- 132077usec = 0.132seconds
            26: reduction kernel launched 1000 times
                grid: [1]  block: [256]
                elapsed time(us): total=40,107 max=91 min=39 avg=40
    
        34: compute region reached 1000 times
            36: kernel launched 1000 times
                grid: [8x1022]  block: [128]                  
                elapsed time(us): total=84,747 max=108 min=83 avg=84  <--------------- 84747usec = 0.084seconds
    

    CCE

    • ftn -rm -O3 -hacc task3.F90
    $ grep region task3.lst
      A data region was created at line 21 and ending at line 45.
      A data region was created at line 24 and ending at line 32.
      A data region was created at line 34 and ending at line 40.
    
    $ grep region task2.lst
      A data region was created at line 23 and ending at line 31.
      A data region was created at line 33 and ending at line 39.
    
    • export CRAY_ACC_DEBUG=2 ; aprun -n1 a.out
    Jacobi relaxation Calculation: 1024 x 1024 mesh
    ACC: Initialize CUDA
    ACC: Get Device 0
    ACC: Create Context
    ACC: Set Thread Context
    ACC: Start transfer 2 items from task3.F90:21
    ACC:       allocate, copy to acc 'a' (4194304 bytes)
    ACC:       allocate 'anew' (4194304 bytes)
    ACC: End transfer (to acc 4194304 bytes, to host 0 bytes)
    ACC: Start transfer 3 items from task3.F90:25
    ACC:       allocate reusable <internal> (4 bytes)
    ACC:       allocate reusable, copy to acc <internal> (4 bytes)
    ACC:       allocate reusable <internal> (4088 bytes)
    ACC: End transfer (to acc 4 bytes, to host 0 bytes)
    ACC: Execute kernel jacobi_acc_kernels_datacopy_$ck_L25_3 blocks:1022 threads:128 async(auto) from task3.F90:25
    ACC: Wait async(auto) from task3.F90:31
    ACC: Start transfer 3 items from task3.F90:31
    ACC:       copy to host, done reusable <internal> (4 bytes)
    ACC:       done reusable <internal> (4 bytes)
    ACC:       done reusable <internal> (0 bytes)
    ACC: End transfer (to acc 0 bytes, to host 4 bytes)
    ACC: Execute kernel jacobi_acc_kernels_datacopy_$ck_L35_5 blocks:1022 threads:128 async(auto) from task3.F90:35
    ACC: Start transfer 3 items from task3.F90:25
    ACC:       reusable acquired <internal> (4 bytes)
    ACC:       reusable acquired <internal> (4 bytes)
    ACC:       reusable acquired <internal> (4088 bytes)
    ACC: End transfer (to acc 0 bytes, to host 0 bytes)
         0   0.250000
    ACC: Execute kernel jacobi_acc_kernels_datacopy_$ck_L25_3 blocks:1022 threads:128 async(auto) from task3.F90:25
    
    
    etc...
    
    
    ACC: Start transfer 3 items from task3.F90:31
    ACC:       copy to host, done reusable <internal> (4 bytes)
    ACC:       done reusable <internal> (4 bytes)
    ACC:       done reusable <internal> (0 bytes)
    ACC: End transfer (to acc 0 bytes, to host 4 bytes)
    ACC: Execute kernel jacobi_acc_kernels_datacopy_$ck_L35_5 blocks:1022 threads:128 async(auto) from task3.F90:35
    ACC: Wait async(auto) from task3.F90:45
    ACC: Start transfer 2 items from task3.F90:45
    ACC:       copy to host, free 'a' (4194304 bytes)
    ACC:       free 'anew' (4194304 bytes)
    ACC: End transfer (to acc 0 bytes, to host 4194304 bytes)
    
  4. Log in to comment