OpenACC / saxpy

Issue #2 new
jg piccinali repo owner created an issue

Get the src

ssh -Y daint01
git clone --single-branch -b openacc.saxpy https://github.com/eth-cscs/parallel-debuggers.git
cd parallel-debuggers/openacc.saxpy.git/

Compile

PGI

module swap PrgEnv-cray  PrgEnv-pgi
module swap craype/2.05 craype/2.2.0
module swap pgi /apps/daint/scorep/mf/pgi/1470
module swap cray-mpich/6.2.2 cray-mpich/7.0.3
module load  craype-accel-nvidia35
module rm    libsci_acc
See below

CCE

* module load PrgEnv-cray      # cce/8.2.3
* module load perftools-lite   # 6.2.0.12614
* module load craype-accel-nvidia35  # cudatoolkit/5.5.20-1.0402.7700.8.1
* make clean
* make   PERFFLAGS=-O3

Run (sample_profile)

  • salloc -N1
  • aprun -n1 ./CRAY.TODI 12
CrayPat/X:  Version 6.2.0.12614 Revision 12614  04/14/14 17:11:54
pat[WARNING][0]: 
Collection of accelerator performance data 
for sampling experiments is not supported.  
To collect accelerator performance data perform a trace experiment.  
See the intro_craypat(1) man page on how to perform a trace experiment.

Run (gpu)

  • export CRAYPAT_LITE=gpu
  • make clean
  • make PERFFLAGS=-O3
  • aprun -n1 ./CRAY.TODI 12
CrayPat/X:  Version 6.2.0.12614 Revision 12614  04/14/14 17:11:54
using MPI with 1 PEs, N=12
_OPENACC version:201306
c[0]=0
c[1]=101
c[N/2]=606
c[N-1]=1111

#################################################################
#                                                               #
#            CrayPat-lite Performance Statistics                #
#                                                               #
#################################################################

CrayPat/X:  Version 6.2.0.12614 Revision 12614 (xf 12504)  04/14/14 17:11:54
Experiment:                  lite  lite/gpu
Number of PEs (MPI ranks):      1
Numbers of PEs per Node:        1
Numbers of Threads per PE:      1
Number of Cores per Socket:    16
Execution start time:  Mon May 19 16:31:35 2014
System name and speed:  todi4 2100 MHz

Wall Clock Time: 0.077983 secs
High Memory:        39.04 MBytes

Table 1:  Accelerator Table by Function (top 10 functions shown)

   Host |  Host |   Acc | Acc Copy | Acc Copy | Events |Function=[max10]
  Time% |  Time |  Time |       In |      Out |        | PE=HIDE
        |       |       | (MBytes) | (MBytes) |        |  Thread=HIDE

 100.0% | 0.000 | 0.000 |    0.000 |    0.000 |      5 |Total
|------------------------------------------------------------------------------------------------------------------
|  46.7% | 0.000 | 0.000 |    0.000 |       -- |      1 |saxpy(int, double, double*, double*).ACC_COPY@li.69
|  25.8% | 0.000 | 0.000 |       -- |       -- |      1 |saxpy(int, double, double*, double*).ACC_ASYNC_KERNEL@li.69
|  17.6% | 0.000 | 0.000 |       -- |    0.000 |      1 |saxpy(int, double, double*, double*).ACC_COPY@li.70
|   8.5% | 0.000 |    -- |       -- |       -- |      1 |saxpy(int, double, double*, double*).ACC_SYNC_WAIT@li.70
|   1.4% | 0.000 |    -- |       -- |       -- |      1 |saxpy(int, double, double*, double*).ACC_REGION@li.69
|==================================================================================================================

Program invocation:  ./CRAY.TODI 12 

For a complete report with expanded tables and notes, run:
  pat_report /users/piccinal/pug.git/src/openacc.saxpy.git/CRAY.TODI+11588-3t.ap2

For help identifying callers of particular functions:
  pat_report -O callers+src /users/piccinal/pug.git/src/openacc.saxpy.git/CRAY.TODI+11588-3t.ap2
To see the entire call tree:
  pat_report -O calltree+src /users/piccinal/pug.git/src/openacc.saxpy.git/CRAY.TODI+11588-3t.ap2

For interactive, graphical performance analysis, run:
  app2 /users/piccinal/pug.git/src/openacc.saxpy.git/CRAY.TODI+11588-3t.ap2

================  End of CrayPat-lite output  ==========================

Comments (11)

  1. jg piccinali reporter

    PGI/14.7

    Setup

    module swap PrgEnv-cray PrgEnv-pgi
    module swap craype/2.05 craype/2.2.0
    module swap pgi /apps/daint/scorep/mf/pgi/1470
    module swap cray-mpich/6.2.2 cray-mpich/7.0.3
    module load cudatoolkit/5.5.20-1.0501.7945.8.2
    module load scorep/1.3
    module list
    

    Compile

    Fortran (ok)

    make OBJ=mpiacc_f.o        CC="scorep --mpp=mpi --cuda ftn"
    

    C (ok)

    make OBJ=mpiacc_c.o        CC="scorep --mpp=mpi --cuda cc"
    

    C++ (issue)

    make OBJ=mpiacc_cxx.o        CC="scorep --mpp=mpi --cuda CC"
    
    • scorep --mpp=mpi --cuda CC -g -acc -ta=nvidia:cc35 -mcmodel=medium -c mpiacc_c.cpp -o PGI_mpiacc_c.o
    • scorep --mpp=mpi --cuda CC -g -acc -ta=nvidia:cc35 -mcmodel=medium PGI_mpiacc_c.o -o PGI.DAINT
    using MPI with 1 PEs, N=12
    _OPENACC version:201111
    c[0]=5.02621e+180
    c[1]=1.78826e+161
    c[N/2]=1.07296e+162
    c[N-1]=1.96709e+162
    [Score-P] src/measurement/SCOREP_RuntimeManagement.c:566: Warning: If you are using MPICH1, please ignore this warning. 
    If not, it seems that your interprocess communication library (e.g., MPI) hasn't been initialized. Score-P can't generate output.
    Application 2780523 resources: utime ~0s, stime ~0s, Rss ~157008, inblocks ~2920, outblocks ~7235
    

    Run

    export SCOREP_ENABLE_PROFILING=false
    export SCOREP_ENABLE_TRACING=true
    export SCOREP_CUDA_ENABLE=yes,flushatexit
    export SCOREP_TOTAL_MEMORY=1G
    aprun -n1 -N1 -d1  PGI.DAINT     12
    
  2. Log in to comment