MAQAO:CQA

Issue #31 new
jg piccinali repo owner created an issue

matmul

Compile

  • module swap PrgEnv-cray PrgEnv-gnu
  • cd /apps/daint/5.2.UP02/maqao/VIHPS/MAQAOH/CQA/matmul
  • make clean; make OPTFLAGS="-O3 -g -dynamic" KERNEL=0 ; mv matmul matmul.0
  • make clean; make OPTFLAGS="-O3 -g -dynamic" KERNEL=1 ; mv matmul matmul.1
  • make clean; make OPTFLAGS="-O3 -g -dynamic -march=native" KERNEL=1 ; mv matmul matmul.1+
  • make clean; make OPTFLAGS="-O3 -g -dynamic -march=native" KERNEL=2 ; mv matmul matmul.2

KERNEL=1+

gcc -O3  -march=native -c -o kernel.o kernel.c     # -O3 !!!
gcc -O2 -c -o rdtsc.o rdtsc.c
gcc -O2 -D KERNEL=1 -c -o driver.o driver.c
gcc -O2 -o matmul kernel.o rdtsc.o driver.o

Run

SandyBridge

o_matmul.0    :cycles per FMA: 2.52
o_matmul.1    :cycles per FMA: 1.62
o_matmul.1arch:cycles per FMA: 0.64
o_matmul.2    :cycles per FMA: 0.55

Haswell

o_matmul.0    :cycles per FMA:  2.25
o_matmul.1    :cycles per FMA:  0.51
o_matmul.1arch:cycles per FMA:  0.46
o_matmul.2    :cycles per FMA:  0.31

Analyze

KERNEL=0

  • export PATH=$PATH:/apps/daint/5.2.UP02/maqao/2.1.1/bin
  • aprun -n1 maqao cqa ./matmul.0 100 1000 fct=kernel0
100 is a bad argument. Argument syntaxe : argument=value

1000 is a bad argument. Argument syntaxe : argument=value

Section 1: Function: kernel0
============================

Found no debug data for this function.
With GNU or Intel compilers, please recompile with -g.
With an Intel compiler you must explicitly specify an optimization level.
Alternatively, try to:
 - recompile with -debug noinline-debug-info (if using Intel compiler 13)
 - analyze the caller function (possible inlining)

Section 1.1: Binary loops in the function named kernel0
=======================================================

Section 1.1.1: Binary loop #2
=============================

The loop is defined in -1:-1--1
In the binary file, the address of the loop is: 400a08
Found no debug data for this function.
With GNU or Intel compilers, please recompile with -g.
With an Intel compiler you must explicitly specify an optimization level and you can prevent CQA from suggesting already used flags by adding -sox.
Alternatively, try to:
 - recompile with -debug noinline-debug-info (if using Intel compiler 13)
 - analyze the caller function (possible inlining)

2% of peak computational performance is used (0.67 out of 32.00 FLOP per cycle (1.73 GFLOPS @ 2.60GHz))

Vectorization status
--------------------
Your loop is not vectorized (all SSE/AVX instructions are used in scalar mode).
Only 12% of vector length is used.

Vectorization
-------------
Your loop is processing FP elements but is NOT OR PARTIALLY VECTORIZED and could benefit from full vectorization.
By fully vectorizing your loop, you can lower the cost of an iteration from 3.00 to 0.38 cycles (8.00x speedup).
Since your execution units are vector units, only a fully vectorized loop can use their full power.

Two propositions:
 - Try another compiler or update/tune your current one:
 - Remove inter-iterations dependences from your loop and make it unit-stride.
  * If your arrays have 2 or more dimensions, check whether elements are accessed contiguously and, otherwise, try to permute loops accordingly:
  * If your loop streams arrays of structures (AoS), try to use structures of arrays instead (SoA):

Bottlenecks
-----------
Detected a non usual bottleneck.

 - Pass to your compiler a micro-architecture specialization option:
  * Please read your compiler manual.

Data dependencies
-----------------
Performance is bounded by DATA DEPENDENCIES (frequent in reduction loops).
By removing most critical dependency chains, you can lower the cost of an iteration from 3.00 to 2.00 cycles (1.50x speedup).

 - Try another compiler or update/tune your current one:
 - Remove inter-iterations dependences from your loop.
 - If not possible, break them into several independent dependency chains (if not done by your compiler with appropriate flags). For example, for a
 '+' reduction, use partial sums.

FMA
---
Presence of both ADD/SUB and MUL operations.

 - Pass to your compiler a micro-architecture specialization option:
  * Please read your compiler manual.
 - Try to change order in which elements are evaluated (using parentheses) in arithmetic expressions containing both ADD/SUB and MUL operations to 
enable your compiler to generate FMA instructions wherever possible.
For instance a + b*c is a valid FMA (MUL then ADD).
However (a+b)* c cannot be translated into an FMA (ADD then MUL).

All innermost loops were analyzed.

KERNEL=1

  • export PATH=$PATH:/apps/daint/5.2.UP02/maqao/2.1.1/bin
  • aprun -n1 maqao cqa ./matmul.1 100 1000 fct=kernel1 --confidence-levels=gain,potential,hint
100 is a bad argument. Argument syntaxe : argument=value
1000 is a bad argument. Argument syntaxe : argument=value

Section 1: Function: kernel1
============================

Found no debug data for this function.
With GNU or Intel compilers, please recompile with -g.
With an Intel compiler you must explicitly specify an optimization level.
Alternatively, try to:
 - recompile with -debug noinline-debug-info (if using Intel compiler 13)
 - analyze the caller function (possible inlining)

Section 1.1: Binary loops in the function named kernel1
=======================================================

Section 1.1.1: Binary loop #8
=============================

The loop is defined in -1:-1--1
In the binary file, the address of the loop is: 400bab
Found no debug data for this function.
With GNU or Intel compilers, please recompile with -g.
With an Intel compiler you must explicitly specify an optimization level and you can prevent CQA from suggesting already used flags by adding -sox.
Alternatively, try to:
 - recompile with -debug noinline-debug-info (if using Intel compiler 13)
 - analyze the caller function (possible inlining)

14% of peak computational performance is used (4.57 out of 32.00 FLOP per cycle (11.89 GFLOPS @ 2.60GHz))

Code clean check
----------------
Detected a slowdown caused by scalar integer instructions (typically used for address computation).
By removing them, you can lower the cost of an iteration from 1.75 to 1.50 cycles (1.17x speedup).

Vectorization status
--------------------
Your loop is vectorized (all SSE/AVX instructions are used in vector mode) but on 50% vector length.


Vectorization
-------------
Your loop is processing FP elements but is NOT OR PARTIALLY VECTORIZED and could benefit from full vectorization.
By fully vectorizing your loop, you can lower the cost of an iteration from 1.75 to 0.88 cycles (2.00x speedup).
Since your execution units are vector units, only a fully vectorized loop can use their full power.

Propositions:
 - Pass to your compiler a micro-architecture specialization option:
  * Please read your compiler manual.
 - Use vector aligned instructions:
  1) align your arrays on 32 bytes boundaries,
  2) inform your compiler that your arrays are vector aligned:


Bottlenecks
-----------
Front-end is a bottleneck.



FMA
---
Presence of both ADD/SUB and MUL operations.

 - Pass to your compiler a micro-architecture specialization option:
  * Please read your compiler manual.
 - Try to change order in which elements are evaluated (using parentheses) in arithmetic expressions containing both ADD/SUB and MUL operations to enable your compiler to generate FMA instructions wherever possible.
For instance a + b*c is a valid FMA (MUL then ADD).
However (a+b)* c cannot be translated into an FMA (ADD then MUL).


Vector unaligned load/store instructions
----------------------------------------
Detected 1 suboptimal vector unaligned load/store instructions.

MOVUPS: 1 occurrences

 - Pass to your compiler a micro-architecture specialization option:
  * Please read your compiler manual.
 - Use vector aligned instructions:
  1) align your arrays on 32 bytes boundaries,
  2) inform your compiler that your arrays are vector aligned:


Type of elements and instruction set
------------------------------------
2 SSE or AVX instructions are processing arithmetic or math operations on single precision FP elements in vector mode (four at a time).


Matching between your loop (in the source code) and the binary loop
-------------------------------------------------------------------
The binary loop is composed of 8 FP arithmetical operations:
 - 4: addition or subtraction
 - 4: multiply
The binary loop is loading 36 bytes (9 single precision FP elements).
The binary loop is storing 16 bytes (4 single precision FP elements).


Arithmetic intensity
--------------------
Arithmetic intensity is 0.15 FP operations per loaded or stored byte.


Section 1.1.2: Binary loop #5
=============================

The loop is defined in -1:-1--1
In the binary file, the address of the loop is: 400c80
Found no debug data for this function.
With GNU or Intel compilers, please recompile with -g.
With an Intel compiler you must explicitly specify an optimization level and you can prevent CQA from suggesting already used flags by adding -sox.
Alternatively, try to:
 - recompile with -debug noinline-debug-info (if using Intel compiler 13)
 - analyze the caller function (possible inlining)

4% of peak computational performance is used (1.33 out of 32.00 FLOP per cycle (3.47 GFLOPS @ 2.60GHz))

Vectorization status
--------------------
Your loop is not vectorized (all SSE/AVX instructions are used in scalar mode).
Only 12% of vector length is used.


Vectorization
-------------
Your loop is processing FP elements but is NOT OR PARTIALLY VECTORIZED and could benefit from full vectorization.
By fully vectorizing your loop, you can lower the cost of an iteration from 1.50 to 0.19 cycles (8.00x speedup).
Since your execution units are vector units, only a fully vectorized loop can use their full power.

Two propositions:
 - Try another compiler or update/tune your current one:
 - Remove inter-iterations dependences from your loop and make it unit-stride.
  * If your arrays have 2 or more dimensions, check whether elements are accessed contiguously and, otherwise, try to permute loops accordingly:
  * If your loop streams arrays of structures (AoS), try to use structures of arrays instead (SoA):


Bottlenecks
-----------
Front-end is a bottleneck.
Load units are a bottleneck.

Try to reduce the number of loads.
For example, provide more information to your compiler:
 - hardcode the bounds of the corresponding 'for' loop,


FMA
---
Presence of both ADD/SUB and MUL operations.

 - Pass to your compiler a micro-architecture specialization option:
  * Please read your compiler manual.
 - Try to change order in which elements are evaluated (using parentheses) in arithmetic expressions containing both ADD/SUB and MUL operations to enable your compiler to generate FMA instructions wherever possible.
For instance a + b*c is a valid FMA (MUL then ADD).
However (a+b)* c cannot be translated into an FMA (ADD then MUL).


Type of elements and instruction set
------------------------------------
2 SSE or AVX instructions are processing arithmetic or math operations on single precision FP elements in scalar mode (one at a time).


Matching between your loop (in the source code) and the binary loop
-------------------------------------------------------------------
The binary loop is composed of 2 FP arithmetical operations:
 - 1: addition or subtraction
 - 1: multiply
The binary loop is loading 12 bytes (3 single precision FP elements).
The binary loop is storing 4 bytes (1 single precision FP elements).


Arithmetic intensity
--------------------
Arithmetic intensity is 0.12 FP operations per loaded or stored byte.



All innermost loops were analyzed.

Comments (5)

  1. jg piccinali reporter
    • module load PrgEnv-cray
    • module load perftools/6.2.3
    • make CC=cc CFLAGS="-O3 -dynamic -hpl=reveal623jg.pl" KERNEL=0 Screen Shot 2015-05-26 at 20.41.44.png
  2. jg piccinali reporter

    KERNEL=1+

    • export PATH=$PATH:/apps/daint/5.2.UP02/maqao/2.1.1/bin
    • aprun -n 1 -N 1 -d 1 -j 1 maqao cqa ./matmul.1+ 100 1000 fct=kernel1 --confidence-levels=gain,potential,hint
    100 is a bad argument. Argument syntaxe : argument=value
    
    1000 is a bad argument. Argument syntaxe : argument=value
    
    Section 1: Function: kernel1
    ============================
    
    These loops are supposed to be defined in: /apps/daint/5.2.UP02/maqao/VIHPS/MAQAOH/CQA/matmul/K0/kernel.c
    
    Section 1.1: Source loop ending at line 30
    ==========================================
    
    Composition and unrolling
    -------------------------
    It is composed of the following loops [ID (first-last source line)]:
     - 5 (29-30)
     - 8 (30-30)
    and is unrolled by 8 (including vectorization).
    
    The following loops are considered as:
     - unrolled and/or vectorized main: 8
     - peel or tail: 5
    The analysis will be displayed for the unrolled and/or vectorized loops: 8
    
    Section 1.1.1: Binary (unrolled and/or vectorized) loop #8
    ==========================================================
    
    The loop is defined in /apps/daint/5.2.UP02/maqao/VIHPS/MAQAOH/CQA/matmul/K0/kernel.c:30-30
    In the binary file, the address of the loop is: 400c59
    22% of peak computational performance is used (7.11 out of 32.00 FLOP per cycle (18.49 GFLOPS @ 2.60GHz))
    
    Code clean check
    ----------------
    Detected a slowdown caused by scalar integer instructions (typically used for address computation).
    By removing them, you can lower the cost of an iteration from 2.25 to 2.00 cycles (1.12x speedup).
    
    Vectorization status
    --------------------
    Your loop is vectorized (all SSE/AVX instructions are used in vector mode) but on 75% vector length.
    
    
    Vectorization
    -------------
    Your loop is processing FP elements but is NOT OR PARTIALLY VECTORIZED and could benefit from full vectorization.
    By fully vectorizing your loop, you can lower the cost of an iteration from 2.25 to 1.97 cycles (1.14x speedup).
    Since your execution units are vector units, only a fully vectorized loop can use their full power.
    
    Propositions:
     - Use vector aligned instructions:
      1) align your arrays on 32 bytes boundaries,
      2) inform your compiler that your arrays are vector aligned:
       * use the __builtin_assume_aligned built-in.
    
    
    Bottlenecks
    -----------
    Front-end is a bottleneck.
    
    
    
    Complex instructions
    --------------------
    Detected COMPLEX INSTRUCTIONS.
    
    These instructions generate more than one micro-operation and only one of them can be decoded during a cycle and the extra micro-operations increase pressure on execution units.
    VINSERTF128: 1 occurrences
    
    
    
    Vector unaligned load/store instructions
    ----------------------------------------
    Detected 1 suboptimal vector unaligned load/store instructions.
    
    VINSERTF128: 1 occurrences
    
     - Use vector aligned instructions:
      1) align your arrays on 32 bytes boundaries,
      2) inform your compiler that your arrays are vector aligned:
       * use the __builtin_assume_aligned built-in.
    
    
    Type of elements and instruction set
    ------------------------------------
    1 AVX instructions are processing arithmetic or math operations on single precision FP elements in vector mode (eight at a time).
    
    
    Matching between your loop (in the source code) and the binary loop
    -------------------------------------------------------------------
    The binary loop is composed of 16 FP arithmetical operations:
     - 8: fused multiply-add
    The binary loop is loading 68 bytes (17 single precision FP elements).
    The binary loop is storing 32 bytes (8 single precision FP elements).
    
    
    Arithmetic intensity
    --------------------
    Arithmetic intensity is 0.16 FP operations per loaded or stored byte.
    
    
    
    All innermost loops were analyzed.
    
  3. jg piccinali reporter

    KERNEL=2

    • make clean; make OPTFLAGS="-O3 -g -march=native " KERNEL=2
    • mv matmul matmul.2
    • export PATH=$PATH:/apps/daint/5.2.UP02/maqao/2.1.1/bin
    • aprun -n1 maqao cqa ./matmul.2 104 1000 fct=kernel2 --confidence-levels=gain,potential,hint
    104 is a bad argument. Argument syntaxe : argument=value
    
    1000 is a bad argument. Argument syntaxe : argument=value
    
    Section 1: Function: kernel2
    ============================
    
    These loops are supposed to be defined in: /apps/daint/5.2.UP02/maqao/VIHPS/MAQAOH/CQA/matmul/K0/kernel.c
    
    Section 1.1: Source loop ending at line 61
    ==========================================
    
    Composition and unrolling
    -------------------------
    It is composed of the following loops [ID (first-last source line)]:
     - 9 (60-61)
     - 12 (61-61)
    and is unrolled by 8 (including vectorization).
    
    The following loops are considered as:
     - unrolled and/or vectorized main: 12
     - peel or tail: 9
    The analysis will be displayed for the unrolled and/or vectorized loops: 12
    
    Section 1.1.1: Binary (unrolled and/or vectorized) loop #12
    ===========================================================
    
    The loop is defined in /apps/daint/5.2.UP02/maqao/VIHPS/MAQAOH/CQA/matmul/K0/kernel.c:61-61
    In the binary file, the address of the loop is: 401190
    25% of peak computational performance is used (8.00 out of 32.00 FLOP per cycle (20.80 GFLOPS @ 2.60GHz))
    
    Code clean check
    ----------------
    Detected a slowdown caused by scalar integer instructions (typically used for address computation).
    By removing them, you can lower the cost of an iteration from 2.00 to 1.75 cycles (1.14x speedup).
    
    Vectorization status
    --------------------
    Your loop is fully vectorized (all SSE/AVX instructions are used in vector mode and on full vector length).
    
    
    Bottlenecks
    -----------
    Front-end is a bottleneck.
    
    
    
    Complex instructions
    --------------------
    Detected COMPLEX INSTRUCTIONS.
    
    These instructions generate more than one micro-operation and only one of them can be decoded during a cycle and the extra micro-operations increase pressure on execution units.
    CMP: 1 occurrences
    
    
    
    Type of elements and instruction set
    ------------------------------------
    1 AVX instructions are processing arithmetic or math operations on single precision FP elements in vector mode (eight at a time).
    
    
    Matching between your loop (in the source code) and the binary loop
    -------------------------------------------------------------------
    The binary loop is composed of 16 FP arithmetical operations:
     - 8: fused multiply-add
    The binary loop is loading 68 bytes (17 single precision FP elements).
    The binary loop is storing 32 bytes (8 single precision FP elements).
    
    
    Arithmetic intensity
    --------------------
    Arithmetic intensity is 0.16 FP operations per loaded or stored byte.
    
    All innermost loops were analyzed.
    
  4. Log in to comment