matmul

Compile

module swap PrgEnv-cray PrgEnv-gnu
cd /apps/daint/5.2.UP02/maqao/VIHPS/MAQAOH/CQA/matmul
make clean; make OPTFLAGS="-O3 -g -dynamic" KERNEL=0 ; mv matmul matmul.0
make clean; make OPTFLAGS="-O3 -g -dynamic" KERNEL=1 ; mv matmul matmul.1
make clean; make OPTFLAGS="-O3 -g -dynamic -march=native" KERNEL=1 ; mv matmul matmul.1+
make clean; make OPTFLAGS="-O3 -g -dynamic -march=native" KERNEL=2 ; mv matmul matmul.2

KERNEL=1+

gcc -O3  -march=native -c -o kernel.o kernel.c     # -O3 !!!
gcc -O2 -c -o rdtsc.o rdtsc.c
gcc -O2 -D KERNEL=1 -c -o driver.o driver.c
gcc -O2 -o matmul kernel.o rdtsc.o driver.o

Run

SandyBridge

o_matmul.0    :cycles per FMA: 2.52
o_matmul.1    :cycles per FMA: 1.62
o_matmul.1arch:cycles per FMA: 0.64
o_matmul.2    :cycles per FMA: 0.55

Haswell

o_matmul.0    :cycles per FMA:  2.25
o_matmul.1    :cycles per FMA:  0.51
o_matmul.1arch:cycles per FMA:  0.46
o_matmul.2    :cycles per FMA:  0.31

Analyze

KERNEL=0

export PATH=$PATH:/apps/daint/5.2.UP02/maqao/2.1.1/bin
aprun -n1 maqao cqa ./matmul.0 100 1000 fct=kernel0

100 is a bad argument. Argument syntaxe : argument=value

1000 is a bad argument. Argument syntaxe : argument=value

Section 1: Function: kernel0
============================

Found no debug data for this function.
With GNU or Intel compilers, please recompile with -g.
With an Intel compiler you must explicitly specify an optimization level.
Alternatively, try to:
 - recompile with -debug noinline-debug-info (if using Intel compiler 13)
 - analyze the caller function (possible inlining)

Section 1.1: Binary loops in the function named kernel0
=======================================================

Section 1.1.1: Binary loop #2
=============================

The loop is defined in -1:-1--1
In the binary file, the address of the loop is: 400a08
Found no debug data for this function.
With GNU or Intel compilers, please recompile with -g.
With an Intel compiler you must explicitly specify an optimization level and you can prevent CQA from suggesting already used flags by adding -sox.
Alternatively, try to:
 - recompile with -debug noinline-debug-info (if using Intel compiler 13)
 - analyze the caller function (possible inlining)

2% of peak computational performance is used (0.67 out of 32.00 FLOP per cycle (1.73 GFLOPS @ 2.60GHz))

Vectorization status
--------------------
Your loop is not vectorized (all SSE/AVX instructions are used in scalar mode).
Only 12% of vector length is used.

Vectorization
-------------
Your loop is processing FP elements but is NOT OR PARTIALLY VECTORIZED and could benefit from full vectorization.
By fully vectorizing your loop, you can lower the cost of an iteration from 3.00 to 0.38 cycles (8.00x speedup).
Since your execution units are vector units, only a fully vectorized loop can use their full power.

Two propositions:
 - Try another compiler or update/tune your current one:
 - Remove inter-iterations dependences from your loop and make it unit-stride.
  * If your arrays have 2 or more dimensions, check whether elements are accessed contiguously and, otherwise, try to permute loops accordingly:
  * If your loop streams arrays of structures (AoS), try to use structures of arrays instead (SoA):

Bottlenecks
-----------
Detected a non usual bottleneck.

 - Pass to your compiler a micro-architecture specialization option:
  * Please read your compiler manual.

Data dependencies
-----------------
Performance is bounded by DATA DEPENDENCIES (frequent in reduction loops).
By removing most critical dependency chains, you can lower the cost of an iteration from 3.00 to 2.00 cycles (1.50x speedup).

 - Try another compiler or update/tune your current one:
 - Remove inter-iterations dependences from your loop.
 - If not possible, break them into several independent dependency chains (if not done by your compiler with appropriate flags). For example, for a
 '+' reduction, use partial sums.

FMA
---
Presence of both ADD/SUB and MUL operations.

 - Pass to your compiler a micro-architecture specialization option:
  * Please read your compiler manual.
 - Try to change order in which elements are evaluated (using parentheses) in arithmetic expressions containing both ADD/SUB and MUL operations to 
enable your compiler to generate FMA instructions wherever possible.
For instance a + b*c is a valid FMA (MUL then ADD).
However (a+b)* c cannot be translated into an FMA (ADD then MUL).

All innermost loops were analyzed.

KERNEL=1

export PATH=$PATH:/apps/daint/5.2.UP02/maqao/2.1.1/bin
aprun -n1 maqao cqa ./matmul.1 100 1000 fct=kernel1 --confidence-levels=gain,potential,hint

100 is a bad argument. Argument syntaxe : argument=value
1000 is a bad argument. Argument syntaxe : argument=value

Section 1: Function: kernel1
============================

Found no debug data for this function.
With GNU or Intel compilers, please recompile with -g.
With an Intel compiler you must explicitly specify an optimization level.
Alternatively, try to:
 - recompile with -debug noinline-debug-info (if using Intel compiler 13)
 - analyze the caller function (possible inlining)

Section 1.1: Binary loops in the function named kernel1
=======================================================

Section 1.1.1: Binary loop #8
=============================

The loop is defined in -1:-1--1
In the binary file, the address of the loop is: 400bab
Found no debug data for this function.
With GNU or Intel compilers, please recompile with -g.
With an Intel compiler you must explicitly specify an optimization level and you can prevent CQA from suggesting already used flags by adding -sox.
Alternatively, try to:
 - recompile with -debug noinline-debug-info (if using Intel compiler 13)
 - analyze the caller function (possible inlining)

14% of peak computational performance is used (4.57 out of 32.00 FLOP per cycle (11.89 GFLOPS @ 2.60GHz))

Code clean check
----------------
Detected a slowdown caused by scalar integer instructions (typically used for address computation).
By removing them, you can lower the cost of an iteration from 1.75 to 1.50 cycles (1.17x speedup).

Vectorization status
--------------------
Your loop is vectorized (all SSE/AVX instructions are used in vector mode) but on 50% vector length.


Vectorization
-------------
Your loop is processing FP elements but is NOT OR PARTIALLY VECTORIZED and could benefit from full vectorization.
By fully vectorizing your loop, you can lower the cost of an iteration from 1.75 to 0.88 cycles (2.00x speedup).
Since your execution units are vector units, only a fully vectorized loop can use their full power.

Propositions:
 - Pass to your compiler a micro-architecture specialization option:
  * Please read your compiler manual.
 - Use vector aligned instructions:
  1) align your arrays on 32 bytes boundaries,
  2) inform your compiler that your arrays are vector aligned:


Bottlenecks
-----------
Front-end is a bottleneck.



FMA
---
Presence of both ADD/SUB and MUL operations.

 - Pass to your compiler a micro-architecture specialization option:
  * Please read your compiler manual.
 - Try to change order in which elements are evaluated (using parentheses) in arithmetic expressions containing both ADD/SUB and MUL operations to enable your compiler to generate FMA instructions wherever possible.
For instance a + b*c is a valid FMA (MUL then ADD).
However (a+b)* c cannot be translated into an FMA (ADD then MUL).


Vector unaligned load/store instructions
----------------------------------------
Detected 1 suboptimal vector unaligned load/store instructions.

MOVUPS: 1 occurrences

 - Pass to your compiler a micro-architecture specialization option:
  * Please read your compiler manual.
 - Use vector aligned instructions:
  1) align your arrays on 32 bytes boundaries,
  2) inform your compiler that your arrays are vector aligned:


Type of elements and instruction set
------------------------------------
2 SSE or AVX instructions are processing arithmetic or math operations on single precision FP elements in vector mode (four at a time).


Matching between your loop (in the source code) and the binary loop
-------------------------------------------------------------------
The binary loop is composed of 8 FP arithmetical operations:
 - 4: addition or subtraction
 - 4: multiply
The binary loop is loading 36 bytes (9 single precision FP elements).
The binary loop is storing 16 bytes (4 single precision FP elements).


Arithmetic intensity
--------------------
Arithmetic intensity is 0.15 FP operations per loaded or stored byte.


Section 1.1.2: Binary loop #5
=============================

The loop is defined in -1:-1--1
In the binary file, the address of the loop is: 400c80
Found no debug data for this function.
With GNU or Intel compilers, please recompile with -g.
With an Intel compiler you must explicitly specify an optimization level and you can prevent CQA from suggesting already used flags by adding -sox.
Alternatively, try to:
 - recompile with -debug noinline-debug-info (if using Intel compiler 13)
 - analyze the caller function (possible inlining)

4% of peak computational performance is used (1.33 out of 32.00 FLOP per cycle (3.47 GFLOPS @ 2.60GHz))

Vectorization status
--------------------
Your loop is not vectorized (all SSE/AVX instructions are used in scalar mode).
Only 12% of vector length is used.


Vectorization
-------------
Your loop is processing FP elements but is NOT OR PARTIALLY VECTORIZED and could benefit from full vectorization.
By fully vectorizing your loop, you can lower the cost of an iteration from 1.50 to 0.19 cycles (8.00x speedup).
Since your execution units are vector units, only a fully vectorized loop can use their full power.

Two propositions:
 - Try another compiler or update/tune your current one:
 - Remove inter-iterations dependences from your loop and make it unit-stride.
  * If your arrays have 2 or more dimensions, check whether elements are accessed contiguously and, otherwise, try to permute loops accordingly:
  * If your loop streams arrays of structures (AoS), try to use structures of arrays instead (SoA):


Bottlenecks
-----------
Front-end is a bottleneck.
Load units are a bottleneck.

Try to reduce the number of loads.
For example, provide more information to your compiler:
 - hardcode the bounds of the corresponding 'for' loop,


FMA
---
Presence of both ADD/SUB and MUL operations.

 - Pass to your compiler a micro-architecture specialization option:
  * Please read your compiler manual.
 - Try to change order in which elements are evaluated (using parentheses) in arithmetic expressions containing both ADD/SUB and MUL operations to enable your compiler to generate FMA instructions wherever possible.
For instance a + b*c is a valid FMA (MUL then ADD).
However (a+b)* c cannot be translated into an FMA (ADD then MUL).


Type of elements and instruction set
------------------------------------
2 SSE or AVX instructions are processing arithmetic or math operations on single precision FP elements in scalar mode (one at a time).


Matching between your loop (in the source code) and the binary loop
-------------------------------------------------------------------
The binary loop is composed of 2 FP arithmetical operations:
 - 1: addition or subtraction
 - 1: multiply
The binary loop is loading 12 bytes (3 single precision FP elements).
The binary loop is storing 4 bytes (1 single precision FP elements).


Arithmetic intensity
--------------------
Arithmetic intensity is 0.12 FP operations per loaded or stored byte.



All innermost loops were analyzed.

MAQAO:CQA

matmul

Compile

KERNEL=1+

Run

SandyBridge

Haswell

Analyze

KERNEL=0

KERNEL=1

Comments (5)

KERNEL=1+

KERNEL=2