Clone wiki

plasma / Performance_Report

General

  • compare
    • PLASMA
    • LAPACK with vendor BLAS
    • full vendor implementation
  • observations
    • LU on KNL turned out to work really well for tile sizes that are multiples of 56 (224, 336, 448) and 80 (400, 480, 560). The final top performance numbers were accomplished with 20 threads for panel factorization and the following nb/ib pairs: 224/40, 336/40, 360/40, 448/40, 560/40.

Intel Xeon E5-2650 v3 aka Haswell

  • Setup GCC, MKL, GNU and Intel OpenMP the same way as on the KNL platform described in the section below
  • NUMA effects can be minimized with numactl --interleave=all ...
  • Compilers and additional software are installed in /sw and are available via the modules system. To view all available modules execute modules avail. To enable a module, for example GCC 7.1.0, run module load gcc/7.1.0.
  • Use Slurm to launch performance experiments. It is a two step process. First you need to allocate compute node(s):

    salloc -N1 -wa00 --time="12:00:00"

  • Then you may launch your experiment:

    srun -N1 -wa00 --time="12:00:00" test/test_dgemm

  • Type squeue to view the Slurm queue and compute job IDs. Type scancel <job_id> to cancel your resource allocation.

  • Additional information is avalable via man saturn


Intel Xeon Phi Knights Landing

  • MKL 17.2
    • source /opt/intel_mkl2017/compilers_and_libraries/linux/bin/compilervars.sh intel64
    • You can put it in your .bashrc.
  • GCC 7
    • CC = /opt/local/gcc-7-20170319/bin/gcc
    • You can put it in your make.inc.
  • GOMP
    • By default, GCC (even the custom one) links against default system GOM in /lib64
    • To change that, you need to tweak LDFLAGS to preempt the linker's standard behavior. Add this at the end of your LDFLAGS in make.inc: -L/opt/local/gcc-7-20170319/lib64 -Wl,-rpath,/opt/local/gcc-7-20170319/lib64
    • To make sure that the right GOMP was linked in your files use ldd lib/lib*.so test/test | grep gomp. You should not be seeing /lib64/libgomp.so but a path to your custom GOMP.
    • GOMP from GCC 6.3 gives you about 405 Gflop/s for DPOTRF() while GOMP from GCC 7 goes to about 415 Gflop/s for a matrix of size 5000
  • MCDRAM
    • numactl -m 1 ...
  • OpenMP environment:
export OMP_NUM_THREADS=68
export OMP_PROC_BIND=true
export OMP_MAX_TASK_PRIORITY=100

The last one only matters for LU.


ARMv8

Building PLASMA on ARM1 * OpenMP environment:

export OMP_NUM_THREADS=96
export OMP_PROC_BIND=true
export OMP_MAX_TASK_PRIORITY=100
export SCI_OPT_GEMM=1

For LU, --nb=128, --ib=64, --mtpf=36 seems to be a good combination. However the Cray LibSci has a bug in TRSM routine that too many concurrent calls will segfault. Workaround: use --nb=256 for matrix size 7000-10000 and even larger for larger matrix sizes...


POWER8

Building PLASMA on Power8

For ESSL, use OMP_PROC_BIND=true... For GEMM, it means 94GFLOPS -> 477 GFLOPS;

For PLASMA, perhaps OMP_PROC_BIND=false is better (220GFLOPS -> 49GFLOPS). (Why? Trace?) The traces suggest that the 20 OMP threads are fixed to hardware thread number 0-19, which correspond to core 0,1,2 only (8-way hyperthreading). To fix this mismatch, add environment variable:

OMP_PLACES="{0}:20:8"

which will map the 20 threads to 20 cores. Now PLASMA DGEMM() will be comparable to ESSL.


Schedule

routines Haswell Knights Landing ARMv8 POWER8
parallel BLAS Mawussi Mawussi
parallel norms Negin Negin
GESV Jakub K. Mawussi Jakub K.
POSV Piotr Mawussi Piotr Piotr
SYSV Ichi Ichi Ichi
GBSV, PBSV David David
mixed precision Maksims Maksims Maksims
matrix inversion Sam Sam
least squares Jakub S. Mawussi

Updated