- LAPACK with vendor BLAS
- full vendor implementation
- LU on KNL turned out to work really well for tile sizes that are multiples of 56 (224, 336, 448) and 80 (400, 480, 560). The final top performance numbers were accomplished with 20 threads for panel factorization and the following nb/ib pairs: 224/40, 336/40, 360/40, 448/40, 560/40.
Intel Xeon E5-2650 v3 aka Haswell
- Setup GCC, MKL, GNU and Intel OpenMP the same way as on the KNL platform described in the section below
- NUMA effects can be minimized with
numactl --interleave=all ...
- Compilers and additional software are installed in
/swand are available via the
modulessystem. To view all available modules execute
modules avail. To enable a module, for example GCC 7.1.0, run
module load gcc/7.1.0.
Use Slurm to launch performance experiments. It is a two step process. First you need to allocate compute node(s):
salloc -N1 -wa00 --time="12:00:00"
Then you may launch your experiment:
srun -N1 -wa00 --time="12:00:00" test/test_dgemm
squeueto view the Slurm queue and compute job IDs. Type
scancel <job_id>to cancel your resource allocation.
Additional information is avalable via
Intel Xeon Phi Knights Landing
- MKL 17.2
source /opt/intel_mkl2017/compilers_and_libraries/linux/bin/compilervars.sh intel64
- You can put it in your .bashrc.
- GCC 7
CC = /opt/local/gcc-7-20170319/bin/gcc
- You can put it in your make.inc.
- By default, GCC (even the custom one) links against default system GOM in /lib64
- To change that, you need to tweak LDFLAGS to preempt the linker's standard behavior. Add this at the end of your
- To make sure that the right GOMP was linked in your files use
ldd lib/lib*.so test/test | grep gomp. You should not be seeing
/lib64/libgomp.sobut a path to your custom GOMP.
- GOMP from GCC 6.3 gives you about 405 Gflop/s for
DPOTRF()while GOMP from GCC 7 goes to about 415 Gflop/s for a matrix of size 5000
numactl -m 1 ...
- OpenMP environment:
export OMP_NUM_THREADS=68 export OMP_PROC_BIND=true export OMP_MAX_TASK_PRIORITY=100
The last one only matters for LU.
Building PLASMA on ARM1 * OpenMP environment:
export OMP_NUM_THREADS=96 export OMP_PROC_BIND=true export OMP_MAX_TASK_PRIORITY=100 export SCI_OPT_GEMM=1
For LU, --nb=128, --ib=64, --mtpf=36 seems to be a good combination. However the Cray LibSci has a bug in TRSM routine that too many concurrent calls will segfault. Workaround: use --nb=256 for matrix size 7000-10000 and even larger for larger matrix sizes...
For ESSL, use
OMP_PROC_BIND=true... For GEMM, it means 94GFLOPS -> 477 GFLOPS;
For PLASMA, perhaps
OMP_PROC_BIND=false is better (220GFLOPS -> 49GFLOPS). (Why? Trace?)
The traces suggest that the 20 OMP threads are fixed to hardware thread number 0-19, which correspond to
core 0,1,2 only (8-way hyperthreading). To fix this mismatch, add environment variable:
which will map the 20 threads to 20 cores. Now PLASMA DGEMM() will be comparable to ESSL.
|GESV||Jakub K.||Mawussi||Jakub K.|
|least squares||Jakub S.||Mawussi|