- changed status to wontfix
Unexpected Threshold Behaviour
Hi Klaus,
I tried the following simple matrix-matrix product program. It compares the time needed for a direct call to the MKL dgemm
routine with the Blaze call. Compiled with icpc -I./blaze/ -mkl=sequential -std=c++14 -O3 -xMIC-AVX512 blaze_mkl.cpp
and the repository version of Blaze.
I did expect to see two things:
- Blaze does call MKL for large matrices as soon as the threshold is hit.
- Blaze is faster for smaller matrices.
But in this case (tested on Intel Xeon Phi Knights Landing and an i5) the MKL version is faster (at least two times), and Blaze seems to ignore my threshold (I just set it to 1 to be sure), but calls MKL for larger matrices (here if N>67
). The calls made to MKL were checked with MKL_VERBOSE=1 ./a.out
.
I fear I did make a stupid mistake somewhere, but I have been trying this for a while now. A concrete feature request would be a flow chart for the decision process when what is called.
Thanks for your time, and kind regards
Jannik
#include <mkl.h>
#define BLAZE_BLAS_MODE 1
#define BLAZE_BLAS_INCLUDE_FILE <mkl_cblas.h>
#define BLAZE_BLAS_IS_PARALLEL 0
#define BLAZE_USE_SHARED_MEMORY_PARALLELIZATION 0
#define BLAZE_TDMATDMATMULT_THRESHOLD 1
#define BLAZE_DMATDMATMULT_THRESHOLD 1
#define BLAZE_TDMATTDMATMULT_THRESHOLD 1
#define BLAZE_TDMATDMATMULT_THRESHOLD 1
// Cache size of Knights Landing
#define BLAZE_CACHE_SIZE 1024 * 1024
#include <blaze/Blaze.h>
constexpr size_t N = 64; // matrix size
constexpr size_t RUNS = 500; // how often called
using namespace blaze;
int main() {
constexpr size_t element_size = sizeof(double);
constexpr size_t lead_dim = (((N * element_size + 511) / 512) * 512 + 64) / element_size;
constexpr size_t alignment = 4096;
double* A =
(double*)_mm_malloc(N * lead_dim * sizeof(double), alignment); // row major
double* B =
(double*)_mm_malloc(N * lead_dim * sizeof(double), alignment); // column major
double* C =
(double*)_mm_malloc(N * lead_dim * sizeof(double), alignment); // column major
// Random initialization
VSLStreamStatePtr stream;
vslNewStream(&stream, VSL_BRNG_SFMT19937, 1337);
vdRngUniform(VSL_RNG_METHOD_UNIFORM_STD, stream, N * N, A, -10, 10);
vdRngUniform(VSL_RNG_METHOD_UNIFORM_STD, stream, N * N, B, -10, 10);
vslDeleteStream(&stream);
std::cout << "Created matrices, N = " << N << ".\n";
dsecnd(); // Warm up
{
gemm(CBLAS_LAYOUT::CblasColMajor, CBLAS_TRANSPOSE::CblasTrans,
CBLAS_TRANSPOSE::CblasNoTrans, N, N, N, 1.0, A, lead_dim /* lda */, B,
lead_dim /* ldb */, 0.0, C, lead_dim /* ldc */);
double total = 0.0;
for (size_t i = 0; i < RUNS; ++i) {
double start = dsecnd();
gemm(CBLAS_LAYOUT::CblasColMajor, CBLAS_TRANSPOSE::CblasTrans,
CBLAS_TRANSPOSE::CblasNoTrans, N, N, N, 1.0, A, lead_dim /* lda */,
B, lead_dim /* ldb */, 0.0, C, lead_dim /* ldc */);
total += dsecnd() - start;
}
std::cout << "Time (MKL direct) = " << total << ", ";
}
std::cout << C[0] << '\n';
auto blA = CustomMatrix<double, aligned, padded, rowMajor>(A, N, N, lead_dim);
auto blB = CustomMatrix<double, aligned, padded, columnMajor>(B, N, N, lead_dim);
auto blC = CustomMatrix<double, aligned, padded, columnMajor>(C, N, N, lead_dim);
// auto blA = DynamicMatrix<double, rowMajor>(N, N);
// auto blB = DynamicMatrix<double, columnMajor>(N, N);
// auto blC = DynamicMatrix<double, columnMajor>(N, N);
{
double total = 0.0;
blC = blA * blB;
for (size_t i = 0; i < RUNS; ++i) {
double start = dsecnd();
blC = blA * blB;
total += dsecnd() - start;
}
std::cout << "A * B (Blaze) = " << total << ", ";
}
std::cout << blC(0, 0) << '\n';
_mm_free(A);
_mm_free(B);
_mm_free(C);
return 0;
}
Comments (4)
-
-
reporter Hi Klaus,
sorry for this :-( Now the threshold works as expected.
Regarding the performance difference it seems that on Xeon Phi Knights Landing the AVX512 instruction set is not used by Blaze. As far as I understand the code, Blaze has AVX512 support for the previous model Knights Corner (
BLAZE_MIC_MODE
). So I assume that is the reason for the MKL being faster. I would like to look into this and do some work, maybe. Are there plans on how to update Blaze for AVX512? For example Knights Landing supportsavx512f avx512pf avx512er avx512cd
, while upcoming/latest CPUs have other parts. How big is the workload to extend Blaze with a new SIMD instruction set?edit: My first reflex would be changing
BLAZE_MIC_MODE
toBLAZE_AVX512_MODE
and then use macros like_FEATURE_AVX512ER
, which are defined inimmintrin.h
, to replace some intrinsics.edit2: Technically, for the MIC architecture the instruction set is called Initial Many Core Instructions (IMCI). The book "Intel Xeon Phi Processor High Performance Programming" by Jim Jeffers et. al. has a nice overview for migrating code from IMCI to AVX512 (Chapter 6). It seems there is not much change needed to get a quick result. Any thoughts?
edit3: Here is a glimpse of how I would approach adding AVX512
Kind regards
Jannik
-
reporter Hi Klaus,
I added in my branch changes for all basic functions and a naive implementation for Reduction.
Kind regards
Jannik
-
Hi Jannik!
Thanks a lot for your efforts and the link. At a first glance it looks great. We will take a closer look as soon as possible, but definitely before the release of Blaze 3.2. Thanks again,
Best regards,
Klaus!
- Log in to comment
Hi Jannik!
Thanks for raising the issue. I have taken a look at your code and discovered the problem in the following four lines:
I expect that you are trying to set all four dense matrix/dense matrix multiplication thresholds to
1
. However, note that the first and fourth line are identical. this means you are unfortunately missing the threshold for the row-major matrix/column-major matrix multiplication (BLAZE_DMATTDMATMULT_THRESHOLD
), which is exactly the kind of multiplication you are using:Changing the first line to
will set the required threshold and Blaze will always call the MKL kernel.
I hope this solves the problem,
Best regards,
Klaus!