Unexpected Threshold Behaviour

Issue #115 wontfix
Jannik Schürg created an issue

Hi Klaus,

I tried the following simple matrix-matrix product program. It compares the time needed for a direct call to the MKL dgemm routine with the Blaze call. Compiled with icpc -I./blaze/ -mkl=sequential -std=c++14 -O3 -xMIC-AVX512 blaze_mkl.cpp and the repository version of Blaze.

I did expect to see two things:

  • Blaze does call MKL for large matrices as soon as the threshold is hit.
  • Blaze is faster for smaller matrices.

But in this case (tested on Intel Xeon Phi Knights Landing and an i5) the MKL version is faster (at least two times), and Blaze seems to ignore my threshold (I just set it to 1 to be sure), but calls MKL for larger matrices (here if N>67). The calls made to MKL were checked with MKL_VERBOSE=1 ./a.out.

I fear I did make a stupid mistake somewhere, but I have been trying this for a while now. A concrete feature request would be a flow chart for the decision process when what is called.

Thanks for your time, and kind regards

Jannik

#include <mkl.h>

#define BLAZE_BLAS_MODE 1
#define BLAZE_BLAS_INCLUDE_FILE <mkl_cblas.h>
#define BLAZE_BLAS_IS_PARALLEL 0

#define BLAZE_USE_SHARED_MEMORY_PARALLELIZATION 0

#define BLAZE_TDMATDMATMULT_THRESHOLD 1
#define BLAZE_DMATDMATMULT_THRESHOLD 1
#define BLAZE_TDMATTDMATMULT_THRESHOLD 1
#define BLAZE_TDMATDMATMULT_THRESHOLD 1

// Cache size of Knights Landing
#define BLAZE_CACHE_SIZE 1024 * 1024

#include <blaze/Blaze.h>

constexpr size_t N    = 64; // matrix size
constexpr size_t RUNS = 500; // how often called

using namespace blaze;

int main() {
  constexpr size_t element_size = sizeof(double);
  constexpr size_t lead_dim = (((N * element_size + 511) / 512) * 512 + 64) / element_size;
  constexpr size_t alignment = 4096;
  double* A =
      (double*)_mm_malloc(N * lead_dim * sizeof(double), alignment);  // row major
  double* B =
      (double*)_mm_malloc(N * lead_dim * sizeof(double), alignment);  // column major
  double* C =
      (double*)_mm_malloc(N * lead_dim * sizeof(double), alignment);  // column major

  // Random initialization
  VSLStreamStatePtr stream;
  vslNewStream(&stream, VSL_BRNG_SFMT19937, 1337);
  vdRngUniform(VSL_RNG_METHOD_UNIFORM_STD, stream, N * N, A, -10, 10);
  vdRngUniform(VSL_RNG_METHOD_UNIFORM_STD, stream, N * N, B, -10, 10);
  vslDeleteStream(&stream);
  std::cout << "Created matrices, N = " << N << ".\n";

  dsecnd(); // Warm up

  {
    gemm(CBLAS_LAYOUT::CblasColMajor, CBLAS_TRANSPOSE::CblasTrans,
         CBLAS_TRANSPOSE::CblasNoTrans, N, N, N, 1.0, A, lead_dim /* lda */, B,
         lead_dim /* ldb */, 0.0, C, lead_dim /* ldc */);
    double total = 0.0;
    for (size_t i = 0; i < RUNS; ++i) {
      double start = dsecnd();
      gemm(CBLAS_LAYOUT::CblasColMajor, CBLAS_TRANSPOSE::CblasTrans,
           CBLAS_TRANSPOSE::CblasNoTrans, N, N, N, 1.0, A, lead_dim /* lda */,
           B, lead_dim /* ldb */, 0.0, C, lead_dim /* ldc */);
      total += dsecnd() - start;
    }
    std::cout << "Time (MKL direct) = " << total << ", ";
  }

  std::cout << C[0] << '\n';

  auto blA = CustomMatrix<double, aligned, padded, rowMajor>(A, N, N, lead_dim);
  auto blB = CustomMatrix<double, aligned, padded, columnMajor>(B, N, N, lead_dim);
  auto blC = CustomMatrix<double, aligned, padded, columnMajor>(C, N, N, lead_dim);

  // auto blA = DynamicMatrix<double, rowMajor>(N, N);
  // auto blB = DynamicMatrix<double, columnMajor>(N, N);
  // auto blC = DynamicMatrix<double, columnMajor>(N, N);

  {
    double total = 0.0;
    blC          = blA * blB;
    for (size_t i = 0; i < RUNS; ++i) {
      double start = dsecnd();
      blC          = blA * blB;
      total += dsecnd() - start;
    }

    std::cout << "A * B (Blaze) = " << total << ", ";
  }

  std::cout << blC(0, 0) << '\n';

  _mm_free(A);
  _mm_free(B);
  _mm_free(C);
  return 0;
}

Comments (4)

  1. Klaus Iglberger

    Hi Jannik!

    Thanks for raising the issue. I have taken a look at your code and discovered the problem in the following four lines:

    #define BLAZE_TDMATDMATMULT_THRESHOLD 1
    #define BLAZE_DMATDMATMULT_THRESHOLD 1
    #define BLAZE_TDMATTDMATMULT_THRESHOLD 1
    #define BLAZE_TDMATDMATMULT_THRESHOLD 1
    

    I expect that you are trying to set all four dense matrix/dense matrix multiplication thresholds to 1. However, note that the first and fourth line are identical. this means you are unfortunately missing the threshold for the row-major matrix/column-major matrix multiplication (BLAZE_DMATTDMATMULT_THRESHOLD), which is exactly the kind of multiplication you are using:

    auto blA = CustomMatrix<double, aligned, padded, rowMajor>(A, N, N, lead_dim);
    auto blB = CustomMatrix<double, aligned, padded, columnMajor>(B, N, N, lead_dim);
    auto blC = CustomMatrix<double, aligned, padded, columnMajor>(C, N, N, lead_dim);
    
    blC = blA * blB;
    

    Changing the first line to

    #define BLAZE_DMATTDMATMULT_THRESHOLD 1
    

    will set the required threshold and Blaze will always call the MKL kernel.

    I hope this solves the problem,

    Best regards,

    Klaus!

  2. Jannik Schürg reporter

    Hi Klaus,

    sorry for this :-( Now the threshold works as expected.

    Regarding the performance difference it seems that on Xeon Phi Knights Landing the AVX512 instruction set is not used by Blaze. As far as I understand the code, Blaze has AVX512 support for the previous model Knights Corner (BLAZE_MIC_MODE). So I assume that is the reason for the MKL being faster. I would like to look into this and do some work, maybe. Are there plans on how to update Blaze for AVX512? For example Knights Landing supports avx512f avx512pf avx512er avx512cd, while upcoming/latest CPUs have other parts. How big is the workload to extend Blaze with a new SIMD instruction set?

    edit: My first reflex would be changing BLAZE_MIC_MODE to BLAZE_AVX512_MODE and then use macros like _FEATURE_AVX512ER, which are defined in immintrin.h, to replace some intrinsics.

    edit2: Technically, for the MIC architecture the instruction set is called Initial Many Core Instructions (IMCI). The book "Intel Xeon Phi Processor High Performance Programming" by Jim Jeffers et. al. has a nice overview for migrating code from IMCI to AVX512 (Chapter 6). It seems there is not much change needed to get a quick result. Any thoughts?

    edit3: Here is a glimpse of how I would approach adding AVX512

    Kind regards

    Jannik

  3. Jannik Schürg reporter

    Hi Klaus,

    I added in my branch changes for all basic functions and a naive implementation for Reduction.

    Kind regards

    Jannik

  4. Klaus Iglberger

    Hi Jannik!

    Thanks a lot for your efforts and the link. At a first glance it looks great. We will take a closer look as soon as possible, but definitely before the release of Blaze 3.2. Thanks again,

    Best regards,

    Klaus!

  5. Log in to comment