Unexpected Threshold Behaviour

Hi Klaus,

I tried the following simple matrix-matrix product program. It compares the time needed for a direct call to the MKL dgemm routine with the Blaze call. Compiled with icpc -I./blaze/ -mkl=sequential -std=c++14 -O3 -xMIC-AVX512 blaze_mkl.cpp and the repository version of Blaze.

I did expect to see two things:

Blaze does call MKL for large matrices as soon as the threshold is hit.
Blaze is faster for smaller matrices.

But in this case (tested on Intel Xeon Phi Knights Landing and an i5) the MKL version is faster (at least two times), and Blaze seems to ignore my threshold (I just set it to 1 to be sure), but calls MKL for larger matrices (here if N>67). The calls made to MKL were checked with MKL_VERBOSE=1 ./a.out.

I fear I did make a stupid mistake somewhere, but I have been trying this for a while now. A concrete feature request would be a flow chart for the decision process when what is called.

Thanks for your time, and kind regards

Jannik

#include <mkl.h>

#define BLAZE_BLAS_MODE 1
#define BLAZE_BLAS_INCLUDE_FILE <mkl_cblas.h>
#define BLAZE_BLAS_IS_PARALLEL 0

#define BLAZE_USE_SHARED_MEMORY_PARALLELIZATION 0

#define BLAZE_TDMATDMATMULT_THRESHOLD 1
#define BLAZE_DMATDMATMULT_THRESHOLD 1
#define BLAZE_TDMATTDMATMULT_THRESHOLD 1
#define BLAZE_TDMATDMATMULT_THRESHOLD 1

// Cache size of Knights Landing
#define BLAZE_CACHE_SIZE 1024 * 1024

#include <blaze/Blaze.h>

constexpr size_t N    = 64; // matrix size
constexpr size_t RUNS = 500; // how often called

using namespace blaze;

int main() {
  constexpr size_t element_size = sizeof(double);
  constexpr size_t lead_dim = (((N * element_size + 511) / 512) * 512 + 64) / element_size;
  constexpr size_t alignment = 4096;
  double* A =
      (double*)_mm_malloc(N * lead_dim * sizeof(double), alignment);  // row major
  double* B =
      (double*)_mm_malloc(N * lead_dim * sizeof(double), alignment);  // column major
  double* C =
      (double*)_mm_malloc(N * lead_dim * sizeof(double), alignment);  // column major

  // Random initialization
  VSLStreamStatePtr stream;
  vslNewStream(&stream, VSL_BRNG_SFMT19937, 1337);
  vdRngUniform(VSL_RNG_METHOD_UNIFORM_STD, stream, N * N, A, -10, 10);
  vdRngUniform(VSL_RNG_METHOD_UNIFORM_STD, stream, N * N, B, -10, 10);
  vslDeleteStream(&stream);
  std::cout << "Created matrices, N = " << N << ".\n";

  dsecnd(); // Warm up

  {
    gemm(CBLAS_LAYOUT::CblasColMajor, CBLAS_TRANSPOSE::CblasTrans,
         CBLAS_TRANSPOSE::CblasNoTrans, N, N, N, 1.0, A, lead_dim /* lda */, B,
         lead_dim /* ldb */, 0.0, C, lead_dim /* ldc */);
    double total = 0.0;
    for (size_t i = 0; i < RUNS; ++i) {
      double start = dsecnd();
      gemm(CBLAS_LAYOUT::CblasColMajor, CBLAS_TRANSPOSE::CblasTrans,
           CBLAS_TRANSPOSE::CblasNoTrans, N, N, N, 1.0, A, lead_dim /* lda */,
           B, lead_dim /* ldb */, 0.0, C, lead_dim /* ldc */);
      total += dsecnd() - start;
    }
    std::cout << "Time (MKL direct) = " << total << ", ";
  }

  std::cout << C[0] << '\n';

  auto blA = CustomMatrix<double, aligned, padded, rowMajor>(A, N, N, lead_dim);
  auto blB = CustomMatrix<double, aligned, padded, columnMajor>(B, N, N, lead_dim);
  auto blC = CustomMatrix<double, aligned, padded, columnMajor>(C, N, N, lead_dim);

  // auto blA = DynamicMatrix<double, rowMajor>(N, N);
  // auto blB = DynamicMatrix<double, columnMajor>(N, N);
  // auto blC = DynamicMatrix<double, columnMajor>(N, N);

  {
    double total = 0.0;
    blC          = blA * blB;
    for (size_t i = 0; i < RUNS; ++i) {
      double start = dsecnd();
      blC          = blA * blB;
      total += dsecnd() - start;
    }

    std::cout << "A * B (Blaze) = " << total << ", ";
  }

  std::cout << blC(0, 0) << '\n';

  _mm_free(A);
  _mm_free(B);
  _mm_free(C);
  return 0;
}

Comments (4)