Multi-threading in Matrix-Matrix-Multiplication of very non-square matrices depends on dimensions and storage order in non-obvious ways

Issue #42 resolved
Marcin Junczys-Dowmunt created an issue

Hi, First, great work with blaze!

Not really a bug per se, but I have the following usability problem:

When multiplying two matrices of dimensions 12x500 and 500x90000, the results has 12x90000 elements, it is quite difficult to foresee if multi-threading will be deployed or not.

With the following code, I only get single-thread performance, no multi-threading is used. Change storage order for one or more of the matrices involved helps with this. But even when using only one thread, depending on storage order, the time to execute this program can range from 7s to 22s on my machine. Shouldn't the algorithm choose between more optimal but largely equivalent ways? For instance transposing all the involved matrices by choosing a columnMajor order and inverting dimension seems to be the fastest configuration for single-thread, but also fails to scale to multiple threads. There is quite some cognitive load on the user to get a decent performance out of this.

#include <iostream>
#include <chrono>
#include <blaze/Math.h>

using namespace blaze;

int main(int argc, char** argv) {
    DynamicMatrix<float> A(12, 500);
    DynamicMatrix<float> B(500, 90000);

    DynamicMatrix<float> C;

    std::chrono::steady_clock::time_point start = std::chrono::steady_clock::now();

    for(int i = 0; i < 100; ++i) {
        C = A * B;
    }

    std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
    std::chrono::duration<double> fp_s = end - start;

    std::cerr << fp_s.count() << std::endl;

    return 0;
}

Comments (8)

  1. Klaus Iglberger

    Hi Marcin!

    Thanks a lot for pointing out this issue. We were not aware of the magnitude of the performance impact for this kind of matrix multiplication.

    The problem has two components. The first one is the underlying matrix multiplication kernel. We assume that you are using a BLAS library for your purpose. Most BLAS libraries don't guarantee optimal performance for rectangular matrices with such a row/column ratio. The choice of the storage order (rowMajor or columnMajor) makes a big difference. Unfortunately we can hardly change this fact or accommodate for it within Blaze. Since you are free to choose a BLAS library we cannot solely decide based on the row/column ratio to create a temporary with the opposite storage order.

    The matrix multiplication kernels shipped with Blaze don't provide BLAS performance yet. Also, it is not to be expected that they perform better than the BLAS kernels for rectangular matrices. However, we have already created an issue to rectify the situation (see Issue #12). When updating the Blaze kernels, we will take this issue into account.

    The second component is the parallelization. For now the parallelization is optimized for square or close-to-square matrices. Thus we have to admit that we could do better and should use the row/column ratio to decide how the matrix multiplication is distributed across the cores. For this reason, we will keep this issue open until we have improved the situation.

    In summary, you are correct that this puts a considerably burden on the user of the library. This shouldn't be the case. We will try our best to alleviate the problem with a better parallelization approach and better matrix multiplication kernels. For now, the best thing you can do is to choose the most suited storage order manually (in your case columnMajor). We apologize for this inconvenience.

    Best regards,

    Klaus!

  2. Marcin Junczys-Dowmunt reporter

    Hi Klaus,

    No need to apologize, blaze is great and thanks for the quick reaction. Actually I am not using any BLAS library because BLAZE gives much better performance for these non-square matrices. OpenBLAS for instance seems to be broken (I guess I should report that too) and I was also not impressed with MKL. Once I have a fitting storing order, BLAZE is a lot faster than any of them.

    Maybe you should abandon the squared matrix benchmarks a little bit? For neural networks these kind of non-square matrices are quite typical. This one for instance is a step for converting a Recurrent Neural Network state into a set of probabilities over a vocabulary of 90,000 words. It could as well be a million words. And it seems BLAZE has quite an edge here over the established BLASes.

    Best,

    Marcin

  3. Klaus Iglberger

    Hi Marcin!

    Thanks for the background info. I'm glad Blaze works out so well for this kind of problem. We will see to it that it stays this way.

    We are currently in the last stage of the next release. I will try to resolve this issue before the freeze, but I cannot promise anything.

    Best regards,

    Klaus!

  4. Klaus Iglberger

    The feature has been implemented and tested as required. It is immediately available via cloning the Blaze repository and will be officially released in Blaze 3.0.

    The matrix multiplication in the example is now parallelized independent of the chosen storage order of the matrix (i.e. rowMajor or columnMajor). Also, threads are distributed such that they best match the row/column ratio of the operation to be parallelized. Assuming four threads, in the given example this would result in a 1x4 distribution of threads, each computing a 12x22500 submatrix.

  5. Log in to comment