Thread Utilization on row/row multiplication

Issue #95 wontfix
Daniel Baker created an issue

Hi,

I'm testing out some basic functionality, and for some reason, I'm not seeing any thread utilization beyond one CPU.

I have compiled with -fopenmp and called both blaze::setNumThreads(60) and omp_set_num_threads(60) on a machine with 112 cores.

I'm multiplying rows in a matrix by each other of size 1000 each, so that I should expect that to exceed any thresholds in blaze/blaze/config/Thresholds.txt.

I have all BLAS macros set to 1 except for "BLAZE_BLAS_IS_PARALLEL 0", as my blas implementation is not parallelizing.

I've also attempted this using std::thread with [edit: similar results]. Actually, I see 60 threads spawned who are each using 0.1% CPU.

Might you be able to point me at where I'm going wrong? I'm calling the function 10,000-row, 10,000 column DynamicMatrix of floats.

Here is the code I'm using.

template<typename FloatType=float>
struct TanhKernelMatrix {
    const FloatType k_; 
    const FloatType c_; 
    template<typename MatrixType>
    blaze::SymmetricMatrix<MatrixType> operator()(MatrixType &a) const {
        blaze::SymmetricMatrix<MatrixType> ret(a.rows());
        for(size_t i(0); i < a.rows(); ++i) {
            for(size_t j(i); j < a.rows(); ++j) {
                ret(i, j) = dot(row(a, i), row(a, j)) + c_; 
            }   
        }   
        ret *= k_; 
        return tanh(ret);
    }   
    TanhKernelMatrix(FloatType k, FloatType c): k_(k), c_(c){}
};

Comments (3)

  1. Klaus Iglberger

    Hi dnbh!

    Unfortunately, the dot() function is not providing any parallelisation, yet. Therefore you will not be able to see any speedup when spawning more threads. This is a feature we plan to provide with issue #4. However, even if it would provide parallelisation, you probably would not see any speedup, since your vectors (i.e. rows) are too small. You would need vectors of approx. 40000 elements to start to see some benefits from parallelization. For instance, Blaze currently uses a threshold of 36000 for the dense vector addition, since only for vectors larger than this threshold you see benefits from multiple threads.

    In your example there is still some potential for parallelization, though. What you can do to speed up your computation is to compute multiple dot products in parallel. This kind of parallelization, however, would have to be done by yourself.

    I hope this explanation this helps. Thanks for raising this issue,

    Best regards,

    Klaus!

  2. Log in to comment