- edited description
Thread Utilization on row/row multiplication
Hi,
I'm testing out some basic functionality, and for some reason, I'm not seeing any thread utilization beyond one CPU.
I have compiled with -fopenmp and called both blaze::setNumThreads(60) and omp_set_num_threads(60) on a machine with 112 cores.
I'm multiplying rows in a matrix by each other of size 1000 each, so that I should expect that to exceed any thresholds in blaze/blaze/config/Thresholds.txt.
I have all BLAS macros set to 1 except for "BLAZE_BLAS_IS_PARALLEL 0", as my blas implementation is not parallelizing.
I've also attempted this using std::thread with [edit: similar results]. Actually, I see 60 threads spawned who are each using 0.1% CPU.
Might you be able to point me at where I'm going wrong? I'm calling the function 10,000-row, 10,000 column DynamicMatrix of floats.
Here is the code I'm using.
template<typename FloatType=float>
struct TanhKernelMatrix {
const FloatType k_;
const FloatType c_;
template<typename MatrixType>
blaze::SymmetricMatrix<MatrixType> operator()(MatrixType &a) const {
blaze::SymmetricMatrix<MatrixType> ret(a.rows());
for(size_t i(0); i < a.rows(); ++i) {
for(size_t j(i); j < a.rows(); ++j) {
ret(i, j) = dot(row(a, i), row(a, j)) + c_;
}
}
ret *= k_;
return tanh(ret);
}
TanhKernelMatrix(FloatType k, FloatType c): k_(k), c_(c){}
};
Comments (3)
-
reporter -
reporter - edited description
-
- changed status to wontfix
Hi dnbh!
Unfortunately, the
dot()
function is not providing any parallelisation, yet. Therefore you will not be able to see any speedup when spawning more threads. This is a feature we plan to provide with issue #4. However, even if it would provide parallelisation, you probably would not see any speedup, since your vectors (i.e. rows) are too small. You would need vectors of approx.40000
elements to start to see some benefits from parallelization. For instance, Blaze currently uses a threshold of36000
for the dense vector addition, since only for vectors larger than this threshold you see benefits from multiple threads.In your example there is still some potential for parallelization, though. What you can do to speed up your computation is to compute multiple dot products in parallel. This kind of parallelization, however, would have to be done by yourself.
I hope this explanation this helps. Thanks for raising this issue,
Best regards,
Klaus!
- Log in to comment