Efficient Row-wise and column-wise multiplication
Hi Klaus!
Large parts of my algorithm require efficient row-wise multiplication of elements. Up until now I have been using the Diagonal matrix adaptor class. My diagonal matrices need to be quite long (up to 2 million elements), which makes using dense diagonal matrices infeasible. Profiling the code, using the compressed diagonal adaptor generates scalar code. Also the majority of diagonal terms are non-zero. I estimate having access to vectorized diagonal matrix multiplies would speed-up my app by ~1.15.
My gut feeling is that using a DynamicVector to store the diagonal terms and appropriately defining assignment should allow for fully vectorized code.
Please let me now what you think. If this is too hard on blaze formalism, maybe one can simply add row-wise/column-wise multiplication. Also, I'm happy to help out.
Thanks for all your work,
Max
Comments (10)
-
-
reporter Hi Klaus!
Regarding the mathematical motivation, scaling rows and columns can be represented using diagonal matrix multiplication. I think alternatively this issue could also have been formulated as asking for performance enhancements for the DiagonalMatrix class. In that sense, this might be the most basic use case for #73. To me the main issue is finding the correct abstraction inside blaze, which you are certainly more equipped to do than I.
Regarding the work around, I will try it out and see if that helps. I am still a little concerned about the performance. For example, calling
scaleRows
on a columnMajor Matrix does not seem optimal to me. I would imagine that scaling either a submatrix withSIMDLENGTH
rows would be more performant? There are also a few variations here, e.g. applying component-wise multiplication to the columns with appropriate blocking. I will write a little benchmark to see if this is really an issue for me though. Another "dumb" workaround would be just to replicate the data in each column and use a component-wise matrix multiplication (this might work for me, since my matrices tend to be tall and skinny).Lastly, It would seem to me that there should be a way to implement (at least the efficient diagonal matrix multiply) in an expression template friendly framework, i.e. I wouldn't expect the need for an evaluation to be forced.
Hope that helps. Let me know if there's anything I can do to help. I'll set-up some micro-benchmarks to see what works best.
Max
-
Hi Max!
Your are correct, diagonal matrix multiplication is the one mathematical way to represent it. I couldn't think of another, simpler one, though, that is already available in Blaze.
You are correct, the
scaleRows()
function will not work particularly well for column-major matrices. However, it should give you an idea how to define the operation manually. Please feel free to adapt it in any way you need.Best regards,
Klaus!
-
reporter I just wanted to follow up on this issue. This is still a pretty serious gap for us.
What would be the required steps to get this operation into the expression template framework? Is it possible to simply adapt something like the
DMatDVecMultExpr.h
andDMatDMatSchurExpr.h
to scale rows or columns? -
Hi Max!
In order to enable a vectorized matrix multiplication between a diagonal sparse matrix and a dense matrix, the required steps would be to adapt the
DMatSMatMultExpr
,DMatTSMatMultExpr
,SMatDMatMultExpr
,SMatTDMatMultExpr
,TDMatSMatMultExpr
,TDMatTSMatMultExpr
,TSMatDMatMultExpr
, andTSMatTDMatMultExpr
expression classes, which cover all possible permutations of sparse and dense matrices. The effort to implement this optimization in the way you expect it is therefore significant.It would prove to be helpful if you indeed evaluate the optimal strategy for your operation. With more information on what exactly you require our implementation effort could be minimized. Without this, we would provide a vectorized scaling of rows for a row-major matrix, we would very likely not implement optimizations for scaling rows in a column-major matrix.
Best regards,
Klaus!
-
reporter Hi Klaus!
I've attached two cartoons, what my kernels look like:
For both kernels, the tall and skinny matrices to the left of the sparse matrices are column major. In terms of optimal implementation, I would assume that the diagonal matrix, has nonzeros along the diagonal. So I would assume that the optimal implementation would be a vectorized multiplication of the diagonal entries with a given column. There might be some cache blocking to be done, but that's really a micro-optimization in my mind at the moment. If I understand correctly, for my use-case, it'd suffice to just implement the
SMatTDMatMultExpr
version.Let me know if that's helpful enough, or if you need more information.
-
Hi Max!
Thanks a lot for posting the graphics. I think I finally understand the details of the operation you have in mind and I have a proposal of how to speed up your computations considerably. The following code snippet demonstrates how to replace the sparse matrix/dense matrix multiplication by a dense matrix Schur product:
using namespace blaze; constexpr size_t N1( 1000UL ); constexpr size_t N2( 10UL ); DiagonalMatrix< CompressedMatrix<double,rowMajor> > A( N1 ); DynamicMatrix<double,columnMajor> B( N1, N2 ); DynamicMatrix<double,columnMajor> C( N1, N2 ); // ... Initialization of A and B // Option 1: Perform a sparse matrix/dense matrix multiplication C = A * B; // Option 2: Perform a dense matrix Schur product const DynamicVector<double,columnVector> tmp( diagonal( A ) ); C = expand( tmp, B.columns() ) % B;
On my machine, assuming
N1
to be1000
andN2
to be10
(tall and skinny) option 2 runs 1.75 times faster than option 1. If instead of the sparse matrix I only use aDynamicVector
to store the diagonal values, option 2 even runs 3.5 times faster than option 1.Based on these results it is possible to increase the performance of the sparse matrix/dense matrix multiplication by a factor of 1.75 (at least for this example problem). However, it will not be possible to go beyond that (i.e. to achieve a factor of 3.5). The reason is that it would always be necessary to allocate memory for the dense vector since the input is a sparse diagonal matrix. In your own interest I therefore suggest you consider using a dense vector to store the diagonal elements, if possible. You would additionally be able to omit the setup of the sparse diagonal matrix, which is considerably more costly than the setup of a dense vector, and you could hide the operation behind a regular function call:
template< typename VT, typename MT > decltype(auto) scaleRows( const DenseVector<VT,columnVector>& lhs, const DenseMatrix<MT,columnMajor>& rhs ) { return expand( ~lhs, (~rhs).columns() ) % (~rhs); }
I hope this helps,
Best regards,
Klaus!
-
reporter Klaus!
This seems like exactly what I need. Transforming the diagonal matrix into a column vector isn't a problem at all. I'll try it out, ASAP. But assuming your machine is AVX2, a speed-up of 3.5 seems near optimal. Thanks for all the help!
Max!
-
Hi Max!
I assume that the suggested solution works for you. Does this resolve this issue?
Best regards,
Klaus!
-
- changed status to resolved
- Log in to comment
Hi Maximilian!
Thanks a lot for creating this issue. The motivation is clear and you are correct that for this case the operation should run with full vectorization. However, no mathematical operation comes to mind that would support this operation and at this point we have no idea how this operation should be expressed syntactically. For that reason we suggest that you add the following two special purpose functions to your application. With these the operation should work at maximum efficiency.
To some extend the idea resembles the idea of a band matrix (see issue #73). Still, we will keep this issue to remind us about the feature request. We will see how these two ideas might be combined. Thanks again,
Best regards,
Klaus!