Efficient Row-wise and column-wise multiplication

Issue #238 resolved
Maximilian Bremer created an issue

Hi Klaus!

Large parts of my algorithm require efficient row-wise multiplication of elements. Up until now I have been using the Diagonal matrix adaptor class. My diagonal matrices need to be quite long (up to 2 million elements), which makes using dense diagonal matrices infeasible. Profiling the code, using the compressed diagonal adaptor generates scalar code. Also the majority of diagonal terms are non-zero. I estimate having access to vectorized diagonal matrix multiplies would speed-up my app by ~1.15.

My gut feeling is that using a DynamicVector to store the diagonal terms and appropriately defining assignment should allow for fully vectorized code.

Please let me now what you think. If this is too hard on blaze formalism, maybe one can simply add row-wise/column-wise multiplication. Also, I'm happy to help out.

Thanks for all your work,

Max

Comments (10)

  1. Klaus Iglberger

    Hi Maximilian!

    Thanks a lot for creating this issue. The motivation is clear and you are correct that for this case the operation should run with full vectorization. However, no mathematical operation comes to mind that would support this operation and at this point we have no idea how this operation should be expressed syntactically. For that reason we suggest that you add the following two special purpose functions to your application. With these the operation should work at maximum efficiency.

    template< typename MT, bool SO, typename VT, bool TF >
    void scaleRows( DenseMatrix<MT,SO>& A, const DenseVector<VT,TF>& x )
    {
       if( rows(A) != size(x) ) {
          BLAZE_THROW_INVALID_ARGUMENT( "Matrix and vector sizes do not match" );
       }
    
       for( size_t i=0UL; i<size(x); ++i ) {
          row( A, i ) *= (~x)[i];
       }
    }
    
    template< typename MT, bool SO, typename VT, bool TF >
    void scaleColumns( DenseMatrix<MT,SO>& A, const DenseVector<VT,TF>& x )
    {
       if( columns(A) != size(x) ) {
          BLAZE_THROW_INVALID_ARGUMENT( "Matrix and vector sizes do not match" );
       }
    
       for( size_t j=0UL; j<size(x); ++j ) {
          column( A, j ) *= (~x)[j];
       }
    }
    

    To some extend the idea resembles the idea of a band matrix (see issue #73). Still, we will keep this issue to remind us about the feature request. We will see how these two ideas might be combined. Thanks again,

    Best regards,

    Klaus!

  2. Maximilian Bremer reporter

    Hi Klaus!

    Regarding the mathematical motivation, scaling rows and columns can be represented using diagonal matrix multiplication. I think alternatively this issue could also have been formulated as asking for performance enhancements for the DiagonalMatrix class. In that sense, this might be the most basic use case for #73. To me the main issue is finding the correct abstraction inside blaze, which you are certainly more equipped to do than I.

    Regarding the work around, I will try it out and see if that helps. I am still a little concerned about the performance. For example, calling scaleRows on a columnMajor Matrix does not seem optimal to me. I would imagine that scaling either a submatrix with SIMDLENGTH rows would be more performant? There are also a few variations here, e.g. applying component-wise multiplication to the columns with appropriate blocking. I will write a little benchmark to see if this is really an issue for me though. Another "dumb" workaround would be just to replicate the data in each column and use a component-wise matrix multiplication (this might work for me, since my matrices tend to be tall and skinny).

    Lastly, It would seem to me that there should be a way to implement (at least the efficient diagonal matrix multiply) in an expression template friendly framework, i.e. I wouldn't expect the need for an evaluation to be forced.

    Hope that helps. Let me know if there's anything I can do to help. I'll set-up some micro-benchmarks to see what works best.

    Max

  3. Klaus Iglberger

    Hi Max!

    Your are correct, diagonal matrix multiplication is the one mathematical way to represent it. I couldn't think of another, simpler one, though, that is already available in Blaze.

    You are correct, the scaleRows() function will not work particularly well for column-major matrices. However, it should give you an idea how to define the operation manually. Please feel free to adapt it in any way you need.

    Best regards,

    Klaus!

  4. Maximilian Bremer reporter

    I just wanted to follow up on this issue. This is still a pretty serious gap for us.

    What would be the required steps to get this operation into the expression template framework? Is it possible to simply adapt something like the DMatDVecMultExpr.h and DMatDMatSchurExpr.h to scale rows or columns?

  5. Klaus Iglberger

    Hi Max!

    In order to enable a vectorized matrix multiplication between a diagonal sparse matrix and a dense matrix, the required steps would be to adapt the DMatSMatMultExpr, DMatTSMatMultExpr, SMatDMatMultExpr, SMatTDMatMultExpr, TDMatSMatMultExpr, TDMatTSMatMultExpr, TSMatDMatMultExpr, and TSMatTDMatMultExpr expression classes, which cover all possible permutations of sparse and dense matrices. The effort to implement this optimization in the way you expect it is therefore significant.

    It would prove to be helpful if you indeed evaluate the optimal strategy for your operation. With more information on what exactly you require our implementation effort could be minimized. Without this, we would provide a vectorized scaling of rows for a row-major matrix, we would very likely not implement optimizations for scaling rows in a column-major matrix.

    Best regards,

    Klaus!

  6. Maximilian Bremer reporter

    Hi Klaus!

    I've attached two cartoons, what my kernels look like:

    volume_kernel.png integration.png

    For both kernels, the tall and skinny matrices to the left of the sparse matrices are column major. In terms of optimal implementation, I would assume that the diagonal matrix, has nonzeros along the diagonal. So I would assume that the optimal implementation would be a vectorized multiplication of the diagonal entries with a given column. There might be some cache blocking to be done, but that's really a micro-optimization in my mind at the moment. If I understand correctly, for my use-case, it'd suffice to just implement the SMatTDMatMultExpr version.

    Let me know if that's helpful enough, or if you need more information.

  7. Klaus Iglberger

    Hi Max!

    Thanks a lot for posting the graphics. I think I finally understand the details of the operation you have in mind and I have a proposal of how to speed up your computations considerably. The following code snippet demonstrates how to replace the sparse matrix/dense matrix multiplication by a dense matrix Schur product:

    using namespace blaze;
    
    constexpr size_t N1( 1000UL );
    constexpr size_t N2( 10UL );
    
    DiagonalMatrix< CompressedMatrix<double,rowMajor> > A( N1 );
    DynamicMatrix<double,columnMajor> B( N1, N2 );
    DynamicMatrix<double,columnMajor> C( N1, N2 );
    
    // ... Initialization of A and B
    
    // Option 1: Perform a sparse matrix/dense matrix multiplication
    C = A * B;
    
    // Option 2: Perform a dense matrix Schur product
    const DynamicVector<double,columnVector> tmp( diagonal( A ) );
    C = expand( tmp, B.columns() ) % B;
    

    On my machine, assuming N1 to be 1000 and N2 to be 10 (tall and skinny) option 2 runs 1.75 times faster than option 1. If instead of the sparse matrix I only use a DynamicVector to store the diagonal values, option 2 even runs 3.5 times faster than option 1.

    Based on these results it is possible to increase the performance of the sparse matrix/dense matrix multiplication by a factor of 1.75 (at least for this example problem). However, it will not be possible to go beyond that (i.e. to achieve a factor of 3.5). The reason is that it would always be necessary to allocate memory for the dense vector since the input is a sparse diagonal matrix. In your own interest I therefore suggest you consider using a dense vector to store the diagonal elements, if possible. You would additionally be able to omit the setup of the sparse diagonal matrix, which is considerably more costly than the setup of a dense vector, and you could hide the operation behind a regular function call:

    template< typename VT, typename MT >
    decltype(auto) scaleRows( const DenseVector<VT,columnVector>& lhs, const DenseMatrix<MT,columnMajor>& rhs )
    {
       return expand( ~lhs, (~rhs).columns() ) % (~rhs);
    }
    

    I hope this helps,

    Best regards,

    Klaus!

  8. Maximilian Bremer reporter

    Klaus!

    This seems like exactly what I need. Transforming the diagonal matrix into a column vector isn't a problem at all. I'll try it out, ASAP. But assuming your machine is AVX2, a speed-up of 3.5 seems near optimal. Thanks for all the help!

    Max!

  9. Klaus Iglberger

    Hi Max!

    I assume that the suggested solution works for you. Does this resolve this issue?

    Best regards,

    Klaus!

  10. Log in to comment