StrictlyUpperMatrix::construct has very costly check (performance)

Issue #416 wontfix

GK Palem created an issue 2021-10-07

While assigning the result of a calculation to aStrictlyUpperMatrix variable, such as below:

DynamicVector<int, blaze::columnVector> v1, v2;
StrictlyUpperMatrix<DynamicMatrix<int, blaze::rowMajor> > m;

StrictlyUpperMatrix<DynamicMatrix<int, blaze::rowMajor> > m1 = (blaze::outer(v1, v2) && m);

the code path is going through very costly runtime check as below in the StrictlyUpperMatrix::construct

inline const MT StrictlyUpperMatrix<MT,SO,true>::construct( const Matrix<MT2,SO2>& m, T )
{
   const MT tmp( *m );

   if( IsUniTriangular_v<MT2> ||
       ( !IsStrictlyUpper_v<MT2> && !isStrictlyUpper( tmp ) ) ) {
      BLAZE_THROW_INVALID_ARGUMENT( "Invalid setup of strictly upper matrix" );
   }

   return tmp;
}

The IsStrictlyUpper_v check appears to be very costly (it is validating the whole matrix).

Should this verification not be done conditionally only in debug versions, and turned off in production grade code? Looks like other parts of the code are using something like BLAZE_INTERNAL_ASSERT

Here the verification is happening using an if .. else runtime condition rather than as BLAZE_INTERNAL_ASSERT

This is happening while a DMatDMatMapExpr is being assigned to StrictlyUpperMatrix and the assignment is turning out to be a very costly affair, performance-wise. In the above example, given that one of the operands for the calculation is StrictlyUpperMatrix and it is being ANDed, we know for sure that the result of the expression would be a StrictlyUpperMatrix. But looks like there is no way to avoid this check or hint the system.

The check is fine in Debug mode, but in production mode this is very costly.

Comments (5)

GK Palem reporter
- edited description
- 2021-10-07T03:44:53+00:00
Klaus Iglberger
- changed status to wontfix
Hi GK!

Thanks for taking the time to create this issue. However, we don't consider this a bug, but a conscious design decision. In case you are certain that a given computation results in a strictly upper matrix and are willing to take the risk to destroy the invariants of a StrictlyUpperMatrix, you can use the declstrupp() function:
```
DynamicVector<int, blaze::columnVector> v1, v2;
StrictlyUpperMatrix<DynamicMatrix<int, blaze::rowMajor> > m;

StrictlyUpperMatrix<DynamicMatrix<int, blaze::rowMajor> > m1 = declstrupp(blaze::outer(v1, v2) && m);
```
This will turn the runtime check into a compile time check. However, please consider the declstrupp() function on the same level as any cast (i.e. you are completely responsible for any subsequent misbehavior).

I hope this helps,

Best regards,

Klaus!
- 2021-10-07T06:09:46+00:00
GK Palem reporter
Thank you Klaus.

Looks like declstrupp() removed the run-time if-else check.

However there is still performance bottleneck at matrix assignment.
```
StrictlyUpperMatrix<DynamicMatrix<bool, blaze::rowMajor> > m1 = blaze::declstrupp(blaze::outer(v1, v2) && m);
```
As shown in the below screenshot of the performance profile, 44% of the time is being spent in just a matrix to matrix assignment for the above equation.

Interestingly it is calling a DynamicMatrix::assignthat is copying elements of the matrix element by element in a dual for loop.

Is there anyway this can be improved? For example, a memcpy that copies the block of memory in one go? The DynamicMatrix in question is using a native type and no reason to copy element by element.

Would appreciate any info on how to improve the performance of this assignment, as this has become the critical performance bottleneck presently.
- 2021-10-09T03:48:52+00:00
Klaus Iglberger
Hi GK!

I’ve taken some time to analyze the performance. The call to DynamicMatrix::assign() performs the assignment of the right-hand side expression into the storage of the StrictlyUpperMatrix<DynamicMatrix<bool, blaze::rowMajor>>. Since that expression cannot be vectorized (logical AND (&&) currently cannot be vectorized) the element-wise assignment is necessary and correct. A memcpy() isn’t an option since the expression doesn’t contain any memory to copy from, it is evaluated on the fly. Unfortunately I’m not aware of a more direct and more efficient way to handle that expression.

I hope this helps,

Best regards,

Klaus!
- 2021-10-09T06:45:20+00:00
GK Palem reporter
Thank you Klaus.

The matrix is of Bool type with the equation involving the Boolean operations.

Hope there can be a specialised implementation for bool vectors / matrix ( packing 32 or 64 bool bits into one integer ) for logical expression evaluations (that take advantage of bit packing parallelization using long integers).
- 2021-10-09T09:49:35+00:00
Log in to comment

Assignee: –

Type: bug

Priority: critical

Status: wontfix

Votes: 0

Watchers: 1