Provide support for solving dense linear systems of equations

Issue #30 resolved

Klaus Iglberger created an issue 2016-01-29

Description

So far the Blaze library is missing the functionality to solve dense linear systems of equations. With the recent addition of the dense matrix inversion this task can now be handled, but unfortunately not as efficient and as convenient as possible. Thus Blaze should provide an API to compute the solution for dense linear systems of equations. The API should both provide an automatic selection of the most efficient way to solve the problem as well as enable a user to manually select the appropriate algorithm.

Conceptual Example

blaze::DynamicMatrix<double,blaze::columnMajor> A, B, X;
blaze::DynamicVector<double,blaze::columnVector> x, b;

// ... Resizing and initialization

// Compute the solution of the LSE with automatic selection of the decomposition algorithm
solve( A * x = b );

// Compute the solution of the LSE by means of an LU decomposition
solve<byLU>( A * x = b );

// Compute the solutions of the LSE by means of an Cholesky (LLH) decomposition
solve<byLLH>( A * X = B );

Tasks

design an API for computing the solution of dense linear systems of equations
provide all necessary LAPACK wrapper functions
provide a full documentation of the feature
ensure compatibility with all existing matrix classes
ensure compatibility with all existing matrix expressions
guarantee maximum performance for all matrix types
add the necessary number of test cases for the entire functionality

Comments (11)

Michael Lehn
The algorithms in Application for a fast Matrix-Matrix Product: LU-Factorization could be used for this.
- 2016-02-14T22:08:22+00:00
Klaus Iglberger reporter
Thanks for the interesting link. We would be very much interested to learn more about this implementation. Do you have a reference to a performance comparison between your implementation and LAPACK and/or PLASMA, including some information on the used CPU to enable us to determine the expected performance? Since we also need other decomposition algorithms: Does this work also provide implementations of the Cholesky (LLH), LDLT, and LDLH decomposition? Again, thanks a lot for the link, any help is very much appreciated.
- 2016-02-15T18:56:17+00:00
Michael Lehn
The implementation of lu_blocked is identical with LAPACK. So using this C++ code and call BLAS from Intel MKL/ATLAS/OpenBLAS you get exactly the same performance as from LAPACK with Intel MKL/ATALS/OpenBLAS.

The implementation of lu_blocked_recursive is more modern. Using BLAS from Intel MKL you get the same performance as from the Intel MKL implementation of dgetrf. These recursive algorithms were just recently published by the usual suspects around Jack Dongarra. Older implementations of Intel MKL (1-2 year ago) could be beaten with this. So I guess they just recently added this approach.
- 2016-02-15T19:56:21+00:00
Michael Lehn
And yes, for Cholesky and QR factorization recursive algorithms are superior to those from traditional LAPACK. Their performance all depends on a fast matrix-matrix product. Due to the the recursion the algorithm becomes rich of matrix-matrix products where inner dimension are rather large and close to the outer dimensions. The lu_unblocked_recursive algorithm is just one example for this class of algorithm...

Please note that there is more then just one variant for this algorithm. In The Science of Programming Matrix Computations you get an overview of different blocked and unblocked algorithms for LU, LLT and QR factorizations. All the blocked variants can be turned into recursively blocked variants.
- 2016-02-16T00:44:17+00:00
Klaus Iglberger reporter
- edited description
- 2016-02-16T09:16:11+00:00
Klaus Iglberger reporter
Thanks a lot for the links. However, it appears that integrating your LU implementation would not have many benefits for Blaze:
- As noted by you, the LAPACK implementations are constantly updated and expected to run at maximum performance. Replacing the LAPACK implementation with a manual implementation of the same performance would thus not increase performance (the user experience), but introduces a high initial implementation effort and a constant maintenance effort (for continuous modernization). This effort is only justifiable if the manual implementation initially outperforms the LAPACK implementation.
- As the vendors update their LAPACK implementation it is to be expected that also the parallelization is improved. Thus using LAPACK provides parallel implementations of all these algorithms without additional effort.
- Blaze is currently using more than 30 LAPACK functions (starting from the upcoming 2.6 release), among these the functions for the LU, LLH, LDLT, LDLH, QR, RQ, QL, and LQ decomposition. Replacing a single of these functions by a manual implementation would not help to get rid of the dependency to LAPACK. We would need manual implementations for all LAPACK functions, including the according substitution, inversion, and solver routines, to be independent of LAPACK.
Still, thanks a lot for your offer to provide us with a fast implementation of the LU decomposition. It is highly appreciated.
- 2016-02-16T14:29:28+00:00
Michael Lehn
No problem at all. Its still good for myself with respect to teaching that I have examples around about other C++ libraries.

I still would have two questions on how to benchmark the internal MMM of blaze:

1) I saw that I can disable the external BLAS backend via the BLAZE_BLAS_MODE macro. In the benchmark program I adapted the call of the matrix product as follows:
```
#if     BLAZE_BLAS_MODE==1
        blaze::dgemm(C2, A, B, alpha, beta);
#else
        C2 = beta*C2 + alpha*A*B;
#endif
```
Is there an implicit overhead if I code the expression like that?

2) What further optimization flags should I use? At the moment I compile with
```
g++-5.3 -Ofast -mavx -DNDEBUG -std=c++11  -DBLAZE_BLAS_MODE=0 -I /path/to/blaze  bench_gemm.cc
```
With this I only reach about 60 percent of Intel MKL's performance and is even below the Micro-Kernel using gcc extensions. So this are the current benchmark results for the matrix-matrix product.

About the LAPACK dependencies. About these 30 LAPACK functions. I think I implemented all of them in FLENS which is also BSD licensed. So porting them to BLAZE should not be a big problem. Here is a subset of the LAPACK functions I coded.
- 2016-02-16T15:56:59+00:00
Klaus Iglberger reporter
The expression
```
C2 = beta*C2 + alpha*A*B;
```
doesn't have a separate, manually optimized kernel. In more detail, the beta scalar prohibits a direct computation. Instead the expression is computed as
```
C2 *= beta;
C2 += alpha*A*B;
```
There are only kernels for the following expressions:
```
C2 = C2 + alpha*A*B;
C2 += alpha*A*B;
// ... and similar
```
The optimization flags are fine. The performance difference is due to the lack of a specifically optimized kernel.

P.S.: We would appreciate if this ticket is used for issue-related matters only (as for instance your initial post). This discussion has nothing to do with anymore with solving linear systems of equations. Thus we would be grateful if we could handle the discussion offline.
- 2016-02-16T19:19:18+00:00
Klaus Iglberger reporter
- assigned issue to
  
  Klaus Iglberger
- edited description
- 2019-12-24T03:29:59+00:00
Klaus Iglberger reporter
- changed status to open
- 2019-12-24T03:30:06+00:00
Klaus Iglberger reporter
- changed status to resolved
Summary

The feature has been implemented, tested, optimized, and documented as required. It is immediately available via cloning the Blaze repository and will be officially released in Blaze 3.7.

Linear Systems

The solve() function computes a solution for the given dense linear system of equations (LSE) A*x=b, where A is the given system matrix, x is the solution vector, and b is the given dense right-hand side vector:
```
blaze::DynamicMatrix<double> A;  // The square general system matrix
blaze::DynamicVector<double> b;  // The right-hand side vector
// ... Resizing and initialization

blaze::DynamicVector<double> x;  // The solution vector

solve( A, x, b );   // Computing the solution x
x = solve( A, b );  // Alternative syntax
```
Alternatively, solve() computes a solution for the given dense LSE A*X=B, where A is the given dense system matrix, the columns of X are the solution vectors, and the columns of B are the given right-hand side vectors:
```
blaze::DynamicMatrix<double> A;  // The square general system matrix
blaze::DynamicMatrix<double> B;  // The right-hand side matrix
// ... Resizing and initialization

blaze::DynamicMatrix<double> X;  // The solution matrix

solve( A, X, B );   // Computing the solutions X
X = solve( A, B );  // Alternative syntax
```
Both solve() functions will automatically select the most suited direct solver algorithm depending on the size and type of the given system matrix. For small matrices of up to 6x6, both functions use manually optimized kernels for maximum performance. For matrices larger than 6x6 the computation is performed by means of the most suited LAPACK solver method (see Linear System Solver).

In case the type of the matrix does not provide additional compile time information about its structure (symmetric, lower, upper, diagonal, ...), the information can be provided manually by means of Declaration Operations when calling the solve() functions:
```
blaze::DynamicMatrix<double> A;  // The square lower system matrix
blaze::DynamicVector<double> b;  // The right-hand side vector
// ... Resizing and initialization

blaze::DynamicVector<double> x;  // The solution vector

solve( declsym( A ), x, b );     // Solving the LSE with a symmetric system matrix
solve( declherm( A ), x, b );    // Solving the LSE with an Hermitian system matrix
solve( decllow( A ), x, b );     // Solving the LSE with a lower system matrix
solve( declunilow( A ), x, b );  // Solving the LSE with an unilower system matrix
solve( declupp( A ), x, b );     // Solving the LSE with an upper system matrix
solve( decluniupp( A ), x, b );  // Solving the LSE with an uniupper system matrix
solve( decldiag( A ), x, b );    // Solving the LSE with a diagonal system matrix
```
For both solve() functions the computation fails if ...
- ... the given matrix is not a square matrix;
- ... the size of the right-hand side vector doesn't match the dimensions of the system matrix;
- ... the number of rows of the right-hand side matrix doesn't match the dimensions of the system matrix;
- ... the given matrix is singular and not invertible.
In all failure cases either a compilation error is created if the failure can be predicted at compile time or a std::invalid_argument exception is thrown.

Note that the solve() functions can only be used for dense matrices with float, double, complex<float> or complex<double> element type. The attempt to call the function with matrices of any other element type or with a sparse matrix results in a compile time error!

Also note that the functions may make use of LAPACK kernels. Thus the functions can only be used if a fitting LAPACK library is available and linked to the executable. Otherwise a linker error will be created.

Furthermore, it is important to note that it is not possible to use any kind of view on the expression object returned by the two-argument solve() functions. Also, it is not possible to access individual elements via the subscript operator or function call operator on the expression object:
```
row( solve( A, b ), 2UL );  // Compilation error: Views cannot be used on an solve() expression!
solve( A, b )[2];           // Compilation error: It is not possible to access individual elements!

rows( solve( A, B ), { 2UL, 4UL } );  // Compilation error: Views cannot be used on an solve() expression!
solve( A, B )(1,2);                   // Compilation error: It is not possible to access individual elements!
```
Finally, please note that the solve() functions do not provide any exception safety guarantee, i.e. in case an exception is thrown the solution vector or matrix may already have been modified.
- 2019-12-24T03:31:11+00:00
Log in to comment

Assignee: Klaus Iglberger

Type: enhancement

Priority: major

Status: resolved

Votes: 4

Watchers: 6

Description

Conceptual Example

Tasks

Comments (11)

Summary

Linear Systems