Provide support for solving dense linear systems of equations

Issue #30 resolved
Klaus Iglberger created an issue

Description

So far the Blaze library is missing the functionality to solve dense linear systems of equations. With the recent addition of the dense matrix inversion this task can now be handled, but unfortunately not as efficient and as convenient as possible. Thus Blaze should provide an API to compute the solution for dense linear systems of equations. The API should both provide an automatic selection of the most efficient way to solve the problem as well as enable a user to manually select the appropriate algorithm.

Conceptual Example

blaze::DynamicMatrix<double,blaze::columnMajor> A, B, X;
blaze::DynamicVector<double,blaze::columnVector> x, b;

// ... Resizing and initialization

// Compute the solution of the LSE with automatic selection of the decomposition algorithm
solve( A * x = b );

// Compute the solution of the LSE by means of an LU decomposition
solve<byLU>( A * x = b );

// Compute the solutions of the LSE by means of an Cholesky (LLH) decomposition
solve<byLLH>( A * X = B );

Tasks

  • design an API for computing the solution of dense linear systems of equations
  • provide all necessary LAPACK wrapper functions
  • provide a full documentation of the feature
  • ensure compatibility with all existing matrix classes
  • ensure compatibility with all existing matrix expressions
  • guarantee maximum performance for all matrix types
  • add the necessary number of test cases for the entire functionality

Comments (11)

  1. Klaus Iglberger reporter

    Thanks for the interesting link. We would be very much interested to learn more about this implementation. Do you have a reference to a performance comparison between your implementation and LAPACK and/or PLASMA, including some information on the used CPU to enable us to determine the expected performance? Since we also need other decomposition algorithms: Does this work also provide implementations of the Cholesky (LLH), LDLT, and LDLH decomposition? Again, thanks a lot for the link, any help is very much appreciated.

  2. Michael Lehn

    The implementation of lu_blocked is identical with LAPACK. So using this C++ code and call BLAS from Intel MKL/ATLAS/OpenBLAS you get exactly the same performance as from LAPACK with Intel MKL/ATALS/OpenBLAS.

    The implementation of lu_blocked_recursive is more modern. Using BLAS from Intel MKL you get the same performance as from the Intel MKL implementation of dgetrf. These recursive algorithms were just recently published by the usual suspects around Jack Dongarra. Older implementations of Intel MKL (1-2 year ago) could be beaten with this. So I guess they just recently added this approach.

  3. Michael Lehn

    And yes, for Cholesky and QR factorization recursive algorithms are superior to those from traditional LAPACK. Their performance all depends on a fast matrix-matrix product. Due to the the recursion the algorithm becomes rich of matrix-matrix products where inner dimension are rather large and close to the outer dimensions. The lu_unblocked_recursive algorithm is just one example for this class of algorithm...

    Please note that there is more then just one variant for this algorithm. In The Science of Programming Matrix Computations you get an overview of different blocked and unblocked algorithms for LU, LLT and QR factorizations. All the blocked variants can be turned into recursively blocked variants.

  4. Klaus Iglberger reporter

    Thanks a lot for the links. However, it appears that integrating your LU implementation would not have many benefits for Blaze:

    • As noted by you, the LAPACK implementations are constantly updated and expected to run at maximum performance. Replacing the LAPACK implementation with a manual implementation of the same performance would thus not increase performance (the user experience), but introduces a high initial implementation effort and a constant maintenance effort (for continuous modernization). This effort is only justifiable if the manual implementation initially outperforms the LAPACK implementation.
    • As the vendors update their LAPACK implementation it is to be expected that also the parallelization is improved. Thus using LAPACK provides parallel implementations of all these algorithms without additional effort.
    • Blaze is currently using more than 30 LAPACK functions (starting from the upcoming 2.6 release), among these the functions for the LU, LLH, LDLT, LDLH, QR, RQ, QL, and LQ decomposition. Replacing a single of these functions by a manual implementation would not help to get rid of the dependency to LAPACK. We would need manual implementations for all LAPACK functions, including the according substitution, inversion, and solver routines, to be independent of LAPACK.

    Still, thanks a lot for your offer to provide us with a fast implementation of the LU decomposition. It is highly appreciated.

  5. Michael Lehn

    No problem at all. Its still good for myself with respect to teaching that I have examples around about other C++ libraries.

    I still would have two questions on how to benchmark the internal MMM of blaze:

    1) I saw that I can disable the external BLAS backend via the BLAZE_BLAS_MODE macro. In the benchmark program I adapted the call of the matrix product as follows:

    #if     BLAZE_BLAS_MODE==1
            blaze::dgemm(C2, A, B, alpha, beta);
    #else
            C2 = beta*C2 + alpha*A*B;
    #endif
    

    Is there an implicit overhead if I code the expression like that?

    2) What further optimization flags should I use? At the moment I compile with

    g++-5.3 -Ofast -mavx -DNDEBUG -std=c++11  -DBLAZE_BLAS_MODE=0 -I /path/to/blaze  bench_gemm.cc
    

    With this I only reach about 60 percent of Intel MKL's performance and is even below the Micro-Kernel using gcc extensions. So this are the current benchmark results for the matrix-matrix product.

    About the LAPACK dependencies. About these 30 LAPACK functions. I think I implemented all of them in FLENS which is also BSD licensed. So porting them to BLAZE should not be a big problem. Here is a subset of the LAPACK functions I coded.

  6. Klaus Iglberger reporter

    The expression

    C2 = beta*C2 + alpha*A*B;
    

    doesn't have a separate, manually optimized kernel. In more detail, the beta scalar prohibits a direct computation. Instead the expression is computed as

    C2 *= beta;
    C2 += alpha*A*B;
    

    There are only kernels for the following expressions:

    C2 = C2 + alpha*A*B;
    C2 += alpha*A*B;
    // ... and similar
    

    The optimization flags are fine. The performance difference is due to the lack of a specifically optimized kernel.

    P.S.: We would appreciate if this ticket is used for issue-related matters only (as for instance your initial post). This discussion has nothing to do with anymore with solving linear systems of equations. Thus we would be grateful if we could handle the discussion offline.

  7. Klaus Iglberger reporter

    Summary

    The feature has been implemented, tested, optimized, and documented as required. It is immediately available via cloning the Blaze repository and will be officially released in Blaze 3.7.

    Linear Systems

    The solve() function computes a solution for the given dense linear system of equations (LSE) A*x=b, where A is the given system matrix, x is the solution vector, and b is the given dense right-hand side vector:

    blaze::DynamicMatrix<double> A;  // The square general system matrix
    blaze::DynamicVector<double> b;  // The right-hand side vector
    // ... Resizing and initialization
    
    blaze::DynamicVector<double> x;  // The solution vector
    
    solve( A, x, b );   // Computing the solution x
    x = solve( A, b );  // Alternative syntax
    

    Alternatively, solve() computes a solution for the given dense LSE A*X=B, where A is the given dense system matrix, the columns of X are the solution vectors, and the columns of B are the given right-hand side vectors:

    blaze::DynamicMatrix<double> A;  // The square general system matrix
    blaze::DynamicMatrix<double> B;  // The right-hand side matrix
    // ... Resizing and initialization
    
    blaze::DynamicMatrix<double> X;  // The solution matrix
    
    solve( A, X, B );   // Computing the solutions X
    X = solve( A, B );  // Alternative syntax
    

    Both solve() functions will automatically select the most suited direct solver algorithm depending on the size and type of the given system matrix. For small matrices of up to 6x6, both functions use manually optimized kernels for maximum performance. For matrices larger than 6x6 the computation is performed by means of the most suited LAPACK solver method (see Linear System Solver).

    In case the type of the matrix does not provide additional compile time information about its structure (symmetric, lower, upper, diagonal, ...), the information can be provided manually by means of Declaration Operations when calling the solve() functions:

    blaze::DynamicMatrix<double> A;  // The square lower system matrix
    blaze::DynamicVector<double> b;  // The right-hand side vector
    // ... Resizing and initialization
    
    blaze::DynamicVector<double> x;  // The solution vector
    
    solve( declsym( A ), x, b );     // Solving the LSE with a symmetric system matrix
    solve( declherm( A ), x, b );    // Solving the LSE with an Hermitian system matrix
    solve( decllow( A ), x, b );     // Solving the LSE with a lower system matrix
    solve( declunilow( A ), x, b );  // Solving the LSE with an unilower system matrix
    solve( declupp( A ), x, b );     // Solving the LSE with an upper system matrix
    solve( decluniupp( A ), x, b );  // Solving the LSE with an uniupper system matrix
    solve( decldiag( A ), x, b );    // Solving the LSE with a diagonal system matrix
    

    For both solve() functions the computation fails if ...

    • ... the given matrix is not a square matrix;
    • ... the size of the right-hand side vector doesn't match the dimensions of the system matrix;
    • ... the number of rows of the right-hand side matrix doesn't match the dimensions of the system matrix;
    • ... the given matrix is singular and not invertible.

    In all failure cases either a compilation error is created if the failure can be predicted at compile time or a std::invalid_argument exception is thrown.

    Note that the solve() functions can only be used for dense matrices with float, double, complex<float> or complex<double> element type. The attempt to call the function with matrices of any other element type or with a sparse matrix results in a compile time error!

    Also note that the functions may make use of LAPACK kernels. Thus the functions can only be used if a fitting LAPACK library is available and linked to the executable. Otherwise a linker error will be created.

    Furthermore, it is important to note that it is not possible to use any kind of view on the expression object returned by the two-argument solve() functions. Also, it is not possible to access individual elements via the subscript operator or function call operator on the expression object:

    row( solve( A, b ), 2UL );  // Compilation error: Views cannot be used on an solve() expression!
    solve( A, b )[2];           // Compilation error: It is not possible to access individual elements!
    
    rows( solve( A, B ), { 2UL, 4UL } );  // Compilation error: Views cannot be used on an solve() expression!
    solve( A, B )(1,2);                   // Compilation error: It is not possible to access individual elements!
    

    Finally, please note that the solve() functions do not provide any exception safety guarantee, i.e. in case an exception is thrown the solution vector or matrix may already have been modified.

  8. Log in to comment