Replace uBLAS backend with Eigen3

Issue #424 resolved

Prof Garth Wells created an issue 2014-12-09

DOLFIN depends on uBLAS and Eigen3, but Eigen3 is richer and more performance. It provides sparse data structures.

Replace uBLAS classes with Eigen3 for enhanced performance and features.

Comments (22)

Prof Garth Wells reporter
- assigned issue to
  
  Prof Garth Wells
- 2015-02-01T10:54:59+00:00

Chris Richardson

Some performance figures for the EigenLUSolver compared to the uBLASKrylovSolver for a 200x200 UnitSquareMesh solving Poisson

Summary of timings               |  Average time  Total time  Reps
------------------------------------------------------------------
Eigen LU solver (cholesky)       |       0.44687     0.44687     1
Eigen LU solver (cholmod)        |       0.34772     0.34772     1
Eigen LU solver (sparselu)       |       0.91114     0.91114     1
Eigen LU solver (umfpack)        |       0.67976     0.67976     1

uBLAS gmres                      |        2.8954      2.8954     1

PETSc mumps                      |        0.9546      0.9546     1
PETSc pastix                     |        1.3718      1.3718     1
PETSc petsc                      |       0.60991     0.60991     1
PETSc superlu_dist               |       0.66945     0.66945     1
PETSc umfpack                    |       0.66618     0.66618     1

2015-02-04T11:33:04+00:00

Anders Logg (Chalmers)
This seems to compare different methods, rather than different backends. What about Eigen gmres for example?
- 2015-02-04T13:05:10+00:00
Chris Richardson
uBLAS does not have an LUSolver, and the built-in solver of uBLASMatrix::solve() is so slow, it does not even run in a reasonable time (minutes). I haven't implemented an EigenKrylovSolver yet.
- 2015-02-04T13:09:26+00:00
Prof Garth Wells reporter
LU solver speed is not so interesting because Eigen has an interface to all the 'good' LU solvers. Interesting is how assembly and mat-vec speed compare to uBLAS.
- 2015-02-04T14:24:31+00:00

Chris Richardson

They seem to be comparable. Making those LU solvers available for serial use, without requiring PETSc, is pretty useful.

Eigen:
Assemble system                  |       0.12315     0.12315     1
Eigen mult                       |        0.7103      0.7103     1

uBLAS:
Assemble system                  |       0.12659     0.12659     1
uBLAS mult                       |       0.57375     0.57375     1

2015-02-04T14:40:54+00:00

Prof Garth Wells reporter
I'm surprised by the mat-vec. How big is the matrix? Eigen is usually reported to be many times faster than uBLAS.
- 2015-02-04T14:43:21+00:00
Chris Richardson
That's 1000 iterations on a 40000x40000 Matrix. Maybe my implementation needs optimising. There are also various issues around RowMajor (preferred by dolfin) and ColMajor (preferred by Eigen) storage.
- 2015-02-04T14:58:08+00:00
Anders Logg (Chalmers)
How come the timings for assembly are so close? Is it dominated by something else, or are the implementations that similar?
- 2015-02-04T14:58:16+00:00
Chris Richardson
@logg - yes, I was thinking the same, but was typing away at EigenKrylovSolver, so I haven't had time to investigate yet...
- 2015-02-04T15:06:19+00:00
Chris Richardson
Yes, the Assembly time is all taken up in SystemAssembler::cell_wise_assembly() not in the LA backends.
- 2015-02-04T15:51:27+00:00
Anders Logg (Chalmers)
What equation is it? Make sure it's something simple like Poisson, and perhaps do the timings with the regular assembler only for the matrix to get the cost of insertion.
- 2015-02-05T08:46:07+00:00
Martin Sandve Alnæs
Better to use the mass matrix than Poisson.
- 2015-02-05T08:51:42+00:00
Anders Logg (Chalmers)
Yes, even better.
- 2015-02-05T08:55:28+00:00
Martin Sandve Alnæs
Another option is of course to use a profiler. I bet we could find some overhead in the assembler.
- 2015-02-05T09:06:50+00:00
Prof Garth Wells reporter
@martinal We've done this before, which is why we added mesh re-ordering to improved data locality, and the ongoing work in https://bitbucket.org/fenics-project/dolfin/branch/garth/fix-issue-350-dofmap-type to get better cache performance when fetching dofmaps.
- 2015-02-05T09:20:36+00:00

Chris Richardson

I think assemble_system() is rather inefficient. Using assemble() and bc.apply() is a bit faster. Interestingly, the BC apply seems slow for uBLAS.

uBLAS:
Assemble cells                   |      0.028031    0.056061     2
DirichletBC apply                |      0.014973    0.029946     2
DirichletBC compute bc           |      0.005567    0.011134     2
DirichletBC init facets          |     0.0051221    0.010244     2



Eigen:
Assemble cells                   |      0.023649    0.047298     2
DirichletBC apply                |     0.0056521    0.011304     2
DirichletBC compute bc           |     0.0055655    0.011131     2
DirichletBC init facets          |     0.0051165    0.010233     2

2015-02-05T09:26:58+00:00

Martin Sandve Alnæs
Nice. When that's finished we can profile again and perhaps revisit ufc signatures and the UFC class.
- 2015-02-05T09:28:21+00:00
Prof Garth Wells reporter
I'm getting quite different assembly timing. Assembling the matrix for Poisson on a 1024x1024 mesh:

uBLAS

1st assembly: 1.68275s

2nd assembly: 0.759649s

Eigen

1st assembly: 1.37295s

2nd assembly: 0.445719s

PETSc

1st assembly: 1.34851s

2nd assembly: 0.68317s

Have you used the NDEBUG flag for Eigen and turned optimisations on?
- 2015-02-05T09:54:00+00:00
Chris Richardson
@garth-wells - clearly not... Looking good now.
- 2015-02-05T11:23:51+00:00
Prof Garth Wells reporter
- changed status to resolved
Implemented in Implemented in 9bc0423.
- 2015-02-19T07:45:12+00:00
Martin Sandve Alnæs
- removed milestone
Removing milestone: 1.6 (automated comment)
- 2016-02-04T14:49:40+00:00
Log in to comment

Assignee: Prof Garth Wells

Type: proposal

Priority: major

Status: resolved

Component: –

Milestone: –

Version: 1.4

Votes: 0

Watchers: 4