Performance regression in KSPSolve with BoomerAMG

I use the following script to very crudely assess the performance of builds on our HPC.

https://bitbucket.org/unilucompmech/fenics-gaia-cluster/src/5a41cec13f2b0cbc492c24717c6bf354afea6f24/poisson.py?at=master&fileviewer=file-view-default

It solves Poisson with BoomerAMG and some options gleaned from @chris_richardson. Everything else (refinement, assembly etc.) is much faster in 2016.1.0 so good work on the whole!

But, I have noticed a severe performance regression moving to 2016.1.0 (new) from 1.6.0 (old) in the KSPSolve stage. I use PETSc 3.7.3 (new) with 2016.1.0, and PETSc 3.6.4 (old) with 1.6.0. However I can also create a similar regression even if I compile DOLFIN 2016.1.0 against PETSc 3.6.4 on our cluster. I can also recreate a similar regression in Docker on my laptop (Dual Core i7 Haswell) (quay.io/fenicsproject/stable:2016.1.0.r4 vs quay.io/fenicsproject/stable:1.6.0). I include results below which you can reproduce (in Docker containers).

On my laptop (1 core, Docker) KSPSolve is ~30 seconds on new FEniCS with new PETSc and ~16 seconds on old FEniCS with old PETSc (Docker).

Running on our cluster (8 cores, single node) KSPSolve is ~100 seconds new FEniCS+ new PETSc and around ~24 seconds on old FEniCS + old PETSc. But I also get ~100 seconds with new FEniCS + old PETSc. Tried various MPI backends here too (MPICH, OpenMPI, Intel).

quay.io/fenicsproject/stable:1.6.0

Passing options to PETSc: -ksp_monitor -ksp_view -pc_hypre_boomeramg_print_debug 1
Z Initial Nx: 10
Z Initial Ny: 10
Z Initial Nz: 20
Refinement 1...
Refinement 2...
Refinement 3...
Finished refinement
Z Degrees of freedom:                  1056321
Z Degrees of freedom per core: 1056321
Building sparsity pattern over cells [======>                         ] 21.3%
Building sparsity pattern over cells [===============>                ] 46.9%
Building sparsity pattern over cells [=====================>          ] 68.3%
Building sparsity pattern over cells [==============================> ] 93.9%
Building sparsity pattern over cells [================================] 100.0%
Assembling matrix over cells [=====>                                  ] 14.9%
Assembling matrix over cells [===========>                            ] 27.7%
Assembling matrix over cells [================>                       ] 40.5%
Assembling matrix over cells [=====================>                  ] 53.3%
Assembling matrix over cells [==========================>             ] 66.1%
Assembling matrix over cells [===============================>        ] 78.9%
Assembling matrix over cells [====================================>   ] 91.7%
Assembling matrix over cells [========================================] 100.0%
Assembling vector over cells [==========>                             ] 25.6%
Assembling vector over cells [====================>                   ] 51.2%
Assembling vector over cells [================================>       ] 81.1%
Assembling vector over cells [========================================] 100.0%
Computing sub domain markers [============>                           ] 31.8%
Computing sub domain markers [============================>           ] 70.0%
Computing sub domain markers [========================================] 100.0%
Applying boundary conditions to linear system.
<Parameter set "linear_solver" containing 8 parameter(s) and 2 nested parameter set(s)>

  linear_solver            |    type  value          range  access  change
  ------------------------------------------------------------------------
  absolute_tolerance       |  double  1e-15             []       1       0
  divergence_limit         |  double  10000             []       1       0
  error_on_nonconvergence  |    bool   true  {true, false}       1       0
  maximum_iterations       |     int  10000             []       1       0
  monitor_convergence      |    bool  false  {true, false}       1       0
  nonzero_initial_guess    |    bool  false  {true, false}       1       0
  relative_tolerance       |  double  1e-10             []       1       1
  report                   |    bool   true  {true, false}       1       0

  <Parameter set "gmres" containing 1 parameter(s) and 0 nested parameter set(s)>

    gmres    |  type  value  range  access  change
    ----------------------------------------------
    restart  |   int     30     []       1       0

  <Parameter set "preconditioner" containing 3 parameter(s) and 2 nested parameter set(s)>

    preconditioner  |    type                      value                                                    range  access  change
    -----------------------------------------------------------------------------------------------------------------------------
    report          |    bool                      false                                            {true, false}       1       0
    shift_nonzero   |  double                          0                                                       []       1       0
    structure       |  string  different_nonzero_pattern  [different_nonzero_pattern, same, same_nonzero_pattern]       1       0

    <Parameter set "ilu" containing 1 parameter(s) and 0 nested parameter set(s)>

      ilu         |  type  value  range  access  change
      -------------------------------------------------
      fill_level  |   int      0     []       1       0

    <Parameter set "schwarz" containing 1 parameter(s) and 0 nested parameter set(s)>

      schwarz  |  type  value  range  access  change
      ----------------------------------------------
      overlap  |   int      1     []       1       0
Solving linear system of size 1056321 x 1056321 (PETSc Krylov solver).
PETSc Krylov solver starting to solve 1056321 x 1056321 system.
Proc = 0    Level = 0    Build Interp Time = 0.630000
Proc = 0    Level = 0    Build Coarse Operator Time = 0.780000
Proc = 0    Level = 1    Build Interp Time = 0.810000
Proc = 0    Level = 1    Build Coarse Operator Time = 0.680000
Proc = 0    Level = 2    Build Interp Time = 0.220000
Proc = 0    Level = 2    Build Coarse Operator Time = 0.280000
Proc = 0    Level = 3    Build Interp Time = 0.050000
Proc = 0    Level = 3    Build Coarse Operator Time = 0.100000
Proc = 0    Level = 4    Coarsen Time = 0.010000
Proc = 0    Level = 4    Build Interp Time = 0.000000
Proc = 0    Level = 4    Build Coarse Operator Time = 0.020000
Proc = 0    Level = 5    Coarsen Time = 0.000000
Proc = 0    Level = 5    Build Interp Time = 0.000000
Proc = 0    Level = 5    Build Coarse Operator Time = 0.010000
Proc = 0    Level = 6    Coarsen Time = 0.000000
Proc = 0    Level = 6    Build Interp Time = 0.000000
Proc = 0    Level = 6    Build Coarse Operator Time = 0.010000
Proc = 0    Level = 7    Coarsen Time = 0.000000
Proc = 0    Level = 7    Build Interp Time = 0.000000
Proc = 0    Level = 7    Build Coarse Operator Time = 0.000000
Proc = 0    Level = 8    Coarsen Time = 0.000000
Proc = 0    Level = 8    Build Interp Time = 0.000000
Proc = 0    Level = 8    Build Coarse Operator Time = 0.000000
Proc = 0    Level = 9    Coarsen Time = 0.000000
Proc = 0    Level = 9    Build Interp Time = 0.000000
Proc = 0    Level = 9    Build Coarse Operator Time = 0.000000
Proc = 0    Level = 10    Coarsen Time = 0.000000
Proc = 0    Level = 10    Build Interp Time = 0.000000
Proc = 0    Level = 10    Build Coarse Operator Time = 0.000000
Proc = 0    Level = 11    Coarsen Time = 0.000000
Proc = 0    Level = 11    Build Interp Time = 0.000000
Proc = 0    Level = 11    Build Coarse Operator Time = 0.000000
Proc = 0    Level = 12    Coarsen Time = 0.000000
Proc = 0    Level = 12    Build Interp Time = 0.000000
Proc = 0    Level = 12    Build Coarse Operator Time = 0.000000
Proc = 0    Level = 13    Coarsen Time = 0.000000
Proc = 0    Level = 13    Build Interp Time = 0.000000
Proc = 0    Level = 13    Build Coarse Operator Time = 0.000000
  Residual norms for  solve.
  0 KSP Residual norm 5.583794485124e+01 
  1 KSP Residual norm 1.611860391018e+01 
  2 KSP Residual norm 5.305846131867e+00 
  3 KSP Residual norm 1.905306416164e+00 
  4 KSP Residual norm 7.143669810157e-01 
  5 KSP Residual norm 2.657322220626e-01 
  6 KSP Residual norm 9.758858276495e-02 
  7 KSP Residual norm 4.513775590689e-02 
  8 KSP Residual norm 2.178923305886e-02 
  9 KSP Residual norm 7.983546156009e-03 
 10 KSP Residual norm 2.613698083862e-03 
 11 KSP Residual norm 9.317568226336e-04 
 12 KSP Residual norm 3.832151753727e-04 
 13 KSP Residual norm 1.712405504300e-04 
 14 KSP Residual norm 7.479485497816e-05 
 15 KSP Residual norm 2.678717319622e-05 
 16 KSP Residual norm 9.046576049639e-06 
 17 KSP Residual norm 3.257880478991e-06 
 18 KSP Residual norm 1.141033761126e-06 
 19 KSP Residual norm 3.762399501122e-07 
 20 KSP Residual norm 1.235637444976e-07 
 21 KSP Residual norm 4.413263355892e-08 
 22 KSP Residual norm 1.755015266427e-08 
 23 KSP Residual norm 7.016483228222e-09 
 24 KSP Residual norm 2.580492440593e-09 
KSP Object:() 1 MPI processes
  type: cg
  maximum iterations=10000, initial guess is zero
  tolerances:  relative=1e-10, absolute=1e-15, divergence=10000
  left preconditioning
  using PRECONDITIONED norm type for convergence test
PC Object:() 1 MPI processes
  type: hypre
    HYPRE BoomerAMG preconditioning
    HYPRE BoomerAMG: Cycle type V
    HYPRE BoomerAMG: Maximum number of levels 25
    HYPRE BoomerAMG: Maximum number of iterations PER hypre call 1
    HYPRE BoomerAMG: Convergence tolerance PER hypre call 0
    HYPRE BoomerAMG: Threshold for strong coupling 0.5
    HYPRE BoomerAMG: Interpolation truncation factor 0.9
    HYPRE BoomerAMG: Interpolation: max elements per row 5
    HYPRE BoomerAMG: Number of levels of aggressive coarsening 4
    HYPRE BoomerAMG: Number of paths for aggressive coarsening 2
    HYPRE BoomerAMG: Maximum row sums 0.9
    HYPRE BoomerAMG: Sweeps down         1
    HYPRE BoomerAMG: Sweeps up           1
    HYPRE BoomerAMG: Sweeps on coarse    1
    HYPRE BoomerAMG: Relax down          symmetric-SOR/Jacobi
    HYPRE BoomerAMG: Relax up            symmetric-SOR/Jacobi
    HYPRE BoomerAMG: Relax on coarse     Gaussian-elimination
    HYPRE BoomerAMG: Relax weight  (all)      1
    HYPRE BoomerAMG: Outer relax weight (all) 1
    HYPRE BoomerAMG: Using CF-relaxation
    HYPRE BoomerAMG: Measure type        local
    HYPRE BoomerAMG: Coarsen type        Falgout
    HYPRE BoomerAMG: Interpolation type  classical
  linear system matrix = precond matrix:
  Mat Object:  ()   1 MPI processes
    type: seqaij
    rows=1056321, cols=1056321
    total: nonzeros=1.5585e+07, allocated nonzeros=1.5585e+07
    total number of mallocs used during MatSetValues calls =0
      not using I-node routines
PETSc Krylov solver (cg, hypre) converged in 24 iterations.
  Hypre preconditioner method: boomeramg
[MPI_AVG] Summary of timings     |    reps    wall avg    wall tot
------------------------------------------------------------------
Apply (PETScMatrix)              |       2    0.030532    0.061064
Apply (PETScVector)              |       5  5.7755e-05  0.00028878
Assemble cells                   |       2      2.9384      5.8767
Assemble exterior facets         |       1     0.27444     0.27444
Build mesh number mesh entities  |       1   3.696e-06   3.696e-06
Build sparsity                   |       2       1.178      2.3561
Delete sparsity                  |       2  1.1487e-05  2.2974e-05
DirichletBC apply                |       1      1.5554      1.5554
DirichletBC compute bc           |       1      1.5168      1.5168
DirichletBC init facets          |       1      1.4913      1.4913
Generate Box mesh                |       1   0.0018448   0.0018448
Init PETSc                       |       1   0.0015295   0.0015295
Init dof vector                  |       1   0.0070173   0.0070173
Init dofmap                      |       1      26.141      26.141
Init dofmap from UFC dofmap      |       1      1.3064      1.3064
Init tensor                      |       2    0.021542    0.043084
Krylov solver                    |       1      16.277      16.277
PETSc Krylov solver              |       1      16.276      16.276
PLAZA: Face long edge            |       3      2.6711      8.0134
PLAZA: Get tetrahedra            |  876000  4.8844e-06      4.2787
PLAZA: refine                    |       3      7.0296      21.089
SCOTCH graph ordering            |       1     0.31682     0.31682
Z Assembling...                  |       1      10.289      10.289
Z Creating mesh...               |       1   0.0024143   0.0024143
Z DirichletBC...                 |       1   0.0091192   0.0091192
Z Expression...                  |       1     0.16942     0.16942
Z Forms...                       |       1  0.00093475  0.00093475
Z FunctionSpace...               |       1      26.191      26.191
Z PETSC Options...               |       1   0.0075967   0.0075967
Z Refinement 1...                |       1     0.26389     0.26389
Z Refinement 2...                |       1      2.0922      2.0922
Z Refinement 3...                |       1      18.733      18.733
Z Solving...                     |       1      16.278      16.278
compute connectivity 0 - 1       |       3    0.015244    0.045731
compute connectivity 2 - 1       |       3      1.0456      3.1369
compute connectivity 2 - 3       |       1      5.1591      5.1591
compute entities dim = 1         |       3     0.62157      1.8647
compute entities dim = 2         |       4      4.5958      18.383
************************************************************************************************************************
***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r -fCourier9' to print this document            ***
************************************************************************************************************************

---------------------------------------------- PETSc Performance Summary: ----------------------------------------------

poisson.py on a arch-linux2-c-opt named 4fbc7e36c67a with 1 processor, by fenics Wed Aug 31 12:56:11 2016
Using Petsc Release Version 3.6.3, Dec, 03, 2015 

                         Max       Max/Min        Avg      Total 
Time (sec):           7.487e+01      1.00000   7.487e+01
Objects:              3.000e+01      1.00000   3.000e+01
Flops:                1.027e+09      1.00000   1.027e+09  1.027e+09
Flops/sec:            1.372e+07      1.00000   1.372e+07  1.372e+07
MPI Messages:         0.000e+00      0.00000   0.000e+00  0.000e+00
MPI Message Lengths:  2.014e+07      1.00000   0.000e+00  2.014e+07
MPI Reductions:       0.000e+00      0.00000

Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract)
                            e.g., VecAXPY() for real vectors of length N --> 2N flops
                            and VecAXPY() for complex vectors of length N --> 8N flops

Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages ---  -- Message Lengths --  -- Reductions --
                        Avg     %Total     Avg     %Total   counts   %Total     Avg         %Total   counts   %Total 
 0:      Main Stage: 7.4869e+01 100.0%  1.0269e+09 100.0%  0.000e+00   0.0%  0.000e+00      100.0%  0.000e+00   0.0% 

------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on interpreting output.
Phase summary info:
   Count: number of times phase was executed
   Time and Flops: Max - maximum over all processors
                   Ratio - ratio of maximum to minimum over all processors
   Mess: number of messages sent
   Avg. len: average message length (bytes)
   Reduct: number of global reductions
   Global: entire computation
   Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop().
      %T - percent time in this phase         %F - percent flops in this phase
      %M - percent messages in this phase     %L - percent message lengths in this phase
      %R - percent reductions in this phase
   Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over all processors)
------------------------------------------------------------------------------------------------------------------------
Event                Count      Time (sec)     Flops                             --- Global ---  --- Stage ---   Total
                   Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------

--- Event Stage 0: Main Stage

VecTDot               48 1.0 6.8321e-02 1.0 1.01e+08 1.0 0.0e+00 0.0e+00 0.0e+00  0 10  0  0  0   0 10  0  0  0  1484
VecNorm               25 1.0 1.7067e-02 1.0 5.28e+07 1.0 0.0e+00 0.0e+00 0.0e+00  0  5  0  0  0   0  5  0  0  0  3095
VecCopy                2 1.0 2.6250e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecSet                31 1.0 1.5567e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecAXPY               48 1.0 7.3175e-02 1.0 1.01e+08 1.0 0.0e+00 0.0e+00 0.0e+00  0 10  0  0  0   0 10  0  0  0  1386
VecAYPX               23 1.0 4.2406e-02 1.0 4.86e+07 1.0 0.0e+00 0.0e+00 0.0e+00  0  5  0  0  0   0  5  0  0  0  1146
VecAssemblyBegin       5 1.0 5.9366e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecAssemblyEnd         5 1.0 1.7881e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecScatterBegin        6 1.0 2.0504e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatMult               24 1.0 5.8344e-01 1.0 7.23e+08 1.0 0.0e+00 0.0e+00 0.0e+00  1 70  0  0  0   1 70  0  0  0  1239
MatConvert             1 1.0 7.3389e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatAssemblyBegin       2 1.0 2.2173e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatAssemblyEnd         2 1.0 6.0960e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatGetRowIJ            1 1.0 3.3379e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatZeroEntries         1 1.0 7.1886e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatView                1 1.0 3.8147e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
PCSetUp                1 1.0 3.6915e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  5  0  0  0  0   5  0  0  0  0     0
PCApply               25 1.0 1.1789e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 16  0  0  0  0  16  0  0  0  0     0
KSPSetUp               1 1.0 1.7953e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
KSPSolve               1 1.0 1.6275e+01 1.0 1.03e+09 1.0 0.0e+00 0.0e+00 0.0e+00 22100  0  0  0  22100  0  0  0    63
------------------------------------------------------------------------------------------------------------------------

Memory usage is given in bytes:

Object Type          Creations   Destructions     Memory  Descendants' Mem.
Reports information only for process 0.

--- Event Stage 0: Main Stage

              Viewer     1              0            0     0
           Index Set     6              6       108936     0
   IS L to G Mapping     6              6     25355400     0
              Vector    12             12     59172536     0
      Vector Scatter     2              2         1296     0
              Matrix     1              1    203923352     0
      Preconditioner     1              1         1488     0
       Krylov Solver     1              1         1208     0
========================================================================================================================
Average time to get PetscTime(): 3.00407e-06
#PETSc Option Table entries:
-ksp_monitor
-ksp_view
-log_summary
-pc_hypre_boomeramg_P_max 5
-pc_hypre_boomeramg_agg_nl 4
-pc_hypre_boomeramg_agg_num_paths 2
-pc_hypre_boomeramg_print_debug 1
-pc_hypre_boomeramg_strong_threshold 0.5
-pc_hypre_boomeramg_truncfactor 0.90000000000000002
#End of PETSc Option Table entries
Compiled without FORTRAN kernels
Compiled with full precision matrices (default)
sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 sizeof(PetscScalar) 8 sizeof(PetscInt) 4
Configure options: --COPTFLAGS=-O2 --CXXOPTFLAGS=-O2 --FOPTFLAGS=-O2 --with-blas-lib=/usr/lib/libopenblas.a --with-lapack-lib=/usr/lib/liblapack.a --with-c-support --with-debugging=0 --with-shared-libraries --download-suitesparse --download-scalapack --download-metis --download-parmetis --download-ptscotch --download-hypre --download-mumps --download-blacs --download-spai --download-ml --prefix=/usr/local
-----------------------------------------
Libraries compiled on Tue Jun 28 09:19:42 2016 on 77705a0c4e38 
Machine characteristics: Linux-4.4.6-coreos-x86_64-with-Ubuntu-16.04-xenial
Using PETSc directory: /tmp/petsc-3.6.3
Using PETSc arch: arch-linux2-c-opt
-----------------------------------------

Using C compiler: mpicc  -fPIC -Wall -Wwrite-strings -Wno-strict-aliasing -Wno-unknown-pragmas -O2  ${COPTFLAGS} ${CFLAGS}
Using Fortran compiler: mpif90  -fPIC -Wall -Wno-unused-variable -ffree-line-length-0 -Wno-unused-dummy-argument -O2   ${FOPTFLAGS} ${FFLAGS} 
-----------------------------------------

Using include paths: -I/tmp/petsc-3.6.3/arch-linux2-c-opt/include -I/tmp/petsc-3.6.3/include -I/tmp/petsc-3.6.3/include -I/tmp/petsc-3.6.3/arch-linux2-c-opt/include -I/usr/local/include -I/usr/include/mpich
-----------------------------------------

Using C linker: mpicc
Using Fortran linker: mpif90
Using libraries: -Wl,-rpath,/tmp/petsc-3.6.3/arch-linux2-c-opt/lib -L/tmp/petsc-3.6.3/arch-linux2-c-opt/lib -lpetsc -Wl,-rpath,/usr/local/lib -L/usr/local/lib -lHYPRE -Wl,-rpath,/usr/lib/x86_64-linux-gnu -L/usr/lib/x86_64-linux-gnu -Wl,-rpath,/usr/lib/gcc/x86_64-linux-gnu/5 -L/usr/lib/gcc/x86_64-linux-gnu/5 -Wl,-rpath,/lib/x86_64-linux-gnu -L/lib/x86_64-linux-gnu -lmpichcxx -lstdc++ -lcmumps -ldmumps -lsmumps -lzmumps -lmumps_common -lpord -lpthread -lscalapack -lspai -lml -lmpichcxx -lstdc++ -lumfpack -lklu -lcholmod -lbtf -lccolamd -lcolamd -lcamd -lamd -lsuitesparseconfig -Wl,-rpath,/usr/lib -L/usr/lib -llapack -lopenblas -lparmetis -lmetis -lpthread -lptesmumps -lptscotch -lptscotcherr -lscotch -lscotcherr -lm -lmpichfort -lgfortran -lm -lgfortran -lm -lquadmath -lm -lmpichcxx -lstdc++ -lrt -lm -lz -Wl,-rpath,/usr/lib/x86_64-linux-gnu -L/usr/lib/x86_64-linux-gnu -Wl,-rpath,/usr/lib/gcc/x86_64-linux-gnu/5 -L/usr/lib/gcc/x86_64-linux-gnu/5 -Wl,-rpath,/usr/lib/x86_64-linux-gnu -L/usr/lib/x86_64-linux-gnu -Wl,-rpath,/lib/x86_64-linux-gnu -L/lib/x86_64-linux-gnu -Wl,-rpath,/usr/lib/x86_64-linux-gnu -L/usr/lib/x86_64-linux-gnu -ldl -lmpich -lgcc_s -ldl 
-----------------------------------------

quay.io/fenicsproject/stable:2016.1.0.r4

Passing options to PETSc: -ksp_monitor -ksp_view -pc_hypre_boomeramg_print_debug 1
Z Initial Nx: 10
Z Initial Ny: 10
Z Initial Nz: 20
Refinement 1...
Compute shared mesh entities of dimension 1.
Number of cells increased from 12000 to 96000 (700.0% increase).
Refinement 2...
Compute shared mesh entities of dimension 1.
Number of cells increased from 96000 to 768000 (700.0% increase).
Refinement 3...
Compute shared mesh entities of dimension 1.
Number of cells increased from 768000 to 6144000 (700.0% increase).
Finished refinement
Z Degrees of freedom:                  1056321
Z Degrees of freedom per core: 1056321
*** -------------------------------------------------------------------------
*** Warning: Automatic determination of degree for Expressions has been deprecated in FEniCS version 2016.1.
*** It will (likely) be removed in the next FEniCS release.
*** Specify the polynomial degree for interpolation of Expression, or provide a FiniteElement
*** -------------------------------------------------------------------------

*** -------------------------------------------------------------------------
*** Warning: Automatic determination of degree for Expressions has been deprecated in FEniCS version 2016.1.
*** It will (likely) be removed in the next FEniCS release.
*** Specify the polynomial degree for interpolation of Expression, or provide a FiniteElement
*** -------------------------------------------------------------------------

Building sparsity pattern over cells [========>                       ] 25.6%
Building sparsity pattern over cells [================>               ] 51.2%
Building sparsity pattern over cells [========================>       ] 76.8%
Building sparsity pattern over cells [================================] 100.0%
Assembling matrix over cells [=====>                                  ] 12.8%
Assembling matrix over cells [==========>                             ] 25.6%
Assembling matrix over cells [===============>                        ] 38.4%
Assembling matrix over cells [====================>                   ] 51.2%
Assembling matrix over cells [=========================>              ] 64.0%
Assembling matrix over cells [==============================>         ] 76.8%
Assembling matrix over cells [===================================>    ] 89.6%
Assembling matrix over cells [========================================] 100.0%
Assembling vector over cells [===========>                            ] 29.9%
Assembling vector over cells [=======================>                ] 59.7%
Assembling vector over cells [===================================>    ] 89.6%
Assembling vector over cells [========================================] 100.0%
Computing sub domain markers [========================================] 100.0%
Applying boundary conditions to linear system.
<Parameter set "linear_solver" containing 8 parameter(s) and 0 nested parameter set(s)>

  linear_solver            |    type    value          range  access  change
  --------------------------------------------------------------------------
  absolute_tolerance       |  double  <unset>             []       0       0
  divergence_limit         |  double  <unset>             []       0       0
  error_on_nonconvergence  |    bool  <unset>  {true, false}       0       0
  maximum_iterations       |     int  <unset>             []       0       0
  monitor_convergence      |    bool  <unset>  {true, false}       0       0
  nonzero_initial_guess    |    bool  <unset>  {true, false}       0       0
  relative_tolerance       |  double    1e-10             []       0       1
  report                   |    bool  <unset>  {true, false}       0       0
PETSc Krylov solver starting to solve 1056321 x 1056321 system.
KSP Object: 1 MPI processes
  type: cg
  maximum iterations=10000, initial guess is zero
  tolerances:  relative=1e-10, absolute=1e-50, divergence=10000.
  left preconditioning
  using PRECONDITIONED norm type for convergence test
PC Object: 1 MPI processes
  type: hypre
    HYPRE BoomerAMG preconditioning
    HYPRE BoomerAMG: Cycle type V
    HYPRE BoomerAMG: Maximum number of levels 25
    HYPRE BoomerAMG: Maximum number of iterations PER hypre call 1
    HYPRE BoomerAMG: Convergence tolerance PER hypre call 0.
    HYPRE BoomerAMG: Threshold for strong coupling 0.25
    HYPRE BoomerAMG: Interpolation truncation factor 0.
    HYPRE BoomerAMG: Interpolation: max elements per row 0
    HYPRE BoomerAMG: Number of levels of aggressive coarsening 0
    HYPRE BoomerAMG: Number of paths for aggressive coarsening 1
    HYPRE BoomerAMG: Maximum row sums 0.9
    HYPRE BoomerAMG: Sweeps down         1
    HYPRE BoomerAMG: Sweeps up           1
    HYPRE BoomerAMG: Sweeps on coarse    1
    HYPRE BoomerAMG: Relax down          symmetric-SOR/Jacobi
    HYPRE BoomerAMG: Relax up            symmetric-SOR/Jacobi
    HYPRE BoomerAMG: Relax on coarse     Gaussian-elimination
    HYPRE BoomerAMG: Relax weight  (all)      1.
    HYPRE BoomerAMG: Outer relax weight (all) 1.
    HYPRE BoomerAMG: Using CF-relaxation
    HYPRE BoomerAMG: Not using more complex smoothers.
    HYPRE BoomerAMG: Measure type        local
    HYPRE BoomerAMG: Coarsen type        Falgout
    HYPRE BoomerAMG: Interpolation type  classical
  linear system matrix = precond matrix:
  Mat Object:   1 MPI processes
    type: seqaij
    rows=1056321, cols=1056321
    total: nonzeros=15584961, allocated nonzeros=15584961
    total number of mallocs used during MatSetValues calls =0
      not using I-node routines
[MPI_AVG] Summary of timings      |    reps    wall avg    wall tot
-------------------------------------------------------------------
Apply (PETScMatrix)               |       2    0.027329    0.054659
Apply (PETScVector)               |       6  8.1309e-05  0.00048785
Assemble cells                    |       2      2.9439      5.8879
Assemble exterior facets          |       1     0.31685     0.31685
Build BoxMesh                     |       1   0.0024531   0.0024531
Build sparsity                    |       2        1.04      2.0801
Compute SCOTCH graph re-ordering  |       1     0.25735     0.25735
Compute connectivity 2-1          |       3     0.63623      1.9087
Compute connectivity 2-3          |       1     0.73728     0.73728
Compute entities dim = 1          |       3     0.35475      1.0643
Compute entities dim = 2          |       4       1.586      6.3439
Computed shared mesh entities     |       3  0.00013871  0.00041613
Delete sparsity                   |       2  7.3415e-06  1.4683e-05
DirichletBC apply                 |       1      0.9833      0.9833
DirichletBC compute bc            |       1     0.94495     0.94495
DirichletBC init facets           |       1     0.91345     0.91345
Init PETSc                        |       1   0.0012785   0.0012785
Init dof vector                   |       1    0.031135    0.031135
Init dofmap                       |       1      11.125      11.125
Init dofmap from UFC dofmap       |       1     0.97001     0.97001
Init tensor                       |       2    0.065852      0.1317
Krylov solver                     |       1      30.748      30.748
Number distributed mesh entities  |       1   3.196e-06   3.196e-06
PETSc Krylov solver               |       1      30.748      30.748
PLAZA: Face long edge             |       3      1.2441      3.7322
PLAZA: Get tetrahedra             |  876000   4.054e-06      3.5513
PLAZA: refine                     |       3      5.1754      15.526
SCOTCH: call SCOTCH_graphBuild    |       1   0.0011413   0.0011413
SCOTCH: call SCOTCH_graphOrder    |       1     0.18641     0.18641
Z Assembling...                   |       1      9.6341      9.6341
Z Creating mesh...                |       1   0.0032052   0.0032052
Z DirichletBC...                  |       1  0.00024782  0.00024782
Z Expression...                   |       1    0.041955    0.041955
Z Forms...                        |       1   0.0013435   0.0013435
Z FunctionSpace...                |       1      11.133      11.133
Z PETSC Options...                |       1    0.031352    0.031352
Z Refinement 1...                 |       1     0.20884     0.20884
Z Refinement 2...                 |       1      1.5389      1.5389
Z Refinement 3...                 |       1      13.781      13.781
Z Solving...                      |       1      30.749      30.749


 WARNING:   -log_summary is being deprecated; switch to -log_view


************************************************************************************************************************
***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r -fCourier9' to print this document            ***
************************************************************************************************************************

---------------------------------------------- PETSc Performance Summary: ----------------------------------------------

poisson.py on a arch-linux2-c-opt named 5ab6fa58350d with 1 processor, by fenics Wed Aug 31 12:54:41 2016
Using Petsc Release Version 3.7.3, Jul, 24, 2016 

                         Max       Max/Min        Avg      Total 
Time (sec):           6.814e+01      1.00000   6.814e+01
Objects:              1.700e+01      1.00000   1.700e+01
Flops:                3.851e+08      1.00000   3.851e+08  3.851e+08
Flops/sec:            5.652e+06      1.00000   5.652e+06  5.652e+06
MPI Messages:         0.000e+00      0.00000   0.000e+00  0.000e+00
MPI Message Lengths:  1.453e+07      1.00000   0.000e+00  1.453e+07
MPI Reductions:       0.000e+00      0.00000

Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract)
                            e.g., VecAXPY() for real vectors of length N --> 2N flops
                            and VecAXPY() for complex vectors of length N --> 8N flops

Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages ---  -- Message Lengths --  -- Reductions --
                        Avg     %Total     Avg     %Total   counts   %Total     Avg         %Total   counts   %Total 
 0:      Main Stage: 6.8138e+01 100.0%  3.8511e+08 100.0%  0.000e+00   0.0%  0.000e+00      100.0%  0.000e+00   0.0% 

------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on interpreting output.
Phase summary info:
   Count: number of times phase was executed
   Time and Flops: Max - maximum over all processors
                   Ratio - ratio of maximum to minimum over all processors
   Mess: number of messages sent
   Avg. len: average message length (bytes)
   Reduct: number of global reductions
   Global: entire computation
   Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop().
      %T - percent time in this phase         %F - percent flops in this phase
      %M - percent messages in this phase     %L - percent message lengths in this phase
      %R - percent reductions in this phase
   Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over all processors)
------------------------------------------------------------------------------------------------------------------------
Event                Count      Time (sec)     Flops                             --- Global ---  --- Stage ---   Total
                   Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------

--- Event Stage 0: Main Stage

VecTDot               18 1.0 2.2055e-02 1.0 3.80e+07 1.0 0.0e+00 0.0e+00 0.0e+00  0 10  0  0  0   0 10  0  0  0  1724
VecNorm               10 1.0 7.0877e-03 1.0 2.11e+07 1.0 0.0e+00 0.0e+00 0.0e+00  0  5  0  0  0   0  5  0  0  0  2981
VecCopy                2 1.0 2.8162e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecSet                22 1.0 1.4510e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecAXPY               18 1.0 3.0588e-02 1.0 3.80e+07 1.0 0.0e+00 0.0e+00 0.0e+00  0 10  0  0  0   0 10  0  0  0  1243
VecAYPX                8 1.0 1.3669e-02 1.0 1.69e+07 1.0 0.0e+00 0.0e+00 0.0e+00  0  4  0  0  0   0  4  0  0  0  1236
VecAssemblyBegin       6 1.0 6.8188e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecAssemblyEnd         6 1.0 6.0558e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatMult                9 1.0 2.2938e-01 1.0 2.71e+08 1.0 0.0e+00 0.0e+00 0.0e+00  0 70  0  0  0   0 70  0  0  0  1182
MatConvert             1 1.0 7.7841e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatAssemblyBegin       2 1.0 5.2452e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatAssemblyEnd         2 1.0 5.4584e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatGetRowIJ            1 1.0 5.0068e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatZeroEntries         1 1.0 1.4291e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatView                1 1.0 2.1219e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
PCSetUp                1 1.0 1.3331e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 20  0  0  0  0  20  0  0  0  0     0
PCApply               10 1.0 1.7107e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 25  0  0  0  0  25  0  0  0  0     0
KSPSetUp               1 1.0 3.4337e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
KSPSolve               1 1.0 3.0748e+01 1.0 3.85e+08 1.0 0.0e+00 0.0e+00 0.0e+00 45100  0  0  0  45100  0  0  0    13
------------------------------------------------------------------------------------------------------------------------

Memory usage is given in bytes:

Object Type          Creations   Destructions     Memory  Descendants' Mem.
Reports information only for process 0.

--- Event Stage 0: Main Stage

              Viewer     1              0            0     0.
           Index Set     2              2       105880     0.
   IS L to G Mapping     4              4     16903632     0.
              Vector     7              7     59164672     0.
              Matrix     1              1    203923784     0.
      Preconditioner     1              1         1384     0.
       Krylov Solver     1              1         1232     0.
========================================================================================================================
Average time to get PetscTime(): 3.38554e-06
#PETSc Option Table entries:
-ksp_monitor
-ksp_view
-log_summary
-pc_hypre_boomeramg_agg_nl 4
-pc_hypre_boomeramg_agg_num_paths 2
-pc_hypre_boomeramg_P_max 5
-pc_hypre_boomeramg_print_debug 1
-pc_hypre_boomeramg_strong_threshold 0.5
-pc_hypre_boomeramg_truncfactor 0.90000000000000002
#End of PETSc Option Table entries
Compiled without FORTRAN kernels
Compiled with full precision matrices (default)
sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 sizeof(PetscScalar) 8 sizeof(PetscInt) 4
Configure options: --COPTFLAGS=-O2 --CXXOPTFLAGS=-O2 --FOPTFLAGS=-O2 --with-blas-lib=/usr/lib/libopenblas.a --with-lapack-lib=/usr/lib/liblapack.a --with-c-support --with-debugging=0 --with-shared-libraries --download-suitesparse --download-scalapack --download-metis --download-parmetis --download-ptscotch --download-hypre --download-mumps --download-blacs --download-spai --download-ml --prefix=/usr/local
-----------------------------------------
Libraries compiled on Sun Aug 28 11:48:54 2016 on 721c971d0d37 
Machine characteristics: Linux-4.4.6-coreos-x86_64-with-Ubuntu-16.04-xenial
Using PETSc directory: /tmp/petsc-3.7.3
Using PETSc arch: arch-linux2-c-opt
-----------------------------------------

Using C compiler: mpicc  -fPIC  -Wall -Wwrite-strings -Wno-strict-aliasing -Wno-unknown-pragmas -fvisibility=hidden -O2  ${COPTFLAGS} ${CFLAGS}
Using Fortran compiler: mpif90  -fPIC -Wall -ffree-line-length-0 -Wno-unused-dummy-argument -O2   ${FOPTFLAGS} ${FFLAGS} 
-----------------------------------------

Using include paths: -I/tmp/petsc-3.7.3/arch-linux2-c-opt/include -I/tmp/petsc-3.7.3/include -I/tmp/petsc-3.7.3/include -I/tmp/petsc-3.7.3/arch-linux2-c-opt/include -I/usr/local/include -I/usr/include/mpich
-----------------------------------------

Using C linker: mpicc
Using Fortran linker: mpif90
Using libraries: -Wl,-rpath,/tmp/petsc-3.7.3/arch-linux2-c-opt/lib -L/tmp/petsc-3.7.3/arch-linux2-c-opt/lib -lpetsc -Wl,-rpath,/usr/local/lib -L/usr/local/lib -lcmumps -ldmumps -lsmumps -lzmumps -lmumps_common -lpord -lpthread -lparmetis -lmetis -lHYPRE -Wl,-rpath,/usr/lib/x86_64-linux-gnu -L/usr/lib/x86_64-linux-gnu -Wl,-rpath,/usr/lib/gcc/x86_64-linux-gnu/5 -L/usr/lib/gcc/x86_64-linux-gnu/5 -Wl,-rpath,/lib/x86_64-linux-gnu -L/lib/x86_64-linux-gnu -lmpichcxx -lstdc++ -lscalapack -lml -lmpichcxx -lstdc++ -lspai -lumfpack -lklu -lcholmod -lbtf -lccolamd -lcolamd -lcamd -lamd -lsuitesparseconfig -Wl,-rpath,/usr/lib -L/usr/lib -llapack -lopenblas -lptesmumps -lptscotch -lptscotcherr -lscotch -lscotcherr -lpthread -lm -lmpichfort -lgfortran -lm -lgfortran -lm -lquadmath -lm -lmpichcxx -lstdc++ -lrt -lm -lz -Wl,-rpath,/usr/lib/x86_64-linux-gnu -L/usr/lib/x86_64-linux-gnu -Wl,-rpath,/usr/lib/gcc/x86_64-linux-gnu/5 -L/usr/lib/gcc/x86_64-linux-gnu/5 -Wl,-rpath,/usr/lib/x86_64-linux-gnu -L/usr/lib/x86_64-linux-gnu -Wl,-rpath,/lib/x86_64-linux-gnu -L/lib/x86_64-linux-gnu -Wl,-rpath,/usr/lib/x86_64-linux-gnu -L/usr/lib/x86_64-linux-gnu -ldl -lmpich -lgcc_s -ldl 
-----------------------------------------

Comments (1)