Segmentation Violation while constructing a finite element space

Issue #860 resolved
Delyan Kalchev created an issue

This happens with both the stable and dev docker versions pulled today.

Here is the sample code

from dolfin import *
from mshr import *

p0 = Point(*(0.0, 0.0))
p1 = Point(*(1.0, 1.0))
pA = Point(*(0.25, 0.25))
pB = Point(*(0.75, 0.75))

dom = Rectangle(p0, p1)
dom.set_subdomain(1, Rectangle(pA, pB))
mesh = generate_mesh(dom, 50)

mesh = adapt(mesh)
mesh = adapt(mesh)
mesh = adapt(mesh)
mesh = adapt(mesh)
mesh = adapt(mesh)

print "Constructing space..."

Z = FunctionSpace(mesh, 'CG', 5)

print "Space constructed!"

The output I get:

#!

Number of cells increased from 7608 to 30432 (300.0% increase).
Number of cells increased from 30432 to 121728 (300.0% increase).
Number of cells increased from 121728 to 486912 (300.0% increase).
Number of cells increased from 486912 to 1947648 (300.0% increase).
Number of cells increased from 1947648 to 7790592 (300.0% increase).
Constructing space...
[0]PETSC ERROR: ------------------------------------------------------------------------
[0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
[0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[0]PETSC ERROR: or see http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
[0]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
[0]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run 
[0]PETSC ERROR: to get more information on the crash.
application called MPI_Abort(MPI_COMM_WORLD, 59) - process 0
[unset]: write_line error; fd=-1 buf=:cmd=abort exitcode=59
:
system msg for write_line failure : Bad file descriptor

I don't know if the issue is in PETSc alone or in DOLFIN but it seems provoked by the actions of DOLFIN.

PS: The machine, I run this on, has more than 190GB of memory available. It would be very surprising if that amount of memory is insufficient...

Comments (7)

  1. Umberto Villa

    Hi Delyan,

    I believe that by default PETSc is compiled with 32-bit integers for global indexes.

    How many global degree of freedom are in a P5 finite element space for a mesh with over 7million cells?

    We can do the math by counting the number of vertices and faces in the mesh -- but my feeling is a LOT of degree of freedom.

    Could your issue be an integer overflow?

  2. Delyan Kalchev reporter

    Hi Umberto,

    Thank you! You may have a point.

    We have 7,790,592 triangles. For degree 5 there are 21 local dofs per triangle. This means that even if it were a discontinuous space the global dofs would be 163,602,432, which is well within the limits of a signed 32-bit integer. However, if we have 7,790,592 dense element matrices which are 21x21 and we try to store them all in a sparse matrix structure, this would result in 3,435,651,072 non-zeroes that overflows a signed 32-bit integer (but not an unsigned one).

    I'm only speculating here, since I'm agnostic about what happens when a function space is constructed. My only observation is that it takes a lot of time and memory during the construction when working on fine meshes. After the construction the memory is released. I don't even know which component may be responsible.

    Having a large number of non-zeros in a CSR matrix can easily overflow some integer and create the respective mess.

    Here comes an experiment. Now I use the slightly simpler example:

    from dolfin import *
    from mshr import *
    
    p0 = Point(0.0, 0.0)
    p1 = Point(1.0, 1.0)
    
    dom = Rectangle(p0, p1)
    mesh = generate_mesh(dom, 50)
    
    mesh = adapt(mesh)
    mesh = adapt(mesh)
    mesh = adapt(mesh)
    mesh = adapt(mesh)
    mesh = adapt(mesh)
    
    print repr(mesh.num_cells())
    
    print "Constructing space..."
    
    Z = FunctionSpace(mesh, 'CG', 5)
    
    print "Space constructed!"
    

    A serial run results in an error:

    #!
    
    fenics@cf6d94b56741:~/shared$ python test.py 
    Number of cells increased from 7722 to 30888 (300.0% increase).
    Number of cells increased from 30888 to 123552 (300.0% increase).
    Number of cells increased from 123552 to 494208 (300.0% increase).
    Number of cells increased from 494208 to 1976832 (300.0% increase).
    Number of cells increased from 1976832 to 7907328 (300.0% increase).
    7907328
    Constructing space...
    [0]PETSC ERROR: ------------------------------------------------------------------------
    [0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
    [0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
    [0]PETSC ERROR: or see http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
    [0]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
    [0]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run 
    [0]PETSC ERROR: to get more information on the crash.
    application called MPI_Abort(MPI_COMM_WORLD, 59) - process 0
    [unset]: write_line error; fd=-1 buf=:cmd=abort exitcode=59
    :
    system msg for write_line failure : Bad file descriptor
    

    Whereas, with 4 MPI processes it finishes without any explicit errors:

    #!
    
    fenics@cf6d94b56741:~/shared$ mpirun -n 4 python test.py 
    Process 0: Number of cells increased from 7722 to 30888 (300.0% increase).
    Process 2: Number of cells increased from 7722 to 30888 (300.0% increase).
    Process 3: Number of cells increased from 7722 to 30888 (300.0% increase).
    Process 1: Number of cells increased from 7722 to 30888 (300.0% increase).
    Process 1: Number of cells increased from 30888 to 123552 (300.0% increase).
    Process 0: Number of cells increased from 30888 to 123552 (300.0% increase).
    Process 2: Number of cells increased from 30888 to 123552 (300.0% increase).
    Process 3: Number of cells increased from 30888 to 123552 (300.0% increase).
    Process 3: Number of cells increased from 123552 to 494208 (300.0% increase).
    Process 1: Number of cells increased from 123552 to 494208 (300.0% increase).
    Process 0: Number of cells increased from 123552 to 494208 (300.0% increase).
    Process 2: Number of cells increased from 123552 to 494208 (300.0% increase).
    Process 0: Number of cells increased from 494208 to 1976832 (300.0% increase).
    Process 2: Number of cells increased from 494208 to 1976832 (300.0% increase).
    Process 3: Number of cells increased from 494208 to 1976832 (300.0% increase).
    Process 1: Number of cells increased from 494208 to 1976832 (300.0% increase).
    Process 0: Number of cells increased from 1976832 to 7907328 (300.0% increase).
    1963906
    Constructing space...
    Process 3: Number of cells increased from 1976832 to 7907328 (300.0% increase).
    1968493
    Constructing space...
    Process 1: Number of cells increased from 1976832 to 7907328 (300.0% increase).
    1987558
    Constructing space...
    Process 2: Number of cells increased from 1976832 to 7907328 (300.0% increase).
    1987371
    Constructing space...
    Space constructed!
    Space constructed!
    Space constructed!
    Space constructed!
    

    This suggests that somewhere the number of non-zeros may overflow an integer and when distributing it in a few processors things end up working.

    Now, it is a pity that marked subdomains are not supported in parallel currently...

  3. Jan Blechta

    Natural solution is to compile PETSc with 64-bit integers on that big machine.

    We could put some checks for integer overflow to prevent the segfault but there is a problem that this is hard too reproduce. I tried on 8GB machine got bad_alloc. If somebody finds a way how to reproduce the problem, we can have a look. Otherwise I would suggest to close the issue.

  4. Delyan Kalchev reporter

    I'm not familiar with the internal process that takes places when a finite element space is created and that would require a 64 bit integers to go through. My concern was if it was a potential internal issue. PETSc with 64-bit integers and/or using MPI seems like a reasonable work-around for me.

  5. Log in to comment