- edited description
Segmentation Violation while constructing a finite element space
This happens with both the stable and dev docker versions pulled today.
Here is the sample code
from dolfin import *
from mshr import *
p0 = Point(*(0.0, 0.0))
p1 = Point(*(1.0, 1.0))
pA = Point(*(0.25, 0.25))
pB = Point(*(0.75, 0.75))
dom = Rectangle(p0, p1)
dom.set_subdomain(1, Rectangle(pA, pB))
mesh = generate_mesh(dom, 50)
mesh = adapt(mesh)
mesh = adapt(mesh)
mesh = adapt(mesh)
mesh = adapt(mesh)
mesh = adapt(mesh)
print "Constructing space..."
Z = FunctionSpace(mesh, 'CG', 5)
print "Space constructed!"
The output I get:
#!
Number of cells increased from 7608 to 30432 (300.0% increase).
Number of cells increased from 30432 to 121728 (300.0% increase).
Number of cells increased from 121728 to 486912 (300.0% increase).
Number of cells increased from 486912 to 1947648 (300.0% increase).
Number of cells increased from 1947648 to 7790592 (300.0% increase).
Constructing space...
[0]PETSC ERROR: ------------------------------------------------------------------------
[0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
[0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[0]PETSC ERROR: or see http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
[0]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
[0]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run
[0]PETSC ERROR: to get more information on the crash.
application called MPI_Abort(MPI_COMM_WORLD, 59) - process 0
[unset]: write_line error; fd=-1 buf=:cmd=abort exitcode=59
:
system msg for write_line failure : Bad file descriptor
I don't know if the issue is in PETSc alone or in DOLFIN but it seems provoked by the actions of DOLFIN.
PS: The machine, I run this on, has more than 190GB of memory available. It would be very surprising if that amount of memory is insufficient...
Comments (7)
-
reporter -
reporter - edited description
-
Hi Delyan,
I believe that by default PETSc is compiled with 32-bit integers for global indexes.
How many global degree of freedom are in a P5 finite element space for a mesh with over 7million cells?
We can do the math by counting the number of vertices and faces in the mesh -- but my feeling is a LOT of degree of freedom.
Could your issue be an integer overflow?
-
reporter Hi Umberto,
Thank you! You may have a point.
We have 7,790,592 triangles. For degree 5 there are 21 local dofs per triangle. This means that even if it were a discontinuous space the global dofs would be 163,602,432, which is well within the limits of a signed 32-bit integer. However, if we have 7,790,592 dense element matrices which are 21x21 and we try to store them all in a sparse matrix structure, this would result in 3,435,651,072 non-zeroes that overflows a signed 32-bit integer (but not an unsigned one).
I'm only speculating here, since I'm agnostic about what happens when a function space is constructed. My only observation is that it takes a lot of time and memory during the construction when working on fine meshes. After the construction the memory is released. I don't even know which component may be responsible.
Having a large number of non-zeros in a CSR matrix can easily overflow some integer and create the respective mess.
Here comes an experiment. Now I use the slightly simpler example:
from dolfin import * from mshr import * p0 = Point(0.0, 0.0) p1 = Point(1.0, 1.0) dom = Rectangle(p0, p1) mesh = generate_mesh(dom, 50) mesh = adapt(mesh) mesh = adapt(mesh) mesh = adapt(mesh) mesh = adapt(mesh) mesh = adapt(mesh) print repr(mesh.num_cells()) print "Constructing space..." Z = FunctionSpace(mesh, 'CG', 5) print "Space constructed!"
A serial run results in an error:
#! fenics@cf6d94b56741:~/shared$ python test.py Number of cells increased from 7722 to 30888 (300.0% increase). Number of cells increased from 30888 to 123552 (300.0% increase). Number of cells increased from 123552 to 494208 (300.0% increase). Number of cells increased from 494208 to 1976832 (300.0% increase). Number of cells increased from 1976832 to 7907328 (300.0% increase). 7907328 Constructing space... [0]PETSC ERROR: ------------------------------------------------------------------------ [0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range [0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger [0]PETSC ERROR: or see http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind [0]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors [0]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run [0]PETSC ERROR: to get more information on the crash. application called MPI_Abort(MPI_COMM_WORLD, 59) - process 0 [unset]: write_line error; fd=-1 buf=:cmd=abort exitcode=59 : system msg for write_line failure : Bad file descriptor
Whereas, with 4 MPI processes it finishes without any explicit errors:
#! fenics@cf6d94b56741:~/shared$ mpirun -n 4 python test.py Process 0: Number of cells increased from 7722 to 30888 (300.0% increase). Process 2: Number of cells increased from 7722 to 30888 (300.0% increase). Process 3: Number of cells increased from 7722 to 30888 (300.0% increase). Process 1: Number of cells increased from 7722 to 30888 (300.0% increase). Process 1: Number of cells increased from 30888 to 123552 (300.0% increase). Process 0: Number of cells increased from 30888 to 123552 (300.0% increase). Process 2: Number of cells increased from 30888 to 123552 (300.0% increase). Process 3: Number of cells increased from 30888 to 123552 (300.0% increase). Process 3: Number of cells increased from 123552 to 494208 (300.0% increase). Process 1: Number of cells increased from 123552 to 494208 (300.0% increase). Process 0: Number of cells increased from 123552 to 494208 (300.0% increase). Process 2: Number of cells increased from 123552 to 494208 (300.0% increase). Process 0: Number of cells increased from 494208 to 1976832 (300.0% increase). Process 2: Number of cells increased from 494208 to 1976832 (300.0% increase). Process 3: Number of cells increased from 494208 to 1976832 (300.0% increase). Process 1: Number of cells increased from 494208 to 1976832 (300.0% increase). Process 0: Number of cells increased from 1976832 to 7907328 (300.0% increase). 1963906 Constructing space... Process 3: Number of cells increased from 1976832 to 7907328 (300.0% increase). 1968493 Constructing space... Process 1: Number of cells increased from 1976832 to 7907328 (300.0% increase). 1987558 Constructing space... Process 2: Number of cells increased from 1976832 to 7907328 (300.0% increase). 1987371 Constructing space... Space constructed! Space constructed! Space constructed! Space constructed!
This suggests that somewhere the number of non-zeros may overflow an integer and when distributing it in a few processors things end up working.
Now, it is a pity that marked subdomains are not supported in parallel currently...
-
Natural solution is to compile PETSc with 64-bit integers on that big machine.
We could put some checks for integer overflow to prevent the segfault but there is a problem that this is hard too reproduce. I tried on 8GB machine got bad_alloc. If somebody finds a way how to reproduce the problem, we can have a look. Otherwise I would suggest to close the issue.
-
reporter I'm not familiar with the internal process that takes places when a finite element space is created and that would require a 64 bit integers to go through. My concern was if it was a potential internal issue. PETSc with 64-bit integers and/or using MPI seems like a reasonable work-around for me.
-
reporter - changed status to resolved
Compile PETSc with 64-bit integers and/or use MPI.
- Log in to comment