Test that exec routines are OFED-safe

Issue #20 closed

Jan Blechta created an issue 2016-08-16

Instant offers a way to configure how external commands (cmake, make, etc.) are executed. This is crucial to make it run on OFED (InfiniBand) clusters. This is is related to thread-safety (although we don't care much about thread-safety). See

New possibility is to use subprocess32.

This might be related to #4 and #10.

UPDATE: This has already been implemented but might not have be necessarily tested on Infiniband clusters.

Comments (24)

Jan Blechta reporter
- edited description
- 2016-08-16T14:47:24+00:00
Jan Blechta reporter
- changed milestone to 2016.2
- 2016-08-17T07:44:46+00:00
Prof Garth Wells
We require Python 2.7, which comes with subprocess.
- 2016-08-17T08:30:34+00:00
Jan Blechta reporter
Implementation of subprocess in Py2 is not OFED fork safe.
- 2016-08-17T08:35:19+00:00
Martin Sandve Alnæs
Do you know for sure that subprocess32 is safe? Sounds like a cleaner solution that the current instant workarounds and environment variables.
- 2016-08-29T11:39:44+00:00
Jan Blechta reporter
No, I don't know it for sure. But the package description mentions that it avoids any "trickery" between fork and exec and solves thread-safety issues. Sounds exactly like conditions needed by OFED/Infiniband.

Anyway, I can just port system call wrappers from Instant and add subprocess32 on top as one possibility.
- 2016-08-29T11:46:39+00:00
Martin Sandve Alnæs
Do you have a way of testing this on OFED/Infiniband?

Ping @johannes_ring
- 2016-08-29T12:25:45+00:00
Jan Blechta reporter
Yes, I'll test it.
- 2016-08-29T12:49:02+00:00
Prof Garth Wells
Could you add a simple test program somewhere that we can use in the future? I removed some related code recently from Instant because it used a Python module that is removed in Py3, but it was confusing what's what.
- 2016-08-29T13:15:05+00:00
Jan Blechta reporter
It did not need to be removed.

The test is to run any program which does JITting. Probably on more nodes so it is sure that mpi does not switch infiniband off. In any case you will see Infiniband/fork warnings (at least with OpenMPI; the warning can be turned off by some parameter). In the case that implementation does something nasty between fork and exec (like subprocess in Py2), it is very probable that program will segfault.

If you're able to run Poisson demo with Infiniband involved while seeing no segfaults, you're probably fine.
- 2016-08-29T13:43:29+00:00
Prof Garth Wells
@blechta ?? How could Instant work with Py3 if it imports modules that have been removed from Py3?
- 2016-08-29T14:12:55+00:00
Jan Blechta reporter
Method is chosen at runtime by env variable, see the documentation (why did not RTD pull the last change?). Of course, INSTANT_SYSTEM_CALL_METHOD=COMMANDS will not work with Py3. It is not a default. Is it necessary to remove it? It gives to cluster admins more options to circumvent the problem. I admit that I don't know whether COMMANDS helps on any implementation. I have succeeded with OS_SYSTEM.
- 2016-08-29T14:26:15+00:00
Martin Sandve Alnæs
I've pushed a branch to dijitso next that includes using subprocess32 on python 2.
- 2016-09-28T14:35:22+00:00
Jan Blechta reporter
Well, OS_SYSTEM has been tested to work on some systems. subprocess32 has not yet, but according to the reading mentioned above I strongly believe it will work.

Anyway, copying the code from instant/output.py and doc/sphinx/source/installation.rst is very easy. Is there a reason why it should be avoided?
- 2016-09-29T06:45:11+00:00
Martin Sandve Alnæs
It's in dijitso master. @blechta will you test it and close this issue?
- 2016-09-29T08:58:52+00:00
Jan Blechta reporter
I can't promise this early. Installing FEniCS on cluster from source might not be that trivial.

I think that only responsible solution would be to copy the code from Instant known to work until subprocess32 is tested.
- 2016-10-11T22:36:15+00:00
Jan Blechta reporter
Just heard that FEniCS runs on Shifter on our cluster. With official FEniCS containers I'll try testing it before the release.
- 2016-10-13T08:33:28+00:00
Jan Blechta reporter
- marked as major
- edited description
- changed milestone to 2017.2
- changed title to Test that exec routines are OFED-safe
- marked as task
Did not get Infiniband working with shifter yet (MPICH ABI compatibility trick might help) and don't have up-to-date native build on the cluster.
- 2017-02-28T12:14:19+00:00

Søren Madsen

The following python (2.7.13) code

from dolfin import *
from mshr import *

rank=MPI.rank(mpi_comm_world())
geometry=Rectangle(dolfin.Point(0., 0.), dolfin.Point(1.0, 1.0))
mesh = generate_mesh(geometry, 200)
V = VectorFunctionSpace(mesh, 'Lagrange', 1)

crashes on a cluster (centos 6.5, slurm, infiniband) when running on more than one node . The error is

[11]PETSC ERROR: ------------------------------------------------------------------------
[11]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
[11]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[11]PETSC ERROR: or see http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
[11]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
[11]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run 
[11]PETSC ERROR: to get more information on the crash.
[21]PETSC ERROR: ------------------------------------------------------------------------

The crash happens when creating the function space. Sometimes the code also just hangs without crashing.

Running on only one node the code works as expected without crashs and hangs. The following environment variables are set prior to actually running the code on the nodes:

export INSTANT_SYSTEM_CALL_METHOD=SUBPROCESS
export DIJITSO_SYSTEM_CALL_METHOD=$INSTANT_SYSTEM_CALL_METHOD
SCRATCH=/scratch/$SLURM_JOB_ID
export INSTANT_CACHE_DIR=$SCRATCH
export DIJITSO_CACHE_DIR=$SCRATCH

I have also tried OS_SYSTEM instead of subprocess, same result. Subprocess32 IS installed on the system.

I have tried 2017.1 and development versions with the same result. Both build from source.

Running "native" PETSc 3.7.6 examples across multiple nodes works fine.

The following warning is always shown when running fenics jobs:

--------------------------------------------------------------------------
A process has executed an operation involving a call to the
"fork()" system call to create a child process.  Open MPI is currently
operating in a condition that could result in memory corruption or
other system errors; your job may hang, crash, or produce silent
data corruption.  The use of fork() (or system() or other calls that
create child processes) is strongly discouraged.

The process that invoked fork was:

  Local host:          [[44329,1],8] (PID 23391)

If you are *absolutely sure* that your application will successfully
and correctly survive a call to fork(), you may disable this warning
by setting the mpi_warn_on_fork MCA parameter to 0.
--------------------------------------------------------------------------

We run OpenMPI 2.1.0.

Please let me know if I can help test any possible fixes for this.

2017-09-01T06:26:56+00:00

Jan Blechta reporter
I would suggest three things:
1. Try without importing mshr, just with a mesh generated by DOLFIN.
2. Try with infiniband off. There's a parameter for disabling InfiniBand BTL in OpenMPI.
3. Try with debugger and post the stacktrace. In interactive mode run something like
```
mpirun -n 2 xterm -e gdb -ex r python2 test.py
```
then type bt <return> after a process segfaults.
- 2017-09-04T12:41:04+00:00
Søren Madsen
Thanks for the suggestions. I have tried 1 and 2 and will report pack with 3 later. Do I need PETsc with debug symbols for this?
1. Same result when not using mshr.
2. I have tried mpiexec with --mca btl tcp,self and here it actually works if I change the FFC cache to be on a shared drive, i.e. DO NOT SET
```
SCRATCH=/scratch/$SLURM_JOB_ID
export INSTANT_CACHE_DIR=$SCRATCH
export DIJITSO_CACHE_DIR=$SCRATCH
```
If I run on a single node with --mca btl openib,self -np 2 I get the PETSc-crash reported in the other message above.

I'll do some more diggin' with gdb...
- 2017-09-04T15:49:05+00:00

Søren Madsen

Hi again!

I tried to mess with this using python2 a number of times and never got it working at all with openib. I also never got anything that looked like useful information with gdb. Recently, I have switched to python 3.6.1 for everything and the above test case now works fine :-)

I have tried a larger program of mine, which sometimes fails with the same error as above when JIT'ing but also other errors come up at random places in the program:

*** -------------------------------------------------------------------------
*** Error:   Unable to access vector of degrees of freedom.
*** Reason:  Cannot access a non-const vector from a subfunction.
*** Where:   This error was encountered inside Function.cpp.
*** Process: 19
*** DOLFIN version: 2018.1.0.dev0
*** Git changeset:  2e7d72afc27e4f0d63be3cd5b1cc0473814645fa
*** -------------------------------------------------------------------------

*** -------------------------------------------------------------------------
*** Error:   Unable to successfully call PETSc function 'KSPSolve'.
*** Reason:  PETSc error code is: 63 (Argument out of range).
*** Where:   This error was encountered inside /fenics/src/dolfin/dolfin/la/PETScKrylovSolver.cpp.
*** Process: 0
*** 
*** DOLFIN version: 2018.1.0.dev0
*** Git changeset:  2e7d72afc27e4f0d63be3cd5b1cc0473814645fa
*** -------------------------------------------------------------------------

*** -------------------------------------------------------------------------
*** Error:   Unable to successfully call PETSc function 'MatAssemblyEnd'.
*** Reason:  PETSc error code is: 63 (Argument out of range).
*** Where:   This error was encountered inside /fenics/src/dolfin/dolfin/la/PETScMatrix.cpp.
*** Process: 0
*** 
*** DOLFIN version: 2018.1.0.dev0
*** Git changeset:  2e7d72afc27e4f0d63be3cd5b1cc0473814645fa
*** -------------------------------------------------------------------------

*** -------------------------------------------------------------------------
*** Error:   Unable to apply changes to sparsity pattern.
*** Reason:  Received illegal sparsity pattern entry for row/column 2136199019, not in range [408461, 415074].
*** Where:   This error was encountered inside SparsityPattern.cpp.
*** Process: 62
*** 
*** DOLFIN version: 2018.1.0.dev0
*** Git changeset:  2e7d72afc27e4f0d63be3cd5b1cc0473814645fa
*** -------------------------------------------------------------------------

This is usually after having run a few iterations in an optimization loop (with donfin-adjoint). I have tried to run both with 'OS_SYSTEM' and 'SUBPROCESS'.

I have never seen these errors when running on just a single computer node.

I will try to come up with a simple case which fails. Stay tuned....

Updated: Just got an error with '--mca btl tcp,self'! Could the original issue be fixed and new ones are popping up?

2017-12-19T13:21:00+00:00

Jan Blechta reporter
- removed milestone
- removed responsible
- 2018-08-21T15:51:18+00:00
Jan Blechta reporter
- changed status to closed
Possibly not an issue any more. Chris pushed recently some fixes to race conditions.
- 2018-08-21T15:52:15+00:00
Log in to comment

Assignee: –

Type: task

Priority: major

Status: closed

Milestone: –

Version: –

Votes: 0

Watchers: 3