- edited description
Test that exec routines are OFED-safe
Instant offers a way to configure how external commands (cmake, make, etc.) are executed. This is crucial to make it run on OFED (InfiniBand) clusters. This is is related to thread-safety (although we don't care much about thread-safety). See
New possibility is to use subprocess32.
This might be related to #4 and #10.
UPDATE: This has already been implemented but might not have be necessarily tested on Infiniband clusters.
Comments (24)
-
reporter -
reporter - changed milestone to 2016.2
-
We require Python 2.7, which comes with
subprocess
. -
reporter Implementation of
subprocess
in Py2 is not OFED fork safe. -
Do you know for sure that subprocess32 is safe? Sounds like a cleaner solution that the current instant workarounds and environment variables.
-
reporter No, I don't know it for sure. But the package description mentions that it avoids any "trickery" between fork and exec and solves thread-safety issues. Sounds exactly like conditions needed by OFED/Infiniband.
Anyway, I can just port system call wrappers from Instant and add subprocess32 on top as one possibility.
-
Do you have a way of testing this on OFED/Infiniband?
Ping @johannes_ring
-
reporter Yes, I'll test it.
-
Could you add a simple test program somewhere that we can use in the future? I removed some related code recently from Instant because it used a Python module that is removed in Py3, but it was confusing what's what.
-
reporter It did not need to be removed.
The test is to run any program which does JITting. Probably on more nodes so it is sure that mpi does not switch infiniband off. In any case you will see Infiniband/fork warnings (at least with OpenMPI; the warning can be turned off by some parameter). In the case that implementation does something nasty between fork and exec (like subprocess in Py2), it is very probable that program will segfault.
If you're able to run Poisson demo with Infiniband involved while seeing no segfaults, you're probably fine.
-
@blechta ?? How could Instant work with Py3 if it imports modules that have been removed from Py3?
-
reporter Method is chosen at runtime by env variable, see the documentation (why did not RTD pull the last change?). Of course,
INSTANT_SYSTEM_CALL_METHOD=COMMANDS
will not work with Py3. It is not a default. Is it necessary to remove it? It gives to cluster admins more options to circumvent the problem. I admit that I don't know whetherCOMMANDS
helps on any implementation. I have succeeded withOS_SYSTEM
. -
I've pushed a branch to dijitso next that includes using subprocess32 on python 2.
-
reporter Well,
OS_SYSTEM
has been tested to work on some systems.subprocess32
has not yet, but according to the reading mentioned above I strongly believe it will work.Anyway, copying the code from
instant/output.py
anddoc/sphinx/source/installation.rst
is very easy. Is there a reason why it should be avoided? -
It's in dijitso master. @blechta will you test it and close this issue?
-
reporter I can't promise this early. Installing FEniCS on cluster from source might not be that trivial.
I think that only responsible solution would be to copy the code from Instant known to work until subprocess32 is tested.
-
reporter Just heard that FEniCS runs on Shifter on our cluster. With official FEniCS containers I'll try testing it before the release.
-
reporter - marked as major
- edited description
- changed milestone to 2017.2
- changed title to Test that exec routines are OFED-safe
- marked as task
Did not get Infiniband working with shifter yet (MPICH ABI compatibility trick might help) and don't have up-to-date native build on the cluster.
-
The following python (2.7.13) code
from dolfin import * from mshr import * rank=MPI.rank(mpi_comm_world()) geometry=Rectangle(dolfin.Point(0., 0.), dolfin.Point(1.0, 1.0)) mesh = generate_mesh(geometry, 200) V = VectorFunctionSpace(mesh, 'Lagrange', 1)
crashes on a cluster (centos 6.5, slurm, infiniband) when running on more than one node . The error is
[11]PETSC ERROR: ------------------------------------------------------------------------ [11]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range [11]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger [11]PETSC ERROR: or see http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind [11]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors [11]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run [11]PETSC ERROR: to get more information on the crash. [21]PETSC ERROR: ------------------------------------------------------------------------
The crash happens when creating the function space. Sometimes the code also just hangs without crashing.
Running on only one node the code works as expected without crashs and hangs. The following environment variables are set prior to actually running the code on the nodes:
export INSTANT_SYSTEM_CALL_METHOD=SUBPROCESS export DIJITSO_SYSTEM_CALL_METHOD=$INSTANT_SYSTEM_CALL_METHOD SCRATCH=/scratch/$SLURM_JOB_ID export INSTANT_CACHE_DIR=$SCRATCH export DIJITSO_CACHE_DIR=$SCRATCH
I have also tried OS_SYSTEM instead of subprocess, same result. Subprocess32 IS installed on the system.
I have tried 2017.1 and development versions with the same result. Both build from source.
Running "native" PETSc 3.7.6 examples across multiple nodes works fine.
The following warning is always shown when running fenics jobs:
-------------------------------------------------------------------------- A process has executed an operation involving a call to the "fork()" system call to create a child process. Open MPI is currently operating in a condition that could result in memory corruption or other system errors; your job may hang, crash, or produce silent data corruption. The use of fork() (or system() or other calls that create child processes) is strongly discouraged. The process that invoked fork was: Local host: [[44329,1],8] (PID 23391) If you are *absolutely sure* that your application will successfully and correctly survive a call to fork(), you may disable this warning by setting the mpi_warn_on_fork MCA parameter to 0. --------------------------------------------------------------------------
We run OpenMPI 2.1.0.
Please let me know if I can help test any possible fixes for this.
-
reporter I would suggest three things:
- Try without importing mshr, just with a mesh generated by DOLFIN.
- Try with infiniband off. There's a parameter for disabling InfiniBand BTL in OpenMPI.
- Try with debugger and post the stacktrace. In interactive mode run something like
mpirun -n 2 xterm -e gdb -ex r python2 test.py
then type
bt <return>
after a process segfaults. -
Thanks for the suggestions. I have tried 1 and 2 and will report pack with 3 later. Do I need PETsc with debug symbols for this?
-
Same result when not using mshr.
-
I have tried mpiexec with --mca btl tcp,self and here it actually works if I change the FFC cache to be on a shared drive, i.e. DO NOT SET
SCRATCH=/scratch/$SLURM_JOB_ID export INSTANT_CACHE_DIR=$SCRATCH export DIJITSO_CACHE_DIR=$SCRATCH
If I run on a single node with --mca btl openib,self -np 2 I get the PETSc-crash reported in the other message above.
I'll do some more diggin' with gdb...
-
-
Hi again!
I tried to mess with this using python2 a number of times and never got it working at all with openib. I also never got anything that looked like useful information with gdb. Recently, I have switched to python 3.6.1 for everything and the above test case now works fine :-)
I have tried a larger program of mine, which sometimes fails with the same error as above when JIT'ing but also other errors come up at random places in the program:
*** ------------------------------------------------------------------------- *** Error: Unable to access vector of degrees of freedom. *** Reason: Cannot access a non-const vector from a subfunction. *** Where: This error was encountered inside Function.cpp. *** Process: 19 *** DOLFIN version: 2018.1.0.dev0 *** Git changeset: 2e7d72afc27e4f0d63be3cd5b1cc0473814645fa *** -------------------------------------------------------------------------
*** ------------------------------------------------------------------------- *** Error: Unable to successfully call PETSc function 'KSPSolve'. *** Reason: PETSc error code is: 63 (Argument out of range). *** Where: This error was encountered inside /fenics/src/dolfin/dolfin/la/PETScKrylovSolver.cpp. *** Process: 0 *** *** DOLFIN version: 2018.1.0.dev0 *** Git changeset: 2e7d72afc27e4f0d63be3cd5b1cc0473814645fa *** -------------------------------------------------------------------------
*** ------------------------------------------------------------------------- *** Error: Unable to successfully call PETSc function 'MatAssemblyEnd'. *** Reason: PETSc error code is: 63 (Argument out of range). *** Where: This error was encountered inside /fenics/src/dolfin/dolfin/la/PETScMatrix.cpp. *** Process: 0 *** *** DOLFIN version: 2018.1.0.dev0 *** Git changeset: 2e7d72afc27e4f0d63be3cd5b1cc0473814645fa *** -------------------------------------------------------------------------
*** ------------------------------------------------------------------------- *** Error: Unable to apply changes to sparsity pattern. *** Reason: Received illegal sparsity pattern entry for row/column 2136199019, not in range [408461, 415074]. *** Where: This error was encountered inside SparsityPattern.cpp. *** Process: 62 *** *** DOLFIN version: 2018.1.0.dev0 *** Git changeset: 2e7d72afc27e4f0d63be3cd5b1cc0473814645fa *** -------------------------------------------------------------------------
This is usually after having run a few iterations in an optimization loop (with donfin-adjoint). I have tried to run both with 'OS_SYSTEM' and 'SUBPROCESS'.
I have never seen these errors when running on just a single computer node.
I will try to come up with a simple case which fails. Stay tuned....
Updated: Just got an error with '--mca btl tcp,self'! Could the original issue be fixed and new ones are popping up?
-
reporter - removed milestone
- removed responsible
-
reporter - changed status to closed
Possibly not an issue any more. Chris pushed recently some fixes to race conditions.
- Log in to comment