Test that exec routines are OFED-safe

Issue #20 closed
Jan Blechta created an issue

Instant offers a way to configure how external commands (cmake, make, etc.) are executed. This is crucial to make it run on OFED (InfiniBand) clusters. This is is related to thread-safety (although we don't care much about thread-safety). See

New possibility is to use subprocess32.

This might be related to #4 and #10.

UPDATE: This has already been implemented but might not have be necessarily tested on Infiniband clusters.

Comments (24)

  1. Martin Sandve Alnæs

    Do you know for sure that subprocess32 is safe? Sounds like a cleaner solution that the current instant workarounds and environment variables.

  2. Jan Blechta reporter

    No, I don't know it for sure. But the package description mentions that it avoids any "trickery" between fork and exec and solves thread-safety issues. Sounds exactly like conditions needed by OFED/Infiniband.

    Anyway, I can just port system call wrappers from Instant and add subprocess32 on top as one possibility.

  3. Prof Garth Wells

    Could you add a simple test program somewhere that we can use in the future? I removed some related code recently from Instant because it used a Python module that is removed in Py3, but it was confusing what's what.

  4. Jan Blechta reporter

    It did not need to be removed.

    The test is to run any program which does JITting. Probably on more nodes so it is sure that mpi does not switch infiniband off. In any case you will see Infiniband/fork warnings (at least with OpenMPI; the warning can be turned off by some parameter). In the case that implementation does something nasty between fork and exec (like subprocess in Py2), it is very probable that program will segfault.

    If you're able to run Poisson demo with Infiniband involved while seeing no segfaults, you're probably fine.

  5. Prof Garth Wells

    @blechta ?? How could Instant work with Py3 if it imports modules that have been removed from Py3?

  6. Jan Blechta reporter

    Method is chosen at runtime by env variable, see the documentation (why did not RTD pull the last change?). Of course, INSTANT_SYSTEM_CALL_METHOD=COMMANDS will not work with Py3. It is not a default. Is it necessary to remove it? It gives to cluster admins more options to circumvent the problem. I admit that I don't know whether COMMANDS helps on any implementation. I have succeeded with OS_SYSTEM.

  7. Jan Blechta reporter

    Well, OS_SYSTEM has been tested to work on some systems. subprocess32 has not yet, but according to the reading mentioned above I strongly believe it will work.

    Anyway, copying the code from instant/output.py and doc/sphinx/source/installation.rst is very easy. Is there a reason why it should be avoided?

  8. Jan Blechta reporter

    I can't promise this early. Installing FEniCS on cluster from source might not be that trivial.

    I think that only responsible solution would be to copy the code from Instant known to work until subprocess32 is tested.

  9. Jan Blechta reporter

    Just heard that FEniCS runs on Shifter on our cluster. With official FEniCS containers 😃 I'll try testing it before the release.

  10. Søren Madsen

    The following python (2.7.13) code

    from dolfin import *
    from mshr import *
    
    rank=MPI.rank(mpi_comm_world())
    geometry=Rectangle(dolfin.Point(0., 0.), dolfin.Point(1.0, 1.0))
    mesh = generate_mesh(geometry, 200)
    V = VectorFunctionSpace(mesh, 'Lagrange', 1)
    

    crashes on a cluster (centos 6.5, slurm, infiniband) when running on more than one node . The error is

    [11]PETSC ERROR: ------------------------------------------------------------------------
    [11]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
    [11]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
    [11]PETSC ERROR: or see http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
    [11]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
    [11]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run 
    [11]PETSC ERROR: to get more information on the crash.
    [21]PETSC ERROR: ------------------------------------------------------------------------
    

    The crash happens when creating the function space. Sometimes the code also just hangs without crashing.

    Running on only one node the code works as expected without crashs and hangs. The following environment variables are set prior to actually running the code on the nodes:

    export INSTANT_SYSTEM_CALL_METHOD=SUBPROCESS
    export DIJITSO_SYSTEM_CALL_METHOD=$INSTANT_SYSTEM_CALL_METHOD
    SCRATCH=/scratch/$SLURM_JOB_ID
    export INSTANT_CACHE_DIR=$SCRATCH
    export DIJITSO_CACHE_DIR=$SCRATCH
    

    I have also tried OS_SYSTEM instead of subprocess, same result. Subprocess32 IS installed on the system.

    I have tried 2017.1 and development versions with the same result. Both build from source.

    Running "native" PETSc 3.7.6 examples across multiple nodes works fine.

    The following warning is always shown when running fenics jobs:

    --------------------------------------------------------------------------
    A process has executed an operation involving a call to the
    "fork()" system call to create a child process.  Open MPI is currently
    operating in a condition that could result in memory corruption or
    other system errors; your job may hang, crash, or produce silent
    data corruption.  The use of fork() (or system() or other calls that
    create child processes) is strongly discouraged.
    
    The process that invoked fork was:
    
      Local host:          [[44329,1],8] (PID 23391)
    
    If you are *absolutely sure* that your application will successfully
    and correctly survive a call to fork(), you may disable this warning
    by setting the mpi_warn_on_fork MCA parameter to 0.
    --------------------------------------------------------------------------
    

    We run OpenMPI 2.1.0.

    Please let me know if I can help test any possible fixes for this.

  11. Jan Blechta reporter

    I would suggest three things:

    1. Try without importing mshr, just with a mesh generated by DOLFIN.
    2. Try with infiniband off. There's a parameter for disabling InfiniBand BTL in OpenMPI.
    3. Try with debugger and post the stacktrace. In interactive mode run something like
    mpirun -n 2 xterm -e gdb -ex r python2 test.py
    

    then type bt <return> after a process segfaults.

  12. Søren Madsen

    Thanks for the suggestions. I have tried 1 and 2 and will report pack with 3 later. Do I need PETsc with debug symbols for this?

    1. Same result when not using mshr.

    2. I have tried mpiexec with --mca btl tcp,self and here it actually works if I change the FFC cache to be on a shared drive, i.e. DO NOT SET

    SCRATCH=/scratch/$SLURM_JOB_ID
    export INSTANT_CACHE_DIR=$SCRATCH
    export DIJITSO_CACHE_DIR=$SCRATCH
    

    If I run on a single node with --mca btl openib,self -np 2 I get the PETSc-crash reported in the other message above.

    I'll do some more diggin' with gdb...

  13. Søren Madsen

    Hi again!

    I tried to mess with this using python2 a number of times and never got it working at all with openib. I also never got anything that looked like useful information with gdb. Recently, I have switched to python 3.6.1 for everything and the above test case now works fine :-)

    I have tried a larger program of mine, which sometimes fails with the same error as above when JIT'ing but also other errors come up at random places in the program:

    *** -------------------------------------------------------------------------
    *** Error:   Unable to access vector of degrees of freedom.
    *** Reason:  Cannot access a non-const vector from a subfunction.
    *** Where:   This error was encountered inside Function.cpp.
    *** Process: 19
    *** DOLFIN version: 2018.1.0.dev0
    *** Git changeset:  2e7d72afc27e4f0d63be3cd5b1cc0473814645fa
    *** -------------------------------------------------------------------------
    
    *** -------------------------------------------------------------------------
    *** Error:   Unable to successfully call PETSc function 'KSPSolve'.
    *** Reason:  PETSc error code is: 63 (Argument out of range).
    *** Where:   This error was encountered inside /fenics/src/dolfin/dolfin/la/PETScKrylovSolver.cpp.
    *** Process: 0
    *** 
    *** DOLFIN version: 2018.1.0.dev0
    *** Git changeset:  2e7d72afc27e4f0d63be3cd5b1cc0473814645fa
    *** -------------------------------------------------------------------------
    
    *** -------------------------------------------------------------------------
    *** Error:   Unable to successfully call PETSc function 'MatAssemblyEnd'.
    *** Reason:  PETSc error code is: 63 (Argument out of range).
    *** Where:   This error was encountered inside /fenics/src/dolfin/dolfin/la/PETScMatrix.cpp.
    *** Process: 0
    *** 
    *** DOLFIN version: 2018.1.0.dev0
    *** Git changeset:  2e7d72afc27e4f0d63be3cd5b1cc0473814645fa
    *** -------------------------------------------------------------------------
    
    *** -------------------------------------------------------------------------
    *** Error:   Unable to apply changes to sparsity pattern.
    *** Reason:  Received illegal sparsity pattern entry for row/column 2136199019, not in range [408461, 415074].
    *** Where:   This error was encountered inside SparsityPattern.cpp.
    *** Process: 62
    *** 
    *** DOLFIN version: 2018.1.0.dev0
    *** Git changeset:  2e7d72afc27e4f0d63be3cd5b1cc0473814645fa
    *** -------------------------------------------------------------------------
    

    This is usually after having run a few iterations in an optimization loop (with donfin-adjoint). I have tried to run both with 'OS_SYSTEM' and 'SUBPROCESS'.

    I have never seen these errors when running on just a single computer node.

    I will try to come up with a simple case which fails. Stay tuned....

    Updated: Just got an error with '--mca btl tcp,self'! Could the original issue be fixed and new ones are popping up?

  14. Log in to comment