Including swig version in module hash

Issue #19 closed
Martin Sandve Alnæs created an issue

This issue exemplifies why removing the swig version from the module hash was not a good idea:

https://bitbucket.org/fenics-project/dolfin/issue/294/clean-instant-cache-on-version-upgrade

In addition to the example of upgrading in the above issue, similar issues can probably occur if one has more multiple installations installed in different environments with shared .instant cache directory.

I don't recall exactly what the problem was, but it was something about heavy disk access on clusters? If instant can get the swig version only once during any process (at startup or at first need) and cache it in an internal variable, will that avoid the problems?

Comments (33)

  1. Johan Hake

    The problem was that the system call tended to crash on some clusters.

    The SWIG version is cashed, the first time it is asked for. Moving the caching from the first usage to the first import, will probably not help. But the issue is hard to debug. Maybe @chris_richardson or @blechta have any more inputs?

  2. Martin Sandve Alnæs reporter

    It just seems weird that calling swig to do actual work is fine but asking for the version is a problem...

  3. Chris Richardson

    On multiple processes, the 'swig -version' was being called on every process on every run, whereas the actual parsing and compilation only takes place once on process zero. I would agree that there is still a fair bit of debugging to do...

  4. Martin Sandve Alnæs reporter

    Ok. So this is an issue where an MPI-aware instant might be able to save the day, with process zero communicating the version (or even an integer hash of various compiler versions) to the processes.

  5. Prof Garth Wells

    The JIT code is flawed in parallel. It's not possible to assign the MPI process that is responsible for JIT. Maybe fixing this would resolve this Issue?

  6. Martin Sandve Alnæs reporter

    No, this issue is that every process (previously) would call "swig -version" because every process needed the version for the module hash to import after only process zero finished the actual jit. If process zero (or whatever single process does the jit) could communicate the module hash to the others, it could include swig version and other compiler versions in the hash without the network filesystem overload.

  7. Prof Garth Wells

    @martinal Communicating the SWIG version would be easy. I wouldn't build this into Instant - better to let the calling process manage this.

  8. Martin Sandve Alnæs reporter

    Ok, so a solution would be to

    1. Let dolfin collect toolchain versions into a toolchain_hash on the master node
    2. Let dolfin communicate toolchain_hash to all processes
    3. Let dolfin pass toolchain_hash as an argument to ffc jit which uses it in JitObject

    So instant doesn't even need to be touched.

  9. Jan Blechta
    In addition to the example of upgrading in the above issue, similar
    issues can probably occur if one has more multiple installations
    installed in different environments with shared .instant cache directory.
    

    Exactly. We solve this, here in Prague, by setting INSTANT_FOO_DIR environment variables for FEniCS builds with different builds of SWIG, NumPy (taking into account not only version but distinct builds; for example with different MPI, BLAS).

    It would be logical to let the hash depend also on NumPy version.

    The problem was that the system call tended to crash on some clusters.
    

    If you mean memory corruptions caused by touching the memory between fork and exec on OFED-based hardware (InfiniBand), this seems to be solved by pull request #2 and pull request #3.

    It just seems weird that calling swig to do actual work is fine but
    asking for the version is a problem...
    

    If the constellation of stars is set up to cause a crash, then it will probably crash on first fork, which is checking SWIG version. If you hardcoded SWIG version and executable filepath, it was segfaulting elsewhere from time to time. Now, it seems OK.

  10. Jan Blechta

    @martinal

    If process zero (or whatever single process does the jit) could communicate
    the module hash to the others, it could include swig version and other compiler
    versions in the hash without the network filesystem overload.
    

    In what sense avoiding swig -version call on each process helps? Still you touch a filesystem on every process when importing a module finally.

  11. Martin Sandve Alnæs reporter

    Jan Blechta: That's what confuses me. Why is getting the swig version so special that it had to be removed?

  12. Chris Richardson

    True, there is nothing special about it, but it doesn't scale particularly well, apparently.

    I think there are still some bugs/issues on some HPC systems, whichever method you use to execute system commands. If those are triggered one time in 1000, then you have a higher chance of hitting it. As @johanhake says, it can be difficult to debug. Avoiding system calls altogether would be a good long term aim, perhaps.

  13. Jan Blechta

    @chris_richardson DOLFIN import on 1000 cores is probably much more catastrophic. But OK, I agree.

    Have you tried the new INSTANT_SYSTEM_CALL_METHOD=OS_SYSTEM. This uses C system() call which should be OFED-fork safe.

  14. Chris Richardson

    @blechta - yes I tried your patch, but it still crashes - sometimes. Reading in modules never seems to be a problem.

  15. Johan Hake
    The JIT code is flawed in parallel. It's not possible to assign the MPI 
    process that is responsible for JIT. Maybe fixing this would resolve 
    this Issue?
    

    This need to be done in DOLFIN. We could add a (global!?) parameter controlling this.

    Communicating the SWIG version would be easy.
    

    How would we do that in the present DOLFIN? AFAIK there is no interface to send a string from one process to another from the Python interface.

    If that is easy or easily fixed we can just add a function to instant to manually set the swig version. Then during start up (import of dolfin) the first process check the swig version and broadcasts it to the other processes which manually sets it.

  16. Prof Garth Wells

    @johanhake To sort out which process does the JIT compilation, it should be made possible to pass an MPI communicator, or to initialise a 'JIT compiler' object with an MPI communicator. One can image scenarios where only a sub-set of processes would work on a particular equation.

    We can easily add a function to dolfin::MPI to broadcast a string. Even better is if we can broadcast an integer for the version.

  17. Johan Hake

    @garth-wells Could be tricky to create JIT compiler object as JIT is called from several places inside the library. Such objects then need to be passed throughout the interface.

    We could introduce a global set_jit_mpi_comm. By default it is mpi_comm_world. But if petsc4py is not installed can we get the range of processes a communicator is using?

    An integer for the SWIG version should be fine.

  18. Martin Sandve Alnæs reporter

    jit isn't really called a lot of places. Searching for 'jit(' gives the jit.py contents:

    martinal@martinal-mc:~/dev/fenics/dolfin$ grep 'jit(' `find -name \*.py`
    ./site-packages/dolfin/compilemodules/jit.py:    def mpi_jit(*args, **kwargs):
    ./site-packages/dolfin/compilemodules/jit.py:            return local_jit(*args, **kwargs)
    ./site-packages/dolfin/compilemodules/jit.py:            output = local_jit(*args, **kwargs)
    ./site-packages/dolfin/compilemodules/jit.py:            output = local_jit(*args,**kwargs)
    ./site-packages/dolfin/compilemodules/jit.py:def jit(form, form_compiler_parameters=None):
    

    plus just two more calls:

    ./site-packages/dolfin/functions/functionspace.py:        ufc_element, ufc_dofmap = jit(self._ufl_element)
    ./site-packages/dolfin/fem/form.py:                                 = jit(form, form_compiler_parameters)
    

    Essentially in the constructors of FunctionSpaceBase and Form, which both have a mesh available which has a communicator.

    Further instant compilation from dolfin includes AFAIK only:

    ./site-packages/dolfin/compilemodules/compilemodule.py:    compiled_module = instant.import_module(module_name)
    ./site-packages/dolfin/compilemodules/compilemodule.py:    compiled_module = instant.build_module(\
    

    which already creates module_name by hashing various versions together with the code to compile.

    Does instant actually need to know the swig version? I suggested above that we rather pass a toolchain_hash to jit from dolfin, which ffc can include in the module hash it passes to instant. Then all modules will get a new hash after a toolchain upgrade, and the toolchain can include both swig and c++ compiler versions and whatever else we may need in the future.

  19. Martin Sandve Alnæs reporter

    However, compiled Expressions and SubDomains do not have a mesh and therefore not an mpi communicator available.

  20. Johan Hake
    jit isn't really called a lot of places.
    

    No, but the potential JIT object also need to be passed to method using Form and compile_extension_module. It is mostly solve, assemble, and adaptive_solve for Form and CompiledSubDomain and Expression for compile_extension_module

    Does instant actually need to know the swig version?
    

    Not really, but it is natural that it is instant that keep track of runtime SWIG version checking. As long as we can choose not to use it.

    The swig version was not checked in many places, I think it was in FCC hash generation and in the compilemodule in dolfin. So what we had was close to what you suggest, I guess. We just need to make sure the version check is done only on one process.

  21. Martin Sandve Alnæs reporter

    The swig version was used to check that dolfin and the jitted module uses the same swig version? That can also be checked by dolfin on import without involving instant.

  22. Jan Blechta
    However, compiled Expressions and SubDomains do not have a mesh
    and therefore not an mpi communicator available.
    

    Why could not perform the compilation rank 0 from comm_world in this case?

  23. Prof Garth Wells

    If Expressions and SubDomains need to pass messages, then they should have a communicator. We can default to the objects storing MPI_COMM_WORLD, but it should also be possible for a user to pass their own communicator.

  24. Martin Sandve Alnæs reporter
    Why could not perform the compilation rank 0 from comm_world in this case?
    

    If a subset of processes solve equation A and another subset solves equation B the processes will need to perform a different set of steps which may involve e.g. creating Expression objects that only live on those subsets of processors. Rank 0 from comm_world may not be one of the Expression owners.

  25. Martin Sandve Alnæs reporter

    I guess in general we do not know how a dolfin program with equations on different mpi process groups will look like.

  26. Johan Hake

    Several issues here. Adding better MPI awareness to the JIT process, or more fundamentally to the whole Python layer of DOLFIN and the need for keeping the SWIG version in the hash. The latter being the actual issue...

  27. Johan Hake
    The swig version was used to check that dolfin and the jitted module 
    uses the same swig version? That can also be checked by dolfin on 
    import without involving instant.
    

    And it is checked. I think the problem is now that we only raise an error, telling the user to sync the SWIG versions. We could just add to the error message that the user should run instant_clean.

    However, we still have the problem with several swig versions using the same instant dir. This can be fixed by adding back the swig version to the hash signature. To avoid problems on clusters we need to make sure that this is done only on process 0. Also this has nothing to do with JIT forms on a subset of processes. All processes needs the current SWIG version.

  28. Martin Sandve Alnæs reporter

    Yes, the subset of processes issue was introduced by Garth above and has no relation to the swig/toolchain hash issue.

  29. Log in to comment