Including swig version in module hash
This issue exemplifies why removing the swig version from the module hash was not a good idea:
https://bitbucket.org/fenics-project/dolfin/issue/294/clean-instant-cache-on-version-upgrade
In addition to the example of upgrading in the above issue, similar issues can probably occur if one has more multiple installations installed in different environments with shared .instant cache directory.
I don't recall exactly what the problem was, but it was something about heavy disk access on clusters? If instant can get the swig version only once during any process (at startup or at first need) and cache it in an internal variable, will that avoid the problems?
Comments (33)
-
-
reporter It just seems weird that calling swig to do actual work is fine but asking for the version is a problem...
-
On multiple processes, the 'swig -version' was being called on every process on every run, whereas the actual parsing and compilation only takes place once on process zero. I would agree that there is still a fair bit of debugging to do...
-
reporter Ok. So this is an issue where an MPI-aware instant might be able to save the day, with process zero communicating the version (or even an integer hash of various compiler versions) to the processes.
-
The JIT code is flawed in parallel. It's not possible to assign the MPI process that is responsible for JIT. Maybe fixing this would resolve this Issue?
-
I guess it would involve another dependency, though - mpi4py?
-
reporter No, this issue is that every process (previously) would call "swig -version" because every process needed the version for the module hash to import after only process zero finished the actual jit. If process zero (or whatever single process does the jit) could communicate the module hash to the others, it could include swig version and other compiler versions in the hash without the network filesystem overload.
-
@martinal Communicating the SWIG version would be easy. I wouldn't build this into Instant - better to let the calling process manage this.
-
reporter Ok, so a solution would be to
- Let dolfin collect toolchain versions into a toolchain_hash on the master node
- Let dolfin communicate toolchain_hash to all processes
- Let dolfin pass toolchain_hash as an argument to ffc jit which uses it in JitObject
So instant doesn't even need to be touched.
-
In addition to the example of upgrading in the above issue, similar issues can probably occur if one has more multiple installations installed in different environments with shared .instant cache directory.
Exactly. We solve this, here in Prague, by setting
INSTANT_FOO_DIR
environment variables for FEniCS builds with different builds of SWIG, NumPy (taking into account not only version but distinct builds; for example with different MPI, BLAS).It would be logical to let the hash depend also on NumPy version.
The problem was that the system call tended to crash on some clusters.
If you mean memory corruptions caused by touching the memory between
fork
andexec
on OFED-based hardware (InfiniBand), this seems to be solved by pull request #2 and pull request #3.It just seems weird that calling swig to do actual work is fine but asking for the version is a problem...
If the constellation of stars is set up to cause a crash, then it will probably crash on first fork, which is checking SWIG version. If you hardcoded SWIG version and executable filepath, it was segfaulting elsewhere from time to time. Now, it seems OK.
-
@martinal
If process zero (or whatever single process does the jit) could communicate the module hash to the others, it could include swig version and other compiler versions in the hash without the network filesystem overload.
In what sense avoiding
swig -version
call on each process helps? Still you touch a filesystem on every process when importing a module finally. -
reporter Jan Blechta: That's what confuses me. Why is getting the swig version so special that it had to be removed?
-
True, there is nothing special about it, but it doesn't scale particularly well, apparently.
I think there are still some bugs/issues on some HPC systems, whichever method you use to execute system commands. If those are triggered one time in 1000, then you have a higher chance of hitting it. As @johanhake says, it can be difficult to debug. Avoiding system calls altogether would be a good long term aim, perhaps.
-
@chris_richardson DOLFIN import on 1000 cores is probably much more catastrophic. But OK, I agree.
Have you tried the new
INSTANT_SYSTEM_CALL_METHOD=OS_SYSTEM
. This uses Csystem()
call which should be OFED-fork safe. -
@blechta - yes I tried your patch, but it still crashes - sometimes. Reading in modules never seems to be a problem.
-
@chris_richardson To be sure, did you
export INSTANT_SYSTEM_CALL_METHOD=OS_SYSTEM
? It is not still default in master. Could you supply some details of the crash?To import disaster, I don't have sufficiently large scale experience. Nevertheless @jedbrown mentioned that you burn a barrel of oil to start python, NumPy, petsc4py on 65k cores on NFS.
-
The JIT code is flawed in parallel. It's not possible to assign the MPI process that is responsible for JIT. Maybe fixing this would resolve this Issue?
This need to be done in
DOLFIN
. We could add a (global!?) parameter controlling this.Communicating the SWIG version would be easy.
How would we do that in the present
DOLFIN
? AFAIK there is no interface to send a string from one process to another from the Python interface.If that is easy or easily fixed we can just add a function to
instant
to manually set the swig version. Then during start up (import of dolfin) the first process check the swig version and broadcasts it to the other processes which manually sets it. -
@johanhake To sort out which process does the JIT compilation, it should be made possible to pass an MPI communicator, or to initialise a 'JIT compiler' object with an MPI communicator. One can image scenarios where only a sub-set of processes would work on a particular equation.
We can easily add a function to
dolfin::MPI
to broadcast a string. Even better is if we can broadcast an integer for the version. -
@garth-wells Could be tricky to create JIT compiler object as JIT is called from several places inside the library. Such objects then need to be passed throughout the interface.
We could introduce a global
set_jit_mpi_comm
. By default it ismpi_comm_world
. But if petsc4py is not installed can we get the range of processes a communicator is using?An integer for the SWIG version should be fine.
-
reporter jit isn't really called a lot of places. Searching for 'jit(' gives the jit.py contents:
martinal@martinal-mc:~/dev/fenics/dolfin$ grep 'jit(' `find -name \*.py` ./site-packages/dolfin/compilemodules/jit.py: def mpi_jit(*args, **kwargs): ./site-packages/dolfin/compilemodules/jit.py: return local_jit(*args, **kwargs) ./site-packages/dolfin/compilemodules/jit.py: output = local_jit(*args, **kwargs) ./site-packages/dolfin/compilemodules/jit.py: output = local_jit(*args,**kwargs) ./site-packages/dolfin/compilemodules/jit.py:def jit(form, form_compiler_parameters=None):
plus just two more calls:
./site-packages/dolfin/functions/functionspace.py: ufc_element, ufc_dofmap = jit(self._ufl_element) ./site-packages/dolfin/fem/form.py: = jit(form, form_compiler_parameters)
Essentially in the constructors of FunctionSpaceBase and Form, which both have a mesh available which has a communicator.
Further instant compilation from dolfin includes AFAIK only:
./site-packages/dolfin/compilemodules/compilemodule.py: compiled_module = instant.import_module(module_name) ./site-packages/dolfin/compilemodules/compilemodule.py: compiled_module = instant.build_module(\
which already creates module_name by hashing various versions together with the code to compile.
Does instant actually need to know the swig version? I suggested above that we rather pass a toolchain_hash to jit from dolfin, which ffc can include in the module hash it passes to instant. Then all modules will get a new hash after a toolchain upgrade, and the toolchain can include both swig and c++ compiler versions and whatever else we may need in the future.
-
reporter However, compiled Expressions and SubDomains do not have a mesh and therefore not an mpi communicator available.
-
jit isn't really called a lot of places.
No, but the potential JIT object also need to be passed to method using
Form
andcompile_extension_module
. It is mostlysolve
,assemble
, andadaptive_solve
forForm
andCompiledSubDomain
andExpression
forcompile_extension_module
Does instant actually need to know the swig version?
Not really, but it is natural that it is instant that keep track of runtime SWIG version checking. As long as we can choose not to use it.
The swig version was not checked in many places, I think it was in FCC hash generation and in the compilemodule in dolfin. So what we had was close to what you suggest, I guess. We just need to make sure the version check is done only on one process.
-
reporter The swig version was used to check that dolfin and the jitted module uses the same swig version? That can also be checked by dolfin on import without involving instant.
-
However, compiled Expressions and SubDomains do not have a mesh and therefore not an mpi communicator available.
Why could not perform the compilation rank 0 from comm_world in this case?
-
If Expressions and SubDomains need to pass messages, then they should have a communicator. We can default to the objects storing MPI_COMM_WORLD, but it should also be possible for a user to pass their own communicator.
-
reporter -
reporter Why could not perform the compilation rank 0 from comm_world in this case?
If a subset of processes solve equation A and another subset solves equation B the processes will need to perform a different set of steps which may involve e.g. creating Expression objects that only live on those subsets of processors. Rank 0 from comm_world may not be one of the Expression owners.
-
reporter I guess in general we do not know how a dolfin program with equations on different mpi process groups will look like.
-
Several issues here. Adding better MPI awareness to the JIT process, or more fundamentally to the whole Python layer of DOLFIN and the need for keeping the SWIG version in the hash. The latter being the actual issue...
-
The swig version was used to check that dolfin and the jitted module uses the same swig version? That can also be checked by dolfin on import without involving instant.
And it is checked. I think the problem is now that we only raise an error, telling the user to sync the SWIG versions. We could just add to the error message that the user should run instant_clean.
However, we still have the problem with several swig versions using the same instant dir. This can be fixed by adding back the swig version to the hash signature. To avoid problems on clusters we need to make sure that this is done only on process 0. Also this has nothing to do with JIT forms on a subset of processes. All processes needs the current SWIG version.
-
reporter Yes, the subset of processes issue was introduced by Garth above and has no relation to the swig/toolchain hash issue.
-
reporter Here's my summary of this issue, rewritten to dolfin actions:
https://bitbucket.org/fenics-project/dolfin/issue/305/passing-swig-toolchain-version-hash-to-jit
-
reporter - changed status to closed
Superseeded by two dolfin issues linked to in the comments here.
- Log in to comment
The problem was that the system call tended to crash on some clusters.
The SWIG version is cashed, the first time it is asked for. Moving the caching from the first usage to the first import, will probably not help. But the issue is hard to debug. Maybe @chris_richardson or @blechta have any more inputs?