Including swig version in module hash

Issue #19 closed

Martin Sandve Alnæs created an issue 2014-05-15

This issue exemplifies why removing the swig version from the module hash was not a good idea:

https://bitbucket.org/fenics-project/dolfin/issue/294/clean-instant-cache-on-version-upgrade

In addition to the example of upgrading in the above issue, similar issues can probably occur if one has more multiple installations installed in different environments with shared .instant cache directory.

I don't recall exactly what the problem was, but it was something about heavy disk access on clusters? If instant can get the swig version only once during any process (at startup or at first need) and cache it in an internal variable, will that avoid the problems?

Comments (33)

Johan Hake
The problem was that the system call tended to crash on some clusters.

The SWIG version is cashed, the first time it is asked for. Moving the caching from the first usage to the first import, will probably not help. But the issue is hard to debug. Maybe @chris_richardson or @blechta have any more inputs?
- 2014-05-15T08:13:30+00:00
Martin Sandve Alnæs reporter
It just seems weird that calling swig to do actual work is fine but asking for the version is a problem...
- 2014-05-15T08:23:45+00:00
Chris Richardson
On multiple processes, the 'swig -version' was being called on every process on every run, whereas the actual parsing and compilation only takes place once on process zero. I would agree that there is still a fair bit of debugging to do...
- 2014-05-15T08:31:09+00:00
Martin Sandve Alnæs reporter
Ok. So this is an issue where an MPI-aware instant might be able to save the day, with process zero communicating the version (or even an integer hash of various compiler versions) to the processes.
- 2014-05-15T08:34:03+00:00
Prof Garth Wells
The JIT code is flawed in parallel. It's not possible to assign the MPI process that is responsible for JIT. Maybe fixing this would resolve this Issue?
- 2014-05-15T08:34:22+00:00
Chris Richardson
I guess it would involve another dependency, though - mpi4py?
- 2014-05-15T08:38:58+00:00
Martin Sandve Alnæs reporter
No, this issue is that every process (previously) would call "swig -version" because every process needed the version for the module hash to import after only process zero finished the actual jit. If process zero (or whatever single process does the jit) could communicate the module hash to the others, it could include swig version and other compiler versions in the hash without the network filesystem overload.
- 2014-05-15T08:40:20+00:00
Prof Garth Wells
@martinal Communicating the SWIG version would be easy. I wouldn't build this into Instant - better to let the calling process manage this.
- 2014-05-15T08:42:47+00:00
Martin Sandve Alnæs reporter
Ok, so a solution would be to
1. Let dolfin collect toolchain versions into a toolchain_hash on the master node
2. Let dolfin communicate toolchain_hash to all processes
3. Let dolfin pass toolchain_hash as an argument to ffc jit which uses it in JitObject
So instant doesn't even need to be touched.
- 2014-05-15T08:48:11+00:00
Jan Blechta
```
In addition to the example of upgrading in the above issue, similar
issues can probably occur if one has more multiple installations
installed in different environments with shared .instant cache directory.
```
Exactly. We solve this, here in Prague, by setting INSTANT_FOO_DIR environment variables for FEniCS builds with different builds of SWIG, NumPy (taking into account not only version but distinct builds; for example with different MPI, BLAS).

It would be logical to let the hash depend also on NumPy version.
```
The problem was that the system call tended to crash on some clusters.
```
If you mean memory corruptions caused by touching the memory between fork and exec on OFED-based hardware (InfiniBand), this seems to be solved by pull request #2 and pull request #3.
```
It just seems weird that calling swig to do actual work is fine but
asking for the version is a problem...
```
If the constellation of stars is set up to cause a crash, then it will probably crash on first fork, which is checking SWIG version. If you hardcoded SWIG version and executable filepath, it was segfaulting elsewhere from time to time. Now, it seems OK.
- 2014-05-15T09:06:09+00:00

Jan Blechta

@martinal

If process zero (or whatever single process does the jit) could communicate
the module hash to the others, it could include swig version and other compiler
versions in the hash without the network filesystem overload.

In what sense avoiding swig -version call on each process helps? Still you touch a filesystem on every process when importing a module finally.

2014-05-15T09:13:41+00:00

Martin Sandve Alnæs reporter
Jan Blechta: That's what confuses me. Why is getting the swig version so special that it had to be removed?
- 2014-05-15T09:17:33+00:00
Chris Richardson
True, there is nothing special about it, but it doesn't scale particularly well, apparently.

I think there are still some bugs/issues on some HPC systems, whichever method you use to execute system commands. If those are triggered one time in 1000, then you have a higher chance of hitting it. As @johanhake says, it can be difficult to debug. Avoiding system calls altogether would be a good long term aim, perhaps.
- 2014-05-15T09:29:15+00:00
Jan Blechta
@chris_richardson DOLFIN import on 1000 cores is probably much more catastrophic. But OK, I agree.

Have you tried the new INSTANT_SYSTEM_CALL_METHOD=OS_SYSTEM. This uses C system() call which should be OFED-fork safe.
- 2014-05-15T09:47:23+00:00
Chris Richardson
@blechta - yes I tried your patch, but it still crashes - sometimes. Reading in modules never seems to be a problem.
- 2014-05-15T12:21:25+00:00
Jan Blechta
@chris_richardson To be sure, did you export INSTANT_SYSTEM_CALL_METHOD=OS_SYSTEM? It is not still default in master. Could you supply some details of the crash?

To import disaster, I don't have sufficiently large scale experience. Nevertheless @jedbrown mentioned that you burn a barrel of oil to start python, NumPy, petsc4py on 65k cores on NFS.
- 2014-05-15T13:04:43+00:00
Johan Hake
```
The JIT code is flawed in parallel. It's not possible to assign the MPI 
process that is responsible for JIT. Maybe fixing this would resolve 
this Issue?
```
This need to be done in DOLFIN. We could add a (global!?) parameter controlling this.
```
Communicating the SWIG version would be easy.
```
How would we do that in the present DOLFIN? AFAIK there is no interface to send a string from one process to another from the Python interface.

If that is easy or easily fixed we can just add a function to instant to manually set the swig version. Then during start up (import of dolfin) the first process check the swig version and broadcasts it to the other processes which manually sets it.
- 2014-05-15T19:23:50+00:00
Prof Garth Wells
@johanhake To sort out which process does the JIT compilation, it should be made possible to pass an MPI communicator, or to initialise a 'JIT compiler' object with an MPI communicator. One can image scenarios where only a sub-set of processes would work on a particular equation.

We can easily add a function to dolfin::MPI to broadcast a string. Even better is if we can broadcast an integer for the version.
- 2014-05-15T20:11:48+00:00
Johan Hake
@garth-wells Could be tricky to create JIT compiler object as JIT is called from several places inside the library. Such objects then need to be passed throughout the interface.

We could introduce a global set_jit_mpi_comm. By default it is mpi_comm_world. But if petsc4py is not installed can we get the range of processes a communicator is using?

An integer for the SWIG version should be fine.
- 2014-05-15T21:09:54+00:00

Martin Sandve Alnæs reporter

jit isn't really called a lot of places. Searching for 'jit(' gives the jit.py contents:

martinal@martinal-mc:~/dev/fenics/dolfin$ grep 'jit(' `find -name \*.py`
./site-packages/dolfin/compilemodules/jit.py:    def mpi_jit(*args, **kwargs):
./site-packages/dolfin/compilemodules/jit.py:            return local_jit(*args, **kwargs)
./site-packages/dolfin/compilemodules/jit.py:            output = local_jit(*args, **kwargs)
./site-packages/dolfin/compilemodules/jit.py:            output = local_jit(*args,**kwargs)
./site-packages/dolfin/compilemodules/jit.py:def jit(form, form_compiler_parameters=None):

plus just two more calls:

./site-packages/dolfin/functions/functionspace.py:        ufc_element, ufc_dofmap = jit(self._ufl_element)
./site-packages/dolfin/fem/form.py:                                 = jit(form, form_compiler_parameters)

Essentially in the constructors of FunctionSpaceBase and Form, which both have a mesh available which has a communicator.

Further instant compilation from dolfin includes AFAIK only:

./site-packages/dolfin/compilemodules/compilemodule.py:    compiled_module = instant.import_module(module_name)
./site-packages/dolfin/compilemodules/compilemodule.py:    compiled_module = instant.build_module(\

which already creates module_name by hashing various versions together with the code to compile.

Does instant actually need to know the swig version? I suggested above that we rather pass a toolchain_hash to jit from dolfin, which ffc can include in the module hash it passes to instant. Then all modules will get a new hash after a toolchain upgrade, and the toolchain can include both swig and c++ compiler versions and whatever else we may need in the future.

2014-05-16T07:43:43+00:00

Martin Sandve Alnæs reporter
However, compiled Expressions and SubDomains do not have a mesh and therefore not an mpi communicator available.
- 2014-05-16T07:57:37+00:00
Johan Hake
```
jit isn't really called a lot of places.
```
No, but the potential JIT object also need to be passed to method using Form and compile_extension_module. It is mostly solve, assemble, and adaptive_solve for Form and CompiledSubDomain and Expression for compile_extension_module
```
Does instant actually need to know the swig version?
```
Not really, but it is natural that it is instant that keep track of runtime SWIG version checking. As long as we can choose not to use it.

The swig version was not checked in many places, I think it was in FCC hash generation and in the compilemodule in dolfin. So what we had was close to what you suggest, I guess. We just need to make sure the version check is done only on one process.
- 2014-05-16T08:00:23+00:00
Martin Sandve Alnæs reporter
The swig version was used to check that dolfin and the jitted module uses the same swig version? That can also be checked by dolfin on import without involving instant.
- 2014-05-16T08:02:13+00:00

Jan Blechta

However, compiled Expressions and SubDomains do not have a mesh
and therefore not an mpi communicator available.

Why could not perform the compilation rank 0 from comm_world in this case?

2014-05-16T08:03:16+00:00

Prof Garth Wells
If Expressions and SubDomains need to pass messages, then they should have a communicator. We can default to the objects storing MPI_COMM_WORLD, but it should also be possible for a user to pass their own communicator.
- 2014-05-16T08:06:45+00:00
Martin Sandve Alnæs reporter
Just so we remember:

https://bitbucket.org/fenics-project/dolfin/issue/304/compiledsubdomain-and-expression
- 2014-05-16T08:08:07+00:00
Martin Sandve Alnæs reporter
```
Why could not perform the compilation rank 0 from comm_world in this case?
```
If a subset of processes solve equation A and another subset solves equation B the processes will need to perform a different set of steps which may involve e.g. creating Expression objects that only live on those subsets of processors. Rank 0 from comm_world may not be one of the Expression owners.
- 2014-05-16T08:10:40+00:00
Martin Sandve Alnæs reporter
I guess in general we do not know how a dolfin program with equations on different mpi process groups will look like.
- 2014-05-16T08:12:59+00:00
Johan Hake
Several issues here. Adding better MPI awareness to the JIT process, or more fundamentally to the whole Python layer of DOLFIN and the need for keeping the SWIG version in the hash. The latter being the actual issue...
- 2014-05-16T08:18:52+00:00
Johan Hake
```
The swig version was used to check that dolfin and the jitted module 
uses the same swig version? That can also be checked by dolfin on 
import without involving instant.
```
And it is checked. I think the problem is now that we only raise an error, telling the user to sync the SWIG versions. We could just add to the error message that the user should run instant_clean.

However, we still have the problem with several swig versions using the same instant dir. This can be fixed by adding back the swig version to the hash signature. To avoid problems on clusters we need to make sure that this is done only on process 0. Also this has nothing to do with JIT forms on a subset of processes. All processes needs the current SWIG version.
- 2014-05-16T08:22:54+00:00
Martin Sandve Alnæs reporter
Yes, the subset of processes issue was introduced by Garth above and has no relation to the swig/toolchain hash issue.
- 2014-05-16T08:37:13+00:00
Martin Sandve Alnæs reporter
Here's my summary of this issue, rewritten to dolfin actions:

https://bitbucket.org/fenics-project/dolfin/issue/305/passing-swig-toolchain-version-hash-to-jit
- 2014-05-16T08:43:08+00:00
Martin Sandve Alnæs reporter
- changed status to closed
Superseeded by two dolfin issues linked to in the comments here.
- 2014-05-16T08:44:24+00:00
Log in to comment

Assignee: –

Type: bug

Priority: major

Status: closed

Milestone: –

Version: dev

Votes: 0

Watchers: 4