Instant cache failures on cluster

Issue #10 new
Martin Sandve Alnæs created an issue

The instant cache mechanism does not work satisfactory on clusters. It needs to be made more robust, but this is of course not trivial.

When I'm submitting a bunch of similar jobs with different parameters on the abel cluster at UiO, I often get failures because of what seems like race conditions when one process starts a build, another waits on a lock, and the other fails to import the module after the lock is released. Or maybe the locking doesn't work properly on the filesystem? Rerunning the job usually works fine.

Here's the error and trace if anyone feels up to the task. I checked that going to the cache directory and importing the module manually works.

In instant.import_module_directly: Failed to import module 'instant_module_fe6679448b399c77b84a609783ec1b952fdf67c5' from '/usit/abel/u1/martinal/.instant/cache'; ImportError:No module named instant_module_fe6679448b399c77b84a609783ec1b952fdf67c5; Failed to import module found in cache. Modulename: 'instant_module_fe6679448b399c77b84a609783ec1b952fdf67c5'; Path: '/usit/abel/u1/martinal/.instant/cache'; ImportError:No module named instant_module_fe6679448b399c77b84a609783ec1b952fdf67c5;

Traceback (most recent call last): ... File "/usit/abel/u1/martinal/no-backup/fenics/dev-2405-install/lib/python2.7/site-packages/ffc/jitcompiler.py", line 185, in jit_form module = instant.import_module(jit_object, cache_dir=cache_dir) File "/usit/abel/u1/martinal/no-backup/fenics/dev-2405-install/lib/python2.7/site-packages/instant/cache.py", line 156, in import_module return check_disk_cache(modulename, cache_dir, moduleids) File "/usit/abel/u1/martinal/no-backup/fenics/dev-2405-install/lib/python2.7/site-packages/instant/cache.py", line 122, in check_disk_cache module = import_and_cache_module(path, modulename, moduleids) File "/usit/abel/u1/martinal/no-backup/fenics/dev-2405-install/lib/python2.7/site-packages/instant/cache.py", line 68, in import_and_cache_module instant_assert(module is not None, "Failed to import module found in cache. Modulename: '%s';\nPath: '%s';\n%s:%s;" % (modulename, path, type(e).name, e)) File "/usit/abel/u1/martinal/no-backup/fenics/dev-2405-install/lib/python2.7/site-packages/instant/output.py", line 55, in instant_assert raise AssertionError(text) AssertionError: Failed to import module found in cache. Modulename: 'instant_module_fe6679448b399c77b84a609783ec1b952fdf67c5'; Path: '/usit/abel/u1/martinal/.instant/cache'; ImportError:No module named instant_module_fe6679448b399c77b84a609783ec1b952fdf67c5;

Comments (9)

  1. Johan Hake

    File standard file locking does not work on nfs, which is ironic as it is only on clusters we need it. The solution has been to use

    flufl.lock

    which is nfs safe, but that relies on symlinks. symlinks are not allowed on abel. We need a python file locking library which is nfs safe and works without symlinks.

  2. Martin Sandve Alnæs reporter

    Right. What's the correct way to set the instant cache directory at runtime? I'd rather take the compilation hit in each job by building in the job-local workdir than worry about race conditions destroying a subset of the jobs.

  3. Chris Richardson

    I'm also having severe problems with JIT compilation/locking (using flufl) when I go over 300 cores. As discussed on the mailing list, would it be possible to review the locking mechanism in general? For example, it is probably not needed when reading modules which have already been compiled successfully.

  4. Johan Hake

    I think we should register a separate issue for the file locking, and then just push a fix for that. As far as I can see, we only need file locking when copying a compiled module to cache.

  5. Log in to comment