File locking on cluster

Issue #16 resolved
Chris Richardson created an issue

File locking is failing (sometimes) with flufl.lock on large clusters of >300 cores. Should aim for rationalisation of the locking scheme, with locking only applied when copying to cache, and consider other locking options (such as using MPI::barrier())

Comments (15)

  1. Martin Sandve Alnæs

    I think the second lock in build.py after the call to copy_to_cache is of no use, since copy_to_cache locks while writing and when it exits we can assume the module is there. I.e. this should be sufficient:

            # Optionally copy compiled module to cache
            if use_cache:
                module_path = copy_to_cache(module_path, cache_dir, modulename)
            # Import module and place in memory cache
            module = import_and_cache_module(module_path, modulename, moduleids)
    

    In check_disk_cache in cache.py, the lock could be avoided but care must be taken to properly implement a flag of some kind such that the module is known to be completely written before trying to read. Currently there is only a check for the existence of the directory, but that is created at the beginning of the copying-to-cache and will thus be open to read/write race conditions.

    Here's the function in cache.py:

    def check_disk_cache(modulename, cache_dir, moduleids):
        # Get file lock to avoid race conditions in cache
        lock = get_lock(cache_dir, modulename)
    
        # Ensure a valid cache_dir
        cache_dir = validate_cache_dir(cache_dir)
    
        # Check on disk, in current directory and cache directory
        for path in (os.getcwd(), cache_dir):
            if os.path.isdir(os.path.join(path, modulename)):
                # Found existing directory, try to import and place in memory cache
                module = import_and_cache_module(path, modulename, moduleids)
                if module:
                    instant_debug("In instant.check_disk_cache: Imported module "\
                                  "'%s' from '%s'." % (modulename, path))
                    release_lock(lock)
                    return module
                else:
                    instant_debug("In instant.check_disk_cache: Failed to import "\
                                  "module '%s' from '%s'." % (modulename, path))
    
        # All attempts failed
        instant_debug("In instant.check_disk_cache: Can't import module with modulename "\
                      "%r using cache directory %r." % (modulename, cache_dir))
        release_lock(lock)
        return None
    

    The check

    if os.path.isdir(os.path.join(path, modulename)):
    

    must be replaced with something safer, e.g.

    if os.path.isdir(os.path.join(path, modulename, "instant_finished_copying_flag_file")):
    

    where this file is written at the end of the copy-to-cache operation.

  2. Johan Hake

    I guess it should be possible to write an empty finished file in the cache directory, which is created when all the copying is done. However, isn't the worst thing that can happen that the import fails and the present run just regenerate the compiled module? That is exactly what happens if the test in check_disk_cache fails, so we are equally far.

  3. Martin Sandve Alnæs

    This is where Alice and Bob enters the picture in computer science :)

    Assume that Alice starts to write the module, for instance she creates the directory, creates a couple of files, but is not done when... Bob, from a different MPI job, starts to read the module, and fails spectacularly because only half of the modules files are there. Soon after, Alice completes the writing of the module, but alas, Bob has already stumbled and his parent MPI job has died with thousands of CPU hours lost...

    That's the situation we're trying to avoid.

    Are you willing to gamble that importing a module that is not completely written to file does not crash? If it doesn't crash and you suddenly have garbage code in memory in a different MPI process, good luck debugging that.

    What should happen is that check_disk_cache either successfully imports a previously built and fully copied module, or returns None if the module is not there. The read lock is currently there to avoid reading until the write lock has been released. Checking for a 'finished' file should be a sufficient replacement.

  4. Johan Hake

    I guess this is the worst case scenario. I would suspect (maybe even gamble?), some error being raised when Bob fails to import the half cooked module and because the import is in a try, except clause, the function will just return None.

  5. Martin Sandve Alnæs

    We're not adding syncronization for the best case scenario :)

    The scenario is not that far off either, if you're starting several jobs in a loop with different parameters the jobs are likely to access the same modules simultaneously. Because compiling takes time, the race condition is then quite likely to occur.

    AFAIK the 'finished' file approach should work fine and is trivial to implement, so no reason to gamble.

    The minor backside is this: since Bob will no longer wait for Alice to release the lock, Bob will compile the module himself and then discard it when he sees that Alice beat him to it. But that's just a minor performance hit.

  6. Johan Hake

    The double compilation is just a minor performance hit. However, if Mike, Alice and Bob are running the same script. Mike and Alice are in front and both starts compiling the module. Mike wins, and copies the module to cache. Bob enters the scene and tries to import the module which Mike has compiled and copied to cache for him, when suddenly Alice is finished with her compilation. She copies to cache while Bob is reading...

    The gambling is against better odds but I guess it is still gambling.

  7. Martin Sandve Alnæs

    That's what the write lock is for.

    • Alice sees that the module is missing and starts compiling.
    • Mike sees that the module is missing and starts compiling.
    • Alice is done and takes the write lock, checks that the module is still not there, and copies, writes the 'finished' file, then releases the write lock.
    • Mike is done and takes the write lock (possibly while Alice is copying), then when he gets it he sees that the module is now there and discards his own compilation result.

    If Bob looks for the module 'finished' file at any point in here he will decide correctly whether the module is fully copied or not.

  8. Johan Hake

    Ok, we are now on the same page. I can see if I am able to fix this. So:

    1. Remove second file_lock in build.py (line 541)
    2. Remove get_lock in cache.py (line 113)
    3. Improve check of existing already compiled module in cache by adding check for an empty finished_copying file, instead of just check if the directory exists on (line 173) in build.py.
  9. Martin Sandve Alnæs

    Looks good! Note that there are a couple of similar get_lock patterns in build.py probably not needed for the same reasons.

  10. Chris Richardson reporter

    I'll try - but my network connection is slow this week. I think the main problem was the multiple locks (with MPI) which were happening in cache.py. So, if they have gone, the locks with process zero during build should be fine.

  11. Log in to comment