File locking on cluster
File locking is failing (sometimes) with flufl.lock
on large clusters of >300 cores.
Should aim for rationalisation of the locking scheme, with locking only applied when copying to cache, and consider other locking options (such as using MPI::barrier()
)
Comments (15)
-
-
I guess it should be possible to write an empty
finished
file in the cache directory, which is created when all the copying is done. However, isn't the worst thing that can happen that the import fails and the present run just regenerate the compiled module? That is exactly what happens if the test incheck_disk_cache
fails, so we are equally far. -
This is where Alice and Bob enters the picture in computer science :)
Assume that Alice starts to write the module, for instance she creates the directory, creates a couple of files, but is not done when... Bob, from a different MPI job, starts to read the module, and fails spectacularly because only half of the modules files are there. Soon after, Alice completes the writing of the module, but alas, Bob has already stumbled and his parent MPI job has died with thousands of CPU hours lost...
That's the situation we're trying to avoid.
Are you willing to gamble that importing a module that is not completely written to file does not crash? If it doesn't crash and you suddenly have garbage code in memory in a different MPI process, good luck debugging that.
What should happen is that check_disk_cache either successfully imports a previously built and fully copied module, or returns None if the module is not there. The read lock is currently there to avoid reading until the write lock has been released. Checking for a 'finished' file should be a sufficient replacement.
-
I guess this is the worst case scenario. I would suspect (maybe even gamble?), some error being raised when Bob fails to import the half cooked module and because the import is in a try, except clause, the function will just return
None
. -
We're not adding syncronization for the best case scenario :)
The scenario is not that far off either, if you're starting several jobs in a loop with different parameters the jobs are likely to access the same modules simultaneously. Because compiling takes time, the race condition is then quite likely to occur.
AFAIK the 'finished' file approach should work fine and is trivial to implement, so no reason to gamble.
The minor backside is this: since Bob will no longer wait for Alice to release the lock, Bob will compile the module himself and then discard it when he sees that Alice beat him to it. But that's just a minor performance hit.
-
The double compilation is just a minor performance hit. However, if Mike, Alice and Bob are running the same script. Mike and Alice are in front and both starts compiling the module. Mike wins, and copies the module to cache. Bob enters the scene and tries to import the module which Mike has compiled and copied to cache for him, when suddenly Alice is finished with her compilation. She copies to cache while Bob is reading...
The gambling is against better odds but I guess it is still gambling.
-
That's what the write lock is for.
- Alice sees that the module is missing and starts compiling.
- Mike sees that the module is missing and starts compiling.
- Alice is done and takes the write lock, checks that the module is still not there, and copies, writes the 'finished' file, then releases the write lock.
- Mike is done and takes the write lock (possibly while Alice is copying), then when he gets it he sees that the module is now there and discards his own compilation result.
If Bob looks for the module 'finished' file at any point in here he will decide correctly whether the module is fully copied or not.
-
And Bob will never see Mikes work, because it never leaves Mikes tmp area.
-
Ok, we are now on the same page. I can see if I am able to fix this. So:
- Remove second
file_lock
inbuild.py
(line 541) - Remove
get_lock
incache.py
(line 113) - Improve check of existing already compiled module in cache by adding check for an empty
finished_copying
file, instead of just check if the directory exists on (line 173) inbuild.py
.
- Remove second
-
Looks good! Note that there are a couple of similar get_lock patterns in build.py probably not needed for the same reasons.
-
@chris_richardson could you check if the pushed branch fixes the issue?
-
reporter I'll try - but my network connection is slow this week. I think the main problem was the multiple locks (with MPI) which were happening in cache.py. So, if they have gone, the locks with process zero during build should be fine.
-
Ok, then I just start graduating the fix.
-
- changed status to resolved
Fixed by: 9664907
-
- removed milestone
Removing milestone: 1.4 (automated comment)
- Log in to comment
I think the second lock in build.py after the call to copy_to_cache is of no use, since copy_to_cache locks while writing and when it exits we can assume the module is there. I.e. this should be sufficient:
In check_disk_cache in cache.py, the lock could be avoided but care must be taken to properly implement a flag of some kind such that the module is known to be completely written before trying to read. Currently there is only a check for the existence of the directory, but that is created at the beginning of the copying-to-cache and will thus be open to read/write race conditions.
Here's the function in cache.py:
The check
must be replaced with something safer, e.g.
where this file is written at the end of the copy-to-cache operation.