Run fails if restart_dir does not exist
If restart_dir is set in the input file, the restart files are written into that directory. This is a nice feature as it keeps the top level directory a bit tidier; this is especially important when running big jobs as there is a restart file per process.
However, if the directory specified by the restart_dir parameter does not exist, then the run fails - it stops when it tries to write to a non-existent directory. This is especially annoying for big jobs that sit in the queue for ages to then just fail. Also, this happens after initialization which can be ~O(10) minutes so is a waste of compute time.
Therefore, I propose to modify the GS2 code so that if restart_dir is specified, then the program checks whether the directory exists and if not then it creates it.
This is related to but separate from issue #44 (Restarting a job overwrites the existing output file), which will be left to be fixed later.
I plan to fix this using call execute_command_line(’mkdir -p restart_dir’) with the correct directory name. This will work whether the directory already exists or not so we don't have to worry about checking whether it exists already or about it not existing and getting created between checking for existence and creating it.
Can someone comment on whether this issue is something that should be fixed and whether my proposed fix is an acceptable method before I implement it please. Thanks
Comments (13)
-
-
reporter There are indeed other reasons why a run might fail but missing restart directory is a common one that is easily fixed. I do not propose to do anything about other reasons why a run might fail due to reasons related to problems writing restart files (full disk, permissions, etc).
The problem with a separate program is that one then has to remember (or be bothered) to run it! Therefore, it is preferable to have the directory created if missing within the GS2 run itself (so that it happens whether or not the pre-check program has been run).
Fair point about
execute_command_line
not being portable. Looks like the most portable way to do it is to use python, wrap that in C and call it from Fortran. If you think this might be an acceptable approach, I will do a separate test implementation to demonstrate the concept and check that it is indeed portable before putting it in GS2. -
I'm not sure the python approach is necessarily portable either (and I'm not sure about the fortran->C->python chain). With regards the remembering to run the program you could always do something like
alias submit="do_the_check *.in && sbatch"
(or whatever the correct syntax is).
-
reporter The alias is a workaround that will have to be set up by each user on each machine that they use. It would be much nicer to have it done automatically in a single place within the GS2 code.
As for portability, the python code would be something like:
import os if not os.path.exists(directory): os.makedirs(directory)
The above assumes the directory is not created in the split second between
os.path.exists
andos.makedirs
. This is probably a safe assumption and concurrent operations are arguably a separate issue anyway.Fancier version of the above are available (e.g. that deal with concurrent operations automatically), but these require newer version of python which could impact portability. The above works in old versions of python which should be available on all but the most obscure systems.
I'm also not sure about the Fortran -> c -> py chain in that I have never tried it before, hence why I plan to try it out separately first. But it seems to be the standard portable way to call python code from Fortran.
-
Python can be fairly portable as in your example, but not very portable if you don't have python available on your machine -- this would add a hard dependency on python to GS2. You say it should be available on most machines, but it's still an extra dependency that may or may not be available/easy to use.
makedirs
can raise an exception -- it's not clear how we'd handle that from Fortran, may lead to strange issues/unhelpful error messages.An alternative approach to not abort if the user has forgotten to setup the directory is : 1. Test is the directory is writable (done already) 2. If not a. add a message to the error stream b. depending on a flag either abort as we do now or default restart_dir to "./"
.
-
reporter Ah ha, yes, that's the easy way! Default to "./" if restart_dir is missing. As for the flag to abort, the default should be to use "./" so one has to specifically say "no, I want you to abort if the directory is missing" since the whole point is to prevent aborting runs just because restart_dir is missing. OK, so if we're happy with this, I can go ahead and fix this now?
-
I'd be inclined to say the default (at least for now) should be to maintain the existing behaviour -- defaulting to "./" could have unintended consequences (such as overwriting a set of restart files you were using) so we should make sure there's enough time for users to gain experience/familiarity.
-
We can have the error message and/or ingen suggest
you might want to set magic_flag = .true. to .....
-
reporter OK, fair enough, I'll do it that way. So I'm good to go now?
-
I think so -- you'll want to look for where
restart_writable
is used ings2_diagnostics.fpp
anddiagnostics/gs2_diagnostics_new.f90
. -
reporter OK, thanks.
-
For completeness we might want to check that the restart is writable after changing
restart_dir
(e.g. in case we've actually run out of disk space or similar) and aborting if our fallback option hasn't worked. -
reporter Makes sense...
- Log in to comment
I'd like us to avoid
execute_command_line
calls when possible. They may not be that portable for one.There's a flag that gets GS2 to check if it can create restart files and then abort nicely if it can't -- I'd propose we could perhaps just make a small program (and/or incorporate this into ingen) that takes an input file and reports if the restart file can be written or not. You can then run this before submitting your job.
Note there are many reasons we might be able to write a restart file (e.g. out of quota/disk space) which won't all be fixed by creating a directory.