Run fails if restart_dir does not exist

Issue #66 new
Stephen Biggs-Fox created an issue

If restart_dir is set in the input file, the restart files are written into that directory. This is a nice feature as it keeps the top level directory a bit tidier; this is especially important when running big jobs as there is a restart file per process.

However, if the directory specified by the restart_dir parameter does not exist, then the run fails - it stops when it tries to write to a non-existent directory. This is especially annoying for big jobs that sit in the queue for ages to then just fail. Also, this happens after initialization which can be ~O(10) minutes so is a waste of compute time.

Therefore, I propose to modify the GS2 code so that if restart_dir is specified, then the program checks whether the directory exists and if not then it creates it.

This is related to but separate from issue #44 (Restarting a job overwrites the existing output file), which will be left to be fixed later.

I plan to fix this using call execute_command_line(’mkdir -p restart_dir’) with the correct directory name. This will work whether the directory already exists or not so we don't have to worry about checking whether it exists already or about it not existing and getting created between checking for existence and creating it.

Can someone comment on whether this issue is something that should be fixed and whether my proposed fix is an acceptable method before I implement it please. Thanks

Comments (13)

  1. David Dickinson

    I'd like us to avoid execute_command_line calls when possible. They may not be that portable for one.

    There's a flag that gets GS2 to check if it can create restart files and then abort nicely if it can't -- I'd propose we could perhaps just make a small program (and/or incorporate this into ingen) that takes an input file and reports if the restart file can be written or not. You can then run this before submitting your job.

    Note there are many reasons we might be able to write a restart file (e.g. out of quota/disk space) which won't all be fixed by creating a directory.

  2. Stephen Biggs-Fox reporter

    There are indeed other reasons why a run might fail but missing restart directory is a common one that is easily fixed. I do not propose to do anything about other reasons why a run might fail due to reasons related to problems writing restart files (full disk, permissions, etc).

    The problem with a separate program is that one then has to remember (or be bothered) to run it! Therefore, it is preferable to have the directory created if missing within the GS2 run itself (so that it happens whether or not the pre-check program has been run).

    Fair point about execute_command_line not being portable. Looks like the most portable way to do it is to use python, wrap that in C and call it from Fortran. If you think this might be an acceptable approach, I will do a separate test implementation to demonstrate the concept and check that it is indeed portable before putting it in GS2.

  3. David Dickinson

    I'm not sure the python approach is necessarily portable either (and I'm not sure about the fortran->C->python chain). With regards the remembering to run the program you could always do something like

    alias submit="do_the_check *.in && sbatch"

    (or whatever the correct syntax is).

  4. Stephen Biggs-Fox reporter

    The alias is a workaround that will have to be set up by each user on each machine that they use. It would be much nicer to have it done automatically in a single place within the GS2 code.

    As for portability, the python code would be something like:

    import os
    if not os.path.exists(directory):
        os.makedirs(directory)
    

    The above assumes the directory is not created in the split second between os.path.exists and os.makedirs. This is probably a safe assumption and concurrent operations are arguably a separate issue anyway.

    Fancier version of the above are available (e.g. that deal with concurrent operations automatically), but these require newer version of python which could impact portability. The above works in old versions of python which should be available on all but the most obscure systems.

    I'm also not sure about the Fortran -> c -> py chain in that I have never tried it before, hence why I plan to try it out separately first. But it seems to be the standard portable way to call python code from Fortran.

  5. David Dickinson

    Python can be fairly portable as in your example, but not very portable if you don't have python available on your machine -- this would add a hard dependency on python to GS2. You say it should be available on most machines, but it's still an extra dependency that may or may not be available/easy to use.

    makedirs can raise an exception -- it's not clear how we'd handle that from Fortran, may lead to strange issues/unhelpful error messages.

    An alternative approach to not abort if the user has forgotten to setup the directory is : 1. Test is the directory is writable (done already) 2. If not a. add a message to the error stream b. depending on a flag either abort as we do now or default restart_dir to "./"

    .

  6. Stephen Biggs-Fox reporter

    Ah ha, yes, that's the easy way! Default to "./" if restart_dir is missing. As for the flag to abort, the default should be to use "./" so one has to specifically say "no, I want you to abort if the directory is missing" since the whole point is to prevent aborting runs just because restart_dir is missing. OK, so if we're happy with this, I can go ahead and fix this now?

  7. David Dickinson

    I'd be inclined to say the default (at least for now) should be to maintain the existing behaviour -- defaulting to "./" could have unintended consequences (such as overwriting a set of restart files you were using) so we should make sure there's enough time for users to gain experience/familiarity.

  8. David Dickinson

    We can have the error message and/or ingen suggest you might want to set magic_flag = .true. to .....

  9. David Dickinson

    I think so -- you'll want to look for where restart_writable is used in gs2_diagnostics.fpp and diagnostics/gs2_diagnostics_new.f90.

  10. David Dickinson

    For completeness we might want to check that the restart is writable after changing restart_dir (e.g. in case we've actually run out of disk space or similar) and aborting if our fallback option hasn't worked.

  11. Log in to comment