read_many must match save_many on nonlinear runs on machines with NETCDF_PARALLEL enabled

Issue #84 new
Ollie Beeke created an issue

I did a simulation on archer, with save_many being true in gs2_diagnostics_knobs, but read_many unspecified in init_g_knobs. The simulation ended prematurely when the timestep changed. I think I now have an understanding of why this happens:

When the timestep changes, if in_memory=.false., restart files are saved to free up memory to calculate the new response matrix. These restart files are then read back in afterwards. When I run with save_many=.true., multiple restart files are generated, but if read_many=.false., then it attempts to read a single restart file with no appended processor index, which fails.

I have submitted a couple of small simulations to check whether the above is indeed the case, and can confirm when the jobs finish.

Nevertheless, I can't imagine a good reason for having save_many /= read_many for any non-linear run, so my suggestion would be either to keep only one variable in use, or to match save(read)_many to read(save)_many when read(save)_many is specified in the input file. If both are specified and different to one another, then a warning could be added, or the run could even be cancelled so that less CPU time is used.

Furthermore, the purpose of Parallel NetCDF is unclear to me. It seems that writing to a single restart file restricts our ability to restart a simulation on a different machine to the one it was initially run on, since not all machines are equipped with the right modules to read/write from/to a single restart file. The issue I have raised above is another direct consequence of using Parallel NetCDF. Perhaps the continued support for Parallel NetCDF should be raised as another issue for discussion?

Comments (4)

  1. David Dickinson

    Yes I think your diagnosis is likely correct. I’m not sure why there are separate options for read and save, but I can think of one possible use case (there may be others) – where you’ve run on one system without parallel netcdf and want to restart elsewhere on a system with parallel netcdf and use a single restart file from there - a bit convoluted but not impossible.

    Yes parallel netcdf is not the default because it is not always widely supported. It’s actually supposed to help our ability to restart in different scenarios as it allows us to use a different number of processors than that originally used. It should be relatively straightforward to write a small program that can take a single restart file and spit out the many restart file equivalents for a specified number of processors, which would help avoid the issues with not having parallel netcdf elsewhere.

    Because the inputs are in different modules (and there are two different save_many inputs due to the two diagnostics modules) linking them could be a bit of a pain and because they’re boolean it’s not really possible to detect if they have been set or not (at least without writing manual code to parse the input file) as we can’t set them to some special value to indicate unset. I think it probably makes sense to add a namelist to gs2_save specifically for dealing with flags to do with the restart files such as these (and the one added in PR #208).

    Of course another solution that would avoid this is to default to in_memory = .true. (in both init_knobs and reinit_knobs for some reason!). It increases memory usage slightly whilst the timestep is changed but this shouldn’t really be an issue (and it detects when there isn’t enough memory and reverts to the file based method).

  2. Ollie Beeke reporter

    Yes, my simulation with read_many=.true. did execute successfully.

    The convoluted use case you mentioned is difficult though, because if you restart from the single restart file, then you would have to hope that no timestep changes were executed during the subsequent run, else this crash would occur. I suppose you could run gs2 for one step as a glorified program to convert from one to many restart files or vice versa! An alternative would be, as you say, to have a program that takes some number (one or many) of restart files and converts them into a different number of restart files for use on a different machine, and do away with parallel NetCDF altogether.

    RE the inputs, I can see how this could be a bit awkward. Maybe just a warning would suffice, so that if someone does run into this error in the future, then they can understand why more easily.

  3. David Dickinson

    Yes that’s what I was thinking just a hack to convert the restart file type. I think the correct solution is probably to have a single flag in a namelist for the restart module and to provide an additional small program that can be used to convert between the two.

    Another option would be to try reading many if the attempt to read single fails.

  4. Log in to comment