Output of zero-step restart does not contain exactly the same data as the output from before the restart

Issue #85 resolved
Stephen Biggs-Fox created an issue

UPDATE 2: This is fixed by PR #255. Test material is attached to the issue here. The default has not been modified - the fix only makes the input file value be respected.

UPDATE: This was solved offline with David. The problem is two-fold: (1) the option force_maxwell_reinit = .true. means that the data in the restart files is modified on load; (2) something in overrides means that even if force_maxwell_reinit = .false. in the input file, then force_maxwell_reinit = .true. is used anyway. The proposed fix is to get overrides to respect what is written in the input file and for force_maxwell_reinit = .false. to be the default.

I’m adding a new feature related to restarts. During testing, I noticed some unexpected behaviour in the existing code before my changes…

If I run a case then restart it and the restart has zero steps, I would expect phi2_by_mode from the output before the restart to be exactly the same as that after the restart since the restart should only load the restart data but not actually do anything, right?

Wrong!

The restart was created using code from commit 334a8fb4 which is about a quarter of the way from v8.0.1 to v8.0.2 (i.e. nearer to 8.0.1).

  • If I restart with that version of the code and do zero steps, phi2_by_mode only agrees within 8 decimal places. I would have expected it to be exactly the same or at least agree to about 15-16 decimal places (double precision).
  • If I restart with the latest code in next at the time of testing (commit 795a1610), the restart data only agrees within 3 decimal places!

I have checked my input file (attached) and, as far as I can see, I don’t have any crazy options that tell GS2 to re-calculate phi on restart.

So, is this a bug or desired behaviour for some reason? And is this a known issue or something new?

NB: To run the attached as described, one has to delete the opt_redist_init line from the input file when restarting with version 795a1610 as this option was removed in one of the commits in between that and 334a8fb4.

P.S. I am currently trying to reproduce this issue using the above input file on 8 nodes of Archer in 20 minutes, i.e. via the debug queue. Previously I was on 25 nodes for 24 hours. The attached input file includes avail_cpu_time = 1200 ! 20 minutes

I will also try to reproduce this for a smaller test case and using the latest code for the bit before the restart and report back. In the meantime, I would be interested to hear if anyone else knows anything about this behaviour.

Comments (12)

  1. David Dickinson

    I guess it should be possible to try reproducing this in a very small case. I’d have expected a zero step restart to give you the same data as you put in (assuming you’re running on the same machine, with the same version of gs2 and libraries etc.).

    One thing in the attached is that you’re using delt_option = “default” but when you restart you probably want ”check_restart”. This will give a different timestep used for calculating the response matrix, but as you shouldn’t be doing any timesteps I wouldn’t have thought this should matter for your test.

    Rather than checking phi2 have you tried checking `phi` and `g`? You can presumably check what is in the original restart file and the new restart file that gets written at the end of the zero step run.

  2. Stephen Biggs-Fox reporter

    Yes, I'm going to try with the low resolution cyclone ITG from the tests folder. If that works, I might try even smaller.

    Yes, delt option is default because this is the input for the initial run. When restarting I change that to check restart. I also change g init option from noise to many.

    Good point about checking phi and g rather than phi2 - I will do that and report back here (probably tomorrow).

  3. Stephen Biggs-Fox reporter

    Using commit 334a8fb4 and the previously attached input file I can reproduce this via the Archer debug queue. The bit before the restart did 5 steps. After restarting for zero steps and comparing the output file and the proc 0 restart file:

    • phi2_by_mode is the same to 10 d.p. (would have expected ~16 d.p.)
    • phi is the same to 9 d.p. (applies to both _r and _i parts - again, would have expected ~16 d.p.)
    • g is the same to 20 d.p., i.e. the same to greater than double precision

    This makes me think something dodgy is happening with the loading of the field - some sort of consistency re-calculation or similar. I have only had a quick look through the code and haven’t spotted anything obvious yet but I will have a more in depth look tomorrow.

  4. David Dickinson

    Yes that was my initial thought as there are some potential flags that could be set that could result in the fields being recalculated, but it seems you have avoided those as far as I can tell.

    You could try doing:

    1. two zero step restarts one after the other – do the two zero step restarts agree with each other exactly?

    2. two zero step restarts from the same initial restart file – do they agree exactly? If not then it suggests some non-deterministic behaviour (uninitialised data/round off error).

    How many dp you get in the netcdf file? What’s the size of phi2_by_mode, phi and g (i.e. what’s the max relative error in each data type)? Have you tried looking at plots of the relative error – is this relatively uniform or is the error localised in certain regions?

  5. Stephen Biggs-Fox reporter

    Good ideas and questions. I will look into those today and report back here. I'm going to try to reproduce the problem with a case small enough to run in a few seconds on one core on my local machine - easier for debugging. FYI - Yesterday I was doing some basic tests of python, NetCDF and Fortran to confirm the correct number of decimal places of accuracy in double precision - these test show that it is definitely possible for a double to be read by Fortran from NetCDF, written back to another NetCDF by Fortran and then for both NetCDFs to be read into python and both numbers to be exactly equal to double precision. So, there is no fundamental reason why GS2 cannot do this correctly, i.e. GS2 is doing something weird and I'm going to find it!

  6. Stephen Biggs-Fox reporter

    OK, I’m homing in on the problem. Haven’t quite found it yet but I’m getting close.

    I have reproduced the problem using a smaller problem viz. the input file attached previously (as input.in) but with the grid sizes from the low resolution cyclone ITG in the GS2 nonlinear tests directory. I have run this locally on one core without MPI using the latest version of the code in next (commit e6ec0879). The reproduces the problem (and runs in about 30 seconds).

    I then created an even smaller problem in single mode (the above is in box mode) using default parameters for almost everything apart from a handful of parameters to turn of unnecessary diagnostics and actually save the restart files. I have edited the original post to attach this as tiny.in. This runs in a fraction of a second on one core of my local machine, again without MPI and using the latest code in next. This does NOT reproduce the problem. Therefore, the problem is to do with one of the input options in input.in. So far I have tested ginit_option, phiinit and use_old_diagnostics - it’s none of them. I will test the remaining differences tomorrow (home time now). I’m sure that if I keep going like this I will find the problem parameter soon enough. Expect a further update tomorrow.

  7. Stephen Biggs-Fox reporter

    Respect force_maxwell_reinit input file value

    This was done to fix a bug where force_maxwell_reinit was always true, even if the input file said false. This had the consequence that the fields were always reset on restart, which meant that a restarted run was not exactly the same as an equivalent run that wnt straight through without restarting. This also made testing feature / bugfixes related to restarts very difficult.

    Fixes #85

    → <<cset c1e688ec9ce4>>

  8. David Dickinson
    • changed status to open

    I've reopened this to reflect the fact that the PR that fixes this is not yet approved and merged. We can mark this as resolved once the fixes goes in.

  9. Log in to comment