scheduling MoL_PostStep in Post_Recover_Variables

Issue #1256 closed
Roland Haas created an issue

this is not ideal since some routines in MoL_PostStep will always change results (eg. Con2Prim). We already removed MoL_PostStep from the past timelevels since the prolongation cannot be handled properly.

Originally (MoL rev 137, was introduced to enforce boundary conditions and recompute the ADM variables. Since then the policy has changed and the ADM variables are supposed to be checkpointed (rule of thumb is that anything with more than one timelevels should be checkpointed).

When recovering from a checkpoint the boundary conditions are currently satisfied since CarpetIOHDF5 reads in everything that it can, incl. data in ghost and boundary zones. This method ensures that bit identical data is read in. Unfortunately running MoL_PostStep then changes values.

It seems as if MoL_PostStep is never required so should not be run.

The attached trivial patch adds a parameter to MoL to disable running in Post_Recover_Variables.

Note that this is not yet are request for review but a work item since more testing is required to decide if anything breaks when MoL_PostStep is not run.

Tests done with GRHydro (and changes so that eg. the atmosphere mask is identical in ghost zones [currently it is not valid in ghost zones but also never checked in ghost zones]) gives bit identical results even when changing the number of processors while recovering. This is a very simple test though.

Keyword: MoL

Comments (16)

  1. Erik Schnetter
    • removed comment

    There are two different groups, MoL_PostStep and MoL_PostStepModify. PostStep is supposed to be a pure projection operation that can run at many times, while PostStepModify is supposed to run only during evolution, e.g. to enforce constraints, at well-defined times and as part of the evolution scheme.

    Could it be that Con2Prim should be run in PostStepModify instead?

    While PostStep may not be required for ADMBase, there are other cases where it is required, e.g. in unigrid simulations where ADMBase is not evolved, or in other cases where some variables depend on others.

  2. Roland Haas reporter
    • removed comment

    Con2Prim is an oddball routine that does not really behave nicely unfortunately. Depending on the EOS used it might even modify the conservatives.

    With the MoL scheduling as at is currently, I don't think that Con2Prim can run in PostStepModify. The reason is that Con2Prim needs to run in postregrid where currently poststepmodify does not seem to run. Essentially con2prim needs to run whenever the conservative varialbles (ie. dens, scon, tau, namely GRHydro's evolved variables change). In that sense it is similar to say the adm constraints. The problem is that calling con2prim multiple times changes the result.

    For not calling mol_poststep in post_recover_variables the argument is that we want to restore the state of the simulation bit identical to what it was at checkpoint time. So if we recompute something (admbase variables, constraints, primitives) we must be sure that we will get precisely the same values as we got at checkpoint time.

    Also since we already know that we cannot call MoL_PostStep on the past timelevels (since SYNC does not work), calling it on the current timelevel only is strange. Any constraints that might need to be applied should already have been applied at the time the checkpoint was written.

    To my manner of thinking, checkpoint/restore should not call any additional functions but just restore the simulation to precisely the state it was in when it was checkpointed, so if the constraints/quantities that MoL_PostStep enforces were wrong at checkpointing time, then they should still be wrong at restore time (since there are routines that run between mol_poststep and the point where the simulation is checkpointed and that might modify the variables). Variables should be either checkpointed or if one wants to be clever and recompute them after checkpointing then the routine to recompute should be scheduled explicitly in POST_RECOVER_VARIABLES (based on the argument that the thorn apparently wants to know about checkpointing so can be expected to take special measures).

    Is there a case where we require MoL_PostStep because it is the only way to fill in data (when recovering on a different number processors?)? Or was it put there just to be sure? Note that it was not always scheduled in POST_RECOVER_VARIABLES, this was done in changeset:137/CactusNumerical/MoL.

    Schedule MoL_PostStep also at post_recover_variables.  This ensures
    that boundary conditions, ADM variables etc. are correct after

    and that the comment in the schedule file says

    ##############################################################                                           ###  Schedule the PostStep parts after recovery            ###
    ###  so that symmetries are automatically done correctly.  ###
    ###  We may want to change this later as it could be       ###
    ###  expensive, but it's simplest for the moment.          ###

    Currently we checkpoint the ADMBase variables.

    Currently the recovery routine reads in everything incl. ghost points (and ghost points are checkpointed). Currently it is possible to have space_mask and grhydro::atmosphere_mask differ in ghost zones and interior points (since GRHydro uses it to keep track of some information of values in eg densrhs which is only valid in the interior anyway and so never bothers to synchronize it). Adding the SYNC makes it possible to get bit identical recovery from checkpoints (if one leaves out mol_poststep).

  3. Frank Löffler
    • removed comment

    Replying to [comment:2 rhaas]:

    Depending on the EOS used it might even modify the conservatives.

    It really shouldn't do that. Do you know why it might?

    The problem is that calling con2prim multiple times changes the result.

    Yes - because it overwrites parts of its inputs. Would it be unreasonable to change that? Yes, it might mean more storage, but it might also mean saving a lot of development time and headache.

  4. Roland Haas reporter
    • removed comment

    Frank: resetting to atmosphere modifies the conservatives (even for say a gamma law). For the hot eos all kinds of things happen apparently (heat is injected, we might want to revert to use temperature instead of eps).

    Right now on top of my head, I don't know what would need to be changed. The idea would be to keep around the primitives from the last CCTK_POSTSTEP as initial guesses? Such a feature would be nice, but remember the xkcd comic you send around :-) I suspect that this would not help when con2prim modifies the conservatives of course.

  5. Erik Schnetter
    • removed comment

    Regarding con2prim: Although the name does not indicate it, I think this routine is also the place to modify the conserved variables to ensure they are consistent with a primitive state. Such a conserved-modifying step is required during evolution anyway.

    However, I am then worried about calling con2prim after regridding. Assuming you regrid (but do nothing to the grid structure), wouldn't calling con2prim in postregrid then change the state of the simulation? Assuming that regridding would only update the process decomposition (which is feasible, and is how load balancing would be implemented): wouldn't this modify the state?

    con2prim needs to be a projection. It can modify the conserved state, but calling con2prim twice in a row should be a no-op. Why is this not the case? Would calling prim2con right after con2prim help? If you run con2prim after prim2con, would at least this be a no-op? If not, I don't see how one could reasonably achieve consistency during an evolution unless con2prim is called only in PostStepModify.

    Regarding calling PostStep in PostRecoverVariables: Once this works for you, let's try it. You are right that all non-checkpointed variables should explicitly be set in PostRecoverVariables anyway.

  6. Frank Löffler
    • removed comment

    Con2Prim currently uses the old values of the primitives as input to calculate the new values. Calling it multiple times would potentially change the primitives each time, because a minimum number of iterations is done by default, by default at least 1 iteration. We either need to change this and thus, make a second call to Con2Prim a noop because the solution is already present (but this might be difficult if other routines change this, like atmosphere treatment), or we need to decouple the initial guesses from the output primitives. Given that we would actually like to have the latter for con2prim anyway (save the initial guesses to have separate tries for con2prims) I would like the second approach more.

  7. Wolfgang Kastaun
    • removed comment

    For the thermal branch of Whisky, the situation is the following

    -C2P does not need an initial guess, when called with the same input, the output is always the same. -C2P has a certain accuracy (given by a parameter), since it uses numerical root finding. Because of this, and only because of this, it is not an exact projection operator. It is however close to beeing one. -C2P does not just compute primitives from conserved variables, but also checks for unphysical input and tries to correct it under certain conditions, or sets the result to nan. Disentangling those two functionalities is possible but would be a performance problem. -C2P is also responsible for setting the artificial atmosphere. This could be decoupled in theory. -C2P does not set any additional masks.

    Therefore, our requirement are

    -C2P should only be called when the conserved variables have been updated. -It should not be called twice. -If it is called twice in certain cornercases, this may under no circumstance depend on arbitrary factors like restarts. -Running it twice is not an accuracy problem, but it is bad for debugging because it prevents us from noticing other bugs that slighltly change results during restart. -During checkpointing, the primitives and conserved variables should just be read from checkpoint file, including ghost/boundary points. There is no need to run C2P again. -Regridding of the conserved vars should be regarded like evolution of them, and C2P has to be run to keep things consistent.

    I also don't see why checkpointing and regridding have to trigger the sheduling of the same routines. Both things are different and should be treated seperately. I support the view that checkpointing should just reload stuff from file. If there are variables that can be recomputed exactly, one can do so instead, but this should be an explicit optimization. The standard should be to save and restore.

  8. Erik Schnetter
    • removed comment

    Regridding may occur at "arbitrary" times, and the postregrid bin is executed on the affected levels, and maybe other levels as well. Ensuring that postregrid is not executed everywhere was always only an optimisation, it was not for ensuring that e.g. con2prim would not run too often. And as mentioned above, re-balancing the load would also count as regridding.

    It may thus make sense to have two versions of con2prim; one to be called during evolution only (which may modify the conserved state), and one that does not do so.

  9. Wolfgang Kastaun
    • removed comment

    I disagree with the previous comment. There is no version of con2prim which may not modify the conserved state. This would be a version which may leave primitives and conserved variables in an inconsistent state. The reason is that there are unphysical values of the conserved variables which cannot be represented by primitive variables. Those corner cases appear frequently at NS surfaces. Another example is with realistic EOSs at low temperatures and high densities, where the energy is basically the degeneracy energy. The smallest numerical error can lead to conserved energy densities becoming smaller than in the zero-temperature limit. For the thermal branch, we have a clear policy how to adjust such cases (and when to abort the run). Those adjustments are really a piece of the evolution that has to be taken seriously. Therefore it has to change the conserved variables sometimes. In any case, I do not see any reason not to call C2P on all points after any modification of the conserved variables. We do not need two C2P, but another schedule group for recomputing dependent variables that are not checkpointed to save disk space or memory.

  10. Erik Schnetter
    • removed comment

    I understand the need for C2P to project back to a consistent state during evolution. However, there are other times than evolution where it may be acceptable to (temporarily) remain in an inconsistent state.

    Leaving checkpointing and recovery aside:

    Carpet may call the postregrid bin on some levels, and it is difficult for the application to control at what times and on which levels this is called. If you insist that C2P is always allowed to modify the state, then C2P may only be called during evolution. This includes postrestrict, but excludes postregrid (and excludes calling it after recovery).

  11. Log in to comment