Error in synchronisation after restriction

Create issue
Issue #1499 closed
Ian Hinder created an issue

Carpet does not currently restrict into ghost points, and it does not synchronise after restricting, as synchronisation is expected to be performed by user thorns in MoL_PostStep at the same time as application of boundary conditions, and MoL_PostStep is scheduled by MoL in CCTK_POSTRESTRICT, which occurs after restriction.

Unfortunately, CCTK_POSTRESTRICT is traversed in the order coarse to fine, whereas restriction happens fine to coarse, so the synchronisation applied by the user thorns does not occur in the correct order. This leads to incorrect data on the coarse grid. I noticed this when comparing output in 3D between identical simulations run on different numbers of processes.

The simple fix is to synchronise after restricting on each level; this ensures that the result is correct, but introduces a performance penalty due to the additional sync. One optimisation is possible. Carpet currently does not restrict into ghost zones, leading to the requirement of a synchronisation after restriction. Carpet can be made to restrict into ghost zones, but only if the restriction operator has a single-point stencil (e.g. point-copying used in vertex-centered mesh refinement). If this is the case, the sync after restriction is no longer necessary.

The branch ianhinder/restrictsync implements these changes.

(aside: the CSS styling on git.carpetcode.org has been broken for a while)

  1. Carpet: Add restriction sync test Test data generated on 1 process. Test fails on 2 processes due to lack of synchronisation in Carpet after restriction.
  2. Carpet: Sync after restriction on each level Synchronising in POSTRESTRICT (e.g. in MoL_PostStep) is not sufficient, as there it happens coarse-to-fine, whereas it needs to happen fine-to-coarse, like restriction. This introduces an additional sync of all restricted variables, which will have a performance impact. The coarse grid was, however, incorrect before. test_restrict_sync now passes on 1, 2 and 4 processes. It fails on 8 processes due to an additional blank line in the output which the test system does not tolerate.
  3. Carpet, CarpetLib: Restrict into ghost zones and skip sync after restrict if not using higher order restriction test_restrict_sync still passes on 1 and 2 processes

Notes:

  • We could enable (3) via a parameter since it is an optimisation. However, Erik believes the optimisation is always correct, and all the tests continue to pass, so I am tentatively proposing that no additional parameter is required.
  • Can we make this sort of problem easier to detect? e.g. by poisoning the ghost points which are not set by restriction? When relying on user thorns to do something, can we poison the corresponding points first?

Thanks to Erik for helping to diagnose the problem and suggesting possible fixes. ET test results unchanged on 1 and 2 processes.

Comments on the commits?

Keyword: backport

Comments (22)

  1. Erik Schnetter
    • removed comment

    The code in dh.cc now contains two if statements checking for higher order restriction -- once with a question mark operator (introduced by this patch), and then in an existing if statement overwriting an existing variable. I suggest to rewrite this such that there is only a single if statement with an else branch, so that the variable "needrecv" is set exactly once.

    The parameter use_higher_order_restriction seems undefined in Carpet -- did you forget to commit param.ccl?

    After these changes okay to commit.

    Somewhat related: The current nearby comment

                // NOTE: change in behaviour (affects only outer boundaries I think)!!!!
                // NOTE: b/c of this we need a low-level sync after the restrict
    

    is wrong: We not only need to sync, but also need to apply boundary conditions (including symmetries) after each restriction! The current sync in Restrict.cc is not sufficient for this. Instead, we should call MoL_PostStep from Evolve.cc... But we should probably have a test case for higher order restriction first.

  2. Ian Hinder reporter
    • changed status to resolved
    • removed comment

    use_higher_order_restriction is an existing CarpetLib parameter which is shared by Carpet. I didn't change this - it was already referred to in this code. I just added another reference to it. I wondered about that comment as well.

    Erik has merged my commits and fixed some conflicts with the "apply after release" branch which was also merged. Closing ticket.

  3. Roland Haas
    • removed comment

    We do have test cases for higher order restriction: CarpetExtra/CarpetProlongateTest contains tests test_cc_horest_o[35]. Are those the ones you were thinking of?

  4. Ian Hinder reporter
    • removed comment

    Do those tests have 3 refinement levels? I wasn't able to reproduce the problem with 2. Since we are now syncing when using higher order restriction, whereas before we weren't, ideally the existing tests should now be failing as the data is different (and correct). This suggests the original tests didn't trigger this problem (e.g. only 2 levels) or the output didn't catch the problem region (the finest grid region of the coarse grid, in a 3-level simulation). It would be good to modify the test I just added (test_restrict_sync) to use higher order restriction, and check that it fails before these fixes and passes afterwards. Do you want to do that?

  5. Roland Haas
    • removed comment

    Well the tests did fail after Erik's changes but then the prolongation tests are not doing what normal thorn is supposed to do, ie they are not scheduling MoL_PostStep after the restrict. So data was regennerated (carpet-6-init-355-gb3bc3f5 "update test data").

  6. Erik Schnetter
    • changed status to open
    • removed comment

    Reopening, since restriction may still be wrong for higher order restriction operators.

  7. Ian Hinder reporter
    • removed comment

    I ran all the ET tests on my commits on datura, and they all passed, so it must have been something that changed in the merge, or in the "after release" branch that was also merged in.

  8. Ian Hinder reporter
    • removed comment

    Replying to [comment:6 rhaas]:

    Well the tests did fail after Erik's changes but then the prolongation tests are not doing what normal thorn is supposed to do, ie they are not scheduling MoL_PostStep after the restrict. So data was regennerated (carpet-6-init-355-gb3bc3f5 "update test data").

    I have now gone through several levels of being confused about what the tests did, what you said and why, and what Erik did or didn't do :) In summary:

    • In September, Erik removed the sync after restriction, because thorns are supposed to sync in postrestrict/mol_poststep anyway
    • This caused CarpetProlongateTest to fail, because it deliberately doesn't do this sync
    • Therefore, the test data was regenerated, and these tests started to pass again
    • When Erik added back in the sync after restriction, which we now realise is needed because the sync in postrestrict doesn't happen in the right order, the test started failing again.
    • This test is not part of the Einstein Toolkit; the thorns CarpetProlongateTest and CarpetIntegrateTest are only in the "proposed" branch. We test them as part of the Jenkins EinsteinToolkitProposed job, but not as part of the main EinsteinToolkit job.

    I think it should be possible to revert the test data change commit and have the tests pass again.

  9. Roland Haas
    • removed comment

    I am attaching a thorn with a sample testsuite to demonstrate the problem with higher order restriction.

  10. Roland Haas
    • removed comment

    An interesting variation is to ask for output along z=0.5,y=0 ie:

    IOASCII::out1D_x                        = yes
    IOASCII::out1D_xline_z                  = 0.5
    

    which shows that a 2 processor run violates the symmetry in x->-x in the Gaussian pulse at the points (+/-0.25, 0.25,0 .75)

  11. Erik Schnetter
    • removed comment

    Thanks. I pushed changes to Carpet that correct higher order restriction operators. With these changes, the output described above is symmetric. I also pushed the test case as CarpetExtra/HighOrderWaveTest.

  12. Ian Hinder reporter
    • removed comment

    I have prepared two patches for the ET_2013_11 release of Carpet. The first adds a test case which fails on 2 processes due to the bug. The second reverts the commit which removed the sync, causing the test to pass. I believe that this fixes the regression, unless there was some other commit which causes some other problem. Notes: I have not addressed the application of boundary conditions for high order restriction operators since this was broken in the previous release as well. If people think this is very important and safe, please implement and test the patch, but I don't really have more time to spend on this, and I think an announcement that cell-centering with boundaries near refinement boundaries is not yet working is sufficient. I have not included the change to restrict into ghost zones, or the optimisation which avoids the sync for vertex-centered code. As such, the performance should be the same as the last release. The trunk will be faster, but this is a less trivial change. The commit which removed the sync was from May 2014, not September as we thought in the telecon. This means that anyone using the trunk since then could potentially have been affected by this. All tests pass on Datura with the new code.

    OK to backport to the release?

  13. Frank Löffler
    • removed comment

    I don't see a reason why not to apply it, but I really like some input from Erik. I seem to recall he agreed, but then we can wait for him to reply here as well. The patch is pretty trivial, and only reinserts something that was taken out as optimization, but I didn't look at the test.

  14. Erik Schnetter
    • changed status to open
    • removed comment

    Higher order restriction operators are somewhat new, and I would not address them when backporting. I would keep the backports as simple as possible; adding the additional sync seems to do just that. Ok to backport from me.

  15. Roland Haas
    • changed status to open
    • removed comment

    The release test was violating the CFL condition and blowing up. Results were thus highly compiler dependent and failed on the (gcc based) test system. I attach a patch that change the timestep and passes on both 1 and 2 processes using gcc and intel and on intel and AMD machines. Undoing the sync fix makes the test fail on 2 processes.

  16. Log in to comment