- changed status to open
- removed comment
Error in synchronisation after restriction
Carpet does not currently restrict into ghost points, and it does not synchronise after restricting, as synchronisation is expected to be performed by user thorns in MoL_PostStep at the same time as application of boundary conditions, and MoL_PostStep is scheduled by MoL in CCTK_POSTRESTRICT, which occurs after restriction.
Unfortunately, CCTK_POSTRESTRICT is traversed in the order coarse to fine, whereas restriction happens fine to coarse, so the synchronisation applied by the user thorns does not occur in the correct order. This leads to incorrect data on the coarse grid. I noticed this when comparing output in 3D between identical simulations run on different numbers of processes.
The simple fix is to synchronise after restricting on each level; this ensures that the result is correct, but introduces a performance penalty due to the additional sync. One optimisation is possible. Carpet currently does not restrict into ghost zones, leading to the requirement of a synchronisation after restriction. Carpet can be made to restrict into ghost zones, but only if the restriction operator has a single-point stencil (e.g. point-copying used in vertex-centered mesh refinement). If this is the case, the sync after restriction is no longer necessary.
The branch ianhinder/restrictsync implements these changes.
(aside: the CSS styling on git.carpetcode.org has been broken for a while)
- Carpet: Add restriction sync test Test data generated on 1 process. Test fails on 2 processes due to lack of synchronisation in Carpet after restriction.
- Carpet: Sync after restriction on each level Synchronising in POSTRESTRICT (e.g. in MoL_PostStep) is not sufficient, as there it happens coarse-to-fine, whereas it needs to happen fine-to-coarse, like restriction. This introduces an additional sync of all restricted variables, which will have a performance impact. The coarse grid was, however, incorrect before. test_restrict_sync now passes on 1, 2 and 4 processes. It fails on 8 processes due to an additional blank line in the output which the test system does not tolerate.
- Carpet, CarpetLib: Restrict into ghost zones and skip sync after restrict if not using higher order restriction test_restrict_sync still passes on 1 and 2 processes
Notes:
- We could enable (3) via a parameter since it is an optimisation. However, Erik believes the optimisation is always correct, and all the tests continue to pass, so I am tentatively proposing that no additional parameter is required.
- Can we make this sort of problem easier to detect? e.g. by poisoning the ghost points which are not set by restriction? When relying on user thorns to do something, can we poison the corresponding points first?
Thanks to Erik for helping to diagnose the problem and suggesting possible fixes. ET test results unchanged on 1 and 2 processes.
Comments on the commits?
Keyword: backport
Comments (21)
-
reporter -
- removed comment
The code in dh.cc now contains two if statements checking for higher order restriction -- once with a question mark operator (introduced by this patch), and then in an existing if statement overwriting an existing variable. I suggest to rewrite this such that there is only a single if statement with an else branch, so that the variable "needrecv" is set exactly once.
The parameter use_higher_order_restriction seems undefined in Carpet -- did you forget to commit param.ccl?
After these changes okay to commit.
Somewhat related: The current nearby comment
// NOTE: change in behaviour (affects only outer boundaries I think)!!!! // NOTE: b/c of this we need a low-level sync after the restrict
is wrong: We not only need to sync, but also need to apply boundary conditions (including symmetries) after each restriction! The current sync in Restrict.cc is not sufficient for this. Instead, we should call MoL_PostStep from Evolve.cc... But we should probably have a test case for higher order restriction first.
-
reporter - changed status to resolved
- removed comment
use_higher_order_restriction is an existing CarpetLib parameter which is shared by Carpet. I didn't change this - it was already referred to in this code. I just added another reference to it. I wondered about that comment as well.
Erik has merged my commits and fixed some conflicts with the "apply after release" branch which was also merged. Closing ticket.
-
- removed comment
We do have test cases for higher order restriction: CarpetExtra/CarpetProlongateTest contains tests test_cc_horest_o[35]. Are those the ones you were thinking of?
-
reporter - removed comment
Do those tests have 3 refinement levels? I wasn't able to reproduce the problem with 2. Since we are now syncing when using higher order restriction, whereas before we weren't, ideally the existing tests should now be failing as the data is different (and correct). This suggests the original tests didn't trigger this problem (e.g. only 2 levels) or the output didn't catch the problem region (the finest grid region of the coarse grid, in a 3-level simulation). It would be good to modify the test I just added (test_restrict_sync) to use higher order restriction, and check that it fails before these fixes and passes afterwards. Do you want to do that?
-
- removed comment
Well the tests did fail after Erik's changes but then the prolongation tests are not doing what normal thorn is supposed to do, ie they are not scheduling MoL_PostStep after the restrict. So data was regennerated (carpet-6-init-355-gb3bc3f5 "update test data").
-
- changed status to open
- removed comment
Reopening, since restriction may still be wrong for higher order restriction operators.
-
reporter - removed comment
I ran all the ET tests on my commits on datura, and they all passed, so it must have been something that changed in the merge, or in the "after release" branch that was also merged in.
-
reporter - removed comment
Replying to [comment:6 rhaas]:
Well the tests did fail after Erik's changes but then the prolongation tests are not doing what normal thorn is supposed to do, ie they are not scheduling MoL_PostStep after the restrict. So data was regennerated (carpet-6-init-355-gb3bc3f5 "update test data").
I have now gone through several levels of being confused about what the tests did, what you said and why, and what Erik did or didn't do :) In summary:
- In September, Erik removed the sync after restriction, because thorns are supposed to sync in postrestrict/mol_poststep anyway
- This caused CarpetProlongateTest to fail, because it deliberately doesn't do this sync
- Therefore, the test data was regenerated, and these tests started to pass again
- When Erik added back in the sync after restriction, which we now realise is needed because the sync in postrestrict doesn't happen in the right order, the test started failing again.
- This test is not part of the Einstein Toolkit; the thorns CarpetProlongateTest and CarpetIntegrateTest are only in the "proposed" branch. We test them as part of the Jenkins EinsteinToolkitProposed job, but not as part of the main EinsteinToolkit job.
I think it should be possible to revert the test data change commit and have the tests pass again.
-
- removed comment
I am attaching a thorn with a sample testsuite to demonstrate the problem with higher order restriction.
-
- removed comment
An interesting variation is to ask for output along z=0.5,y=0 ie:
IOASCII::out1D_x = yes IOASCII::out1D_xline_z = 0.5
which shows that a 2 processor run violates the symmetry in x->-x in the Gaussian pulse at the points (+/-0.25, 0.25,0 .75)
-
- removed comment
Thanks. I pushed changes to Carpet that correct higher order restriction operators. With these changes, the output described above is symmetric. I also pushed the test case as CarpetExtra/HighOrderWaveTest.
-
reporter - removed comment
I have prepared two patches for the ET_2013_11 release of Carpet. The first adds a test case which fails on 2 processes due to the bug. The second reverts the commit which removed the sync, causing the test to pass. I believe that this fixes the regression, unless there was some other commit which causes some other problem. Notes: * I have not addressed the application of boundary conditions for high order restriction operators since this was broken in the previous release as well. If people think this is very important and safe, please implement and test the patch, but I don't really have more time to spend on this, and I think an announcement that cell-centering with boundaries near refinement boundaries is not yet working is sufficient. * I have not included the change to restrict into ghost zones, or the optimisation which avoids the sync for vertex-centered code. As such, the performance should be the same as the last release. The trunk will be faster, but this is a less trivial change. * The commit which removed the sync was from May 2014, not September as we thought in the telecon. This means that anyone using the trunk since then could potentially have been affected by this. * All tests pass on Datura with the new code.
OK to backport to the release?
-
reporter - changed status to open
- removed comment
-
- removed comment
I don't see a reason why not to apply it, but I really like some input from Erik. I seem to recall he agreed, but then we can wait for him to reply here as well. The patch is pretty trivial, and only reinserts something that was taken out as optimization, but I didn't look at the test.
-
- changed status to open
- removed comment
Higher order restriction operators are somewhat new, and I would not address them when backporting. I would keep the backports as simple as possible; adding the additional sync seems to do just that. Ok to backport from me.
-
- changed status to open
- removed comment
The release test was violating the CFL condition and blowing up. Results were thus highly compiler dependent and failed on the (gcc based) test system. I attach a patch that change the timestep and passes on both 1 and 2 processes using gcc and intel and on intel and AMD machines. Undoing the sync fix makes the test fail on 2 processes.
-
- changed status to open
- removed comment
-
reporter - changed status to open
- removed comment
Please apply.
-
- changed status to resolved
- removed comment
committed as hash 58cf9af1453b23e92da26830f638d02e26d699e6 "Carpet: reduce timestep in restriction to due to CFL limit" into the release branch. Trunk uses a different test that is not affected by the issue.
-
- removed comment
- Log in to comment