CarpetIOHDF5 recover failure with manual topology
I have been trying to debug why some runs I was performing could not recover from a checkpoint file, but would otherwise proceed as normal.
I attached a minimalist parfile showing the problem. A small grid is manually distributed over 8 processors and terminates at iteration 2. An attempt at recover fails with nans on grid::x. If the manual topology section is commented out, no problems are seen.
The issue seems to be that with manual topology a region_t structure has it's map entry incorrectly set
What happens is, in
bool gh::recompose there is the check bool const do_recompose = level_did_change(rl);
In level_did_change, the level is considered to change because
the new region_t is
region_t(extent=([41,0,0]:[80,10,10]:[1,1,1]/[41,0,0]:[80,10,10]/[40,11,11]/4840),outer_boundaries=[[0,1,1],[1,1,1]],map=51,processor=1)
while the old isregion_t(extent=([41,0,0]:[80,10,10]:[1,1,1]/[41,0,0]:[80,10,10]/[40,11,11]/4840),outer_boundaries=[[0,1,1],[1,1,1]],map=0,processor=1)
The only difference is the new map is 51.
If I add a line Carpet/src/Recompose.cc:SplitRegions_AsSpecified to force the map entry to be zero, then all seems to work.
Without the change, Carpet recomposes the grid but never calls the postregrid functions. Hence the Nans in grid::x
Comments (7)
-
-
-
assigned issue to
-
assigned issue to
-
- changed status to open
-
I could reproduce this and fix it (I think). What initially confused me was that the example parfile as provided actually passed b/c it does not use manual topology setting (it is commented out):
#Carpet::processor_topology = "manual" #Carpet::processor_topology_3d_x = 8 #Carpet::processor_topology_3d_y = 1 #Carpet::processor_topology_3d_z = 1
Setting reg.map = 0 was the correct solution for you. It turns out that the routine did not set reg.map at all, and the constructor if
region_t
also did not initialize (poison really) it’s map member leading to use of unitialized values (often zero).The pull request:
https://bitbucket.org/eschnett/carpet/pull-requests/29/intiialize-region_t-structure-members/diffcontains code to initialize
region_t::map
andpseudoregion_t::component
to-1
which is an invalid value (and detected by consistency checks). It also changes the “manual”, “along-z” and “along-dir” splitting routines to initializereg.map
fromreg0.map
ie the map of the superregion being passed in (which will be zero in your case), which is in line with how other splitting methods (eg “automatic”) handle maps and also howreg0
is used in those routines otherwise.All tests in the testsuite (and the test parfile of course) pass with those patches.
-
@Erik Schnetter @Yosef Zlochower please review.
-
Unless objected I will push the bugfix after 2019-12-16.
-
- changed status to resolved
- Log in to comment
Here is the patch that “fixed” the problem for me.