CarpetIOHDF5 recover failure with manual topology

Issue #2279 resolved
Former user created an issue

I have been trying to debug why some runs I was performing could not recover from a checkpoint file, but would otherwise proceed as normal.

I attached a minimalist parfile showing the problem. A small grid is manually distributed over 8 processors and terminates at iteration 2. An attempt at recover fails with nans on grid::x. If the manual topology section is commented out, no problems are seen.

The issue seems to be that with manual topology a region_t structure has it's map entry incorrectly set

What happens is, in

bool gh::recompose there is the check bool const do_recompose = level_did_change(rl);

In level_did_change, the level is considered to change because

the new region_t is

region_t(extent=([41,0,0]:[80,10,10]:[1,1,1]/[41,0,0]:[80,10,10]/[40,11,11]/4840),outer_boundaries=[[0,1,1],[1,1,1]],map=51,processor=1)

while the old isregion_t(extent=([41,0,0]:[80,10,10]:[1,1,1]/[41,0,0]:[80,10,10]/[40,11,11]/4840),outer_boundaries=[[0,1,1],[1,1,1]],map=0,processor=1)

The only difference is the new map is 51.

If I add a line Carpet/src/Recompose.cc:SplitRegions_AsSpecified to force the map entry to be zero, then all seems to work.

Without the change, Carpet recomposes the grid but never calls the postregrid functions. Hence the Nans in grid::x

Comments (7)

  1. Yosef Zlochower

    Here is the patch that “fixed” the problem for me.

     diff --git a/Carpet/src/Recompose.cc b/Carpet/src/Recompose.cc
    index 0fb0b72..7083813 100644
    --- a/Carpet/src/Recompose.cc
    +++ b/Carpet/src/Recompose.cc
    @@ -1170,6 +1170,7 @@ static void SplitRegions_AsSpecified(cGH const *const cctkGH,
             obnd[0] &= clb == rlb0;
             obnd[1] &= cub == rub0;
             proc = c;
    +        reg.map = 0;
    
             pseudoregion_t preg(reg.extent, c);
             subtreesx.AT(i) = new ipfulltree(preg);
    

  2. Roland Haas

    I could reproduce this and fix it (I think). What initially confused me was that the example parfile as provided actually passed b/c it does not use manual topology setting (it is commented out):

    #Carpet::processor_topology       = "manual"
    #Carpet::processor_topology_3d_x  = 8
    #Carpet::processor_topology_3d_y  = 1
    #Carpet::processor_topology_3d_z  = 1
    

    Setting reg.map = 0 was the correct solution for you. It turns out that the routine did not set reg.map at all, and the constructor if region_t also did not initialize (poison really) it’s map member leading to use of unitialized values (often zero).

    The pull request:
    https://bitbucket.org/eschnett/carpet/pull-requests/29/intiialize-region_t-structure-members/diff

    contains code to initialize region_t::map and pseudoregion_t::component to -1 which is an invalid value (and detected by consistency checks). It also changes the “manual”, “along-z” and “along-dir” splitting routines to initialize reg.map from reg0.map ie the map of the superregion being passed in (which will be zero in your case), which is in line with how other splitting methods (eg “automatic”) handle maps and also how reg0 is used in those routines otherwise.

    All tests in the testsuite (and the test parfile of course) pass with those patches.

  3. Log in to comment