N13CarpetRegrid211snap_coarseE does not hold after being enforced in cell centering

Create issue
Issue #797 closed
Roland Haas created an issue

I have a run (on Caltech's shc and on lonestar) which runs fine for about 300M or so and then fails with a property enforcement error inside of CarpetRegrid.

Attached please find stdout with Carpet::veryverbose=yes, CarpetRegrid2::veryverbose=yes as well as the parameter file.

It takes about 1hr to reproduce this starting from the checkpoints on shc. Most of this time is IO (the checkpoint is iteration 133120 failure occurs at preregrid in iteration 133185).

Turning off cell centering makes the error go away. Turning on refluxing makes the run fails (much) earlier with a boxes not contained error (re-running this on shc right now, the XSEDE clusters are just too slow).

In case you have access to shc (or zwicky) and would look at the actual output files, the path is: /panfs/ds06/sxs/rhaas/cactus/simulations/noreflux/output-0003

Keyword:

Comments (22)

  1. Roland Haas reporter
    • removed comment

    Some carefully place printfs in property.cc show that for the regriddin in question CarpetRegrid2 has put a small (2 cells high) strip of cells between two boxes, ie. the two boxes for the NS are touching in one corner:

    +---------+

    1
    ___

    +-----+ |

    2
    __+----+
    3

    +----------+

    Region "2" is the small one. In numbers from the code:

    ([22112,28896,23904]:[35744,28960,33952]:[64,64,64]/[345,451,373]:[558,452,530]/[214,2,158]/67624)

    Snaptocoarse uses a contracted_for(coarse_base).expand_for(fine_base).expand(buffer_with % reffact) scheme to align the active (non-buffer) regions. Outputting the strip, then after contraction and then after expand_for and finally the expand (by 1 as it turns out) I get:

    aa: ([22112,28896,23904]:[35744,28960,33952]:[64,64,64]/[345,451,373]:[558,452,530]/[214,2,158]/67624) aa.expand(): ([22112,28896,23904]:[35744,28960,33952]:[64,64,64]/[345,451,373]:[558,452,530]/[214,2,158]/67624) aa.expand().contracted_for(): ([22208,28992,24000]:[35648,28864,33856]:[128,128,128]/[173,226,187]:[278,225,264]/[106,0,78]/0) aa.expand().contracted_for().expanded_for(): ([2464,2464,2464]:[2400,2400,2400]:[64,64,64]/[38,38,38]:[37,37,37]/[0,0,0]/0) aa.expand().contracted_for().expanded_for().expand(): ([2400,2400,2400]:[2464,2464,2464]:[64,64,64]/[37,37,37]:[38,38,38]/[2,2,2]/8)

    Notice that after expanded_for() the box is (a) empty and (b) has suddenly moved (to what is the lower corner of baseextent(0,rl) for this level [2464]). The final expand then turns it into a non-empty box in the wrong spot. This is what eventually triggers the property-does-not-hold assert (the test checks a difference between snapped and current grid to be empty when &'ed with baseextent).

    I'll try and understand why expanded_for() moves the box and if the logic for alignment should also work with very small strips.

  2. Roland Haas reporter
    • changed status to open
    • removed comment

    The issue seems to be an explicit check in expand_for for an empty() *this. Since all the other manipulation functions also seem to contain at least BBBOX_ASSERTs on emptyness in at least some direction, simply removing the test seems like a bad idea. Right I am testing (successfully so far!) the following code, which is just the code pieces from the various bbbox member functions expanded out.

    1. if(0) this fails for very thin strips where cotracted_for creates a temporary empty box snapped |= bb. expand(reffact-1 - buffers % reffact). contracted_for(cbase). expanded_for(base). expand(buffers % reffact);
    2. else same math as above but avoid in particular expanded_for's set-to-empty ivect lower = bb.lower(); ivect upper = bb.upper(); ivect stride = bb.stride(); cout << "aa: " << ibbox(lower,upper,stride) << endl; expand(reffact-1 - buffers % reffact) { lower -= (reffact-1 - buffers[0] % reffact) * stride; upper += (reffact-1 - buffers[1] % reffact) * stride; } cout << "aa.expand(): " << ibbox(lower,upper,stride) << endl; contracted_for(cbase) { stride = cbase.stride(); ivect loff = ((lower - cbase.lower()) % stride + stride) % stride; ivect uoff = ((upper - cbase.lower()) % stride + stride) % stride; lower += (stride - loff) % stride; go inwards upper -= uoff; } cout << "aa.expand().contracted_for(): " << ibbox(lower,upper,stride) << endl; expanded_for(base) { stride = base.stride(); ivect loff = ((lower - base.lower()) % stride + stride) % stride; ivect uoff = ((upper - base.lower()) % stride + stride) % stride; lower -= loff; go outwards upper += (stride - uoff) % stride; } cout << "aa.expand().contracted_for().expanded_for(): " << ibbox(lower,upper,stride) << endl; expand(buffers % reffact); { lower -= (buffers[0] % reffact) * stride; upper += (buffers[1] % reffact) * stride; } cout << "aa.expand().contracted_for().expanded_for().expand(): " << ibbox(lower,upper,stride) << endl; snapped |= ibbox(lower,upper,stride);
    3. endif

    Erik: I don't quite trust my understanding of the "reffact-1 - buffers % reffact" of the algorithm. Do you have an idea if it would be possible to rewrite the order of operations so that it never generates an empty box?

    Otherwise the algorithm seems fine, ie. even for a very tiny strip it produces eg. a left edge that is a the same location as that of a thick strip with the same left edge initially.

  3. Erik Schnetter
    • removed comment

    Thanks for debugging this so far.

    Instead of performing these calculations manually, there may be an easier way. You could enlarge the bbox initially by one coarse grid spacing in each direction, and shrink it by the same amount in the end. This avoids empty bboxes in intermediate steps.

    I believe increasing the first expand amount by reffact and reducing the last expand amount by reffact should do this.

    What do you mean by "creates a temporary"? Do you mean "returns an empty bbox"?

    You could also add a check that bb is not empty before the calculation, and that its result is also not empty.

  4. Roland Haas reporter
    • removed comment

    Yes, I meant "returns an empty box". It's actually in the posted text but since the code snippet is not word-wrapped one has to scroll to the right to see it.

    I'll give the suggestion of adding/subtracting on reffact a try. There seemed to be sufficiently many cases (vertex/cell centered, reffact even/odd, reffact > 2, location of the edge) to make asking for some help a good idea. I'll also add a test for empty at the very beginning in the current calculation and compare both variants (code both and compare the resulting boxes).

    Turns out this would have been so much easier to find had I turned on CARPET_DEBUG :-) We live and learn it seems.

    I think I'll also pick up my footer of asking for this trac spam issue to be fixed again (would be nice to fix this before the next release) :-)

  5. Erik Schnetter
    • removed comment

    You want to check for empty bboxes at both the beginning and at the end. You will need to introduce a temporary for this.

  6. anonymous
    • removed comment

    I'd have simply skipped the whole chunk of code if the input box was empty. (testing the Trac captchas, so not logged in)

  7. anonymous
    • removed comment

    No, this does not suffice.

    (1) The input bbox will never be empty (hence the first assert). (2) Checking that the result is not empty will ensure that this bug does not re-appear.

  8. Roland Haas reporter
    • assigned issue to
    • removed comment

    I have a patch for this that I am currently testing to see that it does not change the constructed boxes. Essentially it follows Erik's suggestion of expanding and contracting the bbox by a coarse cell size.

  9. Erik Schnetter
    • removed comment

    I now think that generating empty bboxes is fine. They are generated if the domain contains "spurious" buffer zones, i.e. buffer zones that are not needed. For example, if you have 9 buffer zones and create a 5^3 refined region far away from anything else, then this 5^3 region does not contribute anything -- it has no interior, it consists only of buffer zones that will be set from the next coarser grid, nothing from there will be restricted back to the coarse grid.

    I am also working on a patch...

  10. Roland Haas reporter
    • removed comment

    With the patch applied as is, it aborts with an assert():

    arrangements/Carpet/CarpetRegrid2/src/property.cc:286: ibset CarpetRegrid2::snap_coarse::snapped_regions(const gh&, const dh&, const CarpetRegrid2::level_boundary&, const std::vector<bboxset<int, 3> >&, int): Assertion `original_region <= base' failed if I then comment out the assert(), it no longer fails but the grid structure created is incorrect (only a single box on refinement level 1 when there should be two).

  11. Erik Schnetter
    • removed comment

    Can you try the attached, updated patch? reflux2.par now passes, i.e. complains about an EOS problem instead of aborting with an assertion failure.

  12. Roland Haas reporter
    • removed comment

    I tired reflux2.par which does indeed pass. I'll give it a try with a large simulation which will hopefully start during the week but otherwise the patch seems good to go to me.

  13. Roland Haas reporter
    • removed comment

    Sigh. A full simulation found an assert:

    /work/00945/rhaas/Zelmani/arrangements/Carpet/CarpetRegrid2/src/property.cc:303: bboxset<int, 3> CarpetRegrid2::snap_coarse::snapped_regions(const gh &, const dh &, const CarpetRegrid2::level_boundary &, const std::vector<bboxset<int, 3>, std::allocator<bboxset<int, 3>>> &, int): Assertion `snapped_region >= original_region' failed during the first actual regrid of the simulation.

  14. Roland Haas reporter
    • removed comment

    I attach my currently in use code in 0001-CarpetRegrid2-do-not-create-temporary-empty-boxes-in.patch (well amost, the real code in use is in real-code.patch but you do not want to commit that to carpet).

  15. Roland Haas reporter
    • removed comment

    attached parameter file and stdout, stderr of run that failed. Most likely will not run for anyone else though.

  16. Log in to comment