inconsistent boxes during refluxing very late in the run

Issue #800 closed
Roland Haas created an issue

this is the second half of #797. Same parameter files but using the Refluxing thorn. Eventually I get:

srcbbox: ([25248,27424,26272]:[28704,30432,29088]:[64,64,64]/[394,428,410]:[448,475,454]/[55,48,45]/118800)
dstbbox: ([28736,27840,26688]:[28736,28096,28992]:[128,128,128]/[224,217,208]:[224,219,226]/[1,3,19]/57)
regbbox: ([28736,27840,26688]:[28736,28096,28992]:[128,128,128]/[224,217,208]:[224,219,226]/[1,3,19]/57)
srcbbox: ([25248,27424,31328]:[28704,30432,34144]:[64,WARNING level 0 in thorn CarpetLib processor 80 host shc054
  (line 174 of /home/rhaas/cactus/Zelmani/arrangements/Carpet/CarpetLib/src/restrict_3d_vc_rf2.cc): 
  -> Internal error: region extent is not contained in array extent
srcbbox: ([25248,27424,23712]:[28704,30432,26592]:[64,64,64]/[394,428,370]:[448,475,415]/[55,48,46]/121440)
dstbbox: ([28736,27840,24512]:[28736,28096,25792]:[128,128,128]/[224,217,191]:[224,219,201]/[1,3,11]/33)
regbbox: ([28736,27840,24512]:[28736,28096,25792]:[128,128,128]/[224,217,191]:[224,219,201]/[1,3,11]/33)
64,64]/[394,428,489]:[448,475,533]/[55,48,45]/118800)
dstbbox: ([28736,27840,31680]:[28736,28096,33344]:[128,128,128]/[224,217,247]:[224,219,260]/[1,3,14]/42)
regbbox: ([28736,27840,31680]:[28736,28096,33344]:[128,128,128]/[224,217,247]:[224,219,260]/[1,3,14]/42)
WARNING level 0 in thorn CarpetLib processor 82 host shc053
  (line 174 of /home/rhaas/cactus/Zelmani/arrangements/Carpet/CarpetLib/src/restrict_3d_vc_rf2.cc): 
  -> Internal error: region extent is not contained in array extent
WARNING level 0 in thorn CarpetLib processor 79 host shc055
  (line 174 of /home/rhaas/cactus/Zelmani/arrangements/Carpet/CarpetLib/src/restrict_3d_vc_rf2.cc): 
  -> Internal error: region extent is not contained in array extent
srcbbox: ([25248,27424,28768]:[28704,30432,31648]:[64,64,64]/[394,428,449]:[448,475,494]/[55,48,46]/121440)
dstbbox: ([28736,27840,29120]:[28736,28096,29888]:[128,128,128]/[224,217,227]:[224,219,233]/[1,3,7]/21)
regbbox: ([28736,27840,29120]:[28736,28096,29888]:[128,128,128]/[224,217,227]:[224,219,233]/[1,3,7]/21)
WARNING level 0 in thorn CarpetLib processor 81 host shc054
  (line 174 of /home/rhaas/cactus/Zelmani/arrangements/Carpet/CarpetLib/src/restrict_3d_vc_rf2.cc): 
  -> Internal error: region extent is not contained in array extent

which happens during refluxing

INFO (Carpet): [ml=0][rl=2][tl=0] Evolution/PostRestrict at iteration 83712 time 196.2
INFO (Carpet): [ml=0][rl=2][tl=0] Scheduling CCTK_POSTRESTRICT
INFO (Carpet): [ml=0][rl=2][tl=0] Level mode call at CCTK_POSTRESTRICT to Refluxing::Refluxing_CorrectState
INFO (Refluxing): Refluxing at iteration 83712 on level 2 of 4
INFO (Refluxing): Refluxing on level 2:
INFO (Carpet): [ml=0][rl=2][tl=0] Entering singlemap mode
INFO (Carpet): [ml=0][rl=2][m=0][tl=0] Entering local mode
INFO (Carpet): [ml=0][rl=2][m=0][c=0,lc=0][tl=0] Leaving local mode
INFO (Carpet): [ml=0][rl=2][m=0][tl=0] Leaving singlemap mode
INFO (Carpet): [ml=0][rl=2][tl=0] Leaving level mode
INFO (Carpet): [ml=0][tl=0] Entering level mode
INFO (Carpet): [ml=0][rl=3][tl=0] Entering singlemap mode
INFO (Carpet): [ml=0][rl=3][m=0][tl=0] Entering local mode
INFO (Carpet): [ml=0][rl=3][m=0][c=0,lc=0][tl=0] Leaving local mode
INFO (Carpet): [ml=0][rl=3][m=0][tl=0] Leaving singlemap mode
INFO (Carpet): [ml=0][rl=3][tl=0] Leaving level mode
INFO (Carpet): [ml=0][tl=0] Entering level mode
INFO (CarpetLib): About to MPI_Isend to processor 1 for type double
INFO (CarpetLib): Finished MPI_Isend
INFO (CarpetLib): About to MPI_Isend to processor 2 for type double
INFO (CarpetLib): Finished MPI_Isend
INFO (CarpetLib): About to MPI_Isend to processor 3 for type double
INFO (CarpetLib): Finished MPI_Isend
INFO (CarpetLib): About to MPI_Isend to processor 7 for type double
INFO (CarpetLib): Finished MPI_Isend
INFO (CarpetLib): About to MPI_Isend to processor 8 for type double
INFO (CarpetLib): Finished MPI_Isend
INFO (CarpetLib): About to MPI_Isend to processor 9 for type double
INFO (CarpetLib): Finished MPI_Isend
INFO (CarpetLib): About to MPI_Waitall
INFO (CarpetLib): Finished MPI_Waitall
INFO (CarpetLib): About to MPI_Waitall
INFO (CarpetLib): Finished MPI_Waitall
INFO (Carpet): [ml=0][rl=2][tl=0] Entering singlemap mode
INFO (Carpet): [ml=0][rl=2][m=0][tl=0] Entering local mode
INFO (Refluxing): Refluxing on level 2 map 0 component 0 direction 0 face 0: [25,25,25]:[25,31,35]
INFO (Refluxing): Refluxing on level 2 map 0 component 0 direction 1 face 0: [25,25,25]:[33,25,35]
INFO (Refluxing): Refluxing on level 2 map 0 component 0 direction 2 face 0: [25,25,25]:[33,31,25]
INFO (Carpet): [ml=0][rl=2][m=0][c=0,lc=0][tl=0] Leaving local mode
INFO (Carpet): [ml=0][rl=2][m=0][tl=0] Leaving singlemap mode
INFO (Carpet): [ml=0][rl=2][tl=0] Leaving level mode
INFO (Carpet): [ml=0][tl=0] Entering level mode
INFO (Carpet): [ml=0][rl=3][tl=0] Entering singlemap mode
INFO (Carpet): [ml=0][rl=3][m=0][tl=0] Entering local mode
INFO (Carpet): [ml=0][rl=3][m=0][c=0,lc=0][tl=0] Leaving local mode
INFO (Carpet): [ml=0][rl=3][m=0][tl=0] Leaving singlemap mode
INFO (Carpet): [ml=0][rl=3][tl=0] Leaving level mode
INFO (Carpet): [ml=0][tl=0] Entering level mode
INFO (Carpet): [ml=0][rl=2][tl=0] Entering singlemap mode
INFO (Carpet): [ml=0][rl=2][m=0][tl=0] Entering local mode
INFO (Carpet): [ml=0][rl=2][m=0][c=0,lc=0][tl=0] Leaving local mode
INFO (Carpet): [ml=0][rl=2][m=0][tl=0] Leaving singlemap mode
INFO (Carpet): [ml=0][rl=2][tl=0] SyncGroup "GRHYDRO::DENS" iteration=83712 time=196.2
INFO (Carpet): [ml=0][rl=2][tl=0] SyncGroup "GRHYDRO::SCON" iteration=83712 time=196.2
INFO (Carpet): [ml=0][rl=2][tl=0] SyncGroup "GRHYDRO::TAU" iteration=83712 time=196.2
INFO (Carpet): [ml=0][rl=2][tl=0] ProlongateGroups
INFO (CarpetLib): About to MPI_Irecv from processor 1 for type double
INFO (CarpetLib): Finished MPI_Irecv
INFO (CarpetLib): About to MPI_Irecv from processor 6 for type double
INFO (CarpetLib): Finished MPI_Irecv
INFO (CarpetLib): About to MPI_Irecv from processor 36 for type double
INFO (CarpetLib): Finished MPI_Irecv
INFO (CarpetLib): About to MPI_Irecv from processor 39 for type double
INFO (CarpetLib): Finished MPI_Irecv
INFO (CarpetLib): About to MPI_Irecv from processor 48 for type double
INFO (CarpetLib): Finished MPI_Irecv
INFO (CarpetLib): About to MPI_Waitall
INFO (CarpetLib): Finished MPI_Waitall
INFO (CarpetLib): About to MPI_Waitall
INFO (CarpetLib): Finished MPI_Waitall
INFO (Carpet): [ml=0][rl=2][tl=0] SyncGroups
INFO (CarpetLib): About to MPI_Irecv from processor 1 for type double
INFO (CarpetLib): Finished MPI_Irecv
INFO (CarpetLib): About to MPI_Irecv from processor 2 for type double
INFO (CarpetLib): Finished MPI_Irecv
INFO (CarpetLib): About to MPI_Irecv from processor 3 for type double
INFO (CarpetLib): Finished MPI_Irecv
INFO (CarpetLib): About to MPI_Irecv from processor 7 for type double
INFO (CarpetLib): Finished MPI_Irecv
INFO (CarpetLib): About to MPI_Irecv from processor 8 for type double
INFO (CarpetLib): Finished MPI_Irecv
INFO (CarpetLib): About to MPI_Irecv from processor 9 for type double
INFO (CarpetLib): Finished MPI_Irecv
INFO (CarpetLib): About to MPI_Irecv from processor 10 for type double
INFO (CarpetLib): Finished MPI_Irecv
INFO (CarpetLib): About to MPI_Isend to processor 1 for type double
INFO (CarpetLib): Finished MPI_Isend
INFO (CarpetLib): About to MPI_Isend to processor 2 for type double
INFO (CarpetLib): Finished MPI_Isend
INFO (CarpetLib): About to MPI_Isend to processor 3 for type double
INFO (CarpetLib): Finished MPI_Isend
INFO (CarpetLib): About to MPI_Isend to processor 7 for type double
INFO (CarpetLib): Finished MPI_Isend
INFO (CarpetLib): About to MPI_Isend to processor 8 for type double
INFO (CarpetLib): Finished MPI_Isend
INFO (CarpetLib): About to MPI_Isend to processor 9 for type double
INFO (CarpetLib): Finished MPI_Isend
INFO (CarpetLib): About to MPI_Isend to processor 10 for type double
INFO (CarpetLib): Finished MPI_Isend
INFO (CarpetLib): About to MPI_Waitall
INFO (CarpetLib): Finished MPI_Waitall
INFO (CarpetLib): About to MPI_Waitall
INFO (CarpetLib): Finished MPI_Waitall
INFO (Carpet): [ml=0][rl=2][tl=0] Level mode call at MoL_PostStep to ADMBase::ADMBase_Boundaries
INFO (Carpet): [ml=0][rl=2][tl=0] SyncGroup "ADMBASE::LAPSE" iteration=83712 time=196.2
INFO (Carpet): [ml=0][rl=2][tl=0] SyncGroup "ADMBASE::DTLAPSE" iteration=83712 time=196.2
INFO (Carpet): [ml=0][rl=2][tl=0] SyncGroup "ADMBASE::SHIFT" iteration=83712 time=196.2
INFO (Carpet): [ml=0][rl=2][tl=0] SyncGroup "ADMBASE::DTSHIFT" iteration=83712 time=196.2
INFO (Carpet): [ml=0][rl=2][tl=0] SyncGroup "ADMBASE::METRIC" iteration=83712 time=196.2
INFO (Carpet): [ml=0][rl=2][tl=0] SyncGroup "ADMBASE::CURV" iteration=83712 time=196.2
INFO (Carpet): [ml=0][rl=2][tl=0] ProlongateGroups
INFO (CarpetLib): About to MPI_Irecv from processor 1 for type double
INFO (CarpetLib): Finished MPI_Irecv
INFO (CarpetLib): About to MPI_Irecv from processor 6 for type double
INFO (CarpetLib): Finished MPI_Irecv
INFO (CarpetLib): About to MPI_Irecv from processor 36 for type double
INFO (CarpetLib): Finished MPI_Irecv
INFO (CarpetLib): About to MPI_Irecv from processor 39 for type double
INFO (CarpetLib): Finished MPI_Irecv
INFO (CarpetLib): About to MPI_Irecv from processor 48 for type double
INFO (CarpetLib): Finished MPI_Irecv
INFO (CarpetLib): About to MPI_Waitall

The full stdout and stderr files are about 250MB for this (because of all the verbosity).

I agree that this is not a very helpful error report. Unfortunately I don't know how to distill more useful information out of it :-(

Keyword: Refluxing

Comments (7)

  1. Roland Haas reporter
    • removed comment

    I did some initial debugging on this (hampered by my unfamiliarity with CarpetLib's internals), and so far the extra piece of information is only that the failed function is a `CarpetLib::restrict_3d_vc_rf2<double, 0, 1, 1>` called (with some intermediaries) from reflux_all where both source and destination boxes have (integer) coordinates that look cell centred (though that might be ok).

  2. Roland Haas reporter
    • removed comment

    I did some debugging on this and have come up a set of patches to actually evolve past the point of failure. Since the error happens in the mixed vertex-cell-centered restriciton routine used during refluxing I changed the containedness test that triggers the assert to shift the boxes by half a stride in the directions in which the boxes are vertex centered. Ie. it undoes the shift that happened when the sendrecv boxes are set up in dh.cc's regrid function (in the refluxing section). What sounds fishy about this to me is that this seems indicate that the dstbox was at the very edge of the source box so I am not sure if relxuing is actually happening at the correct location. Looking at the boxes that are distributed between the processors I have that the error was triggered by

    srcbbox: ([23904,27552,28768]:[29088,30304,34144]:[64,64,64]/[373,430,449]:[454,473,533]/[82,44,85]/306680) dstbbox: ([29120,29632,29120]:[29120,29888,32576]:[128,128,128]/[227,231,227]:[227,233,254]/[1,3,28]/84) regbbox: ([29120,29632,29120]:[29120,29888,32576]:[128,128,128]/[227,231,227]:[227,233,254]/[1,3,28]/84) in component 27 (processor 27) which is the source in the following pseudoregion

    fast_ref_refl_sendrecv_0_0: [(send:(ext:([29088,29600,29088]:[29088,29920,32608]:[64,64,64]/[454,462,454]:[454,467,509]/[1,6,56]/336),c:27),recv:(ext:([29120,29632,29120]:[29120,29888,32576]:[128,128,128]/[227,231,227]:[227,233,254]/[1,3,28]/84),c:24)),... However the sendrecv actually is at the edge of two components namely component 27 and 29:

    ml=0 rl=3 c=27 dh::light_dboxes:{ exterior: ([23904,27552,28768]:[29088,30304,34144]:[64,64,64]/[373,430,449]:[454,473,533]/[82,44,85]/306680) owned: ([24096,27744,28960]:[28896,30112,33952]:[64,64,64]/[376,433,452]:[451,470,530]/[76,38,79]/228152) interior: ([24096,27744,28960]:[28896,30112,33952]:[64,64,64]/[376,433,452]:[451,470,530]/[76,38,79]/228152) active_size: 153020 }

    ml=0 rl=3 c=29 dh::light_dboxes:{ exterior: ([28768,27552,28768]:[34016,30304,34144]:[64,64,64]/[449,430,449]:[531,473,533]/[83,44,85]/310420) owned: ([28960,27744,28960]:[33824,30112,33952]:[64,64,64]/[452,433,452]:[528,470,530]/[77,38,79]/231154) interior: ([28960,27744,28960]:[33824,30112,33952]:[64,64,64]/[452,433,452]:[528,470,530]/[77,38,79]/231154) active_size: 155050 }

    So it would seems as if this is not actually a refinement boundary but just an inter-processor boundary and that refluxing should not happen here (the interface coordinate is 29056) at all. On the other hand, the logic in dh.cc that computes the interfaces seems sound (module issues that I would have expected more instances where face==1 interfaces are shifted right and face==0 interfaces are shifted left but that does not actually change any of the sendrecv created; I assume that the ghost zones help).

    I attach my patches that remove the warning but that might well not fix that underlying problem if my reading of the box layout is correct.

  3. Roland Haas reporter
    • changed status to open
    • removed comment

    I finally managed to produce a parameter file that triggers the same error during initial data. Run with:

    OMP_NUM_THREADS=1 mpirun -n 8 cactus_sim --redirect=oe reflux.par

    The failure should happen in process 5 with

    srcbbox: ([37,85,5]:[95,111,31]:[2,2,2]/[18,42,2]:[47,55,15]/[30,14,14]/5880) dstbbox: ([96,96,12]:[96,100,24]:[4,4,4]/[24,24,3]:[24,25,6]/[1,2,4]/8) regbbox: ([96,96,12]:[96,100,24]:[4,4,4]/[24,24,3]:[24,25,6]/[1,2,4]/8) WARNING level 0 in thorn CarpetLib processor 5 host horizon.tapir.caltech.edu (line 183 of /mnt/data/rhaas/postdoc/gr/Zelmani/arrangements/Carpet/CarpetLib/src/restrict_3d_vc_rf2.cc): -> Internal error: region extent is not contained in array extent and indeed it is using a point at the very edge of a box for refluxing. Refluxing seems to be called for at this point (96,96,z) so this is fine and even using the very outer point is fine since no interpolation needs to be done in the x direction. So the patches above actually would seem to fix the problem rather than just hide it. It might still be safer to use the points from the neighbouring component which are non-ghost points compared to the ghost points currently used.

    The grid structure visualizes relatively nicely by using the output of

    for((rl=0;$rl<2;rl++)) ; do for((c=0;$c<8;c++)) ; do if [ $rl -eq 0 ] && [ $c -eq 0 ] ; then printf 'plot ' else printf ',' fi printf '"<gawk '\''$1==0 && $4==%d && $3==%d'\'' ./reflux/weight.xy.asc" u 6:7 t "r%dc%d"' $c $rl $rl $c done done echo ""

  4. Roland Haas reporter
    • changed component to Carpet
    • removed comment

    Erik: is it ok to apply patches 0002 and 0003? (0001 is already applied). They affect the box that is used to check for "containedness" in the vc restriction case (ie. for refluxing). They undo the shift by half a cell that is done in regrid.cc in the vertex centered direction since refluxing is an injection, not an average over neighbours.

  5. Log in to comment