#405 Merged at 7bd90f0
  1. dcollins4096

The fields divB and gradPhi are currently allocated in Grid_PrepareGrid. . This gets called on all grids on all tasks, which leads to memory errors. This PR moves the allocation to Grid_AllocateGrids, which only gets called on the grid's task.

Since this changes the allocation sequence, I issued PR 404 which ensures that divB and gradPhi are allocated for all possible initializers.

Comments (25)

  1. Brian OShea

    This PR fails the test suite vs. the tip of enzo-dev. The tests that run work just fine. However, the following tests die with segmentation faults:


    The error message I get on my laptop (gcc compiler stack) is the same for all of these, and is as follows:

     [starbuck:94287] *** Process received signal ***
    [starbuck:94287] Signal: Segmentation fault: 11 (11)
    [starbuck:94287] Signal code: Address not mapped (1)
    [starbuck:94287] Failing at address: 0x0
    [starbuck:94287] [ 0] 0   libsystem_platform.dylib            0x00007fff959da52a _sigtramp + 26
    [starbuck:94287] [ 1] 0   ???                                 0x00007fc452032400 0x0 + 140481166255104
    [starbuck:94287] [ 2] 0   enzo.exe                            0x000000010a7fcd3b _ZN4grid14MHDRK2_1stStepEPP6fluxesiiP16ExternalBoundary + 3387
    [starbuck:94287] [ 3] 0   enzo.exe                            0x000000010a7c8624 _Z15EvolveLevel_RK2P11TopGridDataPP19LevelHierarchyEntryidP16ExternalBoundaryP18ImplicitProblemABCdPP15SiblingGridList + 1780
    [starbuck:94287] [ 4] 0   enzo.exe                            0x000000010a7c8c85 _Z15EvolveLevel_RK2P11TopGridDataPP19LevelHierarchyEntryidP16ExternalBoundaryP18ImplicitProblemABCdPP15SiblingGridList + 3413
    [starbuck:94287] [ 5] 0   enzo.exe                            0x000000010a4fcde7 _Z15EvolveHierarchyR14HierarchyEntryR11TopGridDataP16ExternalBoundaryP18ImplicitProblemABCPP19LevelHierarchyEntryd + 2439
    [starbuck:94287] [ 6] 0   enzo.exe                            0x000000010a492407 main + 3079
    [starbuck:94287] [ 7] 0   libdyld.dylib                       0x00007fff8c9e95ad start + 1
    [starbuck:94287] *** End of error message ***

    Note that MHDZeldovichPancake (the only 1D test that fails) seg faults after 252 time steps, whereas the 2D tests that fail die after either the first or second time steps.

    1. Brian OShea

      Actually, I have a quick followup on this, since I'm also looking at PR #364. Did you successfully run the test suite, @dcollins4096 ? It's not enabled on your BitBucket account by default, it seems. Does this PR require something from a previous PR in order to run successfully?

  2. dcollins4096 author

    That's right, sorry I wasn't clear in the first note:

    This test requires the other PR, PR 404 to run.

    It seemed like it would be much easier to review if they were two separate PRs, since PR 404 is literally the same change 47 times.

    On my desktop, both PR405 and PR404 pass the push suite, provided PR404 is pulled first.

    1. Brian OShea

      Oh, hilarious. OK, in that case I am not at all concerned about this one. :-) Thanks for clearing this up! (I'll also verify it, but i imagine it's just fine.)

        1. dcollins4096 author

          Oh, fascinating. I just looked for "Fail" in test_results, but these only threw errors.

          I do see that seg fault. I'll dig into it.

  3. Philipp Grete

    @dcollins4096 is there a specific reason you added the allocations to src/enzo/hydro_rk/Grid_MHDRK2_1stStep.C?

    I see how they fix the segfaults, but my first impression is that the allocations there are out of place, i.e., at that place in the code the variables should have already been allocated, which would suggest that a fix at an earlier time (in the code flow) would be more appropriate. If you agree I'd be happy and have a look where that 'earlier time' might be.

    1. dcollins4096 author

      That's an excellent question.

      I looked into exactly that, allocating it earlier, which is what prompted PR 404. I spent a bunch of time yesterday looking into all the places that I missed for the allocation (which caused the seg faults), and it became clear that was more effort than it was worth. It turns out that divB and gradPhi aren't filled with anything meaningful and aren't used until they're computed in the RK solver. They don't get interpolated or moved or communicated or copied from one grid to another. (This implies that they're not persistent between sub-grid timesteps, for what that's worth, which I didn't realize until now) . So it's not really necessary to align the allocation of divB with all the points where BaryonField comes into existence. (There are several of them: I identified two, but there might be more)

      The upshot is allocating them on demand is equivalent to their usage.

      We should discuss if this is the behavior that divB and gradPhi should have, but that's going to require more careful testing.

      1. Philipp Grete

        I'd argue for digging a bit deeper. Rereading Brian's comment that "Note that MHDZeldovichPancake (the only 1D test that fails) seg faults after 252 time steps" indicates that something gets falsely allocated (or pointed to) early on. So it could be worth looking at this now as it might prevent other problems down the road.

        1. dcollins4096 author

          The MHDZeldovichPancake doesn't refine until 252 steps. The test failed then because the new subgrids didn't have divB allocated. Brian's comment happened before I put the allocate in Grid_MHDRK2_1stStep.C: after I put in the allocate, the tests pass. So you're correct that something was getting falsely allocated, but I fixed it, it was not getting allocated.

          I did dig deeper, and there is no purpose in having divB exist outside of those routines, if we're going to keep the functionality of that field the same. The fields only get used in Grid_MHD3D, where they're filled, Grid_MHDSourceTerms where they're added to the fluxes, and the destructor, where they're deleted. So there isn't any danger of anything else happening.

        2. dcollins4096 author

          We could, though, promote it to a proper BaryonField. . This would be I think straight forward, though it would possibly change the behavior. This would allow for future users who for whatever reason need it to be defined before the hydro step.

          1. Philipp Grete

            I'll need to have a closer look at the Dedner AMR structure again. If those fields aren't used beyond the machinery you mention, then it might be a good idea to remove those lines of code from Grid_AllocateGrids and keep them only at hydro_rk/Grid_MHDRK2_1stStep.C (or similar) . This way there's only one (clean) place where they are allocated and not having them as a BaryonField would also highlight that they are not supposed to be touched by other routines.

            1. dcollins4096 author

              I think that is the best solution.

              I went ahead and made that change. It passes the AMR tests mentioned above, I'm running the push suite now. But the AMR Dedner tests are the place where it'll break.

  4. Brian OShea

    @dcollins4096 just checking - is this ready to be merged now? Also, would you mind merging with the tip of enzo-dev to make sure that it still passes all of the tests?

    1. Brian OShea

      Just following up on my own comment: I merged this PR with the tip of enzo-dev and ran it against the tip. The push suite passes with flying colors. I'm going to hit approve; @pgrete , could you sign off as well? And maybe @gbryan could take a quick look as well?

  5. dcollins4096 author

    I did run the push suite as well, and it passed push suite. I also changed the two recent comments from @ngoldbaum and @gbryan . Thanks!

    edit to add, I meant I ran the push suite on a locally-merged version, vs. the current tip.

  6. Philipp Grete

    I went through the Dedner machinery in more detail and am now suggesting we talk about this in more detail next week at the workshop.

    Depending on whether people actually use UseDivergenceCleaning (not the Dedner method, but the active deprojection method) we could remove those variables completely to clean up the code as they are not required at all (if I haven't missed anything fundamental) for the mixed (hyperbolic/parabolic) Dedner-style divergence cleaning.

    They are only required if the EGLM-MHD formulation is used, which (a) never happens (see Grid_MHDSourceTerms.C) given that there's no #define DEDNER_SOURCE and (b) the Dedner paper itself recommends using the GLM-MHD formulation as the additional source terms in the EGLM-MHD formulation cause a "significant loss of conservation".

      1. Brian OShea

        I think it makes sense to hold off on doing anything else with this PR until we discuss next week!

  7. dcollins4096 author

    Upon further discussion, the divB and gradPhi terms are only used in the MHD source terms, which are fully commented out in the code. So those two fields are allocated and filled, but never used. This is consistent with the statements in the original Dedner et al 2002 paper, which states that these terms are problematic. So we’ve removed the terms completely.

    This does not affect the main hyperbolic treatment, which only requires the Phi field, which is treated as a full BaryonField. It also does not affect the poisson solvers cleaning (`UseDivergenceCleaning`), which generates it’s own divergence field.

      1. Philipp Grete

        I also think that it's ready to be merged once Dave confirms that it successfully passes the test suite. I can confirm that it still compiles with CUDA (even though not with the current week of code for other unrelated reasons - working on those next).