Problems with Ghost Exchange

Issue #93 resolved
Michel de Messieres created an issue

hi,

We’ve been using MOAB for a couple of years and have some questions regarding the way exchange_ghost_cells works. Perhaps this is a bug with MOAB.

I have attached one example we are working with, a 2d mesh where we want 2 ghost layers.

Partitioning on 8 procs and requesting 2 ghost layers with bridge_dim 0 and addl_ents 0, I find that there a few spots where the ghost layer is only 1 cell thick.

Experimenting, I found I could resolve this specific issue if I call the ghost exchange twice:

  1st time with bridge_dim 0 and addl_ents 1
  2nd time with bridge_dim 1 and addl_ents 1

However I think the original single call (bridge 0, addl_ents 0) should be working.

I created a simple test derived directly from the HelloParMOAB.cpp test. (attached as moabTest.cxx) Then I added a bit of error checking to catch the spots where the layer is 1 thick, not 2 as expected.

So running the test as follows will use bridge_dim 0 and addl_ents 0:

mpiexec -np 8 ./moabTest.exe mesh.h5m 0

which results in the following output:

Rank 2 has cell ID 150 at ( 0.071 -0.008 0.000) which is part of a bad ghost layer.
Rank 4 has cell ID 150 at ( 0.071 -0.008 0.000) which is part of a bad ghost layer.
Rank 4 has cell ID 169 at ( 0.060 -0.043 0.000) which is part of a bad ghost layer.
Rank 4 has cell ID 168 at ( 0.059 -0.062 0.000) which is part of a bad ghost layer.
Bad cells: 4

But then I run with two exchange calls, first bridge_dim 0, addl_ents 1, then bridge_dim 1, addl_ents 1 , as follows: (last argument to switch the ghost exchange calls)

mpiexec -np 8 ./moabTest.exe mesh.h5m 1

which results in correct ghost layers and this particular case works fine for our application.

Bad cells: 0

The bad cells listed above are consistent with what we see in our application. For example see the attached png outputs which plot the ghost layers for rank 2 and rank 4, where I marked with arrows the bad cells.

My first priority is probably to establish whether the first case should be working or if there is in fact a reason for needing the extra exchange calls.

We run an older branch of MOAB but trying the most recent code gives the same issues. I also attached the original .msh file. I was generating the h5m through moab, so we open the file and then save it again as h5m format after partitioning.

Many thanks for any assistance.

Comments (11)

  1. Iulian Grindeanu

    on the above branch, https://bitbucket.org/fathomteam/moab/branch/iulian07/ghost_layers_issue, there is a new example in which 2 ghost layers are created, in a loop by the number of layers;

    https://bitbucket.org/fathomteam/moab/src/5a90404e6704c84e9146c6dc59f537d068eb6faa/examples/basic/HelloParMOAB2d.cpp?at=iulian07%2Fghost_layers_issue

    it still has some issues when partitions are too thin, and there is a crossover to different partitions; (the errors are aborting the code in debug mode);

  2. Iulian Grindeanu

    Hi Michel, I introduced an option to check and correct thin layers after ghosting; it is on this branch, https://bitbucket.org/fathomteam/moab/branch/iulian07/ghost_layers_issue;

    it might solve the issues you are seeing; I am still debating how or when to use the option automatically; the test parallel/ghost_thin_layers.cpp shows 2 possible way to call the method "manually";

    it should be called after the first ghost layer call, and after that call for a second ghost layer (when you need 2 layers in ghosting ); basically, when partition layers are thin, because of the way they propagate, the ghosting call should be made in a loop, by the number of layers needed; if you need 2 layers, and you know the layers are thin, it should be like this: 1) call exchange ghost layer 1 2) call pcomm-> correct_thin_ghost_layers(); 3) call exchange ghost layer 2 etc.

  3. Michel de Messieres

    hi Iulian. The parallel/ghost_thin_layers.cpp doesn't seem to match up to the description you have. Can you confirm that is the correct example? In the first test I only see one call to exchange_ghost_cells and it comes after the call to correct_thin_ghost_layers(), not before as I would expect.

    So I have a few questions:

    (1) Can you confirm what the pattern would look like for 2 layers. Would it be like this:

    pcomm->exchange_ghost_cells(...);
    pcomm->correct_thin_ghost_layers();
    pcomm->exchange_ghost_cells(...);
    

    or like this:

    pcomm->exchange_ghost_cells(...);
    pcomm->correct_thin_ghost_layers();
    pcomm->exchange_ghost_cells(...);
    pcomm->correct_thin_ghost_layers();
    

    or as in the example:

    pcomm->correct_thin_ghost_layers();
    pcomm->exchange_ghost_cells(...);
    

    (2) Can you confirm how many layers I would be calling for in the above commands. Would I call for 2 layers both times, or 1 the first time and 2 the second time, or 1 both times? Assuming I want two final layers.

    Note: Trying a few tests I found with the new code I could remove some of the errors in the my test case described above with the new options, but not all of them. However I'm not sure I'm calling the exchange sequence correctly.

    Thanks.

  4. Iulian Grindeanu

    Hi Michel,

    When you need 2 layers, the correct sequence is

     pcomm->exchange_ghost_cells( 2,  // int ghost_dim  for 2d topology
              0,  //  int bridge_dim,  (you could use a different bridge dim
          1, // int num_layers, first call with 1 layer 
               ...)
     pcomm->correct_thin_ghost_layers();
    
     pcomm->exchange_ghost_cells( 2,  // int ghost_dim 
        0,  //  int bridge_dim,  (you could use a different bridge dim
       2, // int num_layers, second call with 2 layers
    

    If you would need 3 layers, you would call another correct_thin.., after second ghost call, before calling ghost for 3 layers;

    I know, it is pretty convoluted, but in the cases I tried, it produced good ghosts, of the thickness I needed

    I introduced a reading option too, that does that (it was easier to test) so the first test does this: ghost 1 layer as part of reading, then call correct_thin_layers , then ghost for 2 layers

    second test (test_read_with_thin_ghost_layer) will do the same thing by reading option (so it first ghosts, then call correct_thin.., as part of the reading process); then call second time ghosting for 2 layers

    in both cases I dump the files existing on each core, and I look to see if there are any errors

    Your test executable calls resolve_shared_entities at least 2 times; that is not recommended; resolve shared ents should be called only once (usually is done by reading process);

    also, to test your file you can just do

    mpiexec -np 8 ./ghost_thin_layers mesh.h5m

    plot then the files (there are 8 files created) for each test

    If you find an example that does not create 2 ghost layers correctly, please attach to the issue;

    I still do not like solution, suggestions are welcomed :) tthe actual correct_thin_ method does not seem very expensive, but I would hate to make it default; Also, when there is no encroaching, calling directly with 2 layers works fine; no thin correct needed, no loop for ghost needed

  5. Iulian Grindeanu

    in the long term, if partitions are "thin", which could happen a lot for strong scaling studies, there is no way we can avoid ghosting in a loop by the number of layers; so the correct solution and most general is to call ghosting in a loop; the cost of the new method introduced is relatively small, but not negligible; it does involve another communication step between neighbors; if partitions are thin, and there is a need of multiple ghost layers, the solution itself will need a lot of communication between neighbors;

  6. Michel de Messieres

    This is working for me in the test case. I'm going to apply the fix and test some more complex meshes. I will post any issues. Thanks!

  7. Vijay M

    Checking to see if this has been resolved. @Iulian Grindeanu the thin ghost PR has been merged to master, no ?

  8. Log in to comment