parallel read of simple file fails

Issue #27 resolved
Nico Schlömer created an issue

When trying to read this simple file with this moab-hello-world code, all I'm getting is

$ mpiexec -n 2 ./par_test
 global rank:1 color:0 rank:1 of 2 processors
 global rank:0 color:0 rank:0 of 2 processors
Reading file ./rectanglesmall-2.h5m
 with options: PARALLEL=READ_PART;PARTITION=PARALLEL_PARTITION;PARALLEL_RESOLVE_SHARED_ENTS
 on 2 processors on 1 communicator(s)

and the executable chokes on mb->load_file().

Comments (15)

  1. Iulian Grindeanu

    thanks for reporting; one of the parts in partition is empty, we need to fix this case when one part is empty. if you repartition, it works. (you will have 1 tri per part :) The fix should be part of refactoring the partitioning, there are several issues, including mesh migration, partitioning after loading in parallel

  2. Nico Schlömer reporter

    if you repartition, it works.

    This is the partition I get from

    mbpart 2 -m ML_KWAY rectanglesmall.h5m rectanglesmall-2.h5m
    

    On the other hand,

    mbpart 2 -m ML_RB rectanglesmall.h5m rectanglesmall-2.h5m
    

    appears to work better.

  3. Vijay M

    Is this resolved ? Looks like the cause was a bad partition.

    Iulian, it will be good to check as a post processing step in mbpart to see if any of the parts are empty and then print a detailed warning message for it. Add a test for this too.

  4. Iulian Grindeanu

    We do give a warning at partitioning, but we do not check when we load the file. Navamita had yesterday the same problem, and I did realize very late that some of the parts were empty (3 out of 512 parts)

    If we loaded on 256, it worked, because every proc had something (we distribute 2 partitions per task, and the empty partitions were not adjacent)

  5. Vijay M

    Iulian wrote a fix to check for empty parts. Nico, please test with branch iulian07/empty_part to see if you get a proper error now, instead of hanging ?

    Separately, I think we should validate each part when we generate partition data with mbpart. This should be pretty simple actually and will avoid incidents such as above.

  6. Nico Schlömer reporter

    For some reason, I cannot check out the branch:

    $ git checkout iulian07/<tab><tab>
    iulian07/fix_tupleulong           iulian07/hdf5parallel             iulian07/sequence_factor_option 
    iulian07/ghost_shared_sets        iulian07/parallel_fixes           iulian07/sharedbuild
    

    I do see it though

    $ git branch -a | grep empty
      remotes/github/iulian07/empty_part
      remotes/origin/iulian07/empty_part
    

    No idea what's going wrong there.

  7. Vijay M

    Can you do a "git fetch origin" and try again ? And you can also try being explicit with

    git checkout -b iulian07/empty_part origin/iulian07/empty_part
    
  8. Nico Schlömer reporter

    Cloned from scratch and tested again. Result:

    [0]MOAB ERROR: --------------------- Error Message ------------------------------------
    [0]MOAB ERROR: ReadHDF5 Failure: Attempt reading an empty dataset on proc 0!
    [0]MOAB ERROR: read_set_data() line 2406 in ReadHDF5.cpp
    --------------------------------------------------------------------------
    MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD 
    with errorcode 1.
    
    NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
    You may or may not see output from other processes, depending on
    exactly when Open MPI kills them.
    --------------------------------------------------------------------------
    

    The process still doesn't abort but I guess the above message is enough to make anyone CTRL+C.

  9. Vijay M

    The MPI_Abort should have stopped the execution. Yes, the helpful error message will now be useful to catch these cases.

    I also think we need to do this as a post-processing check during mbpart. I'll submit a separate PR for that.

  10. Vijay M

    @nschloe Is this closed for now with changes in iulian07/empty_part ? We will also create a separate PR that does a post-processing check to make sure mbpart does not generate invalid partitions.

  11. Log in to comment