- changed title to parallel read of simple file fails
parallel read of simple file fails
When trying to read this simple file with this moab-hello-world code, all I'm getting is
$ mpiexec -n 2 ./par_test
global rank:1 color:0 rank:1 of 2 processors
global rank:0 color:0 rank:0 of 2 processors
Reading file ./rectanglesmall-2.h5m
with options: PARALLEL=READ_PART;PARTITION=PARALLEL_PARTITION;PARALLEL_RESOLVE_SHARED_ENTS
on 2 processors on 1 communicator(s)
and the executable chokes on mb->load_file()
.
Comments (15)
-
reporter -
reporter - edited description
-
reporter - edited description
-
thanks for reporting; one of the parts in partition is empty, we need to fix this case when one part is empty. if you repartition, it works. (you will have 1 tri per part :) The fix should be part of refactoring the partitioning, there are several issues, including mesh migration, partitioning after loading in parallel
-
reporter if you repartition, it works.
This is the partition I get from
mbpart 2 -m ML_KWAY rectanglesmall.h5m rectanglesmall-2.h5m
On the other hand,
mbpart 2 -m ML_RB rectanglesmall.h5m rectanglesmall-2.h5m
appears to work better.
-
Is this resolved ? Looks like the cause was a bad partition.
Iulian, it will be good to check as a post processing step in mbpart to see if any of the parts are empty and then print a detailed warning message for it. Add a test for this too.
-
We do give a warning at partitioning, but we do not check when we load the file. Navamita had yesterday the same problem, and I did realize very late that some of the parts were empty (3 out of 512 parts)
If we loaded on 256, it worked, because every proc had something (we distribute 2 partitions per task, and the empty partitions were not adjacent)
-
Iulian wrote a fix to check for empty parts. Nico, please test with branch iulian07/empty_part to see if you get a proper error now, instead of hanging ?
Separately, I think we should validate each part when we generate partition data with mbpart. This should be pretty simple actually and will avoid incidents such as above.
-
reporter - edited description
-
reporter For some reason, I cannot check out the branch:
$ git checkout iulian07/<tab><tab> iulian07/fix_tupleulong iulian07/hdf5parallel iulian07/sequence_factor_option iulian07/ghost_shared_sets iulian07/parallel_fixes iulian07/sharedbuild
I do see it though
$ git branch -a | grep empty remotes/github/iulian07/empty_part remotes/origin/iulian07/empty_part
No idea what's going wrong there.
-
Can you do a "git fetch origin" and try again ? And you can also try being explicit with
git checkout -b iulian07/empty_part origin/iulian07/empty_part
-
reporter Cloned from scratch and tested again. Result:
[0]MOAB ERROR: --------------------- Error Message ------------------------------------ [0]MOAB ERROR: ReadHDF5 Failure: Attempt reading an empty dataset on proc 0! [0]MOAB ERROR: read_set_data() line 2406 in ReadHDF5.cpp -------------------------------------------------------------------------- MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode 1. NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them. --------------------------------------------------------------------------
The process still doesn't abort but I guess the above message is enough to make anyone CTRL+C.
-
The MPI_Abort should have stopped the execution. Yes, the helpful error message will now be useful to catch these cases.
I also think we need to do this as a post-processing check during mbpart. I'll submit a separate PR for that.
-
@nschloe Is this closed for now with changes in iulian07/empty_part ? We will also create a separate PR that does a post-processing check to make sure mbpart does not generate invalid partitions.
-
reporter - changed status to resolved
I guess it's fine to resolve this one.
- Log in to comment