Parallel performance

Issue #61 new
Former user created an issue

Are there any tweaks on using MOAB at scale? I am having two major issues:

  1. mbpart (incl. reordering) crashes somewhere beyond 400 million tets
  2. MOAB gets very slow at scale, reading 341 Mio tets (*.h5m, reordered) requires ~2 mins on 1K nodes and >13 mins on 9K nodes

Tried different partitioning strategies 1), all segfault with some unrealistic malloc:

Zoltan_Malloc (from ../../src/phg/phg_build_calls.c,1782) No space on proc 0 - number of bytes requested = 18446744068676611040
Zoltan_Malloc (from ../../src/phg/phg_build_calls.c,1783) No space on proc 0 - number of bytes requested = 18446744068676611040
[0] Zoltan ERROR in Zoltan_Get_Hypergraph_From_Queries (line 766 of ../../src/phg/phg_build_calls.c):  Error
[0] Zoltan ERROR in Zoltan_PHG_Build_Hypergraph (line 131 of ../../src/phg/phg_build.c):  Error getting hypergraph from application
[0] Zoltan ERROR in Zoltan_PHG (line 291 of ../../src/phg/phg.c):  Error building hypergraph.
[0] Zoltan ERROR in Zoltan_PHG (line 502 of ../../src/phg/phg.c):  Memory error.
[0] Zoltan ERROR in Zoltan_LB (line 482 of ../../src/lb/lb_balance.c):  Partitioning routine returned code -2.
Partitioner failed!
  Error code: MB_FAILURE (16)
  Error message: No error
Zoltan_Malloc (from ../../src/zz/zz_coord.c,157) No space on proc 0 - number of bytes requested = 18446744057587912352
[0] Zoltan ERROR in Zoltan_Get_Coordinates (line 159 of ../../src/zz/zz_coord.c):  Memory error
[c400-105.stampede.tacc.utexas.edu:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11)
/tmp/slurmd/job8000960/slurm_script: line 10: 67020 Segmentation fault      (core dumped) moab/mpi/bin/mbpart 2048 tet4_extreme.msh tet4_extreme_2048_rcb.h5m -z RCB --reorder -T
~         

Given the moderate file size of 13 GB for 2), I would expect a few seconds.

Comments (8)

  1. Iulian Grindeanu

    The error in Zoltan_Malloc is not ours, it is from zoltan; how big is the machine?
    Is there a way we can get the file to play with it? We have a machine here that can use 1Tb allocatable memory on one task (compute001)
    In general, we try to create some partitions before writing such big files; how did you create that mesh file? Is there a way to write it partitioned? Or write separate files, then do parallel merge, and create partitions along the way?

    mbpart is a serial process, we need to improve it to run in parallel for large files; We started to implement a trivial partition with which we can load the file in parallel, then we should call zoltan/or parmetis in parallel for mesh migration/partitioning

    Most of our IO scaling tests revolve around GenLargeMesh example ; there, the mesh is partitioned during creation, and it has a very favorable partition/numbering;

    Also, at that scale, 9K tasks, it is important to have good hdf5 implementation; we have also tested very simple dataset write, directly with hdf5; it does not "weakly" scale; In the sense that every tasks writes the same size of data; increasing the number of tasks leads to increasing total wall time (it should be constant for perfect scaling)

    What kind of machine do you run your tests? On Mira/Vesta ( BG/Q machine) , it is important to have a bglockless environment variable set; otherwise IO is very slow.

    IO time depends a lot on partitioning/numbering; also, if you have lower dimensional entities (faces, edges), that part does not scale well. If you have only primary entities (tetras), it should scale better.

  2. Alex

    Iulian, sorry for the delay..

    Is there a way we can get the file to play with it? 
    

    Yes, sure. I can share both, the gmsh-configs and the generated mesh. Just let me know a place which is convenient for you, maybe on Mira?

    The error in Zoltan_Malloc is not ours, it is from zoltan; how big is the machine?
    

    I tried it on a couple of large memory machines, but couldn't get the job done anywhere. Memory size was ranging from something around 50GB to 1TB. Not sure if more memory will help, dividing 18446744057587912352 by 1TB is still a pretty big number :-)

    In general, we try to create some partitions before writing such big files; how did you create that mesh file? Is there a way to write it partitioned? Or write separate files, then do parallel merge, and create partitions along the way?
    

    I am generating the mesh through gmsh, which, in theory could do some partitioning on-the-fly for testing as well. I had trouble with mbconvert in the past, which is one of the reasons I switched to using mbpart. Can't remember the problem exactly, think it was related to reordering the mesh for MOAB's parallel reader.

    Most of our IO scaling tests revolve around GenLargeMesh example ; there, the mesh is partitioned during creation, and it has a very favorable partition/numbering; 
    

    Hm, I using similar configs (PARALLEL=READ_PART;PARALLEL_RESOLVE_SHARED_ENTS;PARTITION=PARALLEL_PARTITION;) for parsing the mesh; fyi: This is how I am generating the h5m-file: srun -n 1 -p normal -t 24:00:00 mbpart 1500 tet4_10.msh tet4_10_1500.h5m --reorder -m ML_KWAY -t -T 2>&1 | tee tet4_10_1500.h5m.log

    Also, at that scale, 9K tasks, it is important to have good hdf5 implementation; we have also tested very simple dataset write, directly with hdf5; it does not "weakly" scale; In the sense that every tasks writes the same size of data; increasing the number of tasks leads to increasing total wall time (it should be constant for perfect scaling)
    

    Hm, what does good HDF5-implementation mean? For the 9K-runs I simply used what MOAB's configure script compiled automatically. Now, I am providing my own HDF5-installation (1.10.1), also tried a center-installation on Stampede 2 today (1.8.16): No improvement. Interestingly, the initialization (load_file) for a small mesh (38,446,016 tets, 1.5GB) on Stampede 2 is slow (30 seconds) using 500 nodes / ranks. Using 1500 nodes MOAB simply got stuck for 20 mins in load_file, which is when I aborted the job. All of this are strong scaling issues: Is there maybe something which, for some strange reason, performs worse if the #elements per node gets too low?

    What kind of machine do you run your tests? On Mira/Vesta ( BG/Q machine) , it is important to have a bglockless environment variable set; otherwise IO is very slow. 
    

    Uhm, I ran simulations on Theta, Cori Phase II and Stampede 2.

    IO time depends a lot on partitioning/numbering; also, if you have lower dimensional entities (faces, edges), that part does not scale well. If you have only primary entities (tetras), it should scale better.
    

    Is see. I guess it helps once you have my meshes.

  3. Iulian Grindeanu

    we have access on most of these machines, except stampede. Where is stampede? Can you put the mesh somewhere we can access it on theta or mira (or cori? or edison?)

    that error from zoltan must be some overflow of a long int number of bytes requested = 18446744068676611040

    (gdb) p /x 18446744068676611040
    $1 = 0xfffffffed4036be0
    

    this looks to me like a negative number

  4. Alex
    we have access on most of these machines, except stampede. Where is stampede? 
    

    Great; Stampede is in Austin.

    Can you put the mesh somewhere we can access it on theta or mira (or cori? or edison?) 
    

    Yeah sure, the meshes are here:

    [breuera@miralac2 iulian]$ pwd
    /gpfs/mira-fs0/projects/GMSeismicSim/alex_work/iulian
    

    Let me know if you have trouble reaching them.

  5. Iulian Grindeanu

    Hi Alex, Sorry for the silence, I was out ; I do not have permission to folder

    [grindean@miralac2 projects]$ cd GMSeismicSim/
    -bash: cd: GMSeismicSim/: Permission denied
    

    Can you copy it somewhere else? I think it will survive for a few days in /tmp, until I can copy it. How big is the file?

  6. Alex

    No, that didn't work:

    No space left on device
    

    It's a couple of files, all together maybe 100-200 GB.

    Copied the small example and the config I used to generate the larger mesh to tmp for now:

    /tmp/iulian
    

    Should be good enough for the time being.

  7. Iulian Grindeanu

    Thanks, part folder has only a small geo file; The stampede2_small has msh and h5m files, on the order of 2Gb, I assume they are complete. It is enough to get me started, I will take a look at them

    mbpart crashed only on the bigger file, is that true? reorder might have issues.

    Did you try using metis option directly ( -m ML_KWAY , for example?)

  8. Alex
    mbpart crashed only on the bigger file, is that true? reorder might have issues.
    

    Yes, it never crashed on my small files.

    Did you try using metis option directly ( -m ML_KWAY , for example?)
    

    Uhm, no. Didn't know that this could make a difference.

  9. Log in to comment