Evaluate Zoltan Parallel Hypergraph Partitioning for SpMV communication reduction

Issue #140 resolved
Moritz Kreutzer created an issue

http://www.cs.sandia.gov/~egboman/papers/IPDPS06.pdf

Sime nice overview slides I just found: http://www.doc.ic.ac.uk/~wjk/hypergraph-slides.pdf

More information about available software to be found here: http://bmi.osu.edu/umit/software.html

Comments (19)

  1. Moritz Kreutzer reporter

    This should also alleviate the problems with global re-ordering of unsymmetric matrices.

  2. Jonas Thies

    Achim says that in CRESTA Zoltan was found to be the only softwre that scaled beyond about 30k MPI processes, although that was for mesh partitioning, not sparse matrix reordering, I think. I will look into the CRESTA white papers some time to see what they actually did for sparse matrices

  3. Moritz Kreutzer reporter

    Example output which shows that it seems to work:

    mpirun -np 2 ./test/zoltan -m ../test/test_zoltan.mtx -w 1
    
    [Rank 0] A (orig):
    1.10e+01  .         .         .         .         .         .         .         .         .         .         1.20e+01  .         .         .         .         
    .         2.10e+01  2.20e+01  .         2.30e+01  .         .         .         2.40e+01  .         .         .         .         .         .         .         
    3.10e+01  .         3.20e+01  .         .         .         .         .         .         .         .         3.30e+01  .         .         3.40e+01  .         
    .         .         .         4.10e+01  .         .         .         .         .         .         .         .         .         4.20e+01  .         .         
    .         .         5.10e+01  .         5.20e+01  .         .         .         .         .         .         .         .         .         .         .         
    .         .         6.10e+01  .         .         6.20e+01  .         .         .         .         .         .         .         .         .         .         
    7.10e+01  .         .         .         .         .         7.20e+01  .         .         .         7.30e+01  .         .         .         .         .         
    .         .         .         8.10e+01  .         .         .         8.20e+01  .         8.30e+01  8.40e+01  .         .         .         .         .         
    
    [Rank 1] A (orig):
    .         .         9.10e+01  .         .         .         .         .         9.20e+01  .         .         .         .         .         .         .         
    .         .         .         .         .         .         .         .         .         1.01e+02  1.02e+02  .         1.03e+02  .         .         .         
    .         .         .         1.11e+02  .         .         .         .         .         1.12e+02  1.13e+02  .         .         .         .         .         
    .         .         1.21e+02  .         .         1.22e+02  .         .         1.23e+02  .         .         1.24e+02  .         .         .         .         
    .         .         .         .         .         .         .         .         .         .         1.31e+02  .         1.32e+02  .         .         .         
    .         .         .         .         .         .         .         1.41e+02  .         1.42e+02  .         .         .         1.43e+02  .         .         
    1.51e+02  .         .         .         .         .         .         .         1.52e+02  .         .         1.53e+02  .         .         1.54e+02  .         
    1.61e+02  .         .         .         .         .         1.62e+02  .         .         .         1.63e+02  .         .         .         .         1.64e+02  
    
    Rank [1] A=
    .         .         .         .         .         .         .         .         4.10e+01  .         .         .         .         4.20e+01  .         .         
    .         .         .         .         .         .         .         .         .         1.01e+02  1.02e+02  .         1.03e+02  .         .         .         
    .         .         .         .         .         .         .         .         1.11e+02  1.12e+02  1.13e+02  .         .         .         .         .         
    7.10e+01  .         .         .         .         .         .         .         .         .         7.30e+01  7.20e+01  .         .         .         .         
    .         .         .         .         .         .         .         .         .         .         1.31e+02  .         1.32e+02  .         .         .         
    .         .         .         .         .         .         .         .         .         1.42e+02  .         .         .         1.43e+02  1.41e+02  .         
    .         .         .         .         .         .         .         .         8.10e+01  8.30e+01  8.40e+01  .         .         .         8.20e+01  .         
    1.61e+02  .         .         .         .         .         .         .         .         .         1.63e+02  1.62e+02  .         .         .         1.64e+02  
    
    
    Rank [0] A=
    1.10e+01  .         .         .         .         .         1.20e+01  .         .         .         .         .         .         .         .         .         
    .         2.10e+01  2.20e+01  2.40e+01  2.30e+01  .         .         .         .         .         .         .         .         .         .         .         
    3.10e+01  .         3.20e+01  .         .         .         3.30e+01  3.40e+01  .         .         .         .         .         .         .         .         
    .         .         9.10e+01  9.20e+01  .         .         .         .         .         .         .         .         .         .         .         .         
    .         .         5.10e+01  .         5.20e+01  .         .         .         .         .         .         .         .         .         .         .         
    .         .         6.10e+01  .         .         6.20e+01  .         .         .         .         .         .         .         .         .         .         
    .         .         1.21e+02  1.23e+02  .         1.22e+02  1.24e+02  .         .         .         .         .         .         .         .         .         
    1.51e+02  .         .         1.52e+02  .         .         1.53e+02  1.54e+02  .         .         .         .         .         .         .         .         
    [Rank 0] y=Ax (orig):
    2.300000e+01
    9.000000e+01
    1.300000e+02
    8.300000e+01
    1.030000e+02
    1.230000e+02
    2.160000e+02
    3.300000e+02
    
    [Rank 1] y=Ax (orig):
    1.830000e+02
    3.060000e+02
    3.360000e+02
    4.900000e+02
    2.630000e+02
    4.260000e+02
    6.100000e+02
    6.500000e+02
    
    [Rank 1] y=Ax:
    1.830000e+02
    3.060000e+02
    3.360000e+02
    4.900000e+02
    2.630000e+02
    4.260000e+02
    6.100000e+02
    6.500000e+02
    
    [Rank 0] y=Ax:
    2.300000e+01
    9.000000e+01
    1.300000e+02
    8.300000e+01
    1.030000e+02
    1.230000e+02
    2.160000e+02
    3.300000e+02
    
  4. Jonas Thies

    This is a bit off-topic, but I noticed that on Piz Daint there is a module cray-trilinos which contains zoltan. It would be nice if we could just type module load cray-trilinos and then add -DGHOST_USE_ZOLTAN and it works, but right now the module file doesn't tell us where to find the libraries (I don't know how their module system works, and the $CRAY_TRILINOS_DIR contains installations for all kinds of CPUs)

  5. Jonas Thies

    To avoid the gather/scatter, why not just loop over the Export list and Isend every entry, then Loop over the Import list and Irecv every entry? You know how many elements you'll receive from each process, and the order in which they arrive does not matter as Long as you get the new GIDs

  6. Moritz Kreutzer reporter

    In the current implementation, I only have export lists because I have set RETURN_LISTS to PARTS as documented here (that's what Karen advised on the mailing list): http://www.cs.sandia.gov/zoltan/ug_html/ug_alg.html

    Then, I followed the advice of Erik on the Zoltan mailing list:

    I believe the part assignment from Zoltan is sufficient to 
    redistribute your matrix, but if you need an explicit permutation 
    vector, you should sort the part arrays and use the permutation induced 
    by the sort. This is easiest to do in serial. There is no parallel sort 
    in Zoltan.
    

    In an earlier version, I have replaced each outgoing row subsequently by an incoming row, which failed when the numbers of im- and exports did not match.

    What is still a bit unclear to me, is: How do we decide where exactly a row goes? The im- and export lists only specify the process/part, but not where exactly in this part the row should be placed.

    Anyway: The current implementation seems to yield a good permutation, I will post some benchmarks tomorrow...

  7. Jonas Thies

    Once your version is running we can make a branch to try to get it more scalable, I think the whole point of the import/export lists is to facilitate this. What I wrote above should work for just getting the global indices, but if you try to build the matrix from a row function, you will want to know the new global position of some arbitrary index. One way that is maybe not optimal but only involves point-to-point communication is

    • ask the MPI rank who used to have it where he sent it
    • ask the new owner what the local index is
    • compute the new global offset from owner's rank and local offset

    and create a lookup table so you don't have to ask over and over again for the same column indices

  8. Jonas Thies

    no one ever used Zoltan because it's slow, but the partitioning seemed promising. There are probably better ways to implement the interface, but for now we can close tis issue.

  9. Log in to comment