Wiki

Clone wiki

HipMCL / Home

Overview

HipMCL is a high-performance parallel algorithm for large-scale network clustering. HipMCL parallelizes popular Markov Cluster (MCL) algorithm that has been shown to be one of the most successful and widely used algorithms for network clustering. It is based on random walks and was initially designed to detect families in protein-protein interaction networks. Despite MCL’s efficiency and multi-threading support, scalability remains a bottleneck as it fails to process networks of several hundred million nodes and billion edges in an affordable running time. HipMCL overcomes all of these challenges by developing massively-parallel algorithms for all components of MCL. HipMCL can be 1000 times faster than the original MCL without any information loss. It can easily cluster a network of ~75 million nodes with ~68 billion edges in ~2.4 hours using ~2000 nodes of Cori supercomputer at NERSC. It also supports GPUs and can cluster the mentioned network in 15 minutes using 1K nodes of Summit at ORNL. HipMCL is developed in C++ language and uses standard OpenMP and MPI libraries for shared- and distributed-memory parallelization.

Sample Data

We provided several examples for single-node and multi-node runs in HipMCL/test folder. For large scale experiments, sample networks can be downloaded from http://portal.nersc.gov/project/m1982/HipMCL/

See the getting started section for some test examples. For other examples, see the HipMCL/test directory.

Developers

HipMCL is developed by

  • Ariful Azad, Indiana University. contact: azad AT iu.edu
  • Oguz Selvitopi, Lawrence Berkeley National Laboratory. contact: roselvitopi AT lbl.gov
  • Aydın Buluç, Lawrence Berkeley National Laboratory. contact: abuluc AT lbl.gov

HipMCL collaborators:

  • Georgios Pavlopoulos (JGI),
  • Nikos Kyrpides (JGI)
  • Christos Ouzounis (CERTH).

Citations

Ariful Azad, Georgios A. Pavlopoulos, Christos A. Ouzounis, Nikos C. Kyrpides, and Aydın Buluç. HipMCL: A high-performance parallel implementation of the Markov clustering algorithm for large-scale networks. Nucleic Acids Research, 2018. doi: https://doi.org/10.1093/nar/gkx1313.

Oguz Selvitopi, Md Taufique Hussain, Ariful Azad, and Aydın Buluç. Optimizing high performance Markov clustering for pre-exascale architectures. In Proceedings of the IPDPS, 2020.

Compilation

Dependencies: HipMCL depends on standard OpenMP and MPI libraries. OpenMP is usually integrated in most compilers. Please install MPI if it is not installed in your system. For example, you can install MPICH from https://www.mpich.org/.

To compile HipMCL on a workstation, all you need to do is to execute the following two commands, in the given order, inside the HipMCL directory:

#!bash
  cmake .
  make

The executable files are stored in the bin folder. You can run simple tests using the following command from the main directory:

#!bash
  ctest -V 

Alternatively (if cmake fails, or you just don't want to install it), you can just imitate the sample makefiles inside the main directory. For example, the following commands will compile HipMCL on a workstation.

ln -s makefile-mac Makefile 
make hipmcl

To compile with GPU support, checkout the "gpu" branch and use the above cmake commands to compile. This will produce the executable "hipmcl-gpu" under "bin". The GPU version requires CUDA 10 and a compiler compatible with c++14 standard. The GPU version is capable of utilizing all present GPUs on a node. It has the same options as the CPU version and a few more options related to GPUs (see options below).

Compilation at NERSC

We provided a makefile (named makefile-nersc) for NERSC systems. The same makefile works for both Edison and Cori. The following commands will compile HipMCL on Edison and Cori.

export CRAYPE_LINK_TYPE=static
module swap craype-haswell craype-mic-knl (if you want to run on the KNL partition)
ln -s makefile-nersc Makefile 
make hipmcl

Compilation at Summit with GPU support

The necessary module files to compile with GPU support on Summit machine are loaded as follows:

module load cuda cmake gcc/8.1.1 boost

Getting started

Running HipMCL on a workstation:

Assume that HipMCL is downloaded to "HipMCL" directory. The complete sequence of commands needed to run an example:

  1. Compile HipMCL (see above for compilation instruction). The executable file is located in HipMCL/bin directory.
  2. Go to HipMCL/test/small_scale_examples directory. Command: cd test/small_scale_examples
  3. In HipMCL/test/small_scale_examples directory, execute an example script. Command: run: bash sevenvertex.sh
  4. The output of HipMCL will be written in a file named sevenvertex.triples.hipmcl in HipMCL/data directory.

To run your own dataset, create a copy of sevenvertex.sh (if your file is in labeled triple format; see below for input format options). Then change the following in the copied script: 1. HIPMCL_EXE: path to HipMCL executable relative your current location 2. IN_FILE: name and path to your input data file 3. OUT_FILE: name and path to your output data file 4. See below to learn how to set different input parameters. Only two options are mandatory (a) -M input_file_name and (b) -I inflation_parameter.

Running HipMCL on a NERSC system (Edison/Cori):

Assume that HipMCL is downloaded to "HipMCL" directory. The complete sequence of commands needed to run an example:

  1. Compile HipMCL (see above for compilation instruction). The executable file is located in HipMCL/bin directory.
  2. Go to HipMCL/test/NERSC_examples directory. Command: cd test/NERSC_examples
  3. In HipMCL/test/NERSC_examples directory, execute an example script.
  • For edison use: sbatch virus_nersc_edison.sh
  • For cori (KNL) use: sbatch virus_nersc_cori_KNL.sh

The output of HipMCL will be written in a file named vir_vs_vir_30_50length.indexed.triples.hipmcl in HipMCL/data directory.

To run your own dataset, create a copy of virus_nersc_edison.sh (if your file is in labeled triple format; see below for input format options). Then change the following in the copied script: 1. HIPMCL_EXE: path to HipMCL executable relative your current location 2. IN_FILE: name and path to your input data file 3. OUT_FILE: name and path to your output data file 4. See below to learn how to set different input parameters. Only two options are mandatory (a) -M input_file_name and (b) -I inflation_parameter.

Several other examples are provided in the test directory (for both desktop and NERSC runs). The data directory has several small example networks. We uploaded several medium- and large-scale networks here: http://portal.nersc.gov/project/m1982/HipMCL/. These network can be used to test scripts in test/NERSC_examples folder. Note: HipMCL only support square process grid at this moment (a limitation imposed by the CombBLAS library).

Options


-M <ifname> : input file name (mandatory)

The input graph is passed to HipMCL in a file named <ifname>. HipMCL currently supports two input formats: (a) labeled triples (default) (b) matrix market. These formats are briefly discussed below. See example at the bottom of this page for details.

Labeled triple format: Labeled triple format is the default choice in HipMCL. In this format, each line encodes an edge of the graph. An edge is represented by a triple (source_vertex, destination_vertex, edge_weight). Source and destination vertices are presented by string labels and edge weights are represented by floating point numbers. Three fields in a triple is separated by white space. We show an example for for a graph with seven vertices and 12 edges.

#!c++
vertex_1    vertex_2    0.34
vertex_1    vertex_4    1.50
vertex_2    vertex_5    0.67
vertex_2    vertex_7    1.41
vertex_3    vertex_6    2.15
vertex_4    vertex_1    0.55
vertex_4    vertex_3    0.87
vertex_5    vertex_6    1.75
vertex_6    vertex_3    1.4
vertex_7    vertex_3    0.75
vertex_7    vertex_4    0.25
vertex_7    vertex_5    1

Matrix market format: A matrix market file starts with a mandatory header that specifies the type of the matrix. The header is followed by several lines of optional comments . A comment begins with "%". After the optional comments, there is another mandatory data header containing three integers denoting the number of rows, columns and nonzero entries in the matrix. Since the input to HipMCL is a graph, the number of rows and columns are always equal, denoting the number of vertices in the network. The rest of the file lists edges of the graph, one line for each edge. Each edge is represented by three numbers: the source vertex, the destination vertex and the weight of the edge. In contrast to the labeled triples format, matrix market only allow integer vertex identifiers. This is useful when vertex labels are converted to numeric ids in a preprocessing step to reduce input file size. This format is also popular in scientific computing community. A large collection of graphs and sparse matrices are already available in the matrix market format, such as the The University of Florida Sparse Matrix Collection. The same graph shown above with labeled triples format is shown in matrix market format below:

#!c++
%%MatrixMarket matrix coordinate real general
%comments
7   7   12
4   6   0.34
4   2   1.50
6   5   0.67
6   3   1.41
1   7   2.15
2   4   0.55
2   1   0.87
5   7   1.75
7   1   1.4
3   1   0.75
3   2   0.25
3   5   1
Here, the header denotes that the file contains a symmetric matrix where the nonzero values are stored in floating point numbers. See the matrix market website for further detail.

--matrix-market : if provided, the input file is in the matrix market format (default: the file is in labeled triples format)

-base <0|1> : index of the first vertex [optional, default: 1]

This option is only used with the matrix market format. Sets the start vertex index in the input file. By default, the vertices are numbered starting with 1.

-o <ofname> : output file name (optional, default: <ifname>.hipmcl)

The output clusters from HipMCL will be saved in a file named <ofname>. If this option is not supplied by an user, HipMCL will save the output in <ifname>.hipmcl file, where <ifname> is the name of the input file.

In the output file, each cluster is encoded by list of vertices belonging to the cluster and stored in a single line of the file. Vertex labels (or numeric ids in the matrix market format) are separated by white space.

Assume that HipMCL obtained two clusters from the graph described above. A possible output file will look like the following:

#!c++
vertex_2 vertex_3 vertex_5 vertex_6 vertex_7 
vertex_1 vertex_4 

In the current version of HipMCL, we do not sort clusters by the their sizes. In future releases, we will have the option to keep the clusters sorted by their sizes.


-I <num> : inflation parameter (mandatory)

Sets the main inflation value to <num>. This parameter controls the granularity of clusters. It is usually chosen somewhere in the range [1.2-5.0]. -I 5.0 will tend to result in fine-grained clusterings, and -I 1.2 will tend to result in very coarse grained clusterings.


-p <num> : cutoff for pruning (optional, default: 0.0001 )

Set the hard threshold for pruning. After squaring the matrix, entries that are smaller than the cutoff are removed, resulting in a columns with at most 1/cutoff entries.

-pct <pct> recovery percentage (optional, default: 90)

Sets the recovery percentage. See the description of the recovery number.

-R <int> : recovery number (optional, default: 1400)

Sets the recover number. Recovery is applied in a column if the the sum of all remaining entries is less than <pct>/100 and the number of remaining entries is less than the recovery number.

-S <int> : selection number (optional, default: 1100)

If recovery was not necessary and a pruned column has more than S entries, the selection step prunes the column further to at most S entries. It is possible that the recovery condition is satisfies again after a column goes through selection. In that case, recovery is performed again on this column. The latter situation will not occur if R <= S.

--preprune : if provided, apply prune/select/recovery before the first iteration (needed when dense columns are present) (default: don't preprune. However, if the average nonzero per column is larger than max{S,R}, prepruning is still applied by default)


-rand <0|1> : randomly permute vertices for load balance (optional, default: 0 (no permute))

This option asks HipMCL to randomly permute vertices for load balance. By default, HipMCL does not permute vertices.

--remove-isolated : if provided, remove isolated vertices (default: don't remove isolated vertices)


-phases <int> : number of phases (optional, default: 1)

Sets the number of phases used by HipMCL. In each phase, HipMCL expands 1/phase part of the matrix. Using more phases reduces the memory requirement of the expansion step; however, it increasing the running time. Hence, we recommend to set the -per-process-mem option so that HipMCL can dynamically compute the number of phases depending on the available memory. This option is ignored if -per-process-mem is supplied.

-per-process-mem <int> : available memory per process in GB (optional, default: 0GB)

This option is used to adaptively compute the number of phases. If not provided, the number of phases supplied in the parameter <phases> is used. We highly recommend to supply this option so that HipMCL can optimize the number of phases and avoid unnecessary communication. Assume that HipMCL is run on system with 64GB memory per node. The per process memory is calculated by diving 64 by number of processes per node. Examples:

  • 4 MPI processes running in 4 nodes: -per-process-mem 64
  • 4 MPI processes running in 1 nodes: -per-process-mem 16
  • 16 MPI processes running in 8 nodes: -per-process-mem 32

Options helpful for debugging


--show : if provided, show matrices after major steps (optional, default: do not show matrices)


-lspgemm <nsparse|rmerge|bhsparse|cpu|hybrid> : SpGEMM algorithm to run. nsparse, rmerge, bhpsarse are variants of algorithms for GPU and cpu is the algorithm for CPU. hybrid makes a selection between those according to certain criteria (default: hybrid).

--nrounds <int> : Number of samples to use in probabilistic memory estimation (default: 7).

Updated