Wiki
Clone wikiHipMCL / Home
Overview
HipMCL is a high-performance parallel algorithm for large-scale network clustering. HipMCL parallelizes popular Markov Cluster (MCL) algorithm that has been shown to be one of the most successful and widely used algorithms for network clustering. It is based on random walks and was initially designed to detect families in protein-protein interaction networks. Despite MCL’s efficiency and multi-threading support, scalability remains a bottleneck as it fails to process networks of several hundred million nodes and billion edges in an affordable running time. HipMCL overcomes all of these challenges by developing massively-parallel algorithms for all components of MCL. HipMCL can be 1000 times faster than the original MCL without any information loss. It can easily cluster a network of ~75 million nodes with ~68 billion edges in ~2.4 hours using ~2000 nodes of Cori supercomputer at NERSC. It also supports GPUs and can cluster the mentioned network in 15 minutes using 1K nodes of Summit at ORNL. HipMCL is developed in C++ language and uses standard OpenMP and MPI libraries for shared- and distributed-memory parallelization.
Sample Data
We provided several examples for single-node and multi-node runs in HipMCL/test folder. For large scale experiments, sample networks can be downloaded from http://portal.nersc.gov/project/m1982/HipMCL/
See the getting started section for some test examples. For other examples, see the HipMCL/test directory.
Developers
HipMCL is developed by
- Ariful Azad, Indiana University. contact: azad AT iu.edu
- Oguz Selvitopi, Lawrence Berkeley National Laboratory. contact: roselvitopi AT lbl.gov
- Taufique Hussain, Indiana University. contact: mth AT indiana.edu
- Aydın Buluç, Lawrence Berkeley National Laboratory. contact: abuluc AT lbl.gov
HipMCL collaborators:
- Georgios Pavlopoulos (JGI),
- Nikos Kyrpides (JGI)
- Christos Ouzounis (CERTH).
Citations
Ariful Azad, Georgios A. Pavlopoulos, Christos A. Ouzounis, Nikos C. Kyrpides, and Aydın Buluç. HipMCL: A high-performance parallel implementation of the Markov clustering algorithm for large-scale networks. Nucleic Acids Research, 2018. doi: https://doi.org/10.1093/nar/gkx1313.
Oguz Selvitopi, Md Taufique Hussain, Ariful Azad, and Aydın Buluç. Optimizing high performance Markov clustering for pre-exascale architectures. In Proceedings of the IPDPS, 2020.
Compilation
Dependencies: HipMCL depends on standard OpenMP and MPI libraries. OpenMP is usually integrated in most compilers. Please install MPI if it is not installed in your system. For example, you can install MPICH from https://www.mpich.org/.
To compile HipMCL on a workstation, all you need to do is to execute the following two commands, in the given order, inside the HipMCL directory:
#!bash
cmake .
make
The executable files are stored in the bin folder. You can run simple tests using the following command from the main directory:
#!bash
ctest -V
Alternatively (if cmake fails, or you just don't want to install it), you can just imitate the sample makefiles inside the main directory. For example, the following commands will compile HipMCL on a workstation.
ln -s makefile-mac Makefile make hipmcl
To compile with GPU support, checkout the "gpu" branch and use the above cmake commands to compile. This will produce the executable "hipmcl-gpu" under "bin". The GPU version requires CUDA 10 and a compiler compatible with c++14 standard. The GPU version is capable of utilizing all present GPUs on a node. It has the same options as the CPU version and a few more options related to GPUs (see options below).
Compilation at NERSC
We provided a makefile (named makefile-nersc) for NERSC systems. The following commands will compile HipMCL on Perlmutter.
ln -s makefile-nersc Makefile make hipmcl
Compilation at Summit with GPU support
The necessary module files to compile with GPU support on Summit machine are loaded as follows:
module load cuda cmake gcc/8.1.1 boost
Getting started
Running HipMCL on a workstation:
Assume that HipMCL is downloaded to "HipMCL" directory. The complete sequence of commands needed to run an example:
- Compile HipMCL (see above for compilation instruction). The executable file is located in HipMCL/bin directory.
- Go to HipMCL/test/small_scale_examples directory. Command:
cd test/small_scale_examples
- In HipMCL/test/small_scale_examples directory, execute an example script. Command:
run: bash sevenvertex.sh
- The output of HipMCL will be written in a file named
sevenvertex.triples.hipmcl
in HipMCL/data directory.
To run your own dataset, create a copy of sevenvertex.sh (if your file is in labeled triple format; see below for input format options). Then change the following in the copied script:
1. HIPMCL_EXE: path to HipMCL executable relative your current location
2. IN_FILE: name and path to your input data file
3. OUT_FILE: name and path to your output data file
4. See below to learn how to set different input parameters. Only two options are mandatory (a) -M input_file_name
and (b) -I inflation_parameter
.
Running HipMCL on a NERSC system (Perlmutter):
Assume that HipMCL is downloaded to "HipMCL" directory. The complete sequence of commands needed to run an example:
- Compile HipMCL (see above for compilation instruction). The executable file is located in HipMCL/bin directory.
- Go to HipMCL/test/NERSC_examples directory. Command:
cd test/NERSC_examples
- In HipMCL/test/NERSC_examples directory, execute an example script.
- For Perlmutter CPU-only system use:
sbatch virus_nersc_perlmutter_cpu.sh
The output of HipMCL will be written in a file named vir_vs_vir_30_50length.indexed.triples.hipmcl
in HipMCL/data directory.
To run your own dataset, create a copy of virus_nersc_edison.sh (if your file is in labeled triple format; see below for input format options). Then change the following in the copied script:
1. HIPMCL_EXE: path to HipMCL executable relative your current location
2. IN_FILE: name and path to your input data file
3. OUT_FILE: name and path to your output data file
4. See below to learn how to set different input parameters. Only two options are mandatory (a) -M input_file_name
and (b) -I inflation_parameter
.
Several other examples are provided in the test directory (for both desktop and NERSC runs). The data directory has several small example networks. We uploaded several medium- and large-scale networks here: http://portal.nersc.gov/project/m1982/HipMCL/. These network can be used to test scripts in test/NERSC_examples folder. Note: HipMCL only support square process grid at this moment (a limitation imposed by the CombBLAS library).
Options
Options related to input/output files
-M <ifname> : input file name (mandatory)
The input graph is passed to HipMCL in a file named <ifname>. HipMCL currently supports two input formats: (a) labeled triples (default) (b) matrix market. These formats are briefly discussed below. See example at the bottom of this page for details.
Labeled triple format: Labeled triple format is the default choice in HipMCL. In this format, each line encodes an edge of the graph. An edge is represented by a triple (source_vertex, destination_vertex, edge_weight). Source and destination vertices are presented by string labels and edge weights are represented by floating point numbers. Three fields in a triple is separated by white space. We show an example for for a graph with seven vertices and 12 edges.
#!c++ vertex_1 vertex_2 0.34 vertex_1 vertex_4 1.50 vertex_2 vertex_5 0.67 vertex_2 vertex_7 1.41 vertex_3 vertex_6 2.15 vertex_4 vertex_1 0.55 vertex_4 vertex_3 0.87 vertex_5 vertex_6 1.75 vertex_6 vertex_3 1.4 vertex_7 vertex_3 0.75 vertex_7 vertex_4 0.25 vertex_7 vertex_5 1
Matrix market format: A matrix market file starts with a mandatory header that specifies the type of the matrix. The header is followed by several lines of optional comments . A comment begins with "%". After the optional comments, there is another mandatory data header containing three integers denoting the number of rows, columns and nonzero entries in the matrix. Since the input to HipMCL is a graph, the number of rows and columns are always equal, denoting the number of vertices in the network. The rest of the file lists edges of the graph, one line for each edge. Each edge is represented by three numbers: the source vertex, the destination vertex and the weight of the edge. In contrast to the labeled triples format, matrix market only allow integer vertex identifiers. This is useful when vertex labels are converted to numeric ids in a preprocessing step to reduce input file size. This format is also popular in scientific computing community. A large collection of graphs and sparse matrices are already available in the matrix market format, such as the The University of Florida Sparse Matrix Collection. The same graph shown above with labeled triples format is shown in matrix market format below:
#!c++ %%MatrixMarket matrix coordinate real general %comments 7 7 12 4 6 0.34 4 2 1.50 6 5 0.67 6 3 1.41 1 7 2.15 2 4 0.55 2 1 0.87 5 7 1.75 7 1 1.4 3 1 0.75 3 2 0.25 3 5 1
--matrix-market : if provided, the input file is in the matrix market format (default: the file is in labeled triples format)
-base <0|1> : index of the first vertex [optional, default: 1]
This option is only used with the matrix market format. Sets the start vertex index in the input file. By default, the vertices are numbered starting with 1.
-o <ofname> : output file name (optional, default: <ifname>.hipmcl)
The output clusters from HipMCL will be saved in a file named <ofname>. If this option is not supplied by an user, HipMCL will save the output in <ifname>.hipmcl file, where <ifname> is the name of the input file.
In the output file, each cluster is encoded by list of vertices belonging to the cluster and stored in a single line of the file. Vertex labels (or numeric ids in the matrix market format) are separated by white space.
Assume that HipMCL obtained two clusters from the graph described above. A possible output file will look like the following:
#!c++ vertex_2 vertex_3 vertex_5 vertex_6 vertex_7 vertex_1 vertex_4
In the current version of HipMCL, we do not sort clusters by the their sizes. In future releases, we will have the option to keep the clusters sorted by their sizes.
Options related to inflation
-I <num> : inflation parameter (mandatory)
Sets the main inflation value to <num>. This parameter controls the granularity of clusters. It is usually chosen somewhere in the range [1.2-5.0]. -I 5.0 will tend to result in fine-grained clusterings, and -I 1.2 will tend to result in very coarse grained clusterings.
Options related to pruning
-p <num> : cutoff for pruning (optional, default: 0.0001 )
Set the hard threshold for pruning. After squaring the matrix, entries that are smaller than the cutoff are removed, resulting in a columns with at most 1/cutoff entries.
-pct <pct> recovery percentage (optional, default: 90)
Sets the recovery percentage. See the description of the recovery number.
-R <int> : recovery number (optional, default: 1400)
Sets the recover number. Recovery is applied in a column if the the sum of all remaining entries is less than <pct>/100 and the number of remaining entries is less than the recovery number.
-S <int> : selection number (optional, default: 1100)
If recovery was not necessary and a pruned column has more than S entries, the selection step prunes the column further to at most S entries. It is possible that the recovery condition is satisfies again after a column goes through selection. In that case, recovery is performed again on this column. The latter situation will not occur if R <= S.
--preprune : if provided, apply prune/select/recovery before the first iteration (needed when dense columns are present) (default: don't preprune. However, if the average nonzero per column is larger than max{S,R}, prepruning is still applied by default)
Options related to preprocessing
-rand <0|1> : randomly permute vertices for load balance (optional, default: 0 (no permute))
This option asks HipMCL to randomly permute vertices for load balance. By default, HipMCL does not permute vertices.
--remove-isolated : if provided, remove isolated vertices (default: don't remove isolated vertices)
Options related to HipMCL optimizations
-phases <int> : number of phases (optional, default: 1)
Sets the number of phases used by HipMCL. In each phase, HipMCL expands 1/phase part of the matrix. Using more phases reduces the memory requirement of the expansion step; however, it increasing the running time. Hence, we recommend to set the -per-process-mem option so that HipMCL can dynamically compute the number of phases depending on the available memory. This option is ignored if -per-process-mem is supplied.
-per-process-mem <int> : available memory per process in GB (optional, default: 0GB)
This option is used to adaptively compute the number of phases. If not provided, the number of phases supplied in the parameter <phases> is used. We highly recommend to supply this option so that HipMCL can optimize the number of phases and avoid unnecessary communication. Assume that HipMCL is run on system with 64GB memory per node. The per process memory is calculated by diving 64 by number of processes per node. Examples:
- 4 MPI processes running in 4 nodes: -per-process-mem 64
- 4 MPI processes running in 1 nodes: -per-process-mem 16
- 16 MPI processes running in 8 nodes: -per-process-mem 32
-layers <int> : number of layers to use in matrix multiplication (optional, default: 1)
This option is to use 3D matrix multiplication routine. Number of layers l
needs to be chosen in such a way so that both p
and p/l
is a square number where p
is the total number of processes
Options helpful for debugging
--show : if provided, show matrices after major steps (optional, default: do not show matrices)
Options related to GPUs
-lspgemm <nsparse|rmerge|bhsparse|cpu|hybrid> : SpGEMM algorithm to run. nsparse, rmerge, bhpsarse are variants of algorithms for GPU and cpu is the algorithm for CPU. hybrid makes a selection between those according to certain criteria (default: hybrid).
--nrounds <int> : Number of samples to use in probabilistic memory estimation (default: 7).
Updated