Wiki

Clone wiki

HipMCL / HipMCL manual

Overview

HipMCL is a high-performance parallel algorithm for large-scale network clustering. HipMCL parallelizes popular Markov Cluster (MCL) algorithm that has been shown to be one of the most successful and widely used algorithms for network clustering. It is based on random walks and was initially designed to detect families in protein-protein interaction networks. Despite MCL’s efficiency and multi-threading support, scalability remains a bottleneck as it fails to process networks of several hundred million nodes and billion edges in an affordable running time. HipMCL overcomes all of these challenges by developing massively-parallel algorithms for all components of MCL. HipMCL can be x1000 times faster than the original MCL without any information loss. It can easily cluster a network of ~75 million nodes with ~68 billion edges in ~2.4 hours using ~2000 nodes of Cori supercomputer at NERSC. HipMCL is developed in C++ language and uses standard OpenMP and MPI libraries for shared- and distributed-memory parallelization.

Developers

HipMCL is actively developed by Ariful Azad and Aydın Buluç from Lawrence Berkeley National Laboratory (LBNL). We closely collaborate with scientists from the Computational Research Division, Joint Genome Institute and NERSC at LBNL.

Citation

A paper describing HipMCL is under review for a journal publication at this time. Please contact us for updated citation.

Compilation

To compile HipMCL on a workstation, all you need to do is to execute the following two commands, in the given order, inside the HipMCL directory:

  cmake .
  make

The executable files are stored in the bin folder. You can run simple tests using the following command from the main directory:

  ctest -V 

Alternatively (if cmake fails, or you just don't want to install it), you can just imitate the sample makefiles inside the main directory. Those sample makefiles have the following format: makefile-machine. (example: makefile-mac)

Getting started

Options


-M <ifname> : input file name (mandatory)

The input graph is passed to HipMCL in a file named <ifname>. HipMCL currently supports two input formats: (a) labeled triples (default) (b) matrix market. These formats are briefly discussed below. See example at the bottom of this page for details.

Labeled triple format: Labeled triple format is the default choice in HipMCL. In this format, each line encodes an edge of the graph. An edge is represented by a triple (source_vertex, destination_vertex, edge_weight). Source and destination vertices are presented by string labels and edge weights are represented by floating point numbers. Three fields in a triple is separated by white space. We show an example for for a graph with seven vertices and 12 edges.

vertex_1    vertex_2    0.34
vertex_1    vertex_4    1.50
vertex_2    vertex_5    0.67
vertex_2    vertex_7    1.41
vertex_3    vertex_6    2.15
vertex_4    vertex_1    0.55
vertex_4    vertex_3    0.87
vertex_5    vertex_6    1.75
vertex_6    vertex_3    1.4
vertex_7    vertex_3    0.75
vertex_7    vertex_4    0.25
vertex_7    vertex_5    1

Matrix market format: A matrix market file starts with a mandatory header that specifies the type of the matrix. The header is followed by several lines of optional comments . A comment begins with "%". After the optional comments, there is another mandatory data header containing three integers denoting the number of rows, columns and nonzero entries in the matrix. Since the input to HipMCL is a graph, the number of rows and columns are always equal, denoting the number of vertices in the network. The rest of the file lists edges of the graph, one line for each edge. Each edge is represented by three numbers: the source vertex, the destination vertex and the weight of the edge. In contrast to the labeled triples format, matrix market only allow integer vertex identifiers. This is useful when vertex labels are converted to numeric ids in a preprocessing step to reduce input file size. This format is also popular in scientific computing community. A large collection of graphs and sparse matrices are already available in the matrix market format, such as the The University of Florida Sparse Matrix Collection. The same graph shown above with labeled triples format is shown in matrix market format below:

%%MatrixMarket matrix coordinate real general
%comments
7   7   12
4   6   0.34
4   2   1.50
6   5   0.67
6   3   1.41
1   7   2.15
2   4   0.55
2   1   0.87
5   7   1.75
7   1   1.4
3   1   0.75
3   2   0.25
3   5   1
Here, the header denotes that the file contains a symmetric matrix where the nonzero values are stored in floating point numbers. See the matrix market website for further detail.

--matrix-market : if provided, the input file is in the matrix market format (default: the file is in labeled triples format)

-base <0|1> : index of the first vertex [optional, default: 1]

This option is only used with the matrix market format. Sets the start vertex index in the input file. By default, the vertices are numbered starting with 0.

-o <ofname> : output file name (optional, default: <ifname>.hipmcl)

The output clusters from HipMCL will be saved in a file named <ofname>. If this option is not supplied by an user, HipMCL will save the output in <ifname>.hipmcl file, where <ifname> is the name of the input file.

In the output file, each cluster is encoded by list of vertices belonging to the cluster and stored in a single line of the file. Vertex labels (or numeric ids in the matrix market format) are separated by white space.

Assume that HipMCL obtained two clusters from the graph described above. A possible output file will look like the following:

vertex_2 vertex_3 vertex_5 vertex_6 vertex_7 
vertex_1 vertex_4 

In the current version of HipMCL, we do not sort clusters by the their sizes. In future releases, we will have the option to keep the clusters sorted by their sizes.


-I <num> : inflation parameter (mandatory)

Sets the main inflation value to <num>. This parameter controls the granularity of clusters. It is usually chosen somewhere in the range [1.2-5.0]. -I 5.0 will tend to result in fine-grained clusterings, and -I 1.2 will tend to result in very coarse grained clusterings.


-p <num> : cutoff for pruning (optional, default: 1/10000)

Set the hard threshold for pruning. After squaring the matrix, entries that are smaller than the cutoff are removed, resulting in a columns with at most 1/cutoff entries.

-pct <pct> recovery percentage (optional, default: 90)

Sets the recovery percentage. See the description of the recovery number.

-R <int> : recovery number (optional, default: 1400)

Sets the recover number. Recovery is applied in a column if the the sum of all remaining entries is less than <pct>/100 and the number of remaining entries is less than the recovery number.

-S <int> : selection number (optional, default: 1100)

If recovery was not necessary and a pruned column has more than S entries, the selection step prunes the column further to at most S entries. It is possible that the recovery condition is satisfies again after a column goes through selection. In that case, recovery is performed again on this column. The latter situation will not occur if R <= S.


-rand <0|1> : randomly permute vertices for load balance (optional, default: 0 (no permute))

This option asks HipMCL to randomly permute vertices for load balance. By default, HipMCL does not permute vertices.

--remove-isolated : if provided, remove isolated vertices (default: don't remove isolated vertices)


-phases <int> : number of phases (optional, default: 1)

Sets the number of phases used by HipMCL. In each phase, HipMCL expands 1/phase part of the matrix. Using more phases reduces the memory requirement of the expansion step; however, it increasing the running time. Hence, we recommend to set the -per-process-mem option so that HipMCL can dynamically compute the number of phases depending on the available memory. This option is ignored if -per-process-mem is supplied.

-per-process-mem <int> : available memory per process in GB (optional, default: 0GB)

This option is used to adaptively compute the number of phases. If not provided, the number of phases supplied in the parameter <phases> is used. We highly recommend to supply this option so that HipMCL can optimize the number of phases and avoid unnecessary communication. Assume that HipMCL is run on system with 64GB memory per node. The per process memory is calculated by diving 64 by number of processes per node. Examples:

  • 4 MPI processes running in 4 nodes: -per-process-mem 64
  • 4 MPI processes running in 1 nodes: -per-process-mem 16
  • 16 MPI processes running in 8 nodes: -per-process-mem 32

Options helpful for debugging


--show : if provided, show matrices after major steps (optional, default: do not show matrices)

Examples


A simple graph with seven vertices

At first we consider a simple graph with seven vertices and twelve edges as shown in the input/output section. This graph in both labeled triple format (sevenvertexgraph.txt) and matrix market format (sevenvertex.mtx) are provided in the data directory.

A bare-bone HipMCL run could be:

./hipmcl -M sevenvertexgraph.txt -I 2 -o sevenvertex.hipmcl
Output of this run is shown below:
vertex_2 vertex_3 vertex_5 vertex_6 vertex_7 
vertex_1 vertex_4 

The same file in matrix market format can be run as follows:

./hipmcl -M sevenvertex.mtx --matrix-market -I 2 -o sevenvertex.hipmcl

Output of this run is shown below:

1 3 5 6 7 
2 4  

Example with moderate-size networks

Example to cluster a network named graph1.mtx on the NERSC/Edison system with 16 nodes and 24 threads per node.

srun -N 16 -n 16 -c 24  ./hipmcl -M graph1.mtx -per-process-mem 64 -o graph1.hipmcl

Example with bigger networks

Updated