biosubg: Subgraph extraction in directed networks using Monte Carlo simulations

biosubg in a java implementation of MCWalk, an algorithm based on directed random walks using Monte Carlo simulations. It links sets of predefined source/target nodes, quantifies their relation and extracts relevant sub-networks. The compiled program (with Oracle Java 8) is provided as a .jar file (biosubg.jar).

Usage: java -jar biosubg.jar <params_file>

where <params_file> is a file that specifies all simulation parameters, as well as input/output files. Please note that if no <params_file> is provided, simparams.txt will be used as default, or the program will halt.

A sample simparams.txt file is provided in the repository, with explanations for each parameter included in the file and denoted with "#" (the parser will treat anything mentioned after this character in this file as a comment).

The parameters for the input file are described below:

  • RUNS: Total number of runs
  • STEPS: Maximum number of steps per run
  • THRES_MIN: Minimum node threshold for subgraph output
  • THRES_MAX: Maximum node threshold for subgraph output
  • THRES_STEP: Step to increase threshold value, leave default=0 for 10 per order of magnitude
  • MAX_PATHS: Discover k-shortest paths, (EXPERIMENTAL!) leave default=0 to ignore
  • MAX_HOPS: Discover k-shortest paths (EXPERIMENTAL!) leave default=0 to ignore
  • MARK_SHORTEST: Indicates whether the shortest paths will be discovered
  • PRUNE_AFTER: Indicates whether the extracted subgraphs should be pruned in the final step
  • NET_FILE: Network file to be analysed in .sif format (nodeA <relationship type> nodeB)
  • SOURCE_FILE: List of source nodes
  • TARGET_FILE: If same as source nodes all nodes are treated as seeds

Dependency matrix output files with sorting based on gene expression and traversal frequency are marked with the exp and freq suffixes respectively. Subgraphs are output in the networks subdirectory. Depending on whether the traversal score of edges or vertices is used to construct the network, output files are marked with an e or v suffix respectively.

Preparation of input files: biosubg accepts as input:

  1. A network file to be analysed in .sif format. Lines in the SIF file specify a source node, a relationship type (or edge type) and one or more target nodes, e.g. nodeA <relationship type> nodeB. For a more detailed description about the .sif format please refer to the [Cytoscape User Manual](http://wiki.cytoscape.org/Cytoscape_User_Manual/Network_Formats).
  2. A text file with a list of source nodes.
  3. A text file with a list of target nodes Note that the list of target nodes may be the same as the list of source nodes, in which case the nodes will be treated as seeds.
  4. A text file where the input and output files, as well as the simulation parameters are specified. This file is provided in the command line or else the default name simparams.txt is used (the functionality of this file is described above).

biosubg uses the Jung and JGraphT graph libraries and is available under the MIT license.