HTTPS SSH

README

To get all the tools and data, assuming an unix-like environment, one should use the git command to create a local repository:

git clone git@bitbucket.org:phyloviz/popsim-analysis.git
cd popsim-analysis

The popsim-analysis directory should contain the following files:

build.xml  graphics-cumulative-seb.R  lib/  mlst-datasets/  README.md  src/

Notice the build.xml file which the Apache Ant tool reads for compilation instructions. To compile all the code, simply run:

ant

MLST datasets

MLST datasets can be download from public databases (e.g. mlst.net, pubmlst.org). The package net.phyloviz.soap provides means to list the MLST datasets available in the public databases and to download the selected ones. It downloads both the sequence type (ST) profiles and the allelic sequences from each housekeeping gene.

Besides the nine datasets already available in the mlst-datasets/ subdirectory, one can download additional ones. To list the available ones, one can run the following command:

java -cp bin/.:lib/* net.phyloviz.soap.Download 0

Each line corresponds to a dataset, being composed of an identifier, an acronym and a description. To download a particular dataset, one can needs to specify the corresponding identifier and an output directory, as in the following example:

java -cp bin/.:lib/* net.phyloviz.soap.Download 120 ./mlst-datasets/

Tools for the analysis of MLST datasets

The package net.phyloviz.nlvgraph provides a way to compute a graph linking all STs sharing n-allelic differences. It also enables the computation of the compactness and clustering indexes.

The package net.phyloviz.msn provides a way to compute the number of possible minimum spanning trees (MSTs) contained in a given nLV graph and associated statistics.

Counting Trees

The tool needs some parameters, namely the level or the number of allelic differences between STs, the size of the thread pool for parallel computation, the BURST rule level to be considered, and the file with profiles.

For example, to count the number of trees between STs differing at a single locus, going up to the STID BURST rule level, one use the following example:

java -cp bin/.:lib/*:lib/netlib/* net.phyloviz.msn.TreeStats 1 4 5 mlst-datasets/efaecium/efaecium.st.csv 

and to count the number of trees between STs connected up to two allelic differences, going up to the TLV BURST rule level, one use the following example:

java -cp bin/.:lib/*:lib/netlib/* net.phyloviz.msn.TreeStats 2 4 3 mlst-datasets/efaecium/efaecium.st.csv 

Counting Trees for generic networks

To count the number of trees, and for edge statistics (SEB), for a generic network you only need a file with the edgelist of the network and the maximum node identifier. For example, with a network with 5 nodes "simplenet" you could use the following example:

java -cp bin/.:lib/*:lib/netlib/* net.phyloviz.msn.TreeStatsGeneric simplenet 5

Here we assume that node identifiers start with the number 1. If it starts with 0 you should replace the 5 by 4.

Plotting the results

The script graphics-cumulative-seb.R uses the result files from the TreeStats program and plots the cumulative distribution of the Spanning Edge Betweenness (SEB) of all edges, in the forest of all CCs, calculated at the SLV level by the goeBURST algorithm.

Inside the script, one can specify the directory where the datasets are contained (topdirectory variable) and which is the extension of the file containing the edge statistics for each dataset (extension variable).

It computes a graphic per dataset in its own PDF file AllCCsEdges-cumulative-normalized-<dataset>.pdf. An additional PDF file AllCCsEdges-AllDatasets-cumulative.pdf is also generated containing an overlay of all cumulative distribution of the Spanning Edge Betweenness of all datasets.