To get all the tools and data, assuming an unix-like environment, one should use the
git command to create a local repository:
git clone firstname.lastname@example.org:phyloviz/popsim-analysis.git cd popsim-analysis
popsim-analysis directory should contain the following files:
build.xml graphics-cumulative-seb.R lib/ mlst-datasets/ README.md src/
build.xml file which the Apache Ant tool reads for compilation instructions.
To compile all the code, simply run:
MLST datasets can be download from public databases (e.g. mlst.net, pubmlst.org).
net.phyloviz.soap provides means to list the MLST datasets available in the public databases and to download the selected ones.
It downloads both the sequence type (ST) profiles and the allelic sequences from each housekeeping gene.
Besides the nine datasets already available in the
mlst-datasets/ subdirectory, one can download additional ones.
To list the available ones, one can run the following command:
java -cp bin/.:lib/* net.phyloviz.soap.Download 0
Each line corresponds to a dataset, being composed of an identifier, an acronym and a description. To download a particular dataset, one can needs to specify the corresponding identifier and an output directory, as in the following example:
java -cp bin/.:lib/* net.phyloviz.soap.Download 120 ./mlst-datasets/
Tools for the analysis of MLST datasets
net.phyloviz.nlvgraph provides a way to compute a graph linking all STs sharing n-allelic differences. It also enables the computation of the compactness and clustering indexes.
net.phyloviz.msn provides a way to compute the number of possible minimum spanning trees (MSTs) contained in a given nLV graph and associated statistics.
The tool needs some parameters, namely the level or the number of allelic differences between STs, the size of the thread pool for parallel computation, the BURST rule level to be considered, and the file with profiles.
For example, to count the number of trees between STs differing at a single locus, going up to the STID BURST rule level, one use the following example:
java -cp bin/.:lib/*:lib/netlib/* net.phyloviz.msn.TreeStats 1 4 5 mlst-datasets/efaecium/efaecium.st.csv
and to count the number of trees between STs connected up to two allelic differences, going up to the TLV BURST rule level, one use the following example:
java -cp bin/.:lib/*:lib/netlib/* net.phyloviz.msn.TreeStats 2 4 3 mlst-datasets/efaecium/efaecium.st.csv
Counting Trees for generic networks
To count the number of trees, and for edge statistics (SEB), for a generic network you only need a file with the edgelist of the network and the maximum node identifier. For example, with a network with 5 nodes "simplenet" you could use the following example:
java -cp bin/.:lib/*:lib/netlib/* net.phyloviz.msn.TreeStatsGeneric simplenet 5
Here we assume that node identifiers start with the number 1. If it starts with 0 you should replace the 5 by 4.
Plotting the results
graphics-cumulative-seb.R uses the result files from the TreeStats program and plots the cumulative distribution of the Spanning Edge Betweenness (SEB) of all edges, in the forest of all CCs, calculated at the SLV level by the goeBURST algorithm.
Inside the script, one can specify the directory where the datasets are contained (
topdirectory variable) and which is the extension of the file containing the edge statistics for each dataset (
It computes a graphic per dataset in its own PDF file
An additional PDF file
AllCCsEdges-AllDatasets-cumulative.pdf is also generated containing an overlay of all cumulative distribution of the Spanning Edge Betweenness of all datasets.