1. Mark Howison
  2. gabi


gabi / phix-test /

Filename Size Date modified Message
600 B
901 B
473 B
474 B
474 B
333 B
492 B
485 B
483 B
483 B
483 B
545 B
112 B
3.4 KB
5.4 KB
2.7 KB
160.1 KB
167.2 KB

This directory contains scripts to regenerate analyses in the paper:

Howison M, Zapata F, Edwards EJ, Dunn CW. (2013) Bayesian genome assembly and assessment by Markov Chain Monte Carlo sampling. arXiv:1308.1388

To run these scripts, you will need to have installed GABI and the following software (note: matching the version number may not be critical, but we put them here to record what we have successfully tested):

If you would like to regenerate the 2000 read-pair subset, run the 00-subset-data.sh script. This will download a very large dataset from Illumina (two 3.5GB files) and run it through the filter_illumina tool from BioLite. Alternatively, you can skip this step and use the subset.*.fq.gz files that we have alread generated and included in this directory. It will also download the reference assembly NC_001422 from the NCBI Nucleotide Database (also included in the repo as phix-reference-NC_001422.fa).

The 01-assemble.sh script will constuct the assembly graph and run Velvet and SGA on the subset. The assembly graph output should have two clusters (cluster0.graphml and cluster1.graphml) which are reverse complements of each other, and should each have 110 nodes and 148 edges. The script will also generate figures (compare?.pdf) showing which edges in the assembly graph are also present in the Velvet and SGA assemblies and the NCBI reference sequence.

The 02[a-c]-sample.sh scripts are the most computationally intensive and run sample from three independent chains. They includes directives for the SLURM cluster management system. You may need to adapt these to your own cluster environment. By default, we use cluster0.graphml from above and run 20,000 sample iterations, with an estimated runtime of ~7 hours and peak memory usage of 10GB for each chain, each running on an 8-core Intel Xeon E5540 (2.53Ghz) node. If you want to run a quick test, you could reduce the iterations to 1,000, although this will probably fail to converge. On our cluster, we would launch the three scipts with:

sbatch 02a-sample.sh && sleep 60
sbatch 02b-sample.sh
sbatch 02c-sample.sh

Once the sampling has finished, use the 03-report.sh script to generate a report in the directory report-X where X is the current time. You can compare this report against the sample one provided in the report-1399918695 directory. Figures 3 and 5a-b in the paper are from this report.

The remaining scripts 04 through 08 run a study of different priors, and are also computationally intensive, with an estimated runtime of ~7 hours and peak memory usage of 10GB for each run. We would submit these on our cluster with:

sbatch 04-sample-nodata.sh
sbatch 05-sample-flatprior.sh
sbatch 06-sample-scale5.sh
sbatch 07-sample-scale3.sh
sbatch 08-sample-scale1.sh

Once this sampling has finished, use the 09-priors-plot.sh script to generate Figure 4 in the paper.

Seed selection

The seeds used in the test scripts were chosen at random using the same logic as the random.seed() function:

long(random._hexlify(random._urandom(16)), 16)