This tutorial has two sections. The first describes an assembly run of the Agalma transcriptome pipeline using a sample read data set. The second describes a run of the phylogeny pipeline on a set of sample transcriptomes.
You can create a shell script that has just the commands you need to execute
from this document with the command:
grep '^ ' TUTORIAL.md | sed -e 's/^ //' > tutorial.sh.
Let's start by creating an organized directory hierarchy in your home directory for storing Agalma data:
mkdir -p agalma/analyses mkdir -p agalma/data mkdir -p agalma/scratch
We'll also specify a path for a local BioLite database to store the catalog of data and the diagnostics from our runs:
Agalma distinguishes between three types of storage locations:
- a permanent location for raw data ("data")
- a permanent location for the final output from analyses ("analyses")
- a temporary location for the intermediate files from analyses ("scratch")
In a production system, you would want to make sure that "analyses" and "data" are backed up, but "scratch" doesn't need to be backed up because its files are typically only used over the course of a single analysis and can be regenerated.
The 'testdata' command will unpack all of the sample data sets used in the tutorials:
- 4 subsets of paired-end Illumina HiSeq reads from the siphonophore species: Agalma elegans, Craseoa lathetica, Physalia physalis, Nanomia bijuga
- 2 subsets of single-end Illumina GAIIx reads from two different specimens of the species Nanomia bijuga
- 6 transcriptome subsets, from Agalma assemblies of these four siphonophore species, plus an extra one for Nanomia bijuga and a public/external one for the species Nematostella vectensis
Let's run this command to download the sample data to the "data" directory:
cd agalma/data agalma testdata
You should now see the test data files in the
Now we'll create a catalog entry in the BioLite database for the Agalma elegans data set:
agalma catalog insert --paths SRX288285_1.fq SRX288285_2.fq --species "Agalma elegans" --ncbi_id 316166 --itis_id 51383
The catalog command automatically recognizes the Illumina CASAVA 1.8 header in
the data and assigns a catalog ID
Before running a pipeline, you may want to adjust the concurrency and memory
to match your computer. System performance may lag if you allocate the
full available memory to agalma. You should leave at least 2 GB unallocated. If,
for example, you have an 8-core machine with 24 GB of free memory, you could
allocate all the processors and 20 GB of the memory with
export BIOLITE_RESOURCES="threads=8,memory=20G". You can run this small 25K
read data set with less resources, for instance using only 2 threads and 4GB of
Now run the full transcriptome pipeline on the test data from the scratch directory, and specify 'analyses' as the permanent output directory, with:
cd ../scratch agalma transcriptome --id HWI-ST625-73-C0JUVACXX-7 --outdir ../analyses
To generate reports for the transcriptome run, use:
cd .. agalma report --id HWI-ST625-73-C0JUVACXX-7 --outdir reports/HWI-ST625-73-C0JUVACXX-7 agalma resources --id HWI-ST625-73-C0JUVACXX-7 --outdir reports/HWI-ST625-73-C0JUVACXX-7
You will now have a report with diagnostics for all of the transcriptome stages run above located at:
Another report with detailed resource utilization will be located at:
This analysis builds a species tree for 5 taxa using small transcriptome subsets, and requires roughly 7 GB of RAM and 10 minutes of runtime on an 8-core Intel Xeon E5540 node.
Let's define a default value for the output directory so that you don't have to
keep retyping the
--outdir parameter, and also a path for the Agalma database
that will store the sequences and homology information:
Rather than assemble all of the datasets, we are going to use the fasta files from already assembled transcriptomes.
First, we'll catalog our data:
cd data agalma catalog insert --id SRX288285 --paths SRX288285_1.fq SRX288285_2.fq --species "Agalma elegans" --ncbi_id 316166 agalma catalog insert --id SRX288432 --paths SRX288432_1.fq SRX288432_2.fq --species "Craseoa lathetica" --ncbi_id 316205 agalma catalog insert --id SRX288431 --paths SRX288431_1.fq SRX288431_2.fq --species "Physalia physalis" --ncbi_id 168775 agalma catalog insert --id SRX288430 --paths SRX288430_1.fq SRX288430_2.fq --species "Nanomia bijuga" --ncbi_id 168759 agalma catalog insert --id JGI_NEMVEC --paths JGI_NEMVEC.fa --species "Nematostella vectensis" --ncbi_id 45351
Each assembled transcriptome has to be 'post-assembled' to create protein translations (by longest open reading frame) and annotations for BLASTX hits against SwissProt:
cd ../scratch agalma postassemble --id SRX288285 --assembly ../data/SRX288285.fa agalma postassemble --id SRX288430 --assembly ../data/SRX288430.fa agalma postassemble --id SRX288431 --assembly ../data/SRX288431.fa agalma postassemble --id SRX288432 --assembly ../data/SRX288432.fa agalma postassemble --id JGI_NEMVEC --external
For JGI_NEMVEC, we didn't have to specify the path to the assembly because the
catalog path already points to the assembly file, but we do have to specify
that it is an external assembly with the
--external flag. For the
transcriptomes produced by Agalma, that path in the catalog points to the read
sequences used to estimate coverage in the
Next, each assembly is loaded separately into the Agalma database using the
load pipeline. Here is an example of loading transcriptomes for the 5 sample
data sets that were assembled with Agalma and the external assembly from JGI.
Load automatically identifies the most recent run of
postassemble for each
agalma load --id SRX288285 agalma load --id SRX288430 agalma load --id SRX288431 agalma load --id SRX288432 agalma load --id JGI_NEMVEC
Now that you have loaded all of the data sets, call the
on the list of run IDs for each of the load commands. A quick way to find the
the run IDs for all of the load commands you have run is to use the
export LOAD_IDS=$(agalma diagnostics runid -n load) agalma homologize --id PhylogenyTest $LOAD_IDS
You may need to adjust the load IDs if you have run other analyses, or only
want to use some of the data you have loaded, etc. The option
PhylogenyTest specifies the analysis ID, and will be used to pass results
through several commands below. This ID should be unique for each phylogenetic
analysis. Also note that the list $LOAD_IDS is passed directly into the
homologize pipeline without specifying any flags. Please read the help
-h if you need more information.
homologize pipeline will write a collection of gene clusters back into
the Agalma database.
Now call this sequence of pipelines, inserting the analysis ID each time, to build gene trees for each of the clusters, and finally to assemble a supermatrix with a single concatenated sequence of genes for each taxon.
agalma multalign --id PhylogenyTest agalma genetree --id PhylogenyTest agalma treeprune --id PhylogenyTest agalma multalign --id PhylogenyTest agalma supermatrix --id PhylogenyTest agalma speciestree --id PhylogenyTest --raxml_flags="-o Nematostella_vectensis"
The supermatrix will be located in:
and the phylogenetic tree in:
To generate reports for the phylogeny run, use:
cd .. agalma report --id PhylogenyTest --outdir reports/PhylogenyTest agalma resources --id PhylogenyTest --outdir reports/PhylogenyTest agalma phylogeny_report --id PhylogenyTest --outdir reports/PhylogenyTest
You will now have a report with diagnostics for all of the phylogeny stages run above, including images of the supermatrices and the final tree, located at:
Another report with detailed resource utilization will be located at:
A line graph showing the reduction in the number of genes at each stage will be located at:
You can view a list of all the runs from this tutorial with the 'diagnostics' command:
agalma diagnostics list
Agalma's expression pipeline maps reads against an assembly and estimates the read count for each transcript in the assembly (at the gene and isoform level). Multiple read files can be mapped against each assembly, accommodating multiple treatments and replicates for each species. The read counts are exported as a JSON file, that (following parsing) can then be analyzed using tools such as edgeR. If the assembly has also been included in a phylogenetic analysis, then gene trees and a species tree can be exported along with the count data. This allows for phylogenetic analysis of gene expression. For background information on this type of analysis, read Dunn et al., 2013.
This tutorial presents a minimal example of an expression analysis. An assembly is imported, reads are mapped to the assembly, and count data (without gene phylogenies) are exported.
First, catalog a pre-assembled set of transcripts for the species Nanomia bijuga that is provided in the test data:
agalma catalog insert --id Nanomia_bijuga --paths data/Nanomia_bijuga.fa --species "Nanomia bijuga" --ncbi_id 168759 --itis_id 51389
As with the external JGI_NEMVEC assembly above, you have to run
--external flag, and then load the annotated transcript into the
Agalma database with the
agalma postassemble --id Nanomia_bijuga --external agalma load --id Nanomia_bijuga
Next, catalog two single-end Illumina data sets for two different individuals (specimen-1 and specimen-2), but in the same treatment (in this case, the same tissue type: gastrozooids):
agalma catalog insert --id SRX033366 --paths data/SRX033366.fq --species "Nanomia bijuga" --ncbi_id 168759 --itis_id 51389 --treatment gastrozooids --individual specimen-1 agalma catalog insert --id SRX036876 --paths data/SRX036876.fq --species "Nanomia bijuga" --ncbi_id 168759 --itis_id 51389 --treatment gastrozooids --individual specimen-2
Estimate their expression counts by running the
expression pipeline for each
against the provided assembly:
agalma expression --id SRX033366 Nanomia_bijuga agalma expression --id SRX036876 Nanomia_bijuga
Individual reports can be generated for both of the
agalma report --id SRX033366 --outdir report-SRX033366 agalma report --id SRX036876 --outdir report-SRX036876
Finally, bundle the expression count estimates into a JSON file that can be imported into R for further analysis:
IDS=$(agalma diagnostics runid -n load) agalma export_expression --load $IDS >export.json
In this short example, the JSON file only includes the expression levels. If it were run on assemblies following a full phylogenomic analysis, it would also include the species tree and gene trees (with bootstrap values).