Overview

Agalma is developed by the Dunn Lab at Brown University.

See TUTORIAL for an example of how to use Agalma with a sample dataset. Please consult the FAQ and Troubleshooting sections below if you have any questions or problems.

Overview of Agalma

Agalma is a set of analysis pipelines for phylogenomic analyses. It builds alignments of homologous genes and preliminary species trees from genomic and transcriptome data. Agalma includes support for transcriptome assembly (paired-end Illumina data), and can also import gene predictions from other sources (eg, assembled non-Illumina transcriptomes or gene models from annotated genomes). Please carefully read the details on Data requirements before proceeding with your analyses.

Agalma provides a completely automated analysis workflow, and records rich diagnostics. You can then evaluate these diagnostics to spot problems and examine the success of your analyses, the quality of the orignal data, and the appropriateness of the default parameters. You can then rerun subsets of the pipelines with optimized parameters as needed.

The main functionality of this workflow is to:

  • assess read quality with the FastQC package
  • remove clusters in which one or both reads have Illumina adapters (resulting from small inserts)
  • remove clusters where one or both reads is of low mean quality
  • randomize the sequences in the same order in both pairs to make obtaining random subsets easy
  • assemble and annotate rRNA sequences based on a subassembly of the data
  • remove clusters in which one or both reads map to rRNA sequences
  • create a larger subassembly of the dataset to profile expression of transcripts and assess the distribution of sequencing effort
  • create HTML reports that summarize the diagnostics collected during analyses of each transcriptome
  • load the assembled sequences into a database
  • load gene predictions from other sources into the database
  • identify homologous sequences across multiple species based on nucleotide sequence similarity
  • Create nucleotide and protein alignments for each set of homologous sequences through simultaneous alignment and translation
  • Build gene trees for each set of aligned homologous sequences
  • Identify orthologs through topological analysis of the gene trees
  • Create nucleotide and protein alignments for each set of orthologs
  • Construct and analyze supermatrices of orthologs to build species phylogenies

The workflow is optimized to reduce RAM and computational requirements, as well as the disk space used. It logs detailed stats about computer resource utilization to help you understand what type of computational resources you need.

Agalma is built on top of BioLite, a bioinformatics framework written in Python/C++ that automates the collection and reporting of diagnostics, tracks provenance, and provides lightweight tools for building out customized analysis pipelines.

Agalma is named after a clade of siphonophores. Siphonophores are our favorite animals. Please visit siphonophores.org and creaturecast.org to learn more about them.

Quick Install

First, install BioLite using the Quick Install instructions at:

https://bitbucket.org/caseywdunn/biolite

Once BioLite is installed, install Agalma as follows:

On Mac OS X 10.8:

brew install agalma
agalma blastdb

On Ubuntu 12.04, 13.04:

wget https://bitbucket.org/caseywdunn/agalma/raw/master/install_agalma_ubuntu.sh
sudo sh install_agalma_ubuntu.sh
sudo agalma blastdb

After installing, run the built-in test:

mkdir ~/tmp
cd ~/tmp
agalma test

This will check to see that your installation works. It runs through a series of test analyses on some very small data sets. This takes about 22 minutes on a 2 core 2 GHz Ubuntu machine (ie, an M1 Large Amazon EC2 general purpose Ubuntu instance).

Once the test completes without error, run the TUTORIAL. This will familiarize you with the use of Agalma and clarify best practices.

We have tested Agalma on the above operating systems and several others (including CentOS 6.3). It does not work on earlier versions of these operating systems without customization. It likely works on other operating systems as well, though we have not tested them.

See INSTALL for more detailed instructions, system requirements, and for how to configure Agalma after installing it.

Data requirements

The data requirements for raw transcriptome reads that are to be assembled by Agalma are:

  • Input data must be in FASTQ format.
  • The quality scores must be encoded with an ASCII offset of 33. Older Illumina files may use an offset of 64 or 63, in which case they would need to be transformed to 33.
  • Data are paired, with one file for the forward reads and one for the reverse reads. Each file must have a paired read in the other file, and they must be in the exact same order.
  • FASTQ headers are in Casava 1.6 or Casava 1.8 format. A definition and example of each is shown below.

Casava 1.6 FASTQ header:

@HWI-ST625:3:1:1330:2071#0/1
@<instrument>:<lane>:<tile>:<x-pos>:<y-pos>#<index number>/<read>

Casava 1.8 FASTQ header:

@HWI-ST625:73:C0JUVACXX:7:1101:1403:1923 1:N:0:TGACCA
@<instrument>:<run number>:<flowcell ID>:<lane>:<tile>:<x-pos>:<y-pos> <read>:<is filtered>:<control number>:<indexsequence>
  • Reads that did not pass the Illumina filter must be removed already. If the files from the sequencing center include reads that did not pass filter, they can be removed with the following command (assuming a Casava 1.8+ header):

    grep -A 3 '^@.[^:]:N:[^:]*:' in.fastq | grep -v '^--$' > out.fastq

  • The pipeline does not trim sequences, i.e. remove a particular region of each read. If you have low quality ends, these should be trimmed before feeding the data to the pipeline. Likewise, if you need to remove inline indices or other sequences do it before you pass the data to the pipeline.

Using Agalma

All of your interactions with Agalma will be through the command line program agalma. This program can execute a variety of commands. To see the available commands, just type agalma without any arguments:

agalma

usage: agalma COMMAND [ARGS]

This is a wrapper script for the various components that come
with agalma, a suite of tools for de novo assembly and annotation
of transcriptomes from paired-end sequence data. The following
commands are available:

...

Meta-pipelines:
  preassemble    [sanitize, insert_size, remove_rrna]
  transcriptome  [preassemble, assemble]

...

To print a help message for a specific command, use:
  agalma COMMAND -h

As indicated by the help message, you can get help on particular commands with the -h flag. For example, the following shows all the options available for the transcriptome pipeline:

agalma transcriptome -h

The following sections briefly describe the most important commands in a typical analysis.

Registering your data in the 'catalog'

BioLite maintains a 'catalog' stored in an SQLite database of metadata associated with your raw Illumina data, including:

  • A unique ID that you make up to reference this data set.
  • Paths to the FASTQ files containing the raw forward and reverse reads.
  • The species name, NCBI ID, and ITIS ID.
  • The sequencing center where the data was collected.

You can insert a new catalog entry with the command:

agalma catalog insert -h

usage: catalog insert [-h] [-i ID] [-p [PATHS [PATHS ...]]] [-s SPECIES]
                      [-n NCBI_ID] [-d ITIS_ID] [-e EXTRACTION_ID]
                      [-l LIBRARY_ID] [-b LIBRARY_TYPE] [-t TISSUE]
                      [-q SEQUENCER] [-c SEQ_CENTER] [--note NOTE]
                      [--sample_prep SAMPLE_PREP]

By default, this will use the first four fields of the Illumina header in PATH1 as the catalog ID. You can manually override this by specifying --id ID.

We strongly suggest that you specify all fields if they are available. This doesn't have to be done at once - the insert command acts like an update for subsequent calls on an existing catalog ID. That is, if you do a subsequent insert with an existing catalog ID, that catalog record will be updated with any new information you have specified.

Enter the NCBI_ID for your species, if one exists. If not, then leave it blank. Enter the ITIS_ID for your species, if it exists. If not, then enter the ITIS id for the least inclusive clade that includes your species (eg, the genus). Entering both of these IDs at the outset provides a chance to double check the validity and spelling of your species name, and facilitates downstream analyses.

You can view all of your entries with:

agalma catalog all

When executing BioLite pipelines, you can simply use the catalog ID rather than typing the full paths to the raw data.

Overview of the transcriptome assembly pipeline

transcriptome
|- preassemble
|  |- sanitize
|  |- insert_size
|  +- remove_rrna
|- assemble
|- postassemble
+- report

Each of the component pipelines is described in the sections below.

The top-level transcriptome pipeline will prepare your raw RNA-Seq data for assembly, then assemble it using a range of subsets and kmer sizes.

Once you have entered your dataset in the catalog, you can run the entire pipeline with:

agalma transcriptome -i CATALOG_ID -o OUTDIR

The current working directory will be used for temporary files (some of them as large as your input data), and all output that will be kept permanently will be written to the specified OUTDIR.

If the pipeline fails at any stage, you can correct the problem and restart the pipeline from that stage with:

agalma transcriptome --restart --stage N

Diagnostics will be logged to a tab-separated text file called 'diagnostics.txt' in the working directory. Once the pipeline completes, this text file is merged into the global diagnostics SQLite database, which you can browse with the diagnostics command:

agalma diagnostics -h
usage: diagnostics [-h]
                   {list,all,id,run,merge,delete,hide,unhide,programs} ...

For example, to view a listing of all past runs, use no options:

agalma diagnostics list

To view the detailed diagnostics for a particular run, use:

agalma diagnostics run RUN_ID

You can also run the pipeline components individually. The restarting and diagnostics features exists in all the pipelines described below. To see all of the parameters available for a given pipeline, including the default values, use the -h option. This will also display a full list of the stages in the pipeline with short descriptions of each.

sanitize

This pipeline filters low quality reads (default: 28), reads with adapter sequences, and reads with highly skewed base composition. It then randomizes the order of the reads in the files to make it simple to get random subsets of read pairs in later analyses. The output of the pipeline includes two FASTQ files with the 'sanitized' postfix.

To run a dataset with CATALOG_ID through the entire sanitize pipeline, use:

agalma sanitize -i CATALOG_ID -o OUTDIR

insert_size

This pipeline uses a small subassembly and bowtie mapping to estimate the mean and variance of the insert size of your input data. Once these have been estimated and logged in the global diagnostics database, future pipeline runs on the same catalog ID that will automatically use the estimates when calling oases and bowtie.

remove_rrna

After sanitizing your raw data and estimating its insert size, you can assemble a subset of the data to identify and exclude reads that map to known ribosomal RNA. The output of the pipeline includes two FASTQ files with the '.norrna' postfix.

To reduce computation time, Agalma provides a set of curated rRNA sequences that includes only metazoan sequences. If you are working with another clade of organisms, please refer to the FAQ for instructions on how to configure your system.

assemble

The assemble pipeline performs another filtering stage at a higher quality threshold (default: 33). Then it runs a sweep of the Oases assembler over multiple subsets of the input data and kmer sizes. In the last stage, it runs Trinity assembler to generate the final transcriptome assembly.

postassemble

This pipeline cleans the assembled transcripts to remove any rRNA or vector sequences, then selects a single exemplar transcript for each locus. For Oases, exemplar transcripts are chosen as the transcript with the highest confidence score (as calculated by Oases). In the case of a tie, the first transcript with that score is chosen. The same criteria is used for Trinity, except that Trinity does not calculate a confidence score, so we use the number of paths joined to form the transcript as the score. Raw reads are mapped back to the exemplars to estimate coverage and assign RPKM values. Finally, transcripts are annotated with blastx hits against SwissProt. To reduce computation time, Agalma provides a filtered version of SwissProt that includes only metazoan sequences. If you are working with another clade of organisms, please refer to the FAQ for instructions on how to configure your system

If you plan to use gene predictions from other sources (eg, assembled non-Illumina transcriptomes or gene models from annotated genomes) in downstream analyses in Agalma (e.g. the phylogeny pipeline), these predictions need to be run through the postassemble pipeline. Because these predictions should include a single exemplar per locus (you need to select exemplar beforehand), you need to skip the exemplar selection stage when using this pipeline for such data sets:

agalma postassemble --id CATALOG_ID --skip exemplars

See TUTORIAL for an example of how to use Agalma with gene predictions from other sources.

report

This tool generates HTML reports of datasets that have been processed with an Agalma pipeline, using a dataset's CATALOG_ID to search the global diagnostics database:

agalma report --id CATALOG_ID

By default, this will output the HTML files and subdirectories in the current working directory. Or you can output them to a specific OUTDIR with:

agalma report --id CATALOG_ID --outdir OUTDIR

Subsequent phylogenetic analyses require that all gene sequences to be considered are loaded into the local Agalma database. The load command takes care of this process.

load

This pipeline parses assembled sequences and loads them into Agalma's database. Annotations for sequences that were assembled by Agalma are parsed into the database as well. Annotations for gene predictions from other sources are also parsed; however, these data sets use a generic, 'non-Agalma', header, thus you will need to use the flag --generic in these cases:

agalma load --id CATALOG_ID --generic

See TUTORIAL for an example of how to use Agalma with gene predictions from other sources.

Overview of the phylogeny pipeline

Once assemblies for multiple species are loaded into the local Agalma database, a phylogenomic analysis will ecompass the following pipelines

+  homologize
|- multalign
|- genetree
|- treeprune
|- multalign
+- genetree

homologize

This pipeline identifies homologous sequences across datasets that have been loaded into the Agalma database. It performs an all-by-all TBLASTX at a stringent threshold, and the hits which match above a given score are used as edges between two transcripts which then form a graph. The graph is broken into clusters (i.e., connected components) that contain homologous gene sequences.

multalign

This pipeline applies sampling and length filters to each cluster of homologous sequences. Then is uses another all-by-all TBLASTX alignment of sequences within each cluster to trim sequence ends not included in any HSP. Finally, it performs multiple sequence alignment for each cluster using MACSE, a translation aware aligner that accounts for frameshifts and stop codons (see doi:10.1371/journal.pone.0022594).

genetree

This pipeline builds gene trees for each set of homologous sequences. It builds a maximum likelihood phylogenetic tree with RAxML.

treeprune

This pipeline prunes each gene tree to include only one representative sequence per taxon when sequences form a monophyloetic group (here called 'monophyly masking'). It then prunes the monophyly-masked tree into maximally inclusive subtrees with no more than one sequence per taxon (here called 'paralogy pruning'). The pruned trees are then re-entered as clusters in the Agalma database.

multalign

In the second pass, this pipeline performs multiple alignment on the clusters generated from the pruned trees. Using the --supermatrix flag, it concatenates the multiple alignments together into a supermatrix, with one sequence per taxon.

genetree

Finally, the genetree pipeline is used a second time to build a phylogenetic tree from the supermatrix.

Running a pipeline without the catalog

It is possible to run the transcriptome pipeline without first cataloging the data, though we strongly discourage this. To bypass the catalog and specify the paths to your forward and reverse FASTQ files, use:

agalma transcriptome -f FASTQ1 FASTQ2 -o OUTDIR

Without an ID, the pipeline will use the default 'NoID' when writing diagnostics. If you use the -i and -f options together, the ID doesn't have to exist in the catalog, but will be used for naming the diagnostics.

Updating and uninstalling

Agalma is under active development. This means that new updates are not always compatiable with analyses that have already been run. We make every effort to avoid changes to the catalog database structure, so there is usually not a need to re-catalog data when you install a new version. However, you may need to rerun already completed analyses if you want to generate new reports or use existing data with new versions of pipelines.

To uninstall after installing with Homebrew on OS X, use:

brew remove agalma

To uninstall after installing from the source tarball, run the command:

make uninstall

from the source directory where you built and installed Agalma.

Troubleshooting

Please read the entire README.md file prior to use, and then again if you encounter any problems. Unfortunately we cannot support all operating systems, all types of data (eg old Illumina file formats), or integration with other tools that are not already part of the core Agalma pipeline (such as external programs for filtering reads). We are very grateful for bugs you report and suggestions on clarifying the documentation, but please understand that we cannot help you use Agalma in ways that we have not yet tested or on operating systems we do not provide details for (including older versions of those operating systems).

Reporting problems

If you have successfully run the TUTORIAL, your data meet all the specified criteria, you are using Agalma as described, and you have reread the entire README.md, but still encounter a problem, please submit the issue to us with the issue tracker. This will require a Bitbucket account. When reporting a bug, attach the diagnostics file generated by the failed analysis and provide a detailed description of the problem.

Please do not e-mail us directly with bugs. The issue tracker allows everyone to see what bugs have been flagged so that the same issue isn't raised repeatedly. If you have the same problem as someone else, feel free to add a comment to the existing issue. The issue trackerallows us to track problems much more efficiently than we can with e-mail and will also help us get to your bug more quickly.

Problems installing BioLite and Agalma

I am having trouble with Homebrew on OS X

  • Make sure you have the most current version of Homebrew installed. Older versions will not work.

  • Run brew doctor. If you get any errors or warnings, fix them and then rerun brew doctor. Don't ignore warnings - these can be critical.

  • The most common problems we see with brew are due to permissions. Do not run brew with sudo. Run all brew commands from a single account.

  • Homebrew is incompatible with other package managers, such as Fink and Macports. Don't use it if these are installed.

Problems using Agalma

Before you tackle your own data, it is essential that you walk through the TUTORIAL. This will validate your installation, and familiarize you with how the tool is intended to be used.

Once you have successfully run the tutorial to completion, review the section on Data requirements. Make sure your data satisfy all the of these requirements.

When you analyze your own data, adhere as closely as possible to the tutorial. Make use of the catalog, and don't skip pipelines.

When the above advice are followed, the most common problems we see fall into the following categories.

Poor data quality

If there are problems with a sequencing run, many of the reads may be of low quality, or the tail end of all reads are of low quality. If the tail of all reads is low quality, many of them may fall the Agalma quality filters. Though it may still be possible to use poor quality data with Agalma, we cannot provide support for this.

Take a look at your read quality with FastQC or a similar tool. If the read ends are of low quality, trim them prior to cataloging and analyzing them with Agalma. Be sure that your trimming tool does not eliminate some reads from one paired file and other reads from the other paired file. Fixed-length trim (eg, lop off everything after the 80th nucleotide for all reads) is usually an appropriate way to deal with trimming.

Incompatible data

Agalma has been extensively tested with paired reads of length 100bp from the Illumina HiSeq 2000 platform. It has only had minimal to no testing with other types of data, including single-end data, shorter read pair data (especially shorter than 45bp) and data from other sequencing technologies.

Insufficient system resources

If you run out of RAM or disk space, Agalma will fail. Depending on the way your system is configured, you may or may not receive a clear error message indicating this as the cause of the failure. To check available disk space on a Unix system, use the df command. To check available RAM, you can run top in another terminal while an Agalma pipeline is running.

These analyses are computationally intensive, and can therefore take some time.

Frequently Asked Questions (FAQ)

Do you have plans to implement feature X?

Our current priorities for Agalma are to rigorously test and optimize a core phylogenomics workflow. Agalma is very modular, and once we are satisfied with the core functionality we will add additional pipelines that use alternative methods for particular steps, such as orthology evaluation.

Can I add new features to Agalma?

Please do! We encourage you to fork the repository, implement your new feature, and, once it is working, send us a pull request so that we can incorporate it into the master branch. Please take a look at the BioLite documentation to better understand our development model. In most cases it makes sense to implement new features as their own pipelines.

Can I skip pipelines?

We strongly advice you to avoid skipping pipelines and follow the instructions in the TUTORIAL as closely as possible. Otherwise, it is likely that Agalma will not work properly, or will generate unexpected errors.

Can I skip stages within pipelines?

Yes, you can skip stages within pipelines when absolutely necessary (eg, skip exemplar selection in postassemble with transcriptomes not generated within agalma), or when restarting particular pipelines and you do not need to repeat certain stages. Notice that stages within pipelines are numbered starting with 0. So skiping stage 0 actually skips the first stage of the pipeline. You can see all the stages (with their respective numbers) for a pipeline typing

agalma COMMAND -h

I am not working with Metazoa, how can I configure the databases needed by Agalma?

Agalma installs automatically the sample of curated rRNA sequences to:

/usr/local/share/agalma

and the optimized databases to:

/usr/local/share/agalma/blastdb

You can customize your databases depending on the clade of organisms you are working on, and configure the paths to these databases in your configuration file locaed at:

/usr/local/share/agalma/agalma.cfg

Can I run a bootstrap phylogenetic analysis?

Yes, all the options avalaible in RAxML are available in Agalma. These options can be passed as optional arguments to the genetree pipeline with the flag --raxml_flags. For example, for a rapid bootstrap analysis, you would need to type:

--raxml_flags "-x random_number  -# number_of_replicates"

Note that the quotes are necessary.

I am getting an exception error, what should I do?

Sometimes you can get expection errors from 3rd party software that Agalma uses extensively (e.g. BLAST). Although these errors could be actual bugs in those programs, we have run extensive testing and it is unlikely you will run into bugs if you use Agalma as indicated here and in the TUTORIAL. There are a couple of things you can do to check why the error may have happened:

  • Take a look at the files that are generated at each stage of the pipeline and make sure your files are not empty. If they are empty, make sure your data meet the requirements in Agalma, and make sure you are following the instructions in the TUTORIAL to run Agalma.
  • Make sure you are using the versions of the 3rd party software that come with BioLite; see the BioLite INSTALL instructions.

Citing

The agalma cite command should print an up-to-date list of citations.

BioLite is still under development, and is an experimental tool that should be used with care. Please cite:

Howison, M., Sinnott-Armstrong, N. A., & Dunn, C. W. (2012). BioLite, a lightweight bioinformatics framework with automated tracking of diagnostics and provenance. In Proceedings of the 4th USENIX Workshop on the Theory and Practice of Provenance (TaPP '12).

We have not yet published a paper describing Agalma. In the mean time, please cite the BioLite paper.

Agalma and BioLite makes use of many other programs that do much of the heavy lifting of the analyses. Please be sure to credit these essential components as well.

Funding

This software has been developed with support from the following US National Science Foundation grants:

PSCIC Full Proposal: The iPlant Collaborative: A Cyberinfrastructure-Centered Community for a New Plant Biology (Award Number 0735191)

Collaborative Research: Resolving old questions in Mollusc phylogenetics with new EST data and developing general phylogenomic tools (Award Number 0844596)

Infrastructure to Advance Life Sciences in the Ocean State (Award Number 1004057)

The Brown University Center for Computation and Visualization has been instrumental to the development of Agalma.

License

Copyright (c) 2012-2013 Brown University. All rights reserved.

Agalma is distributed under the GNU General Public License version 3. For more information, see LICENSE or visit: http://www.gnu.org/licenses/gpl.html