Agalma is developed by the Dunn Lab at Brown University.

See TUTORIAL for an example of how to use Agalma with a sample dataset. Please consult the FAQ and Troubleshooting sections below if you have any questions or problems.

Overview of Agalma

Agalma is a set of analysis pipelines for phylogenomic analyses. It builds alignments of homologous genes and preliminary species trees from genomic and transcriptome data. Agalma includes support for transcriptome assembly (paired-end Illumina data), and can also import gene predictions from other sources (eg, assembled non-Illumina transcriptomes or gene models from annotated genomes). Please carefully read the details on Data requirements before proceeding with your analyses.

Agalma provides a completely automated analysis workflow, and records rich diagnostics. You can then evaluate these diagnostics to spot problems and examine the success of your analyses, the quality of the orignal data, and the appropriateness of the default parameters. You can then rerun subsets of the pipelines with optimized parameters as needed.

The main functionality of this workflow is to:

  • assess read quality with the FastQC package
  • remove clusters in which one or both reads have Illumina adapters (resulting from small inserts)
  • remove clusters where one or both reads is of low mean quality
  • assemble and annotate rRNA sequences based on a subassembly of the data
  • remove clusters in which one or both reads map to rRNA sequences
  • create a larger subassembly of the dataset to profile expression of transcripts and assess the distribution of sequencing effort
  • create HTML reports that summarize the diagnostics collected during analyses of each transcriptome
  • load the assembled sequences into a database
  • load gene predictions from other sources into the database
  • identify homologous sequences across multiple species based on sequence similarity
  • create nucleotide and protein alignments for each set of homologous sequences
  • build gene trees for each set of aligned homologous sequences
  • identify orthologs through topological analysis of the gene trees
  • create nucleotide and protein alignments for each set of orthologs
  • construct and analyze supermatrices of orthologs to build species phylogenies

The workflow is optimized to reduce RAM and computational requirements, as well as the disk space used. It logs detailed stats about computer resource utilization to help you understand what type of computational resources you need.

Agalma is built on top of BioLite, a bioinformatics framework written in Python/C++ that automates the collection and reporting of diagnostics, tracks provenance, and provides lightweight tools for building out customized analysis pipelines.

Agalma is named after a clade of siphonophores. Siphonophores are our favorite animals. Please visit and to learn more about them.

Quick Install

These instructions describe how to install the latest release of Agalma. We have tested Agalma on OS X 10.9, Ubuntu 14.04, and CentOS 6.5 (with Python 2.7 installed). It may not work on earlier versions of these operating systems without customization. It likely works on other operating systems as well, though we have not tested them.

For more information on dependencies, other operating systems, installation from other sources, and installation of development versions, please see the INSTALL file.

These instructions require pip, the Python package manager. If you don't have pip, you may be able to install it with easy_install:

sudo easy_install pip

or with your OS package manager, e.g. on Ubuntu:

sudo apt-get install -y python-pip python-dev

Install dependencies

Agalma has many dependencies on external programs. We recommend installing them using Bioinformatics Brew, a cross-platform package manager for bioinformatics tools. Once BiB is installed, run:

bib install -f biolite-tools/0.4.0 blast/2.2.29+ bowtie/1.0.0 bowtie2/2.1.0 fastqc/0.10.1 gblocks/0.91b mafft/7.130 mcl/12-135 parallel/20130922 raxml/7.7.6 rsem/1.2.9 samtools/0.1.19 sratoolkit/ transdecoder/r2012-08-15 trinity/r20140413p1

To use alternate phylogeny pipelines, you may also need to run:

bib install -f macse/0.9b1 oma/0.99x

On Ubuntu, pre-install these packages:

sudo apt-get install -y python-matplotlib python-lxml sqlite3
sudo cpan -i FindBin URI::Escape

Install Agalma

Install Agalma with pip:

sudo pip install agalma

On OS X you must run:

export CFLAGS=-Qunused-arguments
export CPPFLAGS=-Qunused-arguments
sudo -E pip install agalma

The export commands are due to a known bug in Apple's most recent command line tools (StackOverflow post).

Test Agalma

After installing, run the built-in test:

mkdir ~/tmp
cd ~/tmp
agalma test

This will check to see that your installation works. It runs through a series of test analyses on some very small data sets. This takes about 24 minutes on a 2 core 2 GHz Ubuntu machine (ie, an C3 Large Amazon EC2 general purpose Ubuntu instance).

Once the test completes without error, run the TUTORIAL. This will familiarize you with the use of Agalma and clarify best practices.

Data Requirements

The data requirements for raw transcriptome reads that are to be assembled by Agalma are:

  • Input data must be in FASTQ format.
  • The quality scores must be encoded with an ASCII offset of 33. Older Illumina files may use an offset of 64 or 63, in which case they would need to be transformed to 33.
  • Data are paired, with one file for the forward reads and one for the reverse reads. Each file must have a paired read in the other file, and they must be in the exact same order.
  • FASTQ headers are in Casava 1.6 or Casava 1.8 format. A definition and example of each is shown below.

Casava 1.6 FASTQ header:

@<instrument>:<lane>:<tile>:<x-pos>:<y-pos>#<index number>/<read>

Casava 1.8 FASTQ header:

@HWI-ST625:73:C0JUVACXX:7:1101:1403:1923 1:N:0:TGACCA
@<instrument>:<run number>:<flowcell ID>:<lane>:<tile>:<x-pos>:<y-pos> <read>:<is filtered>:<control number>:<indexsequence>
  • Reads that did not pass the Illumina filter must be removed already. If the files from the sequencing center include reads that did not pass filter, they can be removed with the following command (assuming a Casava 1.8+ header):

    grep -A 3 '^@.[^:]:N:[^:]*:' in.fastq | grep -v '^--$' > out.fastq

  • The pipeline does not trim sequences, i.e. remove a particular region of each read. If you have low quality ends, these should be trimmed before feeding the data to the pipeline. Likewise, if you need to remove inline indices or other sequences do it before you pass the data to the pipeline.

Using Agalma

All of your interactions with Agalma will be through the command line program agalma. This program can execute a variety of commands. To see the available commands, just type agalma without any arguments:


usage: agalma COMMAND [ARGS]

This is a wrapper script for the various components that come
with agalma, a suite of tools for de novo assembly and annotation
of transcriptomes from paired-end sequence data. The following
commands are available:


  preassemble    [sanitize, insert_size, remove_rrna]
  transcriptome  [preassemble, assemble, postassemble]


To print a help message for a specific command, use:
  agalma COMMAND -h

As indicated by the help message, you can get help on particular commands with the -h flag. For example, the following shows all the options available for the transcriptome pipeline:

agalma transcriptome -h

The following sections briefly describe the most important commands in a typical analysis.

Registering your data in the 'catalog'

BioLite maintains a 'catalog' stored in an SQLite database of metadata associated with your raw Illumina data, including:

  • A unique ID that you make up to reference this data set.
  • Paths to the FASTQ files containing the raw forward and reverse reads.
  • The species name, NCBI ID, and ITIS ID.
  • The sequencing center where the data was collected.

You can insert a new catalog entry with the command:

agalma catalog insert -h

usage: catalog insert [-h] [-i ID] [-p [PATHS [PATHS ...]]] [-s SPECIES]
                      [-n NCBI_ID] [-d ITIS_ID] [-e EXTRACTION_ID]
                      [-l LIBRARY_ID] [-b LIBRARY_TYPE] [-t TISSUE]
                      [-q SEQUENCER] [-c SEQ_CENTER] [--note NOTE]
                      [--sample_prep SAMPLE_PREP]

By default, this will use the first four fields of the Illumina header in PATH1 as the catalog ID. You can manually override this by specifying --id ID.

We strongly suggest that you specify all fields if they are available. This doesn't have to be done at once - the insert command acts like an update for subsequent calls on an existing catalog ID. That is, if you do a subsequent insert with an existing catalog ID, that catalog record will be updated with any new information you have specified.

Enter the NCBI_ID for your species, if one exists. If not, then leave it blank. Enter the ITIS_ID for your species, if it exists. If not, then enter the ITIS id for the least inclusive clade that includes your species (eg, genus). Entering both of these IDs at the outset provides a chance to double check the validity and spelling of your species name, and facilitates downstream analyses.

You can view all of your entries with:

agalma catalog all

When executing BioLite pipelines, you can simply use the catalog ID rather than typing the full paths to the raw data.

Overview of the transcriptome assembly pipeline

|- preassemble
|  |- sanitize
|  |- insert_size
|  +- remove_rrna
|- assemble
|- postassemble
+- report

Each of the component pipelines is described in the sections below.

The top-level transcriptome pipeline will prepare your raw RNA-Seq data for assembly, then assemble it using a range of subset sizes.

Once you have entered your dataset in the catalog, you can run the entire pipeline with:

agalma transcriptome -i CATALOG_ID -o OUTDIR

The current working directory will be used for temporary files (some of them as large as your input data), and all output that will be kept permanently will be written to the specified OUTDIR.

If the pipeline fails at any stage, you can correct the problem and restart the pipeline from that stage with:

agalma transcriptome --restart --stage N

Diagnostics will be logged to a tab-separated text file called 'diagnostics.txt' in the working directory. Once the pipeline completes, this text file is merged into the global diagnostics SQLite database, which you can browse with the diagnostics command:

agalma diagnostics -h
usage: diagnostics [-h]
                   {list,all,id,run,merge,delete,hide,unhide,programs} ...

For example, to view a listing of all past runs, use no options:

agalma diagnostics list

To view the detailed diagnostics for a particular run, use:

agalma diagnostics run RUN_ID

You can also run the pipeline components individually. The restarting and diagnostics features exists in all the pipelines described below. To see all of the parameters available for a given pipeline, including the default values, use the -h option. This will also display a full list of the stages in the pipeline with short descriptions of each.


This pipeline filters low quality reads (default mean: 28), reads with adapter sequences, and reads with highly skewed base composition. The output of the pipeline includes two FASTQ files with the 'sanitized' postfix.

To run a dataset with CATALOG_ID through the entire sanitize pipeline, use:

agalma sanitize -i CATALOG_ID -o OUTDIR


This pipeline uses a small subassembly and bowtie mapping to estimate the mean and variance of the insert size of your input data. Once these have been estimated and logged in the global diagnostics database, future pipeline runs on the same catalog ID will automatically use the estimates where they are relevant.


After sanitizing your raw data and estimating its insert size, you can assemble a subset of the data to identify and exclude reads that map to known ribosomal RNA. The output of the pipeline includes two FASTQ files with the '.norrna' postfix.

To reduce computation time, Agalma provides a set of curated rRNA sequences that includes only metazoan sequences. If you are working with another clade of organisms, please refer to the FAQ for instructions on how to configure your system.


The assemble pipeline performs another filtering stage at a higher quality threshold (default mean: 33). Then it runs Trinity assembler to generate the transcriptome assembly.


This pipeline cleans the assembled transcripts to remove any rRNA or vector sequences. Vector sequences could include untrimmed adapters or plasmids (we sometimes find sequences in our data for the protein expression vectors used to manufacture the sample preparation enzymes). The longest likely coding region per transcript is identified and annotated using blastp against SwissProt. Raw reads are mapped back to the transcripts to estimate coverage and assign TPKM values. The TPKM values are used downstream to choose exemplar transcripts (those with the highest TPKM). Finally, transcripts are annotated with blastx hits against SwissProt. The agalma blastdb command will install a complete, single-file copy of SwissProt for this purpose. To reduce computation time, it is also possible to use a filtered SwissProt database that includes a narrower subset of species, such as only metazoan or viridiplantae sequences (see Subsetting swissprot in the wiki).

If you plan to use gene predictions from other sources (eg, assembled non-Illumina transcriptomes or gene models from annotated genomes) in downstream analyses in Agalma (e.g. the phylogeny pipeline), these predictions need to be run through the postassemble pipeline to generate translations. See TUTORIAL for an example of how to use Agalma with gene predictions from other sources.


This tool generates HTML reports of datasets that have been processed with an Agalma pipeline, using a dataset's CATALOG_ID to search the global diagnostics database:

agalma report --id CATALOG_ID

By default, this will output the HTML files and subdirectories in the current working directory. Or you can output them to a specific OUTDIR with:

agalma report --id CATALOG_ID --outdir OUTDIR

Subsequent phylogenetic analyses require that all gene sequences to be considered are loaded into the local Agalma database. The load command takes care of this process.


This pipeline parses assembled sequences and loads them into Agalma's database. Annotations for sequences that were assembled by Agalma or for gene predictions from other sources are also parsed.

Overview of the phylogeny pipeline

Once assemblies for multiple species are loaded into the local Agalma database, a phylogenomic analysis will typically be conducted with the following sequence of pipelines:



This pipeline identifies homologous sequences across datasets that have been loaded into the Agalma database. It performs an all-by-all BLAST search at a stringent threshold, and the hits which match above a given score are used as edges between two transcripts which then form a graph. The graph is broken into clusters (i.e., connected components) that contain homologous gene sequences.


This pipeline applies sampling and length filters to each cluster of homologous sequences. Then, it performs multiple sequence alignment for each cluster using MAFFT (E-INS-i algorithm). Finally, the alignments are cleaned up with Gblocks.

Alternatively, clusters can be aligned using MACSE, a translation aware multiple sequence aligner that accounts for frameshifts and stop codons. To do this, use the multalignx pipeline. For further details, please read agalma multalignx -h.


This pipeline builds gene trees for each set of homologous sequences. It builds a maximum likelihood phylogenetic tree with RAxML.


This pipeline prunes each gene tree to include only one representative sequence per taxon when sequences form a monophyloetic group (here called 'monophyly masking'). It then prunes the monophyly-masked tree into maximally inclusive subtrees with no more than one sequence per taxon (here called 'paralogy pruning'). The pruned trees are then re-entered as clusters in the Agalma database.


In the second pass, this pipeline performs multiple alignment on the clusters generated from the pruned trees.


This pipeline concatenates the multiple alignments together into a supermatrix, with one sequence per taxon. It also creates a supermatrix with a given proportion of gene occupancy.


Finally, the speciestree pipeline is used to build a maximum likelihood species tree from the supermatrix.

Overview of the expression pipeline

Agalma's expression pipeline maps reads against an assembly and estimates the read count for each transcript in the assembly (at the gene and isoform level). Multiple read files can be mapped against each assembly, accommodating multiple treatments and replicates for each species.


This pipeline maps expression-only dataset for a single individual/treatment to an assembly, estimates the read count for each transcript in the assembly (at the gene and isoform level), and loads these counts into the Agalma database.


This utility packages a full phylogenomic analysis with its accompanying expression analyses into a single JSON files containing the expression counts, gene trees, and species tree. For more background on phylogenetic analysis of gene expression, see Dunn et al., 2013.

Running a pipeline without the catalog

It is possible to run the transcriptome pipeline without first cataloging the data, though we strongly discourage this. To bypass the catalog and specify the paths to your forward and reverse FASTQ files, use:

agalma transcriptome -f FASTQ1 FASTQ2 -o OUTDIR

Without an ID, the pipeline will use the default 'NoID' when writing diagnostics. If you use the -i and -f options together, the ID doesn't have to exist in the catalog, but will be used for naming the diagnostics.

Updating and uninstalling

Agalma is under active development. This means that new updates are not always compatiable with analyses that have already been run. We make every effort to avoid changes to the catalog database structure, so there is usually not a need to re-catalog data when you install a new version. However, you may need to rerun already completed analyses if you want to generate new reports or use existing data with new versions of pipelines.

To upgrade, we recommend uninstalling the previous version, then following the most recent install instructions.

To uninstall with pip, use:

pip uninstall agalma


Please read the entire file prior to use, and then again if you encounter any problems. Unfortunately we cannot support all operating systems, all types of data (eg old Illumina file formats), or integration with other tools that are not already part of the core Agalma pipeline (such as external programs for filtering reads). We are very grateful for bugs you report and suggestions on clarifying the documentation, but please understand that we cannot help you use Agalma in ways that we have not yet tested or on operating systems we do not provide details for (including older versions of those operating systems).

Reporting problems

If you have successfully run the TUTORIAL, your data meet all the specified criteria, you are using Agalma as described, and you have reread the entire, but still encounter a problem, please submit the issue to us with the issue tracker. This will require a Bitbucket account. When reporting a bug, attach the diagnostics file generated by the failed analysis and provide a detailed description of the problem.

Please do not e-mail us directly with bugs. The issue tracker allows everyone to see what bugs have been flagged so that the same issue isn't raised repeatedly. If you have the same problem as someone else, feel free to add a comment to the existing issue. The issue tracker allows us to track problems much more efficiently than we can with e-mail and will also help us get to your bug more quickly.

Problems using Agalma

Before you tackle your own data, it is essential that you walk through the TUTORIAL. This will validate your installation, and familiarize you with how the tool is intended to be used.

Once you have successfully run the tutorial to completion, review the section on Data requirements. Make sure your data satisfy all the of these requirements.

When you analyze your own data, adhere as closely as possible to the tutorial. Make use of the catalog, and don't skip pipelines.

When the above advice are followed, the most common problems we see fall into the following categories.

Poor data quality

If there are problems with a sequencing run, many of the reads may be of low quality, or the tail end of all reads are of low quality. If the tail of all reads is low quality, many of them may fall the Agalma quality filters. Though it may still be possible to use poor quality data with Agalma, we cannot provide support for this.

Take a look at your read quality with FastQC or a similar tool. If the read ends are of low quality, trim them prior to cataloging and analyzing them with Agalma. Be sure that your trimming tool does not eliminate some reads from one paired file and other reads from the other paired file. Fixed-length trim (eg, lop off everything after the 80th nucleotide for all reads) is usually an appropriate way to deal with trimming.

Incompatible data

Agalma has been extensively tested with paired reads of length 100bp from the Illumina HiSeq 2000 platform. It has only had minimal to no testing with other types of data, including single-end data, shorter read pair data (especially shorter than 45bp) and data from other sequencing technologies.

Insufficient system resources

If you run out of RAM or disk space, Agalma will fail. Depending on the way your system is configured, you may or may not receive a clear error message indicating this as the cause of the failure. To check available disk space on a Unix system, use the df command. To check available RAM, you can run top in another terminal while an Agalma pipeline is running.

These analyses are computationally intensive, and can therefore take some time.

Frequently Asked Questions (FAQ)

Do you have plans to implement feature X?

Our current priorities for Agalma are to rigorously test and optimize a core phylogenomics workflow. Agalma is very modular, and once we are satisfied with the core functionality we will add additional pipelines that use alternative methods for particular steps, such as orthology evaluation.

Can I add new features to Agalma?

Please do! We encourage you to fork the repository, implement your new feature, and, once it is working, send us a pull request so that we can incorporate it into the master branch. Please take a look at the BioLite documentation to better understand our development model. In most cases it makes sense to implement new features as their own pipelines.

Can I skip pipelines?

We strongly advice you to avoid skipping pipelines and follow the instructions in the TUTORIAL as closely as possible. Otherwise, it is likely that Agalma will not work properly, or will generate unexpected errors.

Can I skip stages within pipelines?

Yes, you can skip stages within pipelines when absolutely necessary (eg, skip exemplar selection in postassemble with transcriptomes not generated within agalma), or when restarting particular pipelines and you do not need to repeat certain stages. Notice that stages within pipelines are numbered starting with 0. So skiping stage 0 actually skips the first stage of the pipeline. You can see all the stages (with their respective numbers) for a pipeline typing

agalma COMMAND -h

I am not working with Metazoa, how can I configure the databases needed by Agalma?

The BLAST databases automatically installed with Agalma include UniVec, a ribosomal RNA subset of nt, and customized SwissProt database (to include the organelle field in the FASTA header). These should be appropriate for working with any species. The databases are installed to the blastdb subdirectory of the Agalma python module.

Agalma also includes a default set of curated ribosomal RNA sequences in the data/rRNA-animal.fasta file of the Agalma python module. This is a very narrowly sampled set that is suitable for our own research. Another included file data/rRNA-angiosperms.fasta was created for a different project. You should create a custom set of curated rRNA sequences appropriate for the clade of organisms you are studying, and configure the path to this file with the rrna_fasta entry in the config/agalma.cfg file in the python module, or at runtime by setting rrna_fasta=... in the BIOLITE_RESOURCES environment variable.

How to create a new diagnostics database with the catalog from a previous database?

You can copy your catalog to a new diagnostics database. From your old database, do

catalog export "*" >

This creates a shell script with catalog commands. Then setup your BioLite config to point to a new diagnostics database and run the shell commands:


This will populate the new database.

Can I run a bootstrap phylogenetic analysis?

Yes, all the options avalaible in RAxML are available in Agalma. These options can be passed as optional arguments to the genetree and speciestree pipelines with the flag --raxml_flags. For example, for a rapid bootstrap analysis, you would need to type:

--raxml_flags "-x random_number  -# number_of_replicates"

Note that the quotes are necessary.

I am getting an exception error, what should I do?

Sometimes you can get expection errors from 3rd party software that Agalma uses extensively (e.g. BLAST). Although these errors could be actual bugs in those programs, we have run extensive testing and it is unlikely you will run into bugs if you use Agalma as indicated here and in the TUTORIAL. There are a couple of things you can do to check why the error may have happened:

  • Take a look at the files that are generated at each stage of the pipeline and make sure your files are not empty. If they are empty, make sure your data meet the requirements in Agalma, and make sure you are following the instructions in the TUTORIAL to run Agalma.
  • Make sure you are using the versions of the 3rd party software that come with BioLite; see the BioLite INSTALL instructions.


The agalma cite command prints a list of citations, including the citation for Agalma:

Dunn CW, Howison M, Zapata F. 2013. Agalma: an automated phylogenomics workflow. BMC Bioinformatics 14(1): 330. doi:10.1186/1471-2105-14-330

Agalma and BioLite makes use of many other programs that do much of the heavy lifting of the analyses. Please be sure to credit these essential components as well.


This software has been developed with support from the following US National Science Foundation grants:

PSCIC Full Proposal: The iPlant Collaborative: A Cyberinfrastructure-Centered Community for a New Plant Biology (Award Number 0735191)

Collaborative Research: Resolving old questions in Mollusc phylogenetics with new EST data and developing general phylogenomic tools (Award Number 0844596)

Infrastructure to Advance Life Sciences in the Ocean State (Award Number 1004057)

The Brown University Center for Computation and Visualization has been instrumental to the development of Agalma.


Copyright (c) 2012-2014 Brown University. All rights reserved.

Agalma is distributed under the GNU General Public License version 3. For more information, see LICENSE or visit: