Export data for phylogenetic analysis of gene expression

Issue #103 resolved
Casey Dunn
repo owner created an issue

Agalma currently produces annotated assemblies, gene trees, species trees, and expression data. An important use case is the phylogenetic analysis of gene expression data, as outlined in http://icb.oxfordjournals.org/content/53/5/847 . These statistical analyses will likely be included within Agalma some day, but for now will be executed externally. This requires that Agalma export the data needed for the external analyses

There are three things that we need to export from Agalma:

  • The count data produced for each gene in each sample. These counts are derived by mapping reads from a sample to the reference gene sequences for the corresponding species (usually generated by transcriptome assembly).

  • The gene trees. These may not be congruent with the species tree, due to gene duplication, gene loss, phylogenetic inference error, and missing genes that weren't sequenced. Each tip corresponds to a reference gene sequence, and should have a label that corresponds to the gene name used for the count data. Each tree will include genes for one or more species (usually more than one), and a gene tree may include more than one sequence per species.

  • Metadata that is sufficient to specify the experimental design. This includes data on which species, individual, and treatment each sample corresponds to, as well as an estimate of the species tree (with branch lengths).

To keep in mind:

  • Not all genes with expression data will be in the gene trees, since many genes are filtered out according to taxon sampling criteria etc. Rather than export the expression data that intersect with the phylogenetic results, all expression data should be exported. This is because the expression counts are not normalized, and normalization may require the full set of counts.


  • The ideal would be to have some sort of generalized class model for these data which could be mapped to data structures in both python and R. json comes to mind since it is readable, compact, and widely used.

  • Since we will need multiple representations of the data structure (in python in Agalma, in the file format, and in the R class), we should be explicit about where the data model is actually specified. It may make sense to do this as far upstream as possible, i.e. in python in Agalma.

Comments (7)

  1. Casey Dunn reporter

    Mark and I just talked about the contents of the file.

    At the top level there are three components:

    • Species tree
    • Gene trees, a set of gene trees that each have bootstrap support values and duplication/ speciation annotations
    • Expression data, one chunk per species (see details below)

    Each expression chunk should have the following (c is column annotation vector, r is row annotation vector):

    • Species name
    • Expression matrix
    • Sample (c), e.g. library id
    • Individuals (c)
    • Treatments (c)
    • Library preparation method (c)
    • Gene_id (r), corresponds to gene id in the gene trees (within species name space)
    • Human name (r), from blast result
    • Molecule type (r) protein coding, rRNA, unknown etc
    • Genome (r) nuclear, mito, unknown, etc
    • Length (r) effective gene length from rsem
    • isoform (r)
    • Gene ontology (r) a set of all gene ontology terms, including parent nodes, derived form blast hits
  2. Log in to comment