Export data for phylogenetic analysis of gene expression
Agalma currently produces annotated assemblies, gene trees, species trees, and expression data. An important use case is the phylogenetic analysis of gene expression data, as outlined in http://icb.oxfordjournals.org/content/53/5/847 . These statistical analyses will likely be included within Agalma some day, but for now will be executed externally. This requires that Agalma export the data needed for the external analyses
There are three things that we need to export from Agalma:
The count data produced for each gene in each sample. These counts are derived by mapping reads from a sample to the reference gene sequences for the corresponding species (usually generated by transcriptome assembly).
The gene trees. These may not be congruent with the species tree, due to gene duplication, gene loss, phylogenetic inference error, and missing genes that weren't sequenced. Each tip corresponds to a reference gene sequence, and should have a label that corresponds to the gene name used for the count data. Each tree will include genes for one or more species (usually more than one), and a gene tree may include more than one sequence per species.
Metadata that is sufficient to specify the experimental design. This includes data on which species, individual, and treatment each sample corresponds to, as well as an estimate of the species tree (with branch lengths).
To keep in mind:
- Not all genes with expression data will be in the gene trees, since many genes are filtered out according to taxon sampling criteria etc. Rather than export the expression data that intersect with the phylogenetic results, all expression data should be exported. This is because the expression counts are not normalized, and normalization may require the full set of counts.
The ideal would be to have some sort of generalized class model for these data which could be mapped to data structures in both python and R. json comes to mind since it is readable, compact, and widely used.
Since we will need multiple representations of the data structure (in python in Agalma, in the file format, and in the R class), we should be explicit about where the data model is actually specified. It may make sense to do this as far upstream as possible, i.e. in python in Agalma.