1. Casey Dunn
  2. agalma
  3. Issues
Issue #69 resolved

Add ability to catalog and map expression-only datasets

Casey Dunn
repo owner created an issue

Add ability to catalog a dataset and map it to an existing assembly. This is the first step in implementing within the Agalma workflow the analyses described at:


A typical use case would be:

  • The investigator would assemble paired-end long read data for multiple species and perform phylogenetic analyses, as is currently possible.
  • Multiple replicated short-read data will be available for several tissues for a subset of the species in the phylogenetic analyses.
  • Each of these expression datasets will be mapped to the full set of isoforms from the corresponding species (as specified by the load_id), and the abundance of each isoform and gene for this dataset determined with rsem.
  • These counts will be loaded into an expression table in the database for subsequent analysis. This table will include columns for the reference sequence name (gene names should be the same as those in the trees), the id of the dataset of mapped reads, and the count.

Comments (4)

  1. Casey Dunn reporter

    Implementing this feature will require the creation of a new database table that records expression results. These come in two flavors: gene expression (which is an estimate of the aggregated expression of all splice variants of a gene; one value per gene) and transcript expression (expression measured independently for each splice variant). rsemm outputs these two sets of values as different files.

    I propose that we use a single table to agalma.sqlite for all expression results, where each row contains an expression measurement. Here is a proposed schema:

    CREATE TABLE expression (
       run_id INTEGER,
       catalog_id VARCHAR(256),
       locus INTEGER,
       transcript INTEGER,
       confidence FLOAT,
       expression FLOAT,
       note TEXT);

    rows with transcript greater than 0 would correspond to transcript-specific expression. Gene expression could be stored as transcript number 0 (or maybe -1 if there is a risk of transcript indexing starting at 0).

    The information about what the parameters were, and what the reference database was should all be associated with the run_id.

    catalog_id would indicate what sequences were mapped.

    Note that this table is close to a subset of the sequences table, but in this case there can be multiple rows for the same sequence.

  2. Log in to comment