The reference repository contains useful bits and pieces of genome annotations relevant to my projects. Certain files are omitted due to file size constraints, but usually can be regenerated or acquired again quite easily.

  • cellranger (ignored)
  • Ensembl
    • contains genomes from Ensembl
    • contains tx2gene* (transcript-gene mappings) and txdb_*.sqlite files
      • used with R processing of kallisto objects via tximport package
    • kallisto indexes are ignored but can be recalculated from the *.cdna* objects via kallisto index
    • files are generally directly from Ensembl
    • files with *_hs* or *_mm* are human and mouse related, respectively
      • created via parsing of raw Ensembl files via the R script create-tx2gene.R
    • see below for more details
  • adapters
    • contains a fasta file with potential adapters for use with cutadapt tool

Ensembl/ Information

Ensembl data was derived from the cDNA and GTF annotations directly from the Ensembl website. Chrom sizes are derived via a combination of R parsing (see below) and the related assembly website (see Human and Mouse sources).

The script create-tx2gene.R parses the GTF files and chrom sizes files, searches for matches between the two, and then puts together the chrom names and lengths into a chrom_sizes data.frame which is used in combination with the GTF files to create a TxDb sqlite database within R. This is then parsed further for the mapping between 'TXNAME' and 'GENEID' for use with downstream gene summarisation tools such as the R package tximport. Finally, the txdb and tx2gene objects were saved into this folder as sqlite and tsv/rda files, respectively. chromsizes is parsed from the assembly sequence lengths files, *chrom_sizes.txt.gz.

Relevant sources:



For the above ebi linka, lick on 'sequence report' (alternately 'regions', although this is only for scaffolds) link to get sequence lengths for each chromosome/scaffold.