Overview

Introduction

This is a comparative analysis of whole genome alignments and mapping tools. We considered chained blastz {Schwartz, 2003, PMID:12529312} alignments available from the UCSC browser {Dreszer, 2012, PMID:22086951}. To deal with segmental duplications on either species in our analysis, we use the netting heuristic on the swapped human-mouse chains, to get reciprocal chains {Kent, 2003, PMID:14500911}. We also consider the 12-way mammalian whole genome alignments from the EPO pipeline {Paten, 2008, 1PMID:8849524} available from Ensembl version 65 {Flicek 2013, PMID:23203987} and derive the map by extracting only human and mouse alignments

For mappers we considered the

species_mapper.pl
A perl script from K. Beal. (http://www.ebi.ac.uk/~kbeal/species_mapper/)
pslMap
Part of the UCSC tools download. You will also need bedToPsl for bed-psl conversion (http://hgdownload.soe.ucsc.edu/admin/exe/)
bnMapper.py
part of the bx-python library. You need to install all the library to use the tool. bx-python comes also with a out_to_chain.py script for EPO to chain conversion. (https://bitbucket.org/james_taylor/bx-python/)

To run the pipelines, fetchChromSizes (alternatively place hg19.chrom.sizes and mm9.chrom.sizes in the base directory) will be needed.

The goals are (1) to define a pipeline for extracting bijective maps from both alignments, (2) map a set of features using various combinations of tools and alignments (3) compare results from each.

The base directory contains features (on originalPeaks), alignment data (on EPO and UCSC), statistics on the mapping results (stats), and some format conversion scripts to facilitate parsing by all mappers. Details are given on each subdirectory.

Alignments and peaks are restricted to chromosomal coordinates, so contigs are filtered out. This is done transparently for epo_547_hs_mm_12way_mammals_65.out and through formatBed.awk for peaks. Descriptive statistics for each alignments use the following terms I have done some counting on alignment data and used the following terms.

mappable
means a 1:1 correspondence
tinserts
target bases corresponding to gaps in the query
qinserts
query bases corresponding to gaps in the target
tambiguous
target bases mapping to more than a base on the query
qambiguous
query bases mapping to more than a base on the target

EPO alignments

Alignments (epo_547_hs_mm_12way_mammals_65.out) can be converted to a chain format needed to run pslMap and bnMapper.py. To do this conversion you can use out2chain.py. If you run bnMapper.py, another file (epo_547_hs_mm_12way_mammals_65.chain.pkl) will be created for fast loading in latter uses (only for bnMapper).

It makes sense to get rid of all cases when there is no homologous region in one of the species. This is already implemented in out2chain.py. Furthermore, to ensure, at least partial, consistency, genomic alignment blocks with more than one genomic region in both species should be removed.

one2one.awk, many2one.awk, one2many.awk will extract the corresponding genomic alignment block ids from the EPO (.out) file. These can be further used in the filter_out.py file to extract the actual alignments.

alignment mappable tinserts qinserts tambiguous qambiguous
Max Cov 778902145 967209929 798125012 0 0
One : One 776756538 963337224 794863393 0 0
Many : One 0 0 0 0 3273990
One : Many 0 0 0 2118837 0

UCSC chains

Alignment data is downloaded from http://hgdownload.cse.ucsc.edu/goldenPath/hg19/liftOver/hg19ToMm9.over.chain.gz and http://hgdownload.cse.ucsc.edu/goldenPath/mm9/liftOver/mm9ToHg19.over.chain.gz.

alignment mappable tinserts qinserts
hg19ToMm9.over.OO.chain 803794582 1859134164 1430181564
hg19ToMm9.over.chain 1009840531 2306666812 1818211763
mm9ToHg19.over.OO.chain 821458867 1450078287 1801024288
mm9ToHg19.over.chain 1005764924 1782329675 2316044574

Running time

Running time for GATA1_CD36shbrg1_hg19_peaks and Gata1_Mel_mm9_peaks

peaks mapper time (sec)
GATA1_CD36shbrg1_hg19_peaks species_mapper > 400
bnMapper 70
pslMap 15
Gata1_Mel_mm9_peaks species_mapper > 200
bnMapper 80
pslMap 15

Time not reported for the other two files.

Mapping statistics for peaks are on the stats directory and show the number of bases/elements mapped, and Venn diagrams on agreement.

TFos enrichment of 1:1 alignment regions for EPO and UCSC

These computations were done by R. Sandstrom and can be consulted here http://www.mouseencode.org/mouse_analysis/rsandstrom/mousehuman-public.html

Databases integrating human/mouse TFos

Were still trying to publish the databases, please write gertidenas@gmail.com to obtain copies privately.