1. James Taylor
  2. bx-python

Wiki

Clone wiki

bx-python / bnMapper

Documentation

This page describes how to use bnMapper for mapping genomic features. For example, after installing bx-python, from the bx-python (base) directory

odenas@bxlab05:$ bnMapper.py test_data/epo_tests/hpeaks.bed test_data/epo_tests/epo_547_hs_mm_12way_mammals_65.chain 
INFO:bx.align._epo:parsed 5 elements from test_data/epo_tests/epo_547_hs_mm_12way_mammals_65.chain
INFO:bx.align.epo:pckling to test_data/epo_tests/epo_547_hs_mm_12way_mammals_65.chain.pkl
INFO:root:indexing 5 chains ...
INFO:root:loading from test_data/epo_tests/hpeaks.bed ...
INFO:root:transforming (5) elements ...
chr16	85453122	85453194	peak1
chr16	85454203	85454219	peak2
chr16	85454227	85454258	peak2
chr16	85454261	85454286	peak2
chr16	88576704	88576752	peak4
INFO:root:DONE!

Setup

If you haven't done so, install numpy and bx-python (instructions on how to install bx-python are here HowToInstall). Upon successful installation you should find, among other scripts, bnMapper.py and out_to_chain.py in the bin directory under your installation path. You can run both scripts without arguments or with the -h option to get a usage message.

Out to Chain coversion

The out_to_chain.py script is a simple format converter that expects in input an EPO alignment file and converts it into the UCSC chain format (http://genome.ucsc.edu/goldenPath/help/chain.html).

If you run the script with th -h option you should get

odenas@bxlab05:$ out_to_chain.py -h
usage: out_to_chain.py [-h] [--species SPECIES SPECIES] --chrsizes CHRSIZES
                       CHRSIZES [-o FILE]
                       input

EPO alignments (.out) to .chain converter.

positional arguments:
  input                 File to process

optional arguments:
  -h, --help            show this help message and exit
  --species SPECIES SPECIES
                        Names of target and query species (respectively) in
                        the alignment (default: ['homo_sapiens',
                        'mus_musculus'])
  --chrsizes CHRSIZES CHRSIZES
                        Chromosome sizes for the given species. (default:
                        None)
  -o FILE, --output FILE
                        Output file (default: stdout)

The --species option expects two arguments that should match the species denomination on the .out alignment file. The order of the species names will be reflected on the .chain file.

The --chrsizes expects two arguments. These should be paths to chromosome sizes for the first and the second argument of the --species respectively. The files with chromosome sizes have a line per each chromosome/contig with the name of the chromosome/contig and the size in bp. separated by a space. You can use the fetchChromSizes utility to get them from the UCSC database or can also connect directly to the database like so (this will fetch hg19 chromosome sizes)

mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e "select chrom, size from hg19.chromInfo" | tr -d "|" > hg19.genome

The output defaults on the standard output, but can also be specified with the --output option.

Feature mapping

The bnMapper.py script uses an alignment file in the chain format to map genomic features from the target species of the chain file to the query species of the chain file. The script is tuned for mapping relatively short features (typically chip-seq peaks), so features that span multiple chains or chromosomes will be dropped silently. In the future there might be options that can control this behavior.

If you run the script with the -h option you should get

odenas@bxlab05:$ bnMapper.py -h
usage: bnMapper.py [-h] [-f {BED4,BED12}] [-o FILE] [-t FLOAT] [-s] [-g GAP]
                   [-v {info,debug,silent}]
                   input [input ...] alignment

Map features from the target species to the query species of a chain alignment
file. This is intended for mapping relatively short features such as Chip-Seq
peaks on TF binding events. Features that get mapped on different chromosomes
or that span multiple chains are silently filtered out.

positional arguments:
  input                 Input to process. If more than a file is specified,
                        all files will be mapped and placed on --output, which
                        should be a directory.
  alignment             Alignment file (.chain or .pkl)

optional arguments:
  -h, --help            show this help message and exit
  -f {BED4,BED12}, --format {BED4,BED12}
                        Output format. (default: BED4)
  -o FILE, --output FILE
                        Output file. Mandatory if more than one file in input.
                        (default: stdout)
  -t FLOAT, --threshold FLOAT
                        Mapping threshold i.e., |elem| * threshold <=
                        |mapped_elem| (default: 0.0)
  -s, --screen          Only report elements in the alignment (without
                        mapping). -t has not effect here (TODO) (default:
                        False)
  -g GAP, --gap GAP     Ignore elements with an insertion/deletion of this or
                        bigger size. (default: -1)
  -v {info,debug,silent}, --verbose {info,debug,silent}
                        Verbosity level (default: info)

The script will load the features from the input and use the alignment (in .chain) format to do the mapping from the target to the query of the alignment file. The alignment file is typically quite large so it takes a while to load it. To speed this up, the script will create a cached version of the alignment that is faster to load subsequently. So, after the first run

odenas@bxlab05:$ bnMapper.py from_features.bed alignment.chain -o to_features.bed

You will find a file named alignment.chain.pkl. The script will warn you every time it loads a cached alignment file just as a reminder that it might not be up to date if you re-generated the alignment.

You can choose between a BED12 and BED4 output format. The former will have a single (possibly gaped) to-element for each from-element that could be mapped. The latter will preserve names, but might have more than one to-element for each mapped from-element.

The --screen option will simply output the features that can be potentially mapped, without actually mapping them.

The --threshold and the --gap options are filters. The first, one will suppress all features of which only a small fraction can be mapped. The second, will suppress all features that when mapped contain large gaps (as a consequence of insertions in the query genome).

The --output option specifies the output location. If this is a directory, the mapper will create there an output file with the same for each corresponding input file. If you have files f_1.bed, f_2.bed, f_3.bed to map from human to mouse and run

odenas@bxlab05:$ bnMapper.py f_1.bed f_2.bed f_3.bed human_mouse_alignment.chain -o out_dir

you will get out_dir/f_1.bed, out_dir/f_2.bed, out_dir/f_3.bed. This is much faster than running the bnMapper three times as the alignment file is loaded just once.

Examples

The files for the following examples are on the bx-python/test_data/epo_tests directory. For example, in my system I have

odenas@bxlab05:$ pwd
/Users/odenas/src/all_bx/bx-python
odenas@bxlab05:$ ls
MANIFEST.in           distribute_setup.py   lib/                  setup.cfg             test_data/
build/                distribute_setup.pyc  script_tests/         setup.py
dist/                 doc/                  scripts/              src/
odenas@bxlab05:$ ls test_data/epo_tests/
epo_547_hs_mm_12way_mammals_65.chain  hpeaks.bed                            hpeaks.mapped.nopeak2.bed4
epo_547_hs_mm_12way_mammals_65.out    hpeaks.mapped.bed12                   mm9.chrom.sizes
hg19.chrom.sizes                      hpeaks.mapped.bed4
odenas@bxlab05:$ 

We will start with the conversion of the EPO alignments (epo_547_hs_mm_12way_mammals_65.out) to the chain format.

 
out_to_chain.py --species homo_sapiens mus_musculus --chrsizes test_data/epo_tests/hg19.chrom.sizes test_data/epo_tests/mm9.chrom.sizes --output epo.HM.chain test_data/epo_tests/epo_547_hs_mm_12way_mammals_65.out

This command is saying, extract the alignments of homo_sapiens and mus_musculus whose chromosome sizes are test_data/epo_tests/hg19.chrom.sizes and test_data/epo_tests/mm9.chrom.sizes respectively. Build the chain file with homo_sapiens as the target species and write the output on epo.HM.chain. To test the output check that epo.HM.chain is identical to test_data/epo_tests/epo_547_hs_mm_12way_mammals_65.chain

odenas@bxlab05:$ diff -q test_data/epo_tests/epo_547_hs_mm_12way_mammals_65.chain epo.HM.chain 
odenas@bxlab05:$ 

Now we can use epo.HM.chain to map features from homo_sapiens to mus_musculus. If we wanted to map on the other direction we would have to produce another file, say epo.MH.chain like so

 
out_to_chain.py --species mus_musculus homo_sapiens --chrsizes test_data/epo_tests/mm9.chrom.sizes test_data/epo_tests/hg19.chrom.sizes --output epo.MH.chain test_data/epo_tests/epo_547_hs_mm_12way_mammals_65.out

To map the features on test_data/epo_tests/hpeaks.bed we simply type (in BED4 and BED12 respectively)

odenas@bxlab05:$ bnMapper.py test_data/epo_tests/hpeaks.bed epo.HM.chain 
INFO:bx.align._epo:parsed 5 elements from epo.HM.chain
INFO:bx.align.epo:pckling to epo.HM.chain.pkl
INFO:root:indexing 5 chains ...
INFO:root:loading from test_data/epo_tests/hpeaks.bed ...
INFO:root:transforming (5) elements ...
chr16	85453122	85453194	peak1
chr16	85454203	85454219	peak2
chr16	85454227	85454258	peak2
chr16	85454261	85454286	peak2
chr16	88576704	88576752	peak4
INFO:root:DONE!
odenas@bxlab05:$ bnMapper.py -fBED12 test_data/epo_tests/hpeaks.bed epo.HM.chain 
WARNING:bx.align.epo:loading pickled file epo.HM.chain.pkl ...
INFO:root:indexing 5 chains ...
INFO:root:loading from test_data/epo_tests/hpeaks.bed ...
INFO:root:transforming (5) elements ...
chr16	85453122	85453194	peak1	1000	+	85453122	85453194	0,0,0	1	72	0
chr16	85454203	85454286	peak2	1000	+	85454203	85454286	0,0,0	3	16,31,25	0,24,58
chr16	88576704	88576752	peak4	1000	+	88576704	88576752	0,0,0	1	48	0
INFO:root:DONE!

You can supress the logging information by using the -vsilent option

odenas@bxlab05:$ bnMapper.py -vsilent test_data/epo_tests/hpeaks.bed epo.HM.chain 
chr16	85453122	85453194	peak1
chr16	85454203	85454219	peak2
chr16	85454227	85454258	peak2
chr16	85454261	85454286	peak2
chr16	88576704	88576752	peak4
odenas@bxlab05:$ bnMapper.py -vsilent -fBED12 test_data/epo_tests/hpeaks.bed epo.HM.chain 
chr16	85453122	85453194	peak1	1000	+	85453122	85453194	0,0,0	1	72	0
chr16	85454203	85454286	peak2	1000	+	85454203	85454286	0,0,0	3	16,31,25	0,24,58
chr16	88576704	88576752	peak4	1000	+	88576704	88576752	0,0,0	1	48	0

or you can increase it with the -vdebug option. In the picture below you can notice the original peaks on the human genome and the mapped ones on the mouse genome. The alignment coverage track indicates matchin blocks and insertions in that species. Since peak3 falls in a human insertion, it is not mapped on mouse. On the other hand, peak2 is mapped in three blocks on mouse. The two gaps correspond to two insertions in mouse, whether the middle block contains two joined parts that spanned an insertion in human.

Browser view of human and mouse peaks

To limit the mapping in peaks that have taken only a limited amount of indels, one can use the --gap and --threshold options like so

odenas@bxlab05:$ bnMapper.py -g9 -t0.7 ./test_data/epo_tests/hpeaks.bed epo.HM.chain
WARNING:bx.align.epo:loading pickled file epo.HM.chain.pkl ...
INFO:root:indexing 5 chains ...
INFO:root:loading from ./test_data/epo_tests/hpeaks.bed ...
INFO:root:transforming (5) elements ...
chr16	85453122	85453194	peak1
chr16	88576704	88576752	peak4
INFO:root:DONE!

This is filtering out all peaks that incurr a gap (insertion in mouse) of more than 9 bases as well as peaks of which less than 70% of bases are mapped.

Note: relation with pslMap

In case you are looking for consistent results between pslMap (utility from UCSC that performs mappings of generic PSL alignments) and bnMapper, you need to be careful at least on the (chain) alignments you choose to use. If your features are short enough and do not overlap with duplications on neither species and if you run pslMap with the -swapMap options, you should get the similar results. For example, to map human peaks into mouse you need to use the mm9ToHg19.over.chain alignment file for both pslMap (with the -swapMap option) and bnMapper. This because bnMapper maps only on the target-to-query direction (target and query as defined in the chain file).

Notice, however, that bnMapper and pslMap are different, specially when features overlap with duplications on either species or when they span multiple chains. Thus, you are not guarantied to always get the same results.

Updated