HTTPS SSH

Overview

This repo contains code for building Affymetrix probe set alignment files (link.psl format) that can be displayed in Integrated Genome Browser.

This repo also contains an example link.psl file the Loraine Lab group made for the U133 human genome array - it is named "GPL570.HG-U133_Plus_2.link.psl.gz" and resides in the "results" folder.

To get the file, either clone this repository or just download the entire thing as a "zip" file and unpack it.

Note that this link.psl has an important, companion tabix index file named GPL570.HG-U133_Plus_2.link.psl.gz.tbi. (File extenion ".tbi" means "tabix index.") Make sure that if you move these files to a new location, you always keep GPL570.HG-U133_Plus_2.link.psl.gz and its .tbi index file in the same folder.

Also, see: Visualizing probe sets in IGB

Instructions

To make link.psl files for an Affymetrix array:

  1. Get probe set target sequences (fasta file) from Affymetrix. Target means: contains subsequences identical to probes. Depending on the array, most target sequences were originally mRNA records from Genbank or were assembled from ESTs from dbEST.
  2. Get a copy of the reference genome in 2bit format if possible. If not available in 2bit, get it in fasta and convert it using faToTwoBit. Reference genomes in 2bit sequence are available from igbquickload.org/quickload.
  3. Align probe set target sequences onto reference genome using blat. Use a reasonable maximum intron size parameter to avoid spurious alignments.
  4. Get probe sequence file (tab-delimited) from Affymetrix.
  5. Run makeLinkPsl.py to make link.psl file you can open in IGB. Give it output from blat, the probe set target sequences fasta file, and the tab-delimited probe sequence file from Affymetrix.
  6. Sort, compress and index the link.psl file using sort, bgzip, and tabix.
  7. Test by opening file in IGB. See IGB User's Guide for probe set alignment images.

Running blat

Blat is a alignment tool written by Jim Kent that aligns mRNA and EST sequences onto a reference genome. You can get it from the UCSC Genome Bioinformatics Web site. Google to find a copy.

To run blat, get a copy of the reference genome in 2bit format. If you can't find a 2bit file, make one from a fasta file using faToTwoBit, written by Jim Kent and distributed through UCSC. Google to find a copy.

Note: Many genome sequences in 2bit format are also available from the IGBQuickLoad.org Web site. These include many plant genomes not supported by UCSC.

Here's an example script for running blat:

#!/bin/bash
G=H_sapiens_Dec_2013
D=$G.2bit
Q=HG-U133_Plus_2.target.fa
PSL=GPL570.HG-U133_Plus_2.psl
MI=50000
blat -noTrimA -maxIntron=$MI -noHead -minIdentity=95 -dots=100 $D $Q $PSL

Running makeLinkPsl.py

Once you've run blat, use makeLinkPsl.py to make the link.psl file.

Note: makeLinkPsl.py requires BioPython's SeqIO module.

The script requires:

  • blat output - alignments in PSL format mapping probe set target sequences onto a reference genome
  • probe sequences file (tab-delimited format from Affymetrix)
  • target sequence files (fasta format)

After creating the link.psl file, sort it, compress it with bgzip, and index it with tabix.

For example:

#!/bin/bash
makeLinkPsl.py -p data/HG-U133_Plus_2.probe_tab.gz -f data/HG-U133_Plus_2.target.gz -b data/GPL570.HG-U133_Plus_2.psl.gz -q .95 | sort -k14,14 -k16,16n | bgzip -c > GPL570.HG-U133_Plus_2.link.psl.gz
tabix -s 14 -b 16 -e 17 GPL570.HG-U133_Plus_2.link.psl.gz

About link.psl

The original link.psl format was developed at Affymetrix to enable Integrated Genome Browser to display the location of probes selected from probe set target sequences aligned onto a genome.

The first, original version of the format contained four sections, each section starting with a track line header. The first section contained blat output in PSL format indicating the alignment of target sequences onto a reference genome. The second section contained a modified version of the PSL format indicating the location of individual probes relative to their target sequences. The third and fourth sections are no longer being used; you can safely ignore them.

When you use IGB to open a link.psl file, the IGB code used the two mappings (probe-to-target and target-to-genome) to map the probe sequences onto the genome.

The IGB code is able to handle complexities such as deletions and insertions in the target-to-genome alignment, probes that overlap along the target sequence, and so on.

Later, we modified the format to enable random access via byte level HTTP requests and thus support partial data loading into IGB from IGBQuickLoad sites. For this, we are using tabix and bgzip block compression. Now, the two alignment sections are combined.

The first 21 fields of link.psl now contain the probe set target sequence alignment in PSL format (the output of blat), and the next 21 fields contain the probe alignments for that target sequence.