Wiki

Clone wiki

jra-src / Home

Joint Read Aligner (JRA) User Manual

1. What is JRA?

JRA is a split-alignment tool to infer structural variants (currently implemented for deletions only) from large short-read datasets. The idea of JRA in 3 simple pictures:

Alt text

2. Download and Installation

You can choose from one of the following ways to get JRA running on your computer.

Binary

Download a 64-bit Linux stand-alone executable from here . Make sure to add to $PATH the path to this executable. GLIBC >= 2.19 is required. You can check your version by executing ldd --version.

Install using pip

TODO

Manually run from source tree

Clone the JRA source code :

git clone https://bitbucket.org/jointreadalignment/jra-src

or click here to download the source as a zipped file.

Next, manually install all the Python dependencies listed in requirements.txt. Then the jra.py script in the top folder will launch JRA.

3. Quick Usage Guide

To run JRA in its default mode, you need to provide it with 3 things: (1) DNA reads (fasta/fastq format), (2) reference genome (fasta format), (3) pairwise local alignments of the reads to the reference (currently only accepts alignments from LAST. Click here for instructions to install LAST). This is the default JRA workflow:

lastdb -uNEAR -C2 db refGenome.fa
lastal -d96 -Q1 -j1 -i1 db reads.fastq | last-split -m 0.7 -n > LASTAlignments.maf
jra -Q1 reads.fastq refGenome.fa LASTAlignments.maf > out.sam

Note1: If you are running JRA directly from the source tree, jra in the third line corresponds to jra.py in the project top folder.
Note2: For huge datasets (eg. human sequencing projects), see Simple hacks section for a faster recipe.

The first two lines are LAST commands. lastdb constructs an index of the reference genome. lastal computes read-to-reference local alignments, which will act as seeds for JRA. last-split computes error probabilities of the local alignments.

The last line runs JRA to perform the actual joint alignment.

To learn about all the options appearing here, see the JRA advanced usage guide below.

4. JRA Output format

The output is in SAM-like format. For each predicted deletion, JRA reports along with the alignment, its error probability, and the sequence reconstructed in the deletion locus. Consider an example output line.

5   0   chr1    186825927   0.002   40M403D31M  *   0   87 AGGTTAGCAGAAATTTGTCAGTTAAAGAGAGAGAAAATGATGGCAGTCAGTAATAAGGCTGACTGCCTTAT  *   9250.0  35473,10625635,29547858,30292642

Where's what each field means:

Field Meaning Example Remarks
1 ID 5 arbitrarily assigned ID for the deletion event
2 unused 0 for SAM format compatibility. value fixed to 0
3 ref sequence name chr1 -
4 1-based leftmost position where alignment starts 186825927 -
5 Joint-alignment error probability 0.002 Use this as a measure of confidence of the structural variant described by the alignment
6 CIGAR string 56M403D31M See SAM format for description of CIGAR
7 unused * for SAM format compatibility. value fixed to *
8 unused 0 for SAM format compatibility. value fixed to 0
9 length of reconstructed sequence 87 -
10 reconstructed sequence GTC ... TAT reconstruction of the sampled genome in the deletion locus
11 unused * for SAM format compatibility. value fixed to *
12 joint alignment score 9250.0 do not use this score for ranking/filtering purposes. use Field 5 instead
13 supporting reads 35473,10625635,29547858,30292642 names(IDs) of reads which were jointly aligned to infer this deletion

5. JRA Advanced Usage Guide

5.1 LAST options

LAST maintains its own full-fledged documentation here. We only show options relevant to JRA.

lastdb

This LAST module creates an index (suffix array) of the reference.

option Meaning JRA recommended/mandated setting Remarks
-u spaced-seed pattern NEAR Suitable when reference and query are closely related.
-C construct child table along with suffix array 2 Makes alignments faster, but uses more memory (e.g. for human genome, suffix array only = 15GB, with child table = 20G). Do not provide this argument, if memory is an issue.

lastal

This is the LAST module that computes local alignments of query and reference sequences.

option Meaning JRA recommended/mandated setting Remarks
-Q query sequences file format 0 if fasta, 1 if fastq --
-j gapless alignments 1 (mandatory) JRA can currently only process gapless alignments.
-i batch size 1 (mandatory) ensures that alignments appear in the same order as reads
-r score for match 6 this is the default value if lastdb was created using -uNEAR option
-q mismatch cost 18 this is the default value if lastdb was created using -uNEAR option
-d minimum score of local alignment 96 recommended value to go with -r 6 and -q 18 scheme. This allows matches with length as short as 16.

last-split

This is the LAST module that computes the error probability of a local alignment, which is a crucial piece of information that JRA uses.

option Meaning JRA recommended/mandated setting Remarks
-n list all original local alignments mandatory switch Without this switch, last-split computes a split alignment based on a single read.
-m mismap probability between 0.6 to 0.8 Local alignment with error probability more than this value is not reported. Higher values increase sensitivity of downstream JRA, but also decrease specificity.

5.2 JRA options

option Meaning Possible Values Default value Remarks
-Q 1 if reads are in fastq format, else 0 0 or 1 0 --
-d minimum deletion size integer in range (1,inf) 20 JRA will only search for deletions at least this size
-g scoring scheme between reference and sampled genomes File describing scoring scheme defaults scheme is described in defaultScores/scoreGenome.mat See below for file format.
-s scoring scheme modeling sequencing errors File describing scoring scheme defaultScores/scoreSequencer.mat See below for file format.
-m local alignments from LAST that have mismap probability higher than this value are filtered out by JRA float in range (0,1) 0.6 Use same value as for -m option in last-split. Higher values increase sensitivity at the cost of longer running time and decreased specificity.
-e filter out joint-alignments with error probability higher than this value (0,1) 0.01 higher values increase sensitivity but decrease specifity of JRA

5.3 Scoring scheme description format

JRA uses two scoring schemes: the first to model evolutionary divergence between the reference and sampled genome, and the second to model sequencer errors.

The reference-to-sample scheme consists of a substitution matrix, affine gap penalty, and a split cost for structural variant. These should be specified in a text file as shown below, and provided to JRA with -g option. If no scoring scheme is provided, values shown below are used. Rows correspond to the reference and the columns to the query.

A  C  G  T
A 15 -80 -80 -80
C -80 15 -80 -80
G -80 -80 15 -80
T -80 -80 -80 15

delExist -105
delExtend -40
insExist -105
instExtend -40
Split -500

The scoring scheme for sequencing errors consists of a substitution matrix and a linear gap cost (to be implemented). These should be specified in a text file as shown below, and provided to JRA as -s option. If no scoring scheme is provided, values shown below are used.

A  C  G  T
A 15 -15 -15 -15
C -15 15 -15 -15
G -15 -15 15 -15
T -15 -15 -15 15


delExist 0
delExtend -30
insExist 0
instExtend -30

6. Simple hacks

6.1 To go faster

Python's file reading routines are painfully slow. This might be a bottleneck for large datasets. The filter-maf-alignments.sh script provided in the scripts folder filters out uninteresting alignments from last-split, resulting in significantly faster running times for the downstream JRA. You just need to modify the second line in the workflow shown in the Quick Usage guide.

lastal -d96 -Q1 -j1 db reads.fastq | last-split -m 0.7 -n | filter-maf-alignment.sh  - 0.7 > LASTAlignments.maf

Make sure that the second argument to fiter-maf-alignment.sh matches the -m value provided to last-split.

6.2 Workflow for paired-end reads

JRA can incorporate paired-end information to improve its split alignments. Use the UpdateMismapProbFromMate.py in the scripts/ folder in the following manner:

lastal -d96 -Q1 -j1 -i1 db reads1.fastq | last-split -m 0.8 -n | filter-maf-alignment.sh  - 0.7 > mate1Alignments.maf
lastal -d96 -Q1 -j1 -i1 db reads2.fastq | last-split -m 0.8 -n | filter-maf-alignment.sh  - 0.7 > mate2Alignments.maf
UpdateMismapProbFromMate.py -1 mate1Alignments.maf -2 mate2Alignments.maf -o mate1Updated.maf --fmean FRAGMENTMEAN --readlen READLENGTH --fsd FRAGMENT_SIZE_STANDARD_DEVIATION -a ALPHA
UpdateMismapProbFromMate.py -1 mate2Alignments.maf -2 mate1Alignments.maf -o mate2Updated.maf --fmean FRAGMENTMEAN --readlen READLENGTH --fsd FRAGMENT_SIZE_STANDARD_DEVIATION -a QUANTILE_LEVEL_FOR_NORMAL_DISTR_N(0,1)
cat mate1Updated.maf mate2Updated.maf > alignments.maf
cat reads1.fastq reads2.fastq > allReads.fastq
jra -Q1 allReads.fastq refGenome.fa alignments.maf > out.sam

Please note that this is an unoptimized, experimental setup only.

7. Troubleshooting

TODO

8. Publication-related

8.1 Paper

Please use the following article to cite us or to explore the nitty-gritty details of JRA:

Anish M.S. Shrestha, Martin C. Frith, Kiyoshi Asai, Hugues Richard: Jointly aligning a group of DNA reads improves accuracy of identifying large deletions, Nucleic Acids Research 2017. download

8.2 Benchmarking dataset

Datasets used in the paper for benchmarking are available for download here.

9. Contact us

We would love to hear your complaints, questions, comments, suggestions. Please reach out to us at:

  • Anish Shrestha: anish (ATmark) edu.k.u-tokyo.ac.jp
  • Hugues Richard: hugues.richard (ATmark) upmc.fr

Updated