Joint Read Aligner (JRA) User Manual

1. What is JRA?

JRA is a split-alignment tool to infer structural variants (currently implemented for deletions only) from large short-read datasets. The idea of JRA in 3 simple pictures:

Alt text

2. Download and Installation

You can choose from one of the following ways to get JRA running on your computer.

Binary

Download a 64-bit Linux stand-alone executable from here . Make sure to add to $PATH the path to this executable. GLIBC >= 2.19 is required. You can check your version by executing ldd --version.

Install using pip

TODO

Manually run from source tree

Clone the JRA source code :

git clone https://bitbucket.org/jointreadalignment/jra-src

or click here to download the source as a zipped file.

Next, manually install all the Python dependencies listed in requirements.txt. Then the jra.py script in the top folder will launch JRA.

3. Quick Usage Guide

To run JRA in its default mode, you need to provide it with 3 things: (1) DNA reads (fasta/fastq format), (2) reference genome (fasta format), (3) pairwise local alignments of the reads to the reference (currently only accepts alignments from LAST. Click here for instructions to install LAST). This is the default JRA workflow:

lastdb -uNEAR -C2 db refGenome.fa
lastal -d96 -Q1 -j1 -i1 db reads.fastq | last-split -m 0.7 -n > LASTAlignments.maf
jra -Q1 reads.fastq refGenome.fa LASTAlignments.maf > out.sam

Note1: If you are running JRA directly from the source tree, jra in the third line corresponds to jra.py in the project top folder.
Note2: For huge datasets (eg. human sequencing projects), see Simple hacks section for a faster recipe.

The first two lines are LAST commands. lastdb constructs an index of the reference genome. lastal computes read-to-reference local alignments, which will act as seeds for JRA. last-split computes error probabilities of the local alignments.

The last line runs JRA to perform the actual joint alignment.

To learn about all the options appearing here, see the JRA advanced usage guide below.

4. JRA Output format

The output is in SAM-like format. For each predicted deletion, JRA reports along with the alignment, its error probability, and the sequence reconstructed in the deletion locus. Consider an example output line.

5   0   chr1    186825927   0.002   40M403D31M  *   0   87 AGGTTAGCAGAAATTTGTCAGTTAAAGAGAGAGAAAATGATGGCAGTCAGTAATAAGGCTGACTGCCTTAT  *   9250.0  35473,10625635,29547858,30292642

Where's what each field means:

Field	Meaning	Example	Remarks
1	ID	5	arbitrarily assigned ID for the deletion event
2	unused	0	for SAM format compatibility. value fixed to 0
3	ref sequence name	chr1	-
4	1-based leftmost position where alignment starts	186825927	-
5	Joint-alignment error probability	0.002	Use this as a measure of confidence of the structural variant described by the alignment
6	CIGAR string	56M403D31M	See SAM format for description of CIGAR
7	unused	*	for SAM format compatibility. value fixed to *
8	unused	0	for SAM format compatibility. value fixed to 0
9	length of reconstructed sequence	87	-
10	reconstructed sequence	GTC ... TAT	reconstruction of the sampled genome in the deletion locus
11	unused	*	for SAM format compatibility. value fixed to *
12	joint alignment score	9250.0	do not use this score for ranking/filtering purposes. use Field 5 instead
13	supporting reads	35473,10625635,29547858,30292642	names(IDs) of reads which were jointly aligned to infer this deletion

5. JRA Advanced Usage Guide

5.1 LAST options

LAST maintains its own full-fledged documentation here. We only show options relevant to JRA.

lastdb

This LAST module creates an index (suffix array) of the reference.

option	Meaning	JRA recommended/mandated setting	Remarks
-u	spaced-seed pattern	NEAR	Suitable when reference and query are closely related.
-C	construct child table along with suffix array	2	Makes alignments faster, but uses more memory (e.g. for human genome, suffix array only = 15GB, with child table = 20G). Do not provide this argument, if memory is an issue.

lastal

This is the LAST module that computes local alignments of query and reference sequences.

option	Meaning	JRA recommended/mandated setting	Remarks
-Q	query sequences file format	0 if fasta, 1 if fastq	--
-j	gapless alignments	1 (mandatory)	JRA can currently only process gapless alignments.
-i	batch size	1 (mandatory)	ensures that alignments appear in the same order as reads
-r	score for match	6	this is the default value if lastdb was created using -uNEAR option
-q	mismatch cost	18	this is the default value if lastdb was created using -uNEAR option
-d	minimum score of local alignment	96	recommended value to go with -r 6 and -q 18 scheme. This allows matches with length as short as 16.

last-split

This is the LAST module that computes the error probability of a local alignment, which is a crucial piece of information that JRA uses.

option	Meaning	JRA recommended/mandated setting	Remarks
-n	list all original local alignments	mandatory switch	Without this switch, last-split computes a split alignment based on a single read.
-m	mismap probability	between 0.6 to 0.8	Local alignment with error probability more than this value is not reported. Higher values increase sensitivity of downstream JRA, but also decrease specificity.

5.2 JRA options

option	Meaning	Possible Values	Default value	Remarks
-Q	1 if reads are in fastq format, else 0	0 or 1	0	--
-d	minimum deletion size	integer in range (1,inf)	20	JRA will only search for deletions at least this size
-g	scoring scheme between reference and sampled genomes	File describing scoring scheme	defaults scheme is described in defaultScores/scoreGenome.mat	See below for file format.
-s	scoring scheme modeling sequencing errors	File describing scoring scheme	defaultScores/scoreSequencer.mat	See below for file format.
-m	local alignments from LAST that have mismap probability higher than this value are filtered out by JRA	float in range (0,1)	0.6	Use same value as for -m option in last-split. Higher values increase sensitivity at the cost of longer running time and decreased specificity.
-e	filter out joint-alignments with error probability higher than this value	(0,1)	0.01	higher values increase sensitivity but decrease specifity of JRA

5.3 Scoring scheme description format

JRA uses two scoring schemes: the first to model evolutionary divergence between the reference and sampled genome, and the second to model sequencer errors.

The reference-to-sample scheme consists of a substitution matrix, affine gap penalty, and a split cost for structural variant. These should be specified in a text file as shown below, and provided to JRA with -g option. If no scoring scheme is provided, values shown below are used. Rows correspond to the reference and the columns to the query.

A  C  G  T
A 15 -80 -80 -80
C -80 15 -80 -80
G -80 -80 15 -80
T -80 -80 -80 15

delExist -105
delExtend -40
insExist -105
instExtend -40
Split -500

The scoring scheme for sequencing errors consists of a substitution matrix and a linear gap cost (to be implemented). These should be specified in a text file as shown below, and provided to JRA as -s option. If no scoring scheme is provided, values shown below are used.

A  C  G  T
A 15 -15 -15 -15
C -15 15 -15 -15
G -15 -15 15 -15
T -15 -15 -15 15


delExist 0
delExtend -30
insExist 0
instExtend -30

6. Simple hacks

6.1 To go faster

Python's file reading routines are painfully slow. This might be a bottleneck for large datasets. The filter-maf-alignments.sh script provided in the scripts folder filters out uninteresting alignments from last-split, resulting in significantly faster running times for the downstream JRA. You just need to modify the second line in the workflow shown in the Quick Usage guide.

lastal -d96 -Q1 -j1 db reads.fastq | last-split -m 0.7 -n | filter-maf-alignment.sh  - 0.7 > LASTAlignments.maf

Make sure that the second argument to fiter-maf-alignment.sh matches the -m value provided to last-split.

6.2 Workflow for paired-end reads

JRA can incorporate paired-end information to improve its split alignments. Use the UpdateMismapProbFromMate.py in the scripts/ folder in the following manner:

lastal -d96 -Q1 -j1 -i1 db reads1.fastq | last-split -m 0.8 -n | filter-maf-alignment.sh  - 0.7 > mate1Alignments.maf
lastal -d96 -Q1 -j1 -i1 db reads2.fastq | last-split -m 0.8 -n | filter-maf-alignment.sh  - 0.7 > mate2Alignments.maf
UpdateMismapProbFromMate.py -1 mate1Alignments.maf -2 mate2Alignments.maf -o mate1Updated.maf --fmean FRAGMENTMEAN --readlen READLENGTH --fsd FRAGMENT_SIZE_STANDARD_DEVIATION -a ALPHA
UpdateMismapProbFromMate.py -1 mate2Alignments.maf -2 mate1Alignments.maf -o mate2Updated.maf --fmean FRAGMENTMEAN --readlen READLENGTH --fsd FRAGMENT_SIZE_STANDARD_DEVIATION -a QUANTILE_LEVEL_FOR_NORMAL_DISTR_N(0,1)
cat mate1Updated.maf mate2Updated.maf > alignments.maf
cat reads1.fastq reads2.fastq > allReads.fastq
jra -Q1 allReads.fastq refGenome.fa alignments.maf > out.sam

Please note that this is an unoptimized, experimental setup only.

7. Troubleshooting

TODO

Please use the following article to cite us or to explore the nitty-gritty details of JRA:

Anish M.S. Shrestha, Martin C. Frith, Kiyoshi Asai, Hugues Richard: Jointly aligning a group of DNA reads improves accuracy of identifying large deletions, Nucleic Acids Research 2017. download

9. Contact us

We would love to hear your complaints, questions, comments, suggestions. Please reach out to us at:

Anish Shrestha: anish (ATmark) edu.k.u-tokyo.ac.jp
Hugues Richard: hugues.richard (ATmark) upmc.fr

Wiki

jra-src / Home

Joint Read Aligner (JRA) User Manual

1. What is JRA?

2. Download and Installation

Binary

Install using pip

Manually run from source tree

3. Quick Usage Guide

4. JRA Output format

5. JRA Advanced Usage Guide

5.1 LAST options

lastdb

lastal

last-split

5.2 JRA options

5.3 Scoring scheme description format

6. Simple hacks

6.1 To go faster

6.2 Workflow for paired-end reads

7. Troubleshooting

9. Contact us

Joint Read Aligner (JRA) User Manual

1. What is JRA?

2. Download and Installation

Binary

Install using pip

Manually run from source tree

3. Quick Usage Guide

4. JRA Output format

5. JRA Advanced Usage Guide

5.1 LAST options

lastdb

lastal

last-split

5.2 JRA options

5.3 Scoring scheme description format

6. Simple hacks

6.1 To go faster

6.2 Workflow for paired-end reads

7. Troubleshooting

8. Publication-related

9. Contact us