Wiki
Clone wikijra-src / Home
Joint Read Aligner (JRA) User Manual
1. What is JRA?
JRA is a split-alignment tool to infer structural variants (currently implemented for deletions only) from large short-read datasets. The idea of JRA in 3 simple pictures:
2. Download and Installation
You can choose from one of the following ways to get JRA running on your computer.
Binary
Download a 64-bit Linux stand-alone executable from here .
Make sure to add to $PATH the path to this executable.
GLIBC >= 2.19 is required. You can check your version by executing ldd --version
.
Install using pip
TODO
Manually run from source tree
Clone the JRA source code :
git clone https://bitbucket.org/jointreadalignment/jra-src
Next, manually install all the Python dependencies listed in requirements.txt. Then the jra.py script in the top folder will launch JRA.
3. Quick Usage Guide
To run JRA in its default mode, you need to provide it with 3 things: (1) DNA reads (fasta/fastq format), (2) reference genome (fasta format), (3) pairwise local alignments of the reads to the reference (currently only accepts alignments from LAST. Click here for instructions to install LAST). This is the default JRA workflow:
lastdb -uNEAR -C2 db refGenome.fa lastal -d96 -Q1 -j1 -i1 db reads.fastq | last-split -m 0.7 -n > LASTAlignments.maf jra -Q1 reads.fastq refGenome.fa LASTAlignments.maf > out.sam
jra
in the third line corresponds to jra.py
in the project top folder. Note2: For huge datasets (eg. human sequencing projects), see Simple hacks section for a faster recipe.
The first two lines are LAST commands. lastdb
constructs an index of the reference genome. lastal
computes read-to-reference local alignments, which will act as seeds for JRA. last-split
computes error probabilities of the local alignments.
The last line runs JRA to perform the actual joint alignment.
To learn about all the options appearing here, see the JRA advanced usage guide below.
4. JRA Output format
The output is in SAM-like format. For each predicted deletion, JRA reports along with the alignment, its error probability, and the sequence reconstructed in the deletion locus. Consider an example output line.
5 0 chr1 186825927 0.002 40M403D31M * 0 87 AGGTTAGCAGAAATTTGTCAGTTAAAGAGAGAGAAAATGATGGCAGTCAGTAATAAGGCTGACTGCCTTAT * 9250.0 35473,10625635,29547858,30292642
Where's what each field means:
Field | Meaning | Example | Remarks |
---|---|---|---|
1 | ID | 5 | arbitrarily assigned ID for the deletion event |
2 | unused | 0 | for SAM format compatibility. value fixed to 0 |
3 | ref sequence name | chr1 | - |
4 | 1-based leftmost position where alignment starts | 186825927 | - |
5 | Joint-alignment error probability | 0.002 | Use this as a measure of confidence of the structural variant described by the alignment |
6 | CIGAR string | 56M403D31M | See SAM format for description of CIGAR |
7 | unused | * | for SAM format compatibility. value fixed to * |
8 | unused | 0 | for SAM format compatibility. value fixed to 0 |
9 | length of reconstructed sequence | 87 | - |
10 | reconstructed sequence | GTC ... TAT | reconstruction of the sampled genome in the deletion locus |
11 | unused | * | for SAM format compatibility. value fixed to * |
12 | joint alignment score | 9250.0 | do not use this score for ranking/filtering purposes. use Field 5 instead |
13 | supporting reads | 35473,10625635,29547858,30292642 | names(IDs) of reads which were jointly aligned to infer this deletion |
5. JRA Advanced Usage Guide
5.1 LAST options
LAST maintains its own full-fledged documentation here. We only show options relevant to JRA.
lastdb
This LAST module creates an index (suffix array) of the reference.
option | Meaning | JRA recommended/mandated setting | Remarks |
---|---|---|---|
-u | spaced-seed pattern | NEAR | Suitable when reference and query are closely related. |
-C | construct child table along with suffix array | 2 | Makes alignments faster, but uses more memory (e.g. for human genome, suffix array only = 15GB, with child table = 20G). Do not provide this argument, if memory is an issue. |
lastal
This is the LAST module that computes local alignments of query and reference sequences.
option | Meaning | JRA recommended/mandated setting | Remarks |
---|---|---|---|
-Q | query sequences file format | 0 if fasta, 1 if fastq | -- |
-j | gapless alignments | 1 (mandatory) | JRA can currently only process gapless alignments. |
-i | batch size | 1 (mandatory) | ensures that alignments appear in the same order as reads |
-r | score for match | 6 | this is the default value if lastdb was created using -uNEAR option |
-q | mismatch cost | 18 | this is the default value if lastdb was created using -uNEAR option |
-d | minimum score of local alignment | 96 | recommended value to go with -r 6 and -q 18 scheme. This allows matches with length as short as 16. |
last-split
This is the LAST module that computes the error probability of a local alignment, which is a crucial piece of information that JRA uses.
option | Meaning | JRA recommended/mandated setting | Remarks |
---|---|---|---|
-n | list all original local alignments | mandatory switch | Without this switch, last-split computes a split alignment based on a single read. |
-m | mismap probability | between 0.6 to 0.8 | Local alignment with error probability more than this value is not reported. Higher values increase sensitivity of downstream JRA, but also decrease specificity. |
5.2 JRA options
option | Meaning | Possible Values | Default value | Remarks |
---|---|---|---|---|
-Q | 1 if reads are in fastq format, else 0 | 0 or 1 | 0 | -- |
-d | minimum deletion size | integer in range (1,inf) | 20 | JRA will only search for deletions at least this size |
-g | scoring scheme between reference and sampled genomes | File describing scoring scheme | defaults scheme is described in defaultScores/scoreGenome.mat | See below for file format. |
-s | scoring scheme modeling sequencing errors | File describing scoring scheme | defaultScores/scoreSequencer.mat | See below for file format. |
-m | local alignments from LAST that have mismap probability higher than this value are filtered out by JRA | float in range (0,1) | 0.6 | Use same value as for -m option in last-split. Higher values increase sensitivity at the cost of longer running time and decreased specificity. |
-e | filter out joint-alignments with error probability higher than this value | (0,1) | 0.01 | higher values increase sensitivity but decrease specifity of JRA |
5.3 Scoring scheme description format
JRA uses two scoring schemes: the first to model evolutionary divergence between the reference and sampled genome, and the second to model sequencer errors.
The reference-to-sample scheme consists of a substitution matrix, affine gap penalty, and a split cost for structural variant. These should be specified in a text file as shown below, and provided to JRA with -g option. If no scoring scheme is provided, values shown below are used. Rows correspond to the reference and the columns to the query.
A C G T A 15 -80 -80 -80 C -80 15 -80 -80 G -80 -80 15 -80 T -80 -80 -80 15 delExist -105 delExtend -40 insExist -105 instExtend -40 Split -500
The scoring scheme for sequencing errors consists of a substitution matrix and a linear gap cost (to be implemented). These should be specified in a text file as shown below, and provided to JRA as -s option. If no scoring scheme is provided, values shown below are used.
A C G T A 15 -15 -15 -15 C -15 15 -15 -15 G -15 -15 15 -15 T -15 -15 -15 15 delExist 0 delExtend -30 insExist 0 instExtend -30
6. Simple hacks
6.1 To go faster
Python's file reading routines are painfully slow. This might be a bottleneck for large datasets.
The filter-maf-alignments.sh
script provided in the scripts folder filters out uninteresting alignments from last-split, resulting in significantly faster running times for the downstream JRA. You just need to modify the second line in the workflow shown in the Quick Usage guide.
lastal -d96 -Q1 -j1 db reads.fastq | last-split -m 0.7 -n | filter-maf-alignment.sh - 0.7 > LASTAlignments.maf
fiter-maf-alignment.sh
matches the -m value provided to last-split
.
6.2 Workflow for paired-end reads
JRA can incorporate paired-end information to improve its split alignments. Use the UpdateMismapProbFromMate.py
in the scripts/ folder in the following manner:
lastal -d96 -Q1 -j1 -i1 db reads1.fastq | last-split -m 0.8 -n | filter-maf-alignment.sh - 0.7 > mate1Alignments.maf lastal -d96 -Q1 -j1 -i1 db reads2.fastq | last-split -m 0.8 -n | filter-maf-alignment.sh - 0.7 > mate2Alignments.maf UpdateMismapProbFromMate.py -1 mate1Alignments.maf -2 mate2Alignments.maf -o mate1Updated.maf --fmean FRAGMENTMEAN --readlen READLENGTH --fsd FRAGMENT_SIZE_STANDARD_DEVIATION -a ALPHA UpdateMismapProbFromMate.py -1 mate2Alignments.maf -2 mate1Alignments.maf -o mate2Updated.maf --fmean FRAGMENTMEAN --readlen READLENGTH --fsd FRAGMENT_SIZE_STANDARD_DEVIATION -a QUANTILE_LEVEL_FOR_NORMAL_DISTR_N(0,1) cat mate1Updated.maf mate2Updated.maf > alignments.maf cat reads1.fastq reads2.fastq > allReads.fastq jra -Q1 allReads.fastq refGenome.fa alignments.maf > out.sam
7. Troubleshooting
TODO
8. Publication-related
Please use the following article to cite us or to explore the nitty-gritty details of JRA:
Anish M.S. Shrestha, Martin C. Frith, Kiyoshi Asai, Hugues Richard: Jointly aligning a group of DNA reads improves accuracy of identifying large deletions, Nucleic Acids Research 2017. download
9. Contact us
We would love to hear your complaints, questions, comments, suggestions. Please reach out to us at:
- Anish Shrestha: anish (ATmark) edu.k.u-tokyo.ac.jp
- Hugues Richard: hugues.richard (ATmark) upmc.fr
Updated