no junction or junction_aa in AIRR/ChangeO format
Issue #118
resolved
I updated to the latest ChangeO and airr-standards code, and MakeDB.py runs fine but I'm not getting the junction sequence. Maybe this is because I'm using IgBlast 1.8.0, or does IgBlast have to be called differently now? The junction also isn't showing up in the ChangeO output format.
Here's IgBlast output for one sequence:
# IGBLASTN
# Query: IS9DOBB01AS0UE
# Database: /work/01114/vdj/lonestar/../common/igblast-db/db/10_05_2016//human/ReferenceDirectorySet/human_IG_V.fna /work/01114/vdj/lonestar/../common/igblast-db/db/10_05_2016//human/ReferenceDirectorySet/hu
man_IG_D.fna /work/01114/vdj/lonestar/../common/igblast-db/db/10_05_2016//human/ReferenceDirectorySet/human_IG_J.fna
# Domain classification requested: imgt
# V-(D)-J rearrangement summary for query sequence (Top V gene match, Top D gene match, Top J gene match, Chain type, stop codon, V-J frame, Productive, Strand). Multiple equivalent top matches, if present,
are separated by a comma.
IGHV4-59*08 IGHD3-10*01 IGHJ4*02 VH No In-frame Yes +
# V-(D)-J junction details based on top germline gene matches (V end, V-D junction, D region, D-J junction, J start). Note that possible overlapping nucleotides at VDJ junction (i.e, nucleotides that could
be assigned to either rearranging gene) are indicated in parentheses (i.e., (TACT)) but are not included under the V, D, or J gene itself
AGACA TGAAGCAACGGGC ACTATGATTCGGG CAGCCACTACCCC CTTTG
# Alignment summary between query and top germline V gene hit (from, to, length, matches, mismatches, gaps, percent identity)
FR1-IMGT 1 54 54 54 0 0 100
CDR1-IMGT 55 78 24 24 0 0 100
FR2-IMGT 79 129 51 50 1 0 98
CDR2-IMGT 130 150 21 21 0 0 100
FR3-IMGT 151 264 114 109 5 0 95.6
CDR3-IMGT (germline) 265 272 8 8 0 0 100
Total N/A N/A 272 266 6 0 97.8
# Hit table (the first field indicates the chain type of the hit)
# Fields: query id, query gi, query acc., query acc.ver, query length, subject id, subject ids, subject gi, subject gis, subject acc., subject acc.ver, subject accs., subject length, q. start, q. end, s. sta
rt, s. end, query seq, subject seq, evalue, bit score, score, alignment length, % identity, identical, mismatches, positives, gap opens, gaps, % positives, query/sbjct frames, query frame, sbjct frame, BTOP
# 9 hits found
V IS9DOBB01AS0UE 0 IS9DOBB01AS0UE IS9DOBB01AS0UE 354 IGHV4-59*08 IGHV4-59*08 0 0 IGHV4-59*08 IGHV4-59*08 IGHV4-59*08 293 1 272 22 293
GGCCCAGGACTGGTGAAGCCTTCGGAGACCCTGTCCCTCACCTGCACTGTCTCTGGTGGCTCCATCAGTAGTTACTACTGGAGCTGGATCCGGCAGCCCCCAGGGAAGGGACTGGAGTGGATTGGATATATCTATTACAGTGGGAGCACCAACTACAACCCCTCCCTCAAGGGTCGAGTCACCATGTCAGTAGACACGTCCA
AGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCCGCAGACACGGCCTTTTACTACTGTGCGAGACA GGCCCAGGACTGGTGAAGCCTTCGGAGACCCTGTCCCTCACCTGCACTGTCTCTGGTGGCTCCATCAGTAGTTACTACTGGAGCTGGATCCGGCAGCCCCCAGGGAAGGGACTGGAGTGGATTGGGTAT
ATCTATTACAGTGGGAGCACCAACTACAACCCCTCCCTCAAGAGTCGAGTCACCATATCAGTAGACACGTCCAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCCGCAGACACGGCCGTGTATTACTGTGCGAGACA 1.43e-115 406 260 272 97.794 266
6 266 0 0 97.79 1/1 1 1 125AG45GA13GA66TG1TG2CT14
V IS9DOBB01AS0UE 0 IS9DOBB01AS0UE IS9DOBB01AS0UE 354 IGHV4-59*01 IGHV4-59*01 0 0 IGHV4-59*01 IGHV4-59*01 IGHV4-59*01 293 1 270 22 291
GGCCCAGGACTGGTGAAGCCTTCGGAGACCCTGTCCCTCACCTGCACTGTCTCTGGTGGCTCCATCAGTAGTTACTACTGGAGCTGGATCCGGCAGCCCCCAGGGAAGGGACTGGAGTGGATTGGATATATCTATTACAGTGGGAGCACCAACTACAACCCCTCCCTCAAGGGTCGAGTCACCATGTCAGTAGACACGTCCA
AGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCCGCAGACACGGCCTTTTACTACTGTGCGAGA GGCCCAGGACTGGTGAAGCCTTCGGAGACCCTGTCCCTCACCTGCACTGTCTCTGGTGGCTCCATCAGTAGTTACTACTGGAGCTGGATCCGGCAGCCCCCAGGGAAGGGACTGGAGTGGATTGGGTATATCTATTA
CAGTGGGAGCACCAACTACAACCCCTCCCTCAAGAGTCGAGTCACCATATCAGTAGACACGTCCAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCTGCGGACACGGCCGTGTATTACTGTGCGAGA 9.31e-113 397 254 270 97.037 262 8 262
0 0 97.04 1/1 1 1 125AG45GA13GA53CT2AG9TG1TG2CT12
V IS9DOBB01AS0UE 0 IS9DOBB01AS0UE IS9DOBB01AS0UE 354 IGHV4-4*08 IGHV4-4*08 0 0 IGHV4-4*08 IGHV4-4*08 IGHV4-4*08 293 1 270 22 291
GGCCCAGGACTGGTGAAGCCTTCGGAGACCCTGTCCCTCACCTGCACTGTCTCTGGTGGCTCCATCAGTAGTTACTACTGGAGCTGGATCCGGCAGCCCCCAGGGAAGGGACTGGAGTGGATTGGATATATCTATTACAGTGGGAGCACCAACTACAACCCCTCCCTCAAGGGTCGAGTCACCATGTCAGTAGACACGTCCA
AGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCCGCAGACACGGCCTTTTACTACTGTGCGAGA GGCCCAGGACTGGTGAAGCCTTCGGAGACCCTGTCCCTCACCTGCACTGTCTCTGGTGGCTCCATCAGTAGTTACTACTGGAGCTGGATCCGGCAGCCCCCAGGGAAGGGACTGGAGTGGATTGGGTATATCTATAC
CAGTGGGAGCACCAACTACAACCCCTCCCTCAAGAGTCGAGTCACCATATCCGTAGACACGTCCAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCCGCAGACACGGCCGTGTATTACTGTGCGAGA 8.07e-112 394 252 270 96.667 261 9 261
0 0 96.67 1/1 1 1 125AG9TAAC34GA13GA2AC63TG1TG2CT12
D IS9DOBB01AS0UE 0 IS9DOBB01AS0UE IS9DOBB01AS0UE 354 IGHD3-10*01 IGHD3-10*01 0 0 IGHD3-10*01 IGHD3-10*01 IGHD3-10*01 31 286 298 6 18
ACTATGATTCGGG ACTATGGTTCGGG 0.22 19.9 10 13 92.308 12 1 12 0 0 92.31 1/1 1 1 6AG6
D IS9DOBB01AS0UE 0 IS9DOBB01AS0UE IS9DOBB01AS0UE 354 IGHD3-22*01 IGHD3-22*01 0 0 IGHD3-22*01 IGHD3-22*01 IGHD3-22*01 31 286 293 6 13
ACTATGAT ACTATGAT 3.2 16.1 8 8 100.000 8 0 8 0 0 100.00 1/1 1 1 8
D IS9DOBB01AS0UE 0 IS9DOBB01AS0UE IS9DOBB01AS0UE 354 IGHD3-16*01 IGHD3-16*01 0 0 IGHD3-16*01 IGHD3-16*01 IGHD3-16*01 37 288 294 5 11
TATGATT TATGATT 12 14.1 7 7 100.000 7 0 7 0 0 100.00 1/1 1 1 7
J IS9DOBB01AS0UE 0 IS9DOBB01AS0UE IS9DOBB01AS0UE 354 IGHJ4*02 IGHJ4*02 0 0 IGHJ4*02 IGHJ4*02 IGHJ4*02 48 312 354 5 47
CTTTGACTACTGGGGCCAGGGAACCCTGGTCACCGTCTCCTCA CTTTGACTACTGGGGCCAGGGAACCCTGGTCACCGTCTCCTCA 3.06e-20 83.4 43 43 100.000 43 0 43 0 0 100.00 1/1 1
1 43
J IS9DOBB01AS0UE 0 IS9DOBB01AS0UE IS9DOBB01AS0UE 354 IGHJ4*01 IGHJ4*01 0 0 IGHJ4*01 IGHJ4*01 IGHJ4*01 48 312 354 5 47
CTTTGACTACTGGGGCCAGGGAACCCTGGTCACCGTCTCCTCA CTTTGACTACTGGGGCCAAGGAACCCTGGTCACCGTCTCCTCA 1.67e-18 77.6 40 43 97.674 42 1 42 0 0 97.67 1/1 1
1 18GA24
J IS9DOBB01AS0UE 0 IS9DOBB01AS0UE IS9DOBB01AS0UE 354 IGHJ4*03 IGHJ4*03 0 0 IGHJ4*03 IGHJ4*03 IGHJ4*03 48 312 354 5 47
CTTTGACTACTGGGGCCAGGGAACCCTGGTCACCGTCTCCTCA CTTTGACTACTGGGGCCAAGGGACCCTGGTCACCGTCTCCTCA 9.08e-17 71.8 37 43 95.349 41 2 41 0 0 95.35 1/1 1
1 18GA2AG21
and the AIRR output:
sequence_id sequence rev_comp productive v_call d_call j_call c_call sequence_alignment germline_alignment junction junction_aa v_score v_cigar d_score d_cigar
j_score j_cigar c_score c_cigar vj_in_frame stop_codon cdr1 cdr2 cdr3 fwr1 fwr2 fwr3 fwr4 v_identity v_evalue d_identity d_evalue j_identity j_eval
ue v_sequence_start v_sequence_end v_germline_start v_germline_end d_sequence_start d_sequence_end d_germline_start d_germline_end j_sequence_start j_sequence_en
d j_germline_start j_germline_end junction_length np1_length np2_length
IS9DOBB01AS0UE GGCCCAGGACTGGTGAAGCCTTCGGAGACCCTGTCCCTCACCTGCACTGTCTCTGGTGGCTCCATCAGTAGTTACTACTGGAGCTGGATCCGGCAGCCCCCAGGGAAGGGACTGGAGTGGATTGGATATATCTATTACAGTGGGAGCACCAACTACAACCCCTCCCTCAAGGGTCGAGTCACCATGTCAGT
AGACACGTCCAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCCGCAGACACGGCCTTTTACTACTGTGCGAGACATGAAGCAACGGGCACTATGATTCGGGCAGCCACTACCCCCTTTGACTACTGGGGCCAGGGAACCCTGGTCACCGTCTCCTCA F T IGHV4-59*08 IGHD3-
10*01 IGHJ4*02 .....................GGCCCA...GGACTGGTGAAGCCTTCGGAGACCCTGTCCCTCACCTGCACTGTCTCTGGTGGCTCCATC............AGTAGTTACTACTGGAGCTGGATCCGGCAGCCCCCAGGGAAGGGACTGGAGTGGATTGGATATATCTATTA
CAGT.........GGGAGCACCAACTACAACCCCTCCCTCAAG...GGTCGAGTCACCATGTCAGTAGACACGTCCAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCCGCAGACACGGCCTTTTACTACTGTGCGAGACATGAAGCAACGGGCACTATGATTCGGGCAGCCACTACCCCCTTTGACTACTGGGGCCAGGG
AACCCTGGTCACCGTCTCCTCA 406.0 21N125=1X45=1X13=1X66=1X1=1X2=1X14= 19.9 5N285S6=1X6= 83.4 4N311S43= T F GGTGGCTCCATC............AGT
AGTTACTAC ATCTATTACAGT.........GGGAGCACC .....................GGCCCA...GGACTGGTGAAGCCTTCGGAGACCCTGTCCCTCACCTGCACTGTCTCT TGGAGCTGGATCCGGCAGCCCCCAGGGAAGGGACTGGAGTGGATTGGATAT AACTACAACCCCTCCCTC
AAG...GGTCGAGTCACCATGTCAGTAGACACGTCCAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCCGCAGACACGGCCTTTTACTACTGT 0.9779399999999999 1.43e-115 0.9230800000000001 0.22 1.0 3.06e-20 0
272 0 320 285 298 5 18 311 354 4 47 13 13
Comments (5)
-
-
Also, the input fasta file as well? I can test it on my end.
-
reporter It is IgBlast 1.8 so maybe I'm missing some flag so that it produces that output. Do you see any difference between my
outfmt
and yours?igblastn -num_threads 1 -query 01.fasta -ig_seqtype Ig -organism human -auxiliary_data /work/01114/vdj/lonestar/../common/igblast-db/db/optional_file/human_gl.aux -germline_db_V /work/01114/vdj/lonestar/../common/igblast-db/db/10_05_2016//human/ReferenceDirectorySet/human_IG_V.fna -germline_db_D /work/01114/vdj/lonestar/../common/igblast-db/db/10_05_2016//human/ReferenceDirectorySet/human_IG_D.fna -germline_db_J /work/0 1114/vdj/lonestar/../common/igblast-db/db/10_05_2016//human/ReferenceDirectorySet/human_IG_J.fna -domain_system imgt -outfmt "7 qseqid qgi qacc qaccver qlen sseqid sallseqid sgi sallgi sacc saccver sallacc slen qstart qend sstart send qseq sseq evalue bitscore score length pident nident mismatch positive gapopen gaps ppos frames qframe sframe btop"
-
reporter - changed status to resolved
I figured it out. Needed updated
optional_file
files for the database. -
Ah, okay. Just for future reference, this is how we run igblast:
Which boils down to:
export IGDATA=~/share/igblast igblastn -query input.fasta -out output.fmt7 -num_threads 3 \ -germline_db_V ${IGDATA}/database/imgt_human_ig_v \ -germline_db_D ${IGDATA}/database/imgt_human_ig_d \ -germline_db_J ${IGDATA}/database/imgt_human_ig_j \ -auxiliary_data ${IGDATA}/optional_file/human_gl.aux \ -ig_seqtype Ig -organism human -domain_system imgt \ -outfmt '7 std qseq sseq btop'"
- Log in to comment
Well, I've been banging on the code a lot lately. It's possible I messed something up, but I don't see an issue with my test set.
Could you attach or email a TSV of the output for me? What's pasted into the issue is all jumbled up (line wrap, tabs to spaces).
IgBLAST 1.8 should work fine, though there is a potential bug in it (some fields are occasionally missing). But... That doesn't look like output from 1.8. It doesn't have the
Sub-region
section. For example: