KeyError: 'V_SEQ_LENGTH'
Hello, I'm getting the following error when running MakeDb.py. Interestingly, this seems only to occur when running on T-cell receptor aligned output. Will continue digging...
(env2)dkoppstein@master:/dkoppstein/150521SG_v1.8_round2$ source /dkoppstein/150521SG_v1.8_round2/env/env2/bin/activate && python -m pdb `which MakeDb.py` igblast -i intermediate/150521SG_150520-1-P2_S2_L001_TR.fmt7 -s intermediate/150521SG_150520-1-P2_S2_L001_assembled_atleast-2.fasta --fail --outname 150521SG_150520-1-P2_S2_L001_TR --outdir intermediate
> /dkoppstein/src/bitbucket.org/javh/changeo/MakeDb.py(4)<module>()
-> """
(Pdb) c
START> MakeDB
ALIGNER> IgBlast
ALIGN_RESULTS> 150521SG_150520-1-P2_S2_L001_TR.fmt7
SEQ_FILE> 150521SG_150520-1-P2_S2_L001_assembled_atleast-2.fasta
NO_PARSE> False
PROGRESS> 00:16:37 [# ] 5% ( 3,440) 0.3 minTraceback (most recent call last):
File "/usr/lib/python2.7/pdb.py", line 1314, in main
pdb._runscript(mainpyfile)
File "/usr/lib/python2.7/pdb.py", line 1233, in _runscript
self.run(statement)
File "/usr/lib/python2.7/bdb.py", line 400, in run
exec cmd in globals, locals
File "<string>", line 1, in <module>
File "/dkoppstein/src/bitbucket.org/javh/changeo/MakeDb.py", line 4, in <module>
"""
File "/dkoppstein/src/bitbucket.org/javh/changeo/MakeDb.py", line 516, in parseIgBlast
no_parse=no_parse, out_args=out_args)
File "/dkoppstein/src/bitbucket.org/javh/changeo/MakeDb.py", line 421, in writeDb
for record in db_gen:
File "/dkoppstein/src/bitbucket.org/javh/changeo/MakeDb.py", line 223, in readIgBlast
db_gen['N1_LENGTH'] = int(d_align[8]) - int(db_gen['V_SEQ_LENGTH']) - int(db_gen['V_SEQ_START'])
KeyError: 'V_SEQ_LENGTH'
Uncaught exception. Entering post mortem debugging
Running 'cont' or 'step' will restart the program
> /dkoppstein/src/bitbucket.org/javh/changeo/MakeDb.py(223)readIgBlast()
-> db_gen['N1_LENGTH'] = int(d_align[8]) - int(db_gen['V_SEQ_LENGTH']) - int(db_gen['V_SEQ_START'])
(Pdb) db_gen
{'JUNCTION': 'GGTCACCGTCTCCTCAGGG', 'IN_FRAME': 'F', 'D_CALL': 'TRBD1*01,TRBD2*02', 'J_CALL': 'None', 'STOP': 'T', 'FUNCTIONAL': 'F', 'SEQUENCE_INPUT': 'AGCTCTGGGAGAGGAGCCCAGGACTGGGATTCCGAGGTGTTTCCATTCAGTGATCTGCACTGAACACAGAGGACTCGCCATGGAGTTTGGGCTGAGCTGGGTTTTCCGTGTTGCTATTTTAAAAGGTGTCCAGTTTGAGGTGCGTCTGGTGGAGTCGGGGGGAGACTTGGTCCTGAGAGGGTCCCAGGGACGTTGCGGGGCGCCACCTCTTGCCTGGGGTCCTGGCACTGTTGTCACAATGTGACAGCTGGTTCGAGTCCTGGGGCCATGGAACCCTGGTCACCGTCTCCTCAGGGAGTGCATCCGCCC', 'SEQUENCE_ID': 'GTCATCCTTTGTTGTTT_GTTCGAGT|CONSCOUNT=2|PRCONS=0814-bs0-PCR2-IgM', 'D_SEQ_START': '292', 'V_CALL': 'None', 'JUNCTION_LENGTH': 19}
(Pdb) d_align
0 D
1 GTCATCCTTTGTTGTTT_GTTCGAGT|CONSCOUNT=2|PRCONS=...
2 TRBD1*01
3 100.00
4 5
5 0
6 0
7 0
8 292
9 296
10 5
11 9
12 8.4
13 10.4
Name: 2, dtype: object
(Pdb) db_gen
{'JUNCTION': 'GGTCACCGTCTCCTCAGGG', 'IN_FRAME': 'F', 'D_CALL': 'TRBD1*01,TRBD2*02', 'J_CALL': 'None', 'STOP': 'T', 'FUNCTIONAL': 'F', 'SEQUENCE_INPUT': 'AGCTCTGGGAGAGGAGCCCAGGACTGGGATTCCGAGGTGTTTCCATTCAGTGATCTGCACTGAACACAGAGGACTCGCCATGGAGTTTGGGCTGAGCTGGGTTTTCCGTGTTGCTATTTTAAAAGGTGTCCAGTTTGAGGTGCGTCTGGTGGAGTCGGGGGGAGACTTGGTCCTGAGAGGGTCCCAGGGACGTTGCGGGGCGCCACCTCTTGCCTGGGGTCCTGGCACTGTTGTCACAATGTGACAGCTGGTTCGAGTCCTGGGGCCATGGAACCCTGGTCACCGTCTCCTCAGGGAGTGCATCCGCCC', 'SEQUENCE_ID': 'GTCATCCTTTGTTGTTT_GTTCGAGT|CONSCOUNT=2|PRCONS=0814-bs0-PCR2-IgM', 'D_SEQ_START': '292', 'V_CALL': 'None', 'JUNCTION_LENGTH': 19}
Corresponding entry in the .fmt7 file:
# IGBLASTN 2.2.29+
# Query: GTCATCCTTTGTTGTTT_GTTCGAGT|CONSCOUNT=2|PRCONS=0814-bs0-PCR2-IgM
# Database: db/TR.V.human.F+ORF+infrP.ungapped.fasta db/TR.D.human.F+ORF+infrP.ungapped.fasta db/TR.J.human.F+ORF+infrP.ungapped.fasta
# Domain classification requested: imgt
# V-(D)-J rearrangement summary for query sequence (Top V gene match, Top D gene match, Top J gene match, Chain type, stop codon, V-J frame, Productive, Strand). Multiple equivalent top matches having the same score and percent identity, if present, are separated by a comma.
TRGVA*01 TRBD1*01,TRBD2*02 N/A VB Yes N/A No +
# V-(D)-J junction details based on top germline gene matches (V end, V-D junction, D region, D-J junction, J start). Note that possible overlapping nucleotides at VDJ junction (i.e, nucleotides that could be assigned to either rearranging gene) are indicated in parentheses (i.e., (TACT)) but are not included under the V, D, or J gene itself
GGTCA CCGTCTCCT CAGGG N/A N/A
# Hit table (the first field indicates the chain type of the hit)
# Fields: query id, subject id, % identity, alignment length, mismatches, gap opens, gaps, q. start, q. end, s. start, s. end, evalue, bit score
# 4 hits found
V GTCATCCTTTGTTGTTT_GTTCGAGT|CONSCOUNT=2|PRCONS=0814-bs0-PCR2-IgM TRGVA*01 75.00 40 8 1
2 245 282 16 55 0.17 26.8
V GTCATCCTTTGTTGTTT_GTTCGAGT|CONSCOUNT=2|PRCONS=0814-bs0-PCR2-IgM TRBV6-8*01 65.22 46 16 0
0 54 99 196 241 1.5 23.7
D GTCATCCTTTGTTGTTT_GTTCGAGT|CONSCOUNT=2|PRCONS=0814-bs0-PCR2-IgM TRBD1*01 100.00 5 0 0
0 292 296 5 9 8.4 10.4
D GTCATCCTTTGTTGTTT_GTTCGAGT|CONSCOUNT=2|PRCONS=0814-bs0-PCR2-IgM TRBD2*02 100.00 5 0 0
0 294 298 10 14 8.4 10.4
Comments (11)
-
-
Account Deleted Thanks for the info...
Is their a plan/schedule to make it TCR- compliant ? just wondering about a timeline so that we can make a plan to go back to our old parsing option / look into other options, etc.
alternatively, can we help ?
-
Because IgBlast doesn't actually have truly tabular output (fmt7 is barely tabular), a parser must be custom built from looking at example output. I believe the parser is designed to look for modular blocks from the fmt7 output, if you could figure out from a small sample output what the blocks are for which to search, I think the code could be fairly easily extended to have a TCR parser as well. Honestly, I'm not sure when we will get around to it.
-
I think the problem is the allele regex in DbCore. I suspect "TRGVA*01" won't be recognized as a valid allele, because the "A" should be a number. I've got some errands to run, but I can take a look at it when I get back in a couple hours and see if I can fix it.
Do you mind attaching a copy of the fmt7 output and the input sequence (in fasta format)... Just the offending entry. I'll add it to the test cases. I see David included the fmt7 output above, but the tab characters get converted to spaces by the issue system.
-
reporter - attached input.fmt7
- attached input.fasta
Hi Jason, here are the files. Thanks for looking into this. Am at a conference at the moment but will be happy to do what I can to help this evening or tomorrow.
Cheers David
-
-
assigned issue to
-
assigned issue to
-
- changed status to resolved
I updated the regex patterns in DbCore to match TRGVA*01, which as near as I can tell is the only TCR gene name with a letter (A) for the gene number. So hopefully this doesn't break anything.
The test case seems to work now, so I'd give it a try. MakeDb needs some better error checking and some reoganization to make it easier to test, but this might take awhile to get to. I'll try to plink at it this week or next week.
If you hit anymore cases that don't work, just send the fmt7 and fasta entries my way and I'll see what I can do.
-
- changed status to open
Oops... I should leave the issue open until it's tested.
-
Account Deleted thanks!
-
reporter - changed status to resolved
Hi Jason, thanks for fixing this so quickly. I confirm that it now runs successfully for our TCR data (I guess it was just that one strange TCR allele).
-
Great. Yeah, sort of non-obvious that the allele call was the issue though... the V alignment is needed to calculate the N positions, so if it fails in finding the V, then later steps fail. It definitely needs more checks to avoid a chain of fails.
- Log in to comment
I don't think the igblast parser is ready for TCR yet. The above error is because the parser specifically looks for N1 nucleotide length, which isn't something to be found in TCR igblast results.