KeyError: 'V_SEQ_LENGTH'

Issue #30 resolved
David Koppstein created an issue

Hello, I'm getting the following error when running MakeDb.py. Interestingly, this seems only to occur when running on T-cell receptor aligned output. Will continue digging...

(env2)dkoppstein@master:/dkoppstein/150521SG_v1.8_round2$ source /dkoppstein/150521SG_v1.8_round2/env/env2/bin/activate && python -m pdb `which MakeDb.py` igblast -i intermediate/150521SG_150520-1-P2_S2_L001_TR.fmt7 -s intermediate/150521SG_150520-1-P2_S2_L001_assembled_atleast-2.fasta --fail --outname 150521SG_150520-1-P2_S2_L001_TR --outdir intermediate
> /dkoppstein/src/bitbucket.org/javh/changeo/MakeDb.py(4)<module>()
-> """
(Pdb) c
        START> MakeDB
      ALIGNER> IgBlast
ALIGN_RESULTS> 150521SG_150520-1-P2_S2_L001_TR.fmt7
     SEQ_FILE> 150521SG_150520-1-P2_S2_L001_assembled_atleast-2.fasta
     NO_PARSE> False

PROGRESS> 00:16:37 [#                   ]   5% ( 3,440) 0.3 minTraceback (most recent call last):
  File "/usr/lib/python2.7/pdb.py", line 1314, in main
    pdb._runscript(mainpyfile)
  File "/usr/lib/python2.7/pdb.py", line 1233, in _runscript
    self.run(statement)
  File "/usr/lib/python2.7/bdb.py", line 400, in run
    exec cmd in globals, locals
  File "<string>", line 1, in <module>
  File "/dkoppstein/src/bitbucket.org/javh/changeo/MakeDb.py", line 4, in <module>
    """
  File "/dkoppstein/src/bitbucket.org/javh/changeo/MakeDb.py", line 516, in parseIgBlast
    no_parse=no_parse, out_args=out_args)
  File "/dkoppstein/src/bitbucket.org/javh/changeo/MakeDb.py", line 421, in writeDb
    for record in db_gen:
  File "/dkoppstein/src/bitbucket.org/javh/changeo/MakeDb.py", line 223, in readIgBlast
    db_gen['N1_LENGTH'] = int(d_align[8]) - int(db_gen['V_SEQ_LENGTH']) - int(db_gen['V_SEQ_START'])
KeyError: 'V_SEQ_LENGTH'
 Uncaught exception. Entering post mortem debugging
Running 'cont' or 'step' will restart the program
> /dkoppstein/src/bitbucket.org/javh/changeo/MakeDb.py(223)readIgBlast()
-> db_gen['N1_LENGTH'] = int(d_align[8]) - int(db_gen['V_SEQ_LENGTH']) - int(db_gen['V_SEQ_START'])
(Pdb) db_gen
{'JUNCTION': 'GGTCACCGTCTCCTCAGGG', 'IN_FRAME': 'F', 'D_CALL': 'TRBD1*01,TRBD2*02', 'J_CALL': 'None', 'STOP': 'T', 'FUNCTIONAL': 'F', 'SEQUENCE_INPUT': 'AGCTCTGGGAGAGGAGCCCAGGACTGGGATTCCGAGGTGTTTCCATTCAGTGATCTGCACTGAACACAGAGGACTCGCCATGGAGTTTGGGCTGAGCTGGGTTTTCCGTGTTGCTATTTTAAAAGGTGTCCAGTTTGAGGTGCGTCTGGTGGAGTCGGGGGGAGACTTGGTCCTGAGAGGGTCCCAGGGACGTTGCGGGGCGCCACCTCTTGCCTGGGGTCCTGGCACTGTTGTCACAATGTGACAGCTGGTTCGAGTCCTGGGGCCATGGAACCCTGGTCACCGTCTCCTCAGGGAGTGCATCCGCCC', 'SEQUENCE_ID': 'GTCATCCTTTGTTGTTT_GTTCGAGT|CONSCOUNT=2|PRCONS=0814-bs0-PCR2-IgM', 'D_SEQ_START': '292', 'V_CALL': 'None', 'JUNCTION_LENGTH': 19}
(Pdb) d_align
0                                                     D
1     GTCATCCTTTGTTGTTT_GTTCGAGT|CONSCOUNT=2|PRCONS=...
2                                              TRBD1*01
3                                                100.00
4                                                     5
5                                                     0
6                                                     0
7                                                     0
8                                                   292
9                                                   296
10                                                    5
11                                                    9
12                                                  8.4
13                                                 10.4
Name: 2, dtype: object
(Pdb) db_gen
{'JUNCTION': 'GGTCACCGTCTCCTCAGGG', 'IN_FRAME': 'F', 'D_CALL': 'TRBD1*01,TRBD2*02', 'J_CALL': 'None', 'STOP': 'T', 'FUNCTIONAL': 'F', 'SEQUENCE_INPUT': 'AGCTCTGGGAGAGGAGCCCAGGACTGGGATTCCGAGGTGTTTCCATTCAGTGATCTGCACTGAACACAGAGGACTCGCCATGGAGTTTGGGCTGAGCTGGGTTTTCCGTGTTGCTATTTTAAAAGGTGTCCAGTTTGAGGTGCGTCTGGTGGAGTCGGGGGGAGACTTGGTCCTGAGAGGGTCCCAGGGACGTTGCGGGGCGCCACCTCTTGCCTGGGGTCCTGGCACTGTTGTCACAATGTGACAGCTGGTTCGAGTCCTGGGGCCATGGAACCCTGGTCACCGTCTCCTCAGGGAGTGCATCCGCCC', 'SEQUENCE_ID': 'GTCATCCTTTGTTGTTT_GTTCGAGT|CONSCOUNT=2|PRCONS=0814-bs0-PCR2-IgM', 'D_SEQ_START': '292', 'V_CALL': 'None', 'JUNCTION_LENGTH': 19}

Corresponding entry in the .fmt7 file:

# IGBLASTN 2.2.29+
# Query: GTCATCCTTTGTTGTTT_GTTCGAGT|CONSCOUNT=2|PRCONS=0814-bs0-PCR2-IgM
# Database: db/TR.V.human.F+ORF+infrP.ungapped.fasta db/TR.D.human.F+ORF+infrP.ungapped.fasta db/TR.J.human.F+ORF+infrP.ungapped.fasta
# Domain classification requested: imgt

# V-(D)-J rearrangement summary for query sequence (Top V gene match, Top D gene match, Top J gene match, Chain type, stop codon, V-J frame, Productive, Strand).  Multiple equivalent top matches having the same score and percent identity, if present, are separated by a comma.
TRGVA*01        TRBD1*01,TRBD2*02       N/A     VB      Yes     N/A     No      +

# V-(D)-J junction details based on top germline gene matches (V end, V-D junction, D region, D-J junction, J start).  Note that possible overlapping nucleotides at VDJ junction (i.e, nucleotides that could be assigned to either rearranging gene) are indicated in parentheses (i.e., (TACT)) but are not included under the V, D, or J gene itself
GGTCA   CCGTCTCCT       CAGGG   N/A     N/A

# Hit table (the first field indicates the chain type of the hit)
# Fields: query id, subject id, % identity, alignment length, mismatches, gap opens, gaps, q. start, q. end, s. start, s. end, evalue, bit score
# 4 hits found
V       GTCATCCTTTGTTGTTT_GTTCGAGT|CONSCOUNT=2|PRCONS=0814-bs0-PCR2-IgM TRGVA*01        75.00   40      8       1
       2       245     282     16      55      0.17    26.8
V       GTCATCCTTTGTTGTTT_GTTCGAGT|CONSCOUNT=2|PRCONS=0814-bs0-PCR2-IgM TRBV6-8*01      65.22   46      16      0
       0       54      99      196     241     1.5     23.7
D       GTCATCCTTTGTTGTTT_GTTCGAGT|CONSCOUNT=2|PRCONS=0814-bs0-PCR2-IgM TRBD1*01        100.00  5       0       0
       0       292     296     5       9       8.4     10.4
D       GTCATCCTTTGTTGTTT_GTTCGAGT|CONSCOUNT=2|PRCONS=0814-bs0-PCR2-IgM TRBD2*02        100.00  5       0       0
       0       294     298     10      14      8.4     10.4

Comments (11)

  1. Namita Gupta

    I don't think the igblast parser is ready for TCR yet. The above error is because the parser specifically looks for N1 nucleotide length, which isn't something to be found in TCR igblast results.

  2. Former user Account Deleted

    Thanks for the info...

    Is their a plan/schedule to make it TCR- compliant ? just wondering about a timeline so that we can make a plan to go back to our old parsing option / look into other options, etc.

    alternatively, can we help ?

  3. Namita Gupta

    Because IgBlast doesn't actually have truly tabular output (fmt7 is barely tabular), a parser must be custom built from looking at example output. I believe the parser is designed to look for modular blocks from the fmt7 output, if you could figure out from a small sample output what the blocks are for which to search, I think the code could be fairly easily extended to have a TCR parser as well. Honestly, I'm not sure when we will get around to it.

  4. Jason Vander Heiden

    I think the problem is the allele regex in DbCore. I suspect "TRGVA*01" won't be recognized as a valid allele, because the "A" should be a number. I've got some errands to run, but I can take a look at it when I get back in a couple hours and see if I can fix it.

    Do you mind attaching a copy of the fmt7 output and the input sequence (in fasta format)... Just the offending entry. I'll add it to the test cases. I see David included the fmt7 output above, but the tab characters get converted to spaces by the issue system.

  5. David Koppstein reporter

    Hi Jason, here are the files. Thanks for looking into this. Am at a conference at the moment but will be happy to do what I can to help this evening or tomorrow.

    Cheers David

  6. Jason Vander Heiden

    I updated the regex patterns in DbCore to match TRGVA*01, which as near as I can tell is the only TCR gene name with a letter (A) for the gene number. So hopefully this doesn't break anything.

    The test case seems to work now, so I'd give it a try. MakeDb needs some better error checking and some reoganization to make it easier to test, but this might take awhile to get to. I'll try to plink at it this week or next week.

    If you hit anymore cases that don't work, just send the fmt7 and fasta entries my way and I'll see what I can do.

  7. David Koppstein reporter

    Hi Jason, thanks for fixing this so quickly. I confirm that it now runs successfully for our TCR data (I guess it was just that one strange TCR allele).

  8. Jason Vander Heiden

    Great. Yeah, sort of non-obvious that the allele call was the issue though... the V alignment is needed to calculate the N positions, so if it fails in finding the V, then later steps fail. It definitely needs more checks to avoid a chain of fails.

  9. Log in to comment