MakeDb crashes with sequence id

Issue #92 resolved
Scott Christley created an issue

I found this sequence because it was causing a crash in repsum. I then made a test case for myself, and ran across this MakeDb crash just by accident. It definitely is the sequence id because if I change it then MakeDb runs fine.

input file with just a single sequence

>repsum_issue_36
GCAGTTTGTCTGACCCCCTGCTAACTGCAAGCCTCCAGGTCCAGTCTGATTCCATTCTTA

Here is the igblast output

# IGBLASTN 2.2.29+
# Query: repsum_issue_36
# Database: /work/01114/vdj/lonestar/../common/igblast-db/db/10_05_2016//human/ReferenceDirectorySet/human_TR_V.fna /work/01114/vdj/lonestar/../common/igblast-db/db/10_05_2016//human/ReferenceDirectorySet/
human_TR_D.fna /work/01114/vdj/lonestar/../common/igblast-db/db/10_05_2016//human/ReferenceDirectorySet/human_TR_J.fna
# Domain classification requested: imgt

# Note that your query represents the minus strand of a V gene and has been converted to the plus strand. The sequence positions refer to the converted sequence. 

# V-(D)-J rearrangement summary for query sequence (Top V gene match, Top J gene match, Chain type, stop codon, V-J frame, Productive, Strand).  Multiple equivalent top matches having the same score and pe
rcent identity, if present, are separated by a comma.
TRAV7*01        TRAJ9*01        VA      No      Out-of-frame    No      -

# V-(D)-J junction details based on top germline gene matches (V end, V-J junction, J start).  Note that possible overlapping nucleotides at VDJ junction (i.e, nucleotides that could be assigned to either 
rearranging gene) are indicated in parentheses (i.e., (TACT)) but are not included under the V, D, or J gene itself
TGGAC   N/A     CTGGA   

# Alignment summary between query and top germline V gene hit (from, to, length, matches, mismatches, gaps, percent identity)
FR3-IMGT        2       22      21      17      4       0       81
Total   N/A     N/A     21      17      4       0       81

# Hit table (the first field indicates the chain type of the hit)
# Fields: query id, query gi, query acc., query acc.ver, query length, subject id, subject ids, subject gi, subject gis, subject acc., subject acc.ver, subject accs., subject length, q. start, q. end, s. s
tart, s. end, query seq, subject seq, evalue, bit score, score, alignment length, % identity, identical, mismatches, positives, gap opens, gaps, % positives, query/sbjct frames, query frame, sbjct frame, B
TOP
# 6 hits found
V       reversed|repsum_issue_36        0       reversed|repsum_issue_36        reversed|repsum_issue_36        60      TRAV7*01        TRAV7*01        0       0       TRAV7*01        TRAV7*01        TRAV7
*01        274     2       22      202     222     AAGAATGGAATCAGACTGGAC   AAGAATGGAAGCAGCTTGTAC   0.59    22.1    13      21      80.95   17      4       17      0       0       80.95   1/1     1       1
       10TG3ACCT2GT2
V       reversed|repsum_issue_36        0       reversed|repsum_issue_36        reversed|repsum_issue_36        60      TRDV3*01        TRDV3*01        0       0       TRDV3*01        TRDV3*01        TRDV3
*01        290     50      59      145     136     AGACAAACTG      AGACAAACTG         15   17.4    10      10      100.00  10      0       10      0       0       100.00  1/1     1       1       10
V       reversed|repsum_issue_36        0       reversed|repsum_issue_36        reversed|repsum_issue_36        60      TRAV8-7*01      TRAV8-7*01      0       0       TRAV8-7*01      TRAV8-7*01      TRAV8
-7*01      290     23      32      135     126     CTGGAGGCTT      CTGGAGGCTT         15   17.4    10      10      100.00  10      0       10      0       0       100.00  1/1     1       1       10
J       reversed|repsum_issue_36        0       reversed|repsum_issue_36        reversed|repsum_issue_36        60      TRAJ9*01        TRAJ9*01        0       0       TRAJ9*01        TRAJ9*01        TRAJ9
*01        61      23      32      8       17      CTGGAGGCTT      CTGGAGGCTT      0.18    20.3    10      10      100.00  10      0       10      0       0       100.00  1/1     1       1       10
J       reversed|repsum_issue_36        0       reversed|repsum_issue_36        reversed|repsum_issue_36        60      TRAJ24*02       TRAJ24*02       0       0       TRAJ24*02       TRAJ24*02       TRAJ2
4*02       63      31      38      24      31      TTGCAGTT        TTGCAGTT        2.8     16.4    8       8       100.00  8       0       8       0       0       100.00  1/1     1       1       8
J       reversed|repsum_issue_36        0       reversed|repsum_issue_36        reversed|repsum_issue_36        60      TRAJ47*01       TRAJ47*01       0       0       TRAJ47*01       TRAJ47*01       TRAJ4
7*01       57      52      59      13      20      ACAAACTG        ACAAACTG        2.8     16.4    8       8       100.00  8       0       8       0       0       100.00  1/1     1       1       8
# BLAST processed 1 queries

and the command line for MakeDb

MakeDb.py igblast -s $1 -i $2 -r $VDJ_DB_ROOT/human/ReferenceDirectorySet/TR_VDJ.fna --regions --scores

let me know if you need the germline db file. And the stack trace

  File "/scratch/01114/vdj/vdj/job-6157425024944893465-242ac11c-0001-007-igblast_test/bin/MakeDb.py", line 4, in <module>
    __import__('pkg_resources').run_script('changeo==0.3.4.999', 'MakeDb.py')
  File "/opt/apps/gcc5_2/python3/3.5.1/lib/python3.5/site-packages/pkg_resources/__init__.py", line 735, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/opt/apps/gcc5_2/python3/3.5.1/lib/python3.5/site-packages/pkg_resources/__init__.py", line 1652, in run_script
    exec(code, namespace, namespace)
  File "/scratch/01114/vdj/vdj/job-6157425024944893465-242ac11c-0001-007-igblast_test/lib/python3.5/site-packages/changeo-0.3.4.999-py3.5.egg/EGG-INFO/scripts/MakeDb.py", line 555, in <module>
    args.func(**args_dict)
  File "/scratch/01114/vdj/vdj/job-6157425024944893465-242ac11c-0001-007-igblast_test/lib/python3.5/site-packages/changeo-0.3.4.999-py3.5.egg/EGG-INFO/scripts/MakeDb.py", line 289, in parseIgBLAST
    no_parse=no_parse, partial=partial, out_args=out_args)
  File "/scratch/01114/vdj/vdj/job-6157425024944893465-242ac11c-0001-007-igblast_test/lib/python3.5/site-packages/changeo-0.3.4.999-py3.5.egg/EGG-INFO/scripts/MakeDb.py", line 122, in writeDb
    for i, record in enumerate(db, start=1):
  File "/scratch/01114/vdj/vdj/job-6157425024944893465-242ac11c-0001-007-igblast_test/lib/python3.5/site-packages/changeo-0.3.4.999-py3.5.egg/changeo/Parsers.py", line 1096, in __next__
    db = self.parseSections(sections)
  File "/scratch/01114/vdj/vdj/job-6157425024944893465-242ac11c-0001-007-igblast_test/lib/python3.5/site-packages/changeo-0.3.4.999-py3.5.egg/changeo/Parsers.py", line 1021, in parseSections
    db['SEQUENCE_INPUT'] = str(self.seq_dict[query].seq)
KeyError: 'psum_issue_36'

Comments (5)

  1. Jason Vander Heiden

    That should be enough, thanks. Off the top of my head, I see no good reason why it would truncate the sequence id to psum_issue_36 unless there are special characters hidden in the id. I'll take a look tomorrow.

  2. Log in to comment