MakeDb.py igblast fails with KeyError: '# Query:'

Issue #186 resolved
Tartu Immunology created an issue

I’ve been using the Docker container from the tutorial to align my own data. The following runs fine on the file I generated with presto:

%%bash
AssignGenes.py igblast \
-s results/Sample_1_L001-C_atleast-2.fastq \
-b /usr/local/share/igblast --organism human \
--loci ig --format blast --outdir results/igblast --nproc 8

The resulting file looks like:

# IGBLASTN
# Query: 
# Database: /usr/local/share/igblast/database/imgt_human_ig_v /usr/local/share/igblast/database/imgt_human_ig_d /usr/local/share/igblast/database/imgt_human_ig_j
# Domain classification requested: imgt

# V-(D)-J rearrangement summary for query sequence (Top V gene match, Top D gene match, Top J gene match, Chain type, stop codon, V-J frame, Productive, Strand, V Frame shift).  Multiple equivalent top matches, if present, are separated by a comma.
IGHV6-1*01  IGHD6-13*01,IGHD6-25*01 IGHJ4*02    VH  No  In-frame    Yes +   No

# V-(D)-J junction details based on top germline gene matches (V end, V-D junction, D region, D-J junction, J start).  Note that possible overlapping nucleotides at VDJ junction (i.e, nucleotides that could be assigned to either rearranging gene) are indicated in parentheses (i.e., (TACT)) but are not included under the V, D, or J gene itself
AGAGA   TC  GGTATAGCAGC CT  CTTTG   

Now, when I run

%%bash
sudo mkdir -p results/changeo
sudo MakeDb.py igblast \
-s results/Sample_1_L001-C_atleast-2.fastq -i results/igblast/Sample_1_L001-C_atleast-2_igblast.fmt7 \
--format airr \
-r /usr/local/share/germlines/imgt/human/vdj/ --outdir results/changeo \
--outname Sample_1

it fails with

Traceback (most recent call last):
  File "/usr/local/bin/MakeDb.py", line 897, in <module>
    args.func(**args_dict)
  File "/usr/local/bin/MakeDb.py", line 542, in parseIgBLAST
    output = writeDb(germ_iter, fields=fields, aligner_file=aligner_file, total_count=total_count,
  File "/usr/local/bin/MakeDb.py", line 274, in writeDb
    for i, record in enumerate(records, start=1):
  File "/usr/local/bin/MakeDb.py", line 541, in <genexpr>
    germ_iter = (addGermline(x, references, amino_acid=amino_acid) for x in parse_iter)
  File "/usr/local/lib/python3.9/site-packages/changeo/IO.py", line 1531, in __next__
    db = self.parseSections(sections)
  File "/usr/local/lib/python3.9/site-packages/changeo/IO.py", line 1438, in parseSections
    db['sequence_input'] = str(self.sequences[query].seq)
KeyError: '# Query:'

Is that an issue with the input fastq file? Here’s a snippet:

@1CTCTCATT|PRCONS=IGHM|CONSCOUNT=155|DUPCOUNT=2
SOME SEQUENCE
+
{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{

The last command I ran to get this file was a based on

CollapseSeq.py -s HD09N-C_primers-pass_reheader.fastq -n 20 --inner \
    --uf CREGION --cf CONSCOUNT --act sum --outname HD09N-C

from the presto tutorial.

Am I missing some step? Thanks in advance!

Comments (6)

  1. Jason Vander Heiden

    Greetings,

    It looks like the name of the query is missing from the igblast output, so it’s not able to find the matching record in the sequence file. The sequence headers look fine to me. But, IIRC, igblast doesn’t accept fastq input, so it’s probably trying to find the > delimiter for the header and missing it.

    Could you pass AssignGenes a fasta instead of fastq file and see if that fixes the problem? There’s a script in the docker container, so you can just do:

    fastq2fasta.py Sample_1_L001-C_atleast-2.fastq
    AssignGenes.py igblast \
        -s Sample_1_L001-C_atleast-2.fasta \
        -b /usr/local/share/igblast --organism human \
        --loci ig --format blast --outdir results/igblast --nproc 8
    

    I recall this popping up before, but I don’t think we put any checks to terminate the task if we detect fastq input. Associated issue in the wrong repo is here:

    https://bitbucket.org/kleinstein/immcantation/issues/74/assigngenespy-input-file

    Let me know if this doesn’t fix it and we can take a closer look.

  2. Tartu Immunology reporter

    Thanks, Jason!

    I ended up converting the file to fasta after running presto. I guess the only(minor) issue is that it’s not documented how presto output should be fed to changeo.

  3. Jason Vander Heiden

    Yeah, we should at least update the docs and probably sniff the input file to make sure it’s in the correct format, because this error isn’t remotely informative.

  4. ssnn

    AssignGenes now works with fastq files. It will generate the fasta file needed for IgBLAST. We have also improved the documentation in pRESTO.

  5. Log in to comment