Assign Genes skips several rows of sequences

Issue #179 resolved
anw01758 created an issue

I have a fasta file with >5000 sequences that I am trying to use Assign Genes on. I have successfully run the program with smaller files. When I run the command, it runs without error but the resulting file has ~100 less sequences than the input file. By manually examining the files based on sequence ID I can see that the missing lines occur ~700 sequences into the file.

I enter the following into the command line: AssignGenes.py igblast -s pull_491_seq_255_257_5P_LP_2.fasta -b ~/share/igblast --organism mouse --loci ig --format airr

I then create the data table in R Studio Server using: pull_491_seq_255_257_5P_LP_2_igblast <- readChangeoDb("pull_491_seq_255_257_5P_LP_2_igblast.tsv")

I have attached the fasta and tsv files in question.

Comments (2)

  1. Jason Vander Heiden

    Greetings @anw01758 ,

    The first instance of the problem appears to be sequence 735, which starts with . This causes igblast to spit out the error:

    WORKER: T1 BATCH # 8 CEXCEPTION: CFastaReader: Near line 1470, there's a line that doesn't look like plausible data, 
    but it's not marked as defline or comment. (m_Pos = 1470)
    

    Internal . character seem to be fine (it ignores) them, but it can’t handle leading . characters.

    This worked for me:

    $ cat pull_491_seq_255_257_5P_LP_2.fasta | sed "s/\./-/g" > new.fasta
    $ AssignGenes.py igblast -s new.fasta -b ~/share/igblast --organism mouse --loci ig --format airr -o test.tsv
    $ grep ">" pull_491_seq_255_257_5P_LP_2.fasta | wc -l
        5362
    $ tail -n +2 test.tsv | wc -l
        5362
    

    It doesn’t seem to mind - characters. I’m guessing deleting the . or replacing them with N would also work fine. (I didn’t test that.)

    I only see 3 sequences with leading . and 74 failing sequences, so my guess is that igblast batches I/O and fails the entire batch of sequences in the read block when it hits an exception.

    We aren’t currently passing igblast’s warning messages to the user, but it seems like we should be because that would’ve made this more clear. Maybe via a --log argument? I’ll spawn an issue for that.

  2. Log in to comment