Assign Genes skips several rows of sequences
I have a fasta file with >5000 sequences that I am trying to use Assign Genes on. I have successfully run the program with smaller files. When I run the command, it runs without error but the resulting file has ~100 less sequences than the input file. By manually examining the files based on sequence ID I can see that the missing lines occur ~700 sequences into the file.
I enter the following into the command line: AssignGenes.py igblast -s pull_491_seq_255_257_5P_LP_2.fasta -b ~/share/igblast --organism mouse --loci ig --format airr
I then create the data table in R Studio Server using: pull_491_seq_255_257_5P_LP_2_igblast <- readChangeoDb("pull_491_seq_255_257_5P_LP_2_igblast.tsv")
I have attached the fasta and tsv files in question.
Comments (2)
-
-
- changed status to resolved
No news is good news.
- Log in to comment
Greetings @anw01758 ,
The first instance of the problem appears to be sequence
735
, which starts with…
. This causes igblast to spit out the error:Internal
.
character seem to be fine (it ignores) them, but it can’t handle leading.
characters.This worked for me:
It doesn’t seem to mind
-
characters. I’m guessing deleting the.
or replacing them withN
would also work fine. (I didn’t test that.)I only see 3 sequences with leading
.
and 74 failing sequences, so my guess is that igblast batches I/O and fails the entire batch of sequences in the read block when it hits an exception.We aren’t currently passing igblast’s warning messages to the user, but it seems like we should be because that would’ve made this more clear. Maybe via a
--log
argument? I’ll spawn an issue for that.