Assign Genes skips several rows of sequences

Jason Vander Heiden

Greetings @anw01758 ,

The first instance of the problem appears to be sequence 735, which starts with …. This causes igblast to spit out the error:

WORKER: T1 BATCH # 8 CEXCEPTION: CFastaReader: Near line 1470, there's a line that doesn't look like plausible data, 
but it's not marked as defline or comment. (m_Pos = 1470)

Internal . character seem to be fine (it ignores) them, but it can’t handle leading . characters.

This worked for me:

$ cat pull_491_seq_255_257_5P_LP_2.fasta | sed "s/\./-/g" > new.fasta
$ AssignGenes.py igblast -s new.fasta -b ~/share/igblast --organism mouse --loci ig --format airr -o test.tsv
$ grep ">" pull_491_seq_255_257_5P_LP_2.fasta | wc -l
    5362
$ tail -n +2 test.tsv | wc -l
    5362

It doesn’t seem to mind - characters. I’m guessing deleting the . or replacing them with N would also work fine. (I didn’t test that.)

I only see 3 sequences with leading . and 74 failing sequences, so my guess is that igblast batches I/O and fails the entire batch of sequences in the read block when it hits an exception.

We aren’t currently passing igblast’s warning messages to the user, but it seems like we should be because that would’ve made this more clear. Maybe via a --log argument? I’ll spawn an issue for that.

2021-05-02T20:23:08+00:00

Comments (2)

Jason Vander Heiden
Greetings @anw01758 ,

The first instance of the problem appears to be sequence 735, which starts with …. This causes igblast to spit out the error:
```
WORKER: T1 BATCH # 8 CEXCEPTION: CFastaReader: Near line 1470, there's a line that doesn't look like plausible data, 
but it's not marked as defline or comment. (m_Pos = 1470)
```
Internal . character seem to be fine (it ignores) them, but it can’t handle leading . characters.

This worked for me:
```
$ cat pull_491_seq_255_257_5P_LP_2.fasta | sed "s/\./-/g" > new.fasta
$ AssignGenes.py igblast -s new.fasta -b ~/share/igblast --organism mouse --loci ig --format airr -o test.tsv
$ grep ">" pull_491_seq_255_257_5P_LP_2.fasta | wc -l
    5362
$ tail -n +2 test.tsv | wc -l
    5362
```
It doesn’t seem to mind - characters. I’m guessing deleting the . or replacing them with N would also work fine. (I didn’t test that.)

I only see 3 sequences with leading . and 74 failing sequences, so my guess is that igblast batches I/O and fails the entire batch of sequences in the read block when it hits an exception.

We aren’t currently passing igblast’s warning messages to the user, but it seems like we should be because that would’ve made this more clear. Maybe via a --log argument? I’ll spawn an issue for that.
- 2021-05-02T20:23:08+00:00
Jason Vander Heiden
- changed status to resolved
No news is good news.
- 2021-10-24T21:50:01+00:00
Log in to comment