Error during BuildConsensus
We're trying pRESTO to process UMI-Barcoded Ion Torrent PGM BCR mRNA (400bp) (which is indel error-prone like Roche 454 pyrosequencing). Reads are single-ended and all reversely oriented (start with 12bp UMI with fixed nucleotides inserted, and then C-region primer sequences). We seem to have no problem in QC, UMI annotation and primer masking (both sides) steps.
During BuildConsensus step, frequently but not always (probably depends on datasets) this error happens in our run. Error cases always show 'float division by zero' messages. I have to manually kill the process after this error.
BuildConsensus.py -s plusVJ_primers-pass.fastq --bf BARCODE --pf PRIMER -n 5 --prcons 0.6 --maxerror 0.2 --maxgap 0.5 --outname plus --log BC5.log
START> BuildConsensus
FILE> plusVJ_primers-pass.fastq
BARCODE_FIELD> BARCODE
MIN_COUNT> 5
MIN_FREQUENCY> 0.6
MIN_QUALITY> 0
MAX_GAP> 0.5
PRIMER_FIELD> PRIMER
PRIMER_FREQUENCY> 0.6
MAX_ERROR> 0.2
MAX_DIVERSITY> None
DEPENDENT> False
COPY_FIELDS> None
COPY_ACTIONS> None
NPROC> 24
PROGRESS> 02:58:03 |# | 5% ( 2,782) 3.6 minError processing sequence set with ID: CGGGATAGCTGTGGGA
PID 176: Error in sibling process detected. Cleaning up.
PID 151: Error in sibling process detected. Cleaning up.
Process Process-6:
Traceback (most recent call last):
File "/usr/lib64/python3.5/multiprocessing/process.py", line 252, in _bootstrap
self.run()
File "/usr/lib64/python3.5/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/usr/bin/BuildConsensus.py", line 129, in processBCQueue
consensus = cons_func(seq_list, **cons_args)
File "/usr/lib/python3.5/site-packages/presto/Sequence.py", line 520, in qualityConsensus
for c in qual_set}
File "/usr/lib/python3.5/site-packages/presto/Sequence.py", line 520, in <dictcomp>
for c in qual_set}
ZeroDivisionError: float division by zero
PID 157: Error in sibling process detected. Cleaning up.
PID 175: Error in sibling process detected. Cleaning up.
PID 163: Error in sibling process detected. Cleaning up.
PID 173: Error in sibling process detected. Cleaning up.
PID 154: Error in sibling process detected. Cleaning up.
Enabling '--nproc 1' option made no improvement.
Here's our system environment. We're happy to provide more information if you need.
OS: CentOS Linux release 7.3.1611
Docker : version 1.12.6, build 3a094bd/1.12.6
Image: kleinstein/immcantation:1.7.0
Comments (7)
-
-
reporter Thank you for your quick reply! I've sent you an email with example files.
-
Thanks. I got the test data. The problem was caused by some UMI groups having positions where the quality score was 0 in all of the sequences (the tail end was all N characters with quality 0).
I added a fix to the version of presto at the tip of the default branch in the bitbucket repo.
As it looks like you are using the docker container, I'm rebuilding the
kleinstein/immcantation:devel
image right now, and you can get the fix by using that image. The image hasn't finished building yet. It should be done by 3pm your time.I'll try to make an official release of presto with the fix sometime next week, but I have some testing to do on the MaskPrimers changes first.
-
reporter Great. Just in case, I've sent you another example file which causes the same error. This time no N characters are included but there ARE some sequences with characters with quality 0, which passed QC step because average quality score is above the threshold. I hope you just try this file too. No problem for me on deadlines.
Thank you again!
-
Sure thing. I ran the new file you sent and it worked fine.
However, the consensus sequences are poor quality with a lot of Ns in them due to the abundance of indels. With
--freg 0.6
BuildConsensus requires 60% of the reads to have the same non-N character or it'll assign an N. So you get a lot of situations like the following:ID> CATTATTACGTCGTGG SEQCOUNT> 9 INSEQ1> CTCGCACAGTAATAGACGGCCGTGTCCGCGACGGTGACAGAGGTCATCCGCAGGGAGA INSEQ2> CTCGCACAGTAATAGACGGCCGTGTCCGCGACGTGACAGAGGTCATCCGCAGGGAGAA INSEQ3> CTCGCACAGTAATAGACGGCCGTGTCCGCGACGTGACAGAGGTCATCCGCAGGGAGAA INSEQ4> CTCGCACAGTAATAGACGGCCGTGTCCGCGACGTGACAGAGGTCATCCGCAGGGAGAA INSEQ5> CTCGCACAGTAATAGACGGCCGTGTCCGCGACGGTGACAGAGGTCATCCGCAGGGAGA INSEQ6> CTCGCACAGTAATAGACGGCCGTGTCCGCGACGGTGACAGAGGTCATCCGCAGGGAGA INSEQ7> CTCGCACAGTAATAGACGGCCGTGTCCGCGACGGTGACAGAGGTCATCCGCAGGGAGA INSEQ8> TCGCACAGTAATAGACGGCCGTGTCCGCGACGTGACAGAGGTCATCCGCAGGGAGAAC INSEQ9> CTCGCACAGTAATAGACGGCCGTGTCCGCGACGTGACAGAGGTCATCCGCAGGGAGAA CONSENSUS> TCGCACAGTAATAGACGGCCGTGTCCGCGACGNNNNNNNNGNNNNNCNNNNGGNNNAN QUALITY> {{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{!!!!!!!!{!!!!!{!!!!{{!!!{
Where a series of frameshifts causes a lot of positions to be Ns in the consensus.
I would run AlignSets-muscle on the files before running BuildConsensus.
The results looked a lot better when I did this (I'll email you example output separately).
-
reporter Now we've tried the kleinstein/immcantation:devel image and everything works. I also appreciate your helpful advice on the AlignSets tool. This also works well on our data.
Thank you very much for all of this!
-
- changed status to resolved
No problem at all.
Error fixed in 344222d.
- Log in to comment
Greetings @hkoda,
I'll take a look at this. Would you be able to email me an example file that fails? (jason.vanderheiden at yale.edu).
My guess is that there's a sequence that's entirely N characters, or something similar, in the data that we aren't accounting for correctly. I'm not sure without looking at it though.