Error during BuildConsensus

Issue #60 resolved
Koda Hirotomo created an issue

We're trying pRESTO to process UMI-Barcoded Ion Torrent PGM BCR mRNA (400bp) (which is indel error-prone like Roche 454 pyrosequencing). Reads are single-ended and all reversely oriented (start with 12bp UMI with fixed nucleotides inserted, and then C-region primer sequences). We seem to have no problem in QC, UMI annotation and primer masking (both sides) steps.

During BuildConsensus step, frequently but not always (probably depends on datasets) this error happens in our run. Error cases always show 'float division by zero' messages. I have to manually kill the process after this error.

BuildConsensus.py -s plusVJ_primers-pass.fastq --bf BARCODE --pf PRIMER -n 5 --prcons 0.6 --maxerror 0.2 --maxgap 0.5 --outname plus --log BC5.log
           START> BuildConsensus
            FILE> plusVJ_primers-pass.fastq
   BARCODE_FIELD> BARCODE
       MIN_COUNT> 5
   MIN_FREQUENCY> 0.6
     MIN_QUALITY> 0
         MAX_GAP> 0.5
    PRIMER_FIELD> PRIMER
PRIMER_FREQUENCY> 0.6
       MAX_ERROR> 0.2
   MAX_DIVERSITY> None
       DEPENDENT> False
     COPY_FIELDS> None
    COPY_ACTIONS> None
           NPROC> 24

PROGRESS> 02:58:03 |#                   |   5% ( 2,782) 3.6 minError processing sequence set with ID: CGGGATAGCTGTGGGA
PID 176:  Error in sibling process detected. Cleaning up.
PID 151:  Error in sibling process detected. Cleaning up.
Process Process-6:
Traceback (most recent call last):
  File "/usr/lib64/python3.5/multiprocessing/process.py", line 252, in _bootstrap
    self.run()
  File "/usr/lib64/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/bin/BuildConsensus.py", line 129, in processBCQueue
    consensus = cons_func(seq_list, **cons_args)
  File "/usr/lib/python3.5/site-packages/presto/Sequence.py", line 520, in qualityConsensus
    for c in qual_set}
  File "/usr/lib/python3.5/site-packages/presto/Sequence.py", line 520, in <dictcomp>
    for c in qual_set}
ZeroDivisionError: float division by zero
PID 157:  Error in sibling process detected. Cleaning up.
PID 175:  Error in sibling process detected. Cleaning up.
PID 163:  Error in sibling process detected. Cleaning up.
PID 173:  Error in sibling process detected. Cleaning up.
PID 154:  Error in sibling process detected. Cleaning up.

Enabling '--nproc 1' option made no improvement.

Here's our system environment. We're happy to provide more information if you need.

OS: CentOS Linux release 7.3.1611
Docker : version 1.12.6, build 3a094bd/1.12.6
Image: kleinstein/immcantation:1.7.0

Comments (7)

  1. Jason Vander Heiden

    Greetings @hkoda,

    I'll take a look at this. Would you be able to email me an example file that fails? (jason.vanderheiden at yale.edu).

    My guess is that there's a sequence that's entirely N characters, or something similar, in the data that we aren't accounting for correctly. I'm not sure without looking at it though.

  2. Jason Vander Heiden

    Thanks. I got the test data. The problem was caused by some UMI groups having positions where the quality score was 0 in all of the sequences (the tail end was all N characters with quality 0).

    I added a fix to the version of presto at the tip of the default branch in the bitbucket repo.

    As it looks like you are using the docker container, I'm rebuilding the kleinstein/immcantation:devel image right now, and you can get the fix by using that image. The image hasn't finished building yet. It should be done by 3pm your time.

    I'll try to make an official release of presto with the fix sometime next week, but I have some testing to do on the MaskPrimers changes first.

  3. Koda Hirotomo reporter

    Great. Just in case, I've sent you another example file which causes the same error. This time no N characters are included but there ARE some sequences with characters with quality 0, which passed QC step because average quality score is above the threshold. I hope you just try this file too. No problem for me on deadlines.

    Thank you again!

  4. Jason Vander Heiden

    Sure thing. I ran the new file you sent and it worked fine.

    However, the consensus sequences are poor quality with a lot of Ns in them due to the abundance of indels. With --freg 0.6 BuildConsensus requires 60% of the reads to have the same non-N character or it'll assign an N. So you get a lot of situations like the following:

           ID> CATTATTACGTCGTGG
     SEQCOUNT> 9
       INSEQ1> CTCGCACAGTAATAGACGGCCGTGTCCGCGACGGTGACAGAGGTCATCCGCAGGGAGA
       INSEQ2> CTCGCACAGTAATAGACGGCCGTGTCCGCGACGTGACAGAGGTCATCCGCAGGGAGAA
       INSEQ3> CTCGCACAGTAATAGACGGCCGTGTCCGCGACGTGACAGAGGTCATCCGCAGGGAGAA
       INSEQ4> CTCGCACAGTAATAGACGGCCGTGTCCGCGACGTGACAGAGGTCATCCGCAGGGAGAA
       INSEQ5> CTCGCACAGTAATAGACGGCCGTGTCCGCGACGGTGACAGAGGTCATCCGCAGGGAGA
       INSEQ6> CTCGCACAGTAATAGACGGCCGTGTCCGCGACGGTGACAGAGGTCATCCGCAGGGAGA
       INSEQ7> CTCGCACAGTAATAGACGGCCGTGTCCGCGACGGTGACAGAGGTCATCCGCAGGGAGA
       INSEQ8> TCGCACAGTAATAGACGGCCGTGTCCGCGACGTGACAGAGGTCATCCGCAGGGAGAAC
       INSEQ9> CTCGCACAGTAATAGACGGCCGTGTCCGCGACGTGACAGAGGTCATCCGCAGGGAGAA
    CONSENSUS> TCGCACAGTAATAGACGGCCGTGTCCGCGACGNNNNNNNNGNNNNNCNNNNGGNNNAN
      QUALITY> {{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{!!!!!!!!{!!!!!{!!!!{{!!!{
    

    Where a series of frameshifts causes a lot of positions to be Ns in the consensus.

    I would run AlignSets-muscle on the files before running BuildConsensus.

    The results looked a lot better when I did this (I'll email you example output separately).

  5. Koda Hirotomo reporter

    Now we've tried the kleinstein/immcantation:devel image and everything works. I also appreciate your helpful advice on the AlignSets tool. This also works well on our data.

    Thank you very much for all of this!

  6. Log in to comment