Huge processing times for CollapseSeq

Issue #89 new
Santiago Revale created an issue

Hi there!

I was running Presto on MiSeq run samples successfully. Lately, I’ve got a few NextSeq runs to put through the pipeline and, though most of the samples were processed without any issues, I had a few where just the CollapseSeq step took a really long time (33-62 hours).

The thing that annoyed me the most was that three samples took a similar time (33-37 hours) and one took 64 hours. I tried looking at the numbers to figure out if there was a pattern regarding which samples would take longer (as in more raw reads, longer times) but there was no pattern. Here are a few numbers I collected:

SAMPLE  Running Time     raw_reads  contributing_reads  unique_sequences    unique_cdr3
Sample1     33:41:44     6,256,670           4,779,720           737,838        581,965
Sample2     34:29:56     3,418,984           2,797,692           638,508        452,911
Sample3     37:34:06    10,758,170           8,810,811           715,579        497,400
Sample4     62:36:16     3,501,513           2,783,129           885,801        691,839

The only thing that makes Sample4 outstanding is that it has more unique sequences than the others, although the difference is not proportional to the time it took to processed them.

I would really appreciate any tip or advice regarding what could be going on in here so in the future I could anticipate when this could happen or at least I could give an explanation on why it happened.

Here is some additional info:

# Presto Version: 0.6.0 (from the DockerHub immcantation/suite:4.0.0)

# Command used
CollapseSeq.py \
  -s "Sample4_consensus-pass.fasta" \
  -n 5 \
  --uf BARCODE C_CALL \
  --cf CONSCOUNT \
  --act sum \
  --inner \
  --outname "Sample4"

Thank you very much in advance.

Cheers!