Running out of memory with CollapseSeq

Issue #63 resolved
Christoph Kreer created an issue

Hey there, I am trying to run CollapseSeq from the pRESTO Toolkit on a 16 gigabyte pre-processed fastq-file from a 2x300 bp MiSeq run. Should be around 17,000,000 sequences. But running the script crashes my computer. I run everything on a late 2015 iMac with 16 Gb ram and a 2,8 GHz Intel Core i5. When I check the activity monitor I see that the ram is exhausted and the script occupies >50 Gb. Is there any work around or update which does not load all data in memory? Thanks a lot

Christoph

Comments (6)

  1. Jason Vander Heiden

    Hi @ckreer,

    Unfortunately, no, there isn't a way for CollapseSeq to work on-disk. 50GB is a bit extreme for a 16 GB file. Performance and memory usage are something we've wanted to improve in CollapseSeq for a long time now. Not sure where that will happen though.

    I think the easiest workaround for now would be to split the file into separate pieces, then run CollapseSeq on them individually, then merge those results for a final pass.

    For example:

    # Split file into 3M record sets
    SplitSeq.py count -s large.fastq -n 3000000
    # Remove duplicates in each part
    CollapseSeq.py -s large_part*.fastq -n 0 --inner --keepmiss
    # Merge the results of each part
    cat *collapse-unique.fastq > merge.fastq
    # Collapse the collapsed set
    CollapseSeq.py -s merge.fastq --inner
    

    Using -n 0 will save a lot of time (ignoring Ns is a lot faster), and --keepmiss will retain sequences with Ns in the first pass.

  2. Christoph Kreer reporter

    Dear Jason,

    Thank you so much for the quick response and the solution with the splitting. Just one final question: Does the "nested collapsing approach" preserve the original duplicate count in the header from the first round collapsing?

  3. Jason Vander Heiden

    Ah, that's a good question. No, not by default. You can add that by adjusting the headers a little before the final collapse:

    # Rename DUPCOUNT annotation
    ParseHeaders.py rename -s merge.fastq -f DUPCOUNT -k XCOUNT
    # Sum XCOUNT for each duplicate
    CollapseSeq.py -s merge_reheader.fastq --inner --cf XCOUNT --act sum
    

    ParseHeaders-rename changes the name of the DUPCOUNT annotation from the split collapse so you can work with it separately.

    The --cf argument to CollapseSeq tells it to copy the annotation field into the final retained sequence for all duplicates. --act sum tells it to sum the list of values in that field. That should get you the total duplicate counts in the XCOUNT annotation after both steps.

  4. Christoph Kreer reporter

    Dear Jason,

    Splitting, collapsing, renaming and collapsing, as you suggested, made the trick.

    Thanks a lot!

  5. Jason Vander Heiden

    Yay!

    I'm going to leave this issue open for a bit to remind me to look at CollapseSeq's memory usage. We'll probably have to redo the whole algorithm to get better performance, but there might be something simple I can do for the memory issue. Have to check.

  6. Jason Vander Heiden

    A rewrite of CollapseSeq is in progress, so I'm going to close this. We'll have to evaluated on the new algorithm.

  7. Log in to comment