Running out of memory with CollapseSeq
Hey there, I am trying to run CollapseSeq from the pRESTO Toolkit on a 16 gigabyte pre-processed fastq-file from a 2x300 bp MiSeq run. Should be around 17,000,000 sequences. But running the script crashes my computer. I run everything on a late 2015 iMac with 16 Gb ram and a 2,8 GHz Intel Core i5. When I check the activity monitor I see that the ram is exhausted and the script occupies >50 Gb. Is there any work around or update which does not load all data in memory? Thanks a lot
Christoph
Comments (6)
-
-
reporter Dear Jason,
Thank you so much for the quick response and the solution with the splitting. Just one final question: Does the "nested collapsing approach" preserve the original duplicate count in the header from the first round collapsing?
-
Ah, that's a good question. No, not by default. You can add that by adjusting the headers a little before the final collapse:
# Rename DUPCOUNT annotation ParseHeaders.py rename -s merge.fastq -f DUPCOUNT -k XCOUNT # Sum XCOUNT for each duplicate CollapseSeq.py -s merge_reheader.fastq --inner --cf XCOUNT --act sum
ParseHeaders-rename changes the name of the
DUPCOUNT
annotation from the split collapse so you can work with it separately.The
--cf
argument to CollapseSeq tells it to copy the annotation field into the final retained sequence for all duplicates.--act sum
tells it to sum the list of values in that field. That should get you the total duplicate counts in theXCOUNT
annotation after both steps. -
reporter Dear Jason,
Splitting, collapsing, renaming and collapsing, as you suggested, made the trick.
Thanks a lot!
-
Yay!
I'm going to leave this issue open for a bit to remind me to look at CollapseSeq's memory usage. We'll probably have to redo the whole algorithm to get better performance, but there might be something simple I can do for the memory issue. Have to check.
-
- changed status to resolved
A rewrite of CollapseSeq is in progress, so I'm going to close this. We'll have to evaluated on the new algorithm.
- Log in to comment
Hi @ckreer,
Unfortunately, no, there isn't a way for CollapseSeq to work on-disk. 50GB is a bit extreme for a 16 GB file. Performance and memory usage are something we've wanted to improve in CollapseSeq for a long time now. Not sure where that will happen though.
I think the easiest workaround for now would be to split the file into separate pieces, then run CollapseSeq on them individually, then merge those results for a final pass.
For example:
Using
-n 0
will save a lot of time (ignoring Ns is a lot faster), and--keepmiss
will retain sequences with Ns in the first pass.