Method to reduce needed memory when running a huge file
Issue #32
on hold
Hi, I'm running vcontact2 on a 2.4G protein .faa file and there is a MemoryError: Unable to allocate 9.01 TiB for an array with shape (1113079, 1113079) and data type float64. Are there any methods to reduce the needed memory when running such a huge protein file? Here is the code:
source activate vContact2
vcontact2_gene2genome -p prot_vir.faa \
-o g2g.csv \
-s 'Prodigal-FAA'
vcontact2 --raw-proteins prot_vir.faa \
--rel-mode 'Diamond' \
--proteins-fp g2g.csv \
--db 'ProkaryoticViralRefSeq201-Merged' \
--pcs-mode MCL \
--vcs-mode ClusterONE \
--c1-bin /lustre/home/liutang/01software/MAVERICLab-vcontact2-aaa065683c99/bin/cluster_one-1.0.jar \
--output-dir vcontact2_ref201
Thank you.
Comments (2)
-
-
- changed status to on hold
There is an update being planned that will resolve this issue. Unfortunately, it's not going to be within the next couple of weeks.
- Log in to comment
This is a very good question and one that I’ve tried to solve. There are some technical limitations that would require a more skilled hard-coder to help solve.
For most large datasets (usually 500K+ genomes), I’ve been advising users to dereplicate their genomes using ClusterGenomes (an app available on CyVerse.org), dRep, or support scripts of CheckV.
Sorry, that's not much of a good answer. Moving forward, we hope to substitute a portion of the code with another method that’s better suited for huge datasets.
Cheers,
Ben