Method to reduce needed memory when running a huge file

Issue #32 on hold
Former user created an issue

Hi, I'm running vcontact2 on a 2.4G protein .faa file and there is a MemoryError: Unable to allocate 9.01 TiB for an array with shape (1113079, 1113079) and data type float64. Are there any methods to reduce the needed memory when running such a huge protein file? Here is the code:

source activate vContact2
vcontact2_gene2genome -p prot_vir.faa \
                      -o g2g.csv \
                      -s 'Prodigal-FAA'

vcontact2 --raw-proteins prot_vir.faa \
          --rel-mode 'Diamond' \
          --proteins-fp g2g.csv \
          --db 'ProkaryoticViralRefSeq201-Merged' \
          --pcs-mode MCL \
          --vcs-mode ClusterONE \
          --c1-bin /lustre/home/liutang/01software/MAVERICLab-vcontact2-aaa065683c99/bin/cluster_one-1.0.jar \
          --output-dir vcontact2_ref201

Thank you.

Comments (2)

  1. Ben Bolduc

    This is a very good question and one that I’ve tried to solve. There are some technical limitations that would require a more skilled hard-coder to help solve.

    For most large datasets (usually 500K+ genomes), I’ve been advising users to dereplicate their genomes using ClusterGenomes (an app available on CyVerse.org), dRep, or support scripts of CheckV.

    Sorry, that's not much of a good answer. Moving forward, we hope to substitute a portion of the code with another method that’s better suited for huge datasets.

    Cheers,

    Ben

  2. Ben Bolduc

    There is an update being planned that will resolve this issue. Unfortunately, it's not going to be within the next couple of weeks.

  3. Log in to comment