vcontact freezes at "Building the cluster and profiles" step
First of all, thanks for this nice application. I have vcontact2 0.9.15 and I downgraded pandas to 0.25.3, so it’s working with no problems with the examples and real data. However, I am trying now to use it with large number of genomes ~ 780,000. Unfortunately, each time I run vcontact, it stops at "Building the cluster and profiles" step and takes ages and never finishes. If I repeat it using --blast-fp, the same happens. I saw in the Wiki page that there is an approximately 1 million genome limit, which is not reached yet. So, what could be the way to make it finish this step?
By the way, I noticed that this step is not mutlithreaded, so is there a way to make it multithreaded and hence faster?
Comments (5)
-
-
- changed status to on hold
On hold until the release of the next major version.
-
RESOLVED: See solution below
Hey Ben,
thanks for this tool. I experience a similar issue even with just one genome. I am using vContact2 0.9.17. Once it starts the protein clustering it prints this and freezes:
-------------------------------Protein clustering------------------------------- INFO:vcontact2: Loading proteins... INFO:vcontact2: Merging ProkaryoticViralRefSeq94-Merged to user gene-to-genome mapping... DEBUG:vcontact2: Read 268229 proteins from genes2genome.csv. DEBUG:vcontact2: File merged.self-diamond.tab_mcl20.clusters exists and will be used. Use -f to overwrite. INFO:vcontact2: Building the cluster and profiles (this may take some time...) If it fails, try re-running using --blast-fp flag and specifiying merged.self-diamond.tab (or merged.self-blastp.tab)
What I am puzzled about is how can read 268229 proteins from my genes2genome.csv even though it contains only 84 proteins. I used
vcontact2_gene2genome -p proteins.faa -o genes2genome.csv -s MetaGeneMark
to generate the file from MetaGeneMark output. Furthermore, it says thatmerged.self-diamond.tab_mcl20.cluster
exists but I don’t see it in the output folder…Any ideas about that?
Thanks,
AaronSOLUTION:
Somehow it attempted to loadmerged.self-diamond.tab_mcl20.cluster
from the folder I ran the script from. However, this file was corrupted since I interrupted the run prematurely. So instead of creating this file it was loading it and then running in an infinite loop on it. Setting--force
resolved the problem. -
Thanks for using the tool and I’m glad you were able to solve it. vConTACT2 will attempt to find an old run, if possible. The restart mechanism is a bit buggy, so I recommend fresh runs (i.e. remove the directory created) with any sort of “failure.” The only exception is using the --blastp-fp or the “_pcs.csv” “_contigs.csv” and “_contigs.csv” generated as intermediary files. Those are good checkpoint files for moving forward.
-
Yes, the main issue was that it tried to access the file in the directory I executed the script from and not in the specified output directory. So I wasn’t aware that it attempted to reuse an earlier version …
- Log in to comment
Hi Ali,
The estimate of ~1 million is more “rough approximation” than a hard limit. It depends more on how sparse your data is. If your data has a large percentage of genomes with a lot of shared genes, that will consume more memory (and CPU cycles) than a few genomes with a lot of shared genes - despite equivalent numbers of genomes.
This particular step is a combination of CPU-heavy table accounting and memory. At the same time, it’s building an aggregated data table. At some point, I think after it consumes all available memory, it’ll hit the [slower] swap space on your local disk. Since there’s no error message, I’m assuming it’s fit everything into memory and is doing its aggregation.
Regarding multi-threading - yes - there is, and since the release of vConTACT2 I’ve been working on a number of significant upgrades. The ETA for this new version is early summer. I wish it was sooner, but I’ve been juggling a few other projects and haven’t had the time to focus on pushing this out.
In the meantime, my best recommendation is to de-replicate your viral genomes at least 95% identity over 85% alignment length. You can use dRep, ClusterGenomes or CD-HIT (among others).
Cheers,
Ben