vcontact freezes at "Building the cluster and profiles" step

Issue #15 on hold
Ali Hassan Elbehery created an issue

First of all, thanks for this nice application. I have vcontact2 0.9.15 and I downgraded pandas to 0.25.3, so it’s working with no problems with the examples and real data. However, I am trying now to use it with large number of genomes ~ 780,000. Unfortunately, each time I run vcontact, it stops at "Building the cluster and profiles" step and takes ages and never finishes. If I repeat it using --blast-fp, the same happens. I saw in the Wiki page that there is an approximately 1 million genome limit, which is not reached yet. So, what could be the way to make it finish this step?

By the way, I noticed that this step is not mutlithreaded, so is there a way to make it multithreaded and hence faster?

Comments (5)

  1. Ben Bolduc

    Hi Ali,

    The estimate of ~1 million is more “rough approximation” than a hard limit. It depends more on how sparse your data is. If your data has a large percentage of genomes with a lot of shared genes, that will consume more memory (and CPU cycles) than a few genomes with a lot of shared genes - despite equivalent numbers of genomes.

    This particular step is a combination of CPU-heavy table accounting and memory. At the same time, it’s building an aggregated data table. At some point, I think after it consumes all available memory, it’ll hit the [slower] swap space on your local disk. Since there’s no error message, I’m assuming it’s fit everything into memory and is doing its aggregation.

    Regarding multi-threading - yes - there is, and since the release of vConTACT2 I’ve been working on a number of significant upgrades. The ETA for this new version is early summer. I wish it was sooner, but I’ve been juggling a few other projects and haven’t had the time to focus on pushing this out.

    In the meantime, my best recommendation is to de-replicate your viral genomes at least 95% identity over 85% alignment length. You can use dRep, ClusterGenomes or CD-HIT (among others).

    Cheers,

    Ben

  2. Aaron Pfennig

    RESOLVED: See solution below

    Hey Ben,

    thanks for this tool. I experience a similar issue even with just one genome. I am using vContact2 0.9.17. Once it starts the protein clustering it prints this and freezes:

    -------------------------------Protein clustering-------------------------------
    INFO:vcontact2: Loading proteins...
    INFO:vcontact2: Merging ProkaryoticViralRefSeq94-Merged to user gene-to-genome mapping...
    DEBUG:vcontact2: Read 268229 proteins from genes2genome.csv.
    DEBUG:vcontact2: File merged.self-diamond.tab_mcl20.clusters exists and will be used. Use -f to overwrite.
    INFO:vcontact2: Building the cluster and profiles (this may take some time...)
    If it fails, try re-running using --blast-fp flag and specifiying merged.self-diamond.tab (or merged.self-blastp.tab)
    

    What I am puzzled about is how can read 268229 proteins from my genes2genome.csv even though it contains only 84 proteins. I used vcontact2_gene2genome -p proteins.faa -o genes2genome.csv -s MetaGeneMark to generate the file from MetaGeneMark output. Furthermore, it says that merged.self-diamond.tab_mcl20.cluster exists but I don’t see it in the output folder…

    Any ideas about that?

    Thanks,
    Aaron

    SOLUTION:
    Somehow it attempted to load merged.self-diamond.tab_mcl20.cluster from the folder I ran the script from. However, this file was corrupted since I interrupted the run prematurely. So instead of creating this file it was loading it and then running in an infinite loop on it. Setting --forceresolved the problem.

  3. Ben Bolduc

    Thanks for using the tool and I’m glad you were able to solve it. vConTACT2 will attempt to find an old run, if possible. The restart mechanism is a bit buggy, so I recommend fresh runs (i.e. remove the directory created) with any sort of “failure.” The only exception is using the --blastp-fp or the “_pcs.csv” “_contigs.csv” and “_contigs.csv” generated as intermediary files. Those are good checkpoint files for moving forward.

  4. Aaron Pfennig

    Yes, the main issue was that it tried to access the file in the directory I executed the script from and not in the specified output directory. So I wasn’t aware that it attempted to reuse an earlier version …

  5. Log in to comment