Protein clustering comparisons

Issue #20 resolved
Former user created an issue

Hello,

I would like to place the the viral populations from my study in the context of known viruses, by clustering predicted proteins with predicted proteins from viral sequences in public databases: the bacterial and archaeal viral genomes from the NCBI RefSeq database (v75, June 2016) but also the IMG/VR v.2.0, tuning vContact2. It is basically what it have been done in "Host-linked soil viral ecology along a permafrost thaw gradient" (Nat. Microb., 2018).

I have just concatenated the viral contigs from my study and the viral contains from the IMG/VR database, ran Prodigal and vcontact2_gene2genome, before to run vContact2. However, I get a mistake during vContact2, then I assume it is not the good way to do that.

Do you have any suggestions to compare my viral contigs from RefSeq but also an other database using vContact2?

Thank you. Clément

Comments (4)

  1. Clement Coclet

    Hello again,

    @Ben Bolduc I have watched the error file and I get this error:

    “Searching alignments... slurmstepd: error: *** JOB 5934167 ON c452-084 CANCELLED AT 2020-06-28T00:10:56 DUE TO TIME LIMIT ***”

    So I assume I did the good method but I can I avoid this error (due to the big IMG/VR2 DB)?

    Thank you.

  2. Ben Bolduc

    Hi Clément,

    The methodology you’re using is correct. Combining your viral contigs + IMG/VR and then predicting genes with prodigal, followed by gene2genome → vConTACT2. Ensure you’re using Diamond (and not blastp) to run the protein comparison. Blastp will take a week, if not longer at the scales you’re working at.

    It’s not a mistake per se with vConTACT2. The system you’re using is running into a walltime limit. If you’re using CyVerse, you’re limited to a 48-hr runtime. If using KBase… it might be 5 days (I’d need to double-check). Depending on where the job gets canceled (at what point during vContact2 processing) you may be able to jump-start a new run using the intermediate files from the failed run. One “checkpoint” would be the creation of the “vConTACT_pcs” “_contigs” and “_profiles” files. You can use those 3 files as the legacy input files. Another checkpoint is the creation of the diamond (*.dmnd) file.

    I’m not sure if you’re using IMG/VR’s isolate or UViGs, or their complete collection, but I expect the run to take several days. If you can use either of the above 2 checkpoints, that would greatly help. If neither of those checkpoint files are created, then the only other alternative I can suggest w/out modifying your input data would be to install vContact2 on your local machine/cluster and running with a longer limit. As a last resort, you might be able to de-replicate some of your genomes at 97% identity across 80-85% genome length. That might help, but I can’t say if that’ll dramatically reduce the numbers.

    Cheers,

    Ben

  3. Ben Bolduc

    Large scale analyses might be more challenging on CyVerse and/or KBase, considering there are walltime limits linked to the vConTACT2 app(s) themselves. If you encounter such issues, recommendations are 1) reduce your input genomes via de-replication 2) install to a local HPC and run the job with longer walltime limits.

  4. Clement Coclet

    Hi Ben,

    Thank you very much for your reply. I think the best solution is to install vContact2 on a local HPC.

  5. Log in to comment