Protein clustering comparisons

Clement Coclet

Hello again,

@Ben Bolduc I have watched the error file and I get this error:

“Searching alignments... slurmstepd: error: *** JOB 5934167 ON c452-084 CANCELLED AT 2020-06-28T00:10:56 DUE TO TIME LIMIT ***”

So I assume I did the good method but I can I avoid this error (due to the big IMG/VR2 DB)?

‌

Thank you.

2020-07-02T10:04:53+00:00

Ben Bolduc

Hi Clément,

The methodology you’re using is correct. Combining your viral contigs + IMG/VR and then predicting genes with prodigal, followed by gene2genome → vConTACT2. Ensure you’re using Diamond (and not blastp) to run the protein comparison. Blastp will take a week, if not longer at the scales you’re working at.

It’s not a mistake per se with vConTACT2. The system you’re using is running into a walltime limit. If you’re using CyVerse, you’re limited to a 48-hr runtime. If using KBase… it might be 5 days (I’d need to double-check). Depending on where the job gets canceled (at what point during vContact2 processing) you may be able to jump-start a new run using the intermediate files from the failed run. One “checkpoint” would be the creation of the “vConTACT_pcs” “_contigs” and “_profiles” files. You can use those 3 files as the legacy input files. Another checkpoint is the creation of the diamond (*.dmnd) file.

I’m not sure if you’re using IMG/VR’s isolate or UViGs, or their complete collection, but I expect the run to take several days. If you can use either of the above 2 checkpoints, that would greatly help. If neither of those checkpoint files are created, then the only other alternative I can suggest w/out modifying your input data would be to install vContact2 on your local machine/cluster and running with a longer limit. As a last resort, you might be able to de-replicate some of your genomes at 97% identity across 80-85% genome length. That might help, but I can’t say if that’ll dramatically reduce the numbers.

Cheers,

Ben

2020-07-02T15:52:22+00:00

Ben Bolduc

changed status to resolved

Large scale analyses might be more challenging on CyVerse and/or KBase, considering there are walltime limits linked to the vConTACT2 app(s) themselves. If you encounter such issues, recommendations are 1) reduce your input genomes via de-replication 2) install to a local HPC and run the job with longer walltime limits.

2020-07-08T23:43:59+00:00

Clement Coclet

Hi Ben,

‌

Thank you very much for your reply. I think the best solution is to install vContact2 on a local HPC.

2020-07-09T07:59:34+00:00

Comments (4)