all my viruses absent from output files vContact2 in CyVerse

Issue #47 new
pdalcinmartins created an issue

Hi Ben & Sullivan lab,

I am trying to use vContact2 on CyVerse without success after several attempts using v0.9.8 and v0.9.19

My input is a fasta file containing contigs that were VirSorted, CheckV’ed and VirSorted again as in https://www.protocols.io/view/viral-sequence-identification-sop-with-virsorter2-btv8nn9w

To generate my mapping file I have used vContact2-Gene2Genome_1.1.0 and vContact-Gene2Contig_1.0.1 after talking to Adjie

Jobs do run, but my viruses are all absent from output files

I would appreciate any advice on which vContact2 version to use and which mapping file-generating app to use as well for what will work at CyVerse

If you want I can share my files with you in CyVerse, let me know it!

Thank you,

Paula

Comments (5)

  1. Josue Rodriguez Ramos

    Hey Paula and all -

    Just wanted to chime in and say that I seem to be having the same issue. I also followed the SOP Paula posted above. I am not missing all of my viruses - but I am missing 6 of them. They show up in the merged.faa, vConTACT_contigs.csv, vConTACT_proteins.csv, vConTACT_profiles.csv files. The proteins for the missing viruses also seem to be passed into the modules_mcl_5.0.clusters, vConTACT_pcs.csv files and even in the c1.ntw and c1.clusters files just not into the gene_by_genome_overview.csv.

    There doesn’t seem to be any consistency in what is missing (assembly methods vary, their ID names vary, # of proteins vary, genome length varies). Based off of the c1.clusters file - it would also seem that the novelty of these varies - as I have a few that cluster within known Clostridium phages and others that would be considered novel genera (no singletons/outliers based on that file). 2/6 of the missing viruses are in a cluster with each other.

    I called my genes using prodigal (from DRAMv output) and changed the parameter accordingly when I ran vContact2-Gene2Genome_1.1.0. I used vContact2 v0.9.8 on CyVerse.

    Any idea what might be happening?

  2. Ben Bolduc

    Hi Josue and Paula,

    If you could share files with me on CyVerse, that would help. And if possible, use 0.9.19 or above.

    One of the challenging issues here is - as Josue pointed out - there isn’t consistency in what’s being dropped. I have introduced a number of fixes specifically focused on trying to reduce the number of genomes being dropped during the final overview. While it’s been mostly successful, there’s still a number of users who’ve encountered this issue.

    I do try to limit how much I request user data (as I can understand concerns for the sensitive nature of research data) - but given that I cannot reproduce missing genomes in any of my testing data - it’s hard to identify the root issue.

    The reason? Due to limitations in ClusterONE, vConTACT2 must interrogate multiple input and intermediate files and essentially cross-compare each of those inputs and summarize the data to give VC #s, cluster status, taxonomic info, overlap/outlier data, etc.

    -Ben

  3. Josue Rodriguez Ramos

    Hey Ben!

    Thanks for the response. I’m happy to share my data folder if it would help you get to the bottom of it. I shared it with user “bbolduc-iplant-2015”. Let me know if you need it shared to any other user as well.

    -Josué

  4. Ben Bolduc

    Hi Josué,

    Thank you for providing data, I’ll run it on my end and will see if I can’t find the appropriate fix.

    -Ben

  5. Ben Bolduc

    Hi Josué,

    Thank you again for sharing the data with me. I was at least able to confirm what I noticed in your genome-by-genome file.

    I’ve looked over your data and it appears that many of your contigs are clustered. If you download the genome-by-genome file and sort by your genomes, check the “VC Status” column. If they’re “Clustered”, that means they were placed into a genus-level group. They are not, however, assigned to an order/family/genus, so you’ll see Unassigned, Unassigned, Unassigned. We did not want users to blindly use the taxonomies as authoritative, but it seems we confused everyone in the process.

    The latest version on CyVerse is 0.9.19, but I’ll be pushing the 0.10.0 to Bioconda and CyVerse whenever I can find some time.

    If you still don’t see your genomes, we can follow up via email.

    -Ben

  6. Log in to comment