question about VC Subcluster

Issue #33 resolved
Yingli Zhou created an issue

Hi,

In ‘genome_by_genome_overview.csv', there are contigs not belong to any subcluster. But there are also some subclusters that only have one viral contig. I don’t know the differences between the two situations? like the attached fig

I also find some contigs belong to two subclusters (Overlap), I don’t know why?

Could you explain the ‘VC status’ and ‘VC subcluster’ for me?

I also wonder can I run vcontact2 without the reference database, if I just want to cluster my own contigs.

Thanks in advance

Comments (6)

  1. Ben Bolduc

    VC subcluster is a term used to describe the post-ClusterONE clustering refinment vContact2 does. ClusterONE gets the initial clusters, but then vContact2 goes in and “fixes” them by ensuring that VCs adhere to ICTV genera. So for every VC subcluster (VC_963_0), there’s the initial cluster (VC_963_0) and then the subcluster (VC_963_0).

    “VC Status” refers to what the overall placement of the genome was.

    There are multiple different statuses:

    Clustered: high-confidence clustering, and we argue is roughly equivalent to an ICTV genus

    Singleton: Had few or no gene similarities against other genomes. Most don’t even make it into the network

    Overlap: Genomes sharing overlap with other genome(s) from multiple VCs. Often, these viruses have shared core genes, or a large portion of their genome has a conserved region that is shared amongst many.

    Outlier: Had some genes shared with other genomes, but ClusterONE wasn’t confident enough to place them within a particular VC. We suspect these are related to the VCs they’re connected to (within the network), but not at the genus level. Probably, at the sub-family or family level though.

    Clustered/Singleton: A weird category. These are genomes that ClusterONE clustered into the same VC. However, when running a distance-based threshold based on the placement of ICTV/NCBI reference genomes, vContact2 decides that they are not in the same genus and therefore move them to a subcluster. But when that genome goes to the new subcluster, there are no other genomes that get moved to that new subcluster, so it’s “alone.” Hence, why it’s a singleton. But not really, because it was clustered. It’s just that its cluster got split.

    Hope that helps!

  2. Matthew DeMaere

    Perhaps I have justed missed it, but it would be great if this information was put into the Wiki.

  3. Ben Bolduc

    Thanks for the question and the advice. VC Statuses have been added to the wiki (I might add a more thorough explanation later).

  4. Zongzhi Wu

    I have two questions.

    1. I also want to know if I can use vcontact2 to cluster viral contigs only from my datasets without refseq databases provided by vcontact2.

    2. Or even can I cluster my own viral contigs and all refseq DNA viruses including eukaryotic viruses other than refseq databases provided by vcontact2.

  5. Log in to comment