To generate a taxonomic count table

Issue #22 on hold
Former user created an issue

Dear vConTACT2 team, Thanks for your pretty good work with vConTACT2 first of all. I just wonder how should I get a taxonomic count table like OTU table? Now, I get the 'genome_by_genome_overview.csv' and 'viral_cluster_overview.csv'. Does it reflect a specific viral or someone within one group(viral cluster) with each line in 'genome_by_genome_overview.csv '? Could I take each line as a taxonomic unit and use the 'Members' information of 'viral_cluster_overview.csv' to sum each VC reads count to generate a count table? Or, could you show me another best way to generate a taxonomic count table base on vConTACT result. I am looking forward your replay. Thanks a lot!

genome_by_genome_overview.csv:

,Genome,Order,Family,Genus,VC,VC Status,Size,VC Subcluster,VC Subcluster Size,Quality,Adj P-value,Topology Confidence Score,Genera in VC,Families in VC,Orders in VC,Genus Confidence Score
0,Achromobacter~phage~83-24,Caudovirales,Siphoviridae,Jwxvirus,0_0,Clustered,2,VC_0_0,2,0.1952,0.95226825,0.1859,1,1,1,1.0
1,Achromobacter~phage~JWAlpha,Caudovirales,Podoviridae,Jwalphavirus,8_1,Clustered,11,VC_8_1,11,0.4755,1.0,0.4755,3,1,1,0.9818

viral_cluster_overview.csv:

,,VC,Size,Internal Weight,External Weight,Quality,P-value,Min Dist,Max Dist,Total Dist,Below Thres,Taxon Prediction Score,Avg Dist,Genera,Families,Orders,Members
0,VC_0_0,2,155.06242197581085,639.2083422889191,0.1952261482510373,0.04773175196323191,1.7320508075688772,1.7320508075688772,1,1,1.0,1.7320508075688772,1,1,1,"Achromobacter~phage~83-24,Achromobacter~phage~JWX"
1,VC_1000_0,5,16.007331291300495,12.280942590828833,0.5658645471971648,0.3717981656013571,1.7320508075688772,2.6457513110645907,10,10,1.0,2.3080226590546964,1,1,1,"k141_1143022_length_14828_cov_72.0000,k141_1292517_length_10485_cov_134.5822,k141_1980014_length_12453_cov_102.0362,k141_4986945_length_9470_cov_84.1939,k141_767706_length_6153_cov_139.2982"

Comments (2)

  1. Ben Bolduc

    Hello,

    To get a taxonomic count, it’s a little more complicated than simply taking the lines from the genome-by-genome file and counting up a taxon column.

    The easiest way I can think to do this would be to:

    1. Sort and group by each VC (VC_10_0, VC_11_0, VC_12_1, VC_12_2…)
    2. Identify the “majority rules” Order, Family, Genus for each VC

      1. So VC_10_0 could have 5 of Caudovirales and 2 unknown - that VC would be a Caudovirales VC.
      2. Some VCs may not have a majority, or even more important, might have multiple genera - you can either call those mixed or go with the majority
    3. Count up the members of each VC, using the majority taxon to describe that VC

      1. VC_10_0 would have 7 counts towards Caudovirales

    A future update to vcontact will “fix” this annoying issue for users.

    Thanks for your use of the tool!

    Cheers,

    Ben

  2. Log in to comment