Linking contigID to VC

Issue #35 new
Susheel Bhanu Busi created an issue

Dear @Ben Bolduc ,

I was wondering if there are any intermediate files which link the original contigID to the the VC. What is the best way to go about this?

For example, my gene2genome.csv looks like so,

protein_id,contig_id,keywords
GL_6_Down_contig_1783590_1,GL_6_Down_contig_1783590,None_provided
GL_6_Down_contig_1783590_2,GL_6_Down_contig_1783590,None_provided
GL_6_Down_contig_1783590_3,GL_6_Down_contig_1783590,None_provided
GL_6_Down_contig_1783590_4,GL_6_Down_contig_1783590,None_provided

However, in the merged_df.csv file, the contig_id looks like so

,pos,contig_id,proteins,origin,order,family,genus
0,0,Acholeplasma~virus~L2,16,RefSeq-94,,Plasmaviridae,Plasmavirus
1,1,Acholeplasma~virus~MV-L51,4,RefSeq-94,,Inoviridae,Plectrovirus
2,2,Achromobacter~phage~83-24,61,RefSeq-94,Caudovirales,Siphoviridae,Jwxvirus
3,3,Achromobacter~phage~JWAlpha,91,RefSeq-94,Caudovirales,Podoviridae,Jwalphavirus

Is there a file that links these two?

Thank you for your help with this!

Comments (15)

  1. Ben Bolduc

    Hi @Susheel Bhanu Busi ,

    The contig_id should be identical, with the exception of “~”, which are used to replace spaces in the contig IDs. In fact, gene2genome gets merged with the reference taxonomy table using the same headers.

    If there are any differences (outside of ~), please let me know!

    Cheers,

    Ben

  2. Susheel Bhanu Busi reporter

    Dear @Ben Bolduc ,

    Please see the example above. The merged_df.csv file **does not** contain the original contig ID at all.

    My contigIDs are reprented in the gene2genome.csv file which looks like this:

    protein_id,contig_id,keywords
    GL_6_Down_contig_1783590_1,GL_6_Down_contig_1783590,None_provided
    GL_6_Down_contig_1783590_2,GL_6_Down_contig_1783590,None_provided
    GL_6_Down_contig_1783590_3,GL_6_Down_contig_1783590,None_provided
    GL_6_Down_contig_1783590_4,GL_6_Down_contig_1783590,None_provided
    

    Or does it use the keyword part of the gene2genome.csv to merge with the reference taxonomy table, which given the fact that I had None_provided may be causing the missing information?

    Other relevant files may be the following too:

    head vConTACT_contigs.csv
    contig_id,proteins
    Acholeplasma virus L2,16
    Acholeplasma virus MV-L51,4
    Achromobacter phage 83-24,61
    Achromobacter phage JWAlpha,91
    Achromobacter phage JWF,118
    Achromobacter phage JWX,67
    

    Thanks again for your prompt help!

  3. Ben Bolduc

    It looks like your gene2genome file is correctly formatted, and the keywords column is fine with “None_provided.”

    The vConTACT_contigs.csv file also looks good. It’s only getting a count of the number of proteins from each identified genome.

    Merged_df.csv should contain all the references (w/ ~), taxonomy, and your genomes with only the number of proteins identified.

    Could you provide the run log? Usually, an improperly formatted input file gets noticed somewhere along the way, but it sounds like the run completed successfully. Does your proteins.faa file include the protein_id as headers?

    -Ben

  4. Ben Bolduc

    I do see that your genomes were dropped (fewer than 650?, though I can’t tell exactly how many user sequences).

    Could you update your vContact2 installation to 0.9.21? If you installed via conda, you can activate the environment, then

    git clone bitbucket.org/MAVERICLab/vcontact2
    cd vcontact2 && pip install . --upgrade
    

    And, also if possible, could you use a more recent database (i.e., v201)?

    I’m trying to figure out if it’s an issue that was resolved in the past few updates or an issue with the database.

  5. Susheel Bhanu Busi reporter

    HI @Ben Bolduc

    Thank you - I’ve updated vcontact2, but I do have a question about how to call the updated database.

    Typically, my command for the database looks like this:

    --db 'ProkaryoticViralRefSeq94-Merged'
    

    However, the newer databases seem to have the following naming structure:

     ViralRefSeq-prokaryotes-v201.faa.gz
     ViralRefSeq-prokaryotes-v201.Merged-reference.csv 
     ViralRefSeq-prokaryotes-v201.protein2contig.csv
    

    So, which of the following is my database command supposed to look like?

    --db 'ProkaryoticViralRefSeq201-Merged'
    
    (OR)
    --db 'ViralRefSeq-prokaryotes-v201.Merged'
    

    Thanks again for your help with this!

  6. Susheel Bhanu Busi reporter

    Hey @Ben Bolduc I’ve updated vcontact2 and finished a run as well. The log file is attached for your perusal. Here is confirmation that I the most recent version of the tool

    vcontact2) [sbusi@access1 nomis_viruses]$ vcontact2 -vv
    
    ============================This is vConTACT2 0.9.21============================
    

    Even after this run, the output for both the merged_df.csv or genome_by_genome_overview.csv files don’t seem to have the original contigIDs. Please see below:

    (vcontact2) 0 [sbusi@access1 nomis_viruses]$ head /scratch/users/sbusi/nomis_viruses/results/vcontact2_output/GL_R10_GL11_UP_3/genome_by_genome_overview.csv
    ,Genome,Order,Family,Genus,VC,VC Status,Size,VC Subcluster,VC Subcluster Size,Quality,Adj P-value,Topology Confidence Score,Genera in VC,Families in VC,Orders in VC,Genus Confidence Score
    0,Achromobacter~phage~83-24,Caudovirales,Siphoviridae,Jwxvirus,0_0,Clustered,2,VC_0_0,2,0.1875,0.95227493,0.1785,1,1,1,1.0
    1,Achromobacter~phage~JWAlpha,Caudovirales,Podoviridae,Jwalphavirus,7_0,Clustered,20,VC_7_0,20,0.5809,1.0,0.5809,7,1,1,0.8421
    2,Achromobacter~phage~JWF,Caudovirales,Siphoviridae,Unassigned,16_0,Clustered,2,VC_16_0,2,1.0,1.0,1.0,2,1,1,1.0
    3,Achromobacter~phage~JWX,Caudovirales,Siphoviridae,Jwxvirus,0_0,Clustered,2,VC_0_0,2,0.1875,0.95227493,0.1785,1,1,1,1.0
    4,Achromobacter~phage~phiAxp-1,Caudovirales,Siphoviridae,Unassigned,1_0,Clustered,12,VC_1_0,12,0.7038,1.0,0.7038,5,1,1,1.0
    5,Achromobacter~phage~phiAxp-2,Caudovirales,Siphoviridae,Unassigned,19_0,Clustered,5,VC_19_0,5,0.0904,0.96238552,0.087,3,1,1,1.0
    6,Achromobacter~phage~phiAxp-3,Caudovirales,Podoviridae,Jwalphavirus,7_0,Clustered,20,VC_7_0,20,0.5809,1.0,0.5809,7,1,1,0.8421
    7,Acidianus~bottle-shaped~virus,Unassigned,Ampullaviridae,Ampullavirus,22_0,Clustered,3,VC_22_0,3,1.0,1.0,1.0,1,1,1,1.0
    8,Acidianus~bottle-shaped~virus~2,Unassigned,Ampullaviridae,Ampullavirus,22_0,Clustered,3,VC_22_0,3,1.0,1.0,1.0,1,1,1,1.0
    
    (vcontact2) 1 [sbusi@access1 nomis_viruses]$ head /scratch/users/sbusi/nomis_viruses/results/vcontact2_output/GL_R10_GL11_UP_3/merged_df.csv
    ,pos,contig_id,proteins,origin,order,family,genus
    0,0,Acholeplasma~virus~L2,16,RefSeq-201,,Plasmaviridae,Plasmavirus
    1,1,Acholeplasma~virus~MV-L51,4,RefSeq-201,Tubulavirales,Inoviridae,Plectrovirus
    2,2,Achromobacter~phage~83-24,61,RefSeq-201,Caudovirales,Siphoviridae,Jwxvirus
    3,3,Achromobacter~phage~JWAlpha,91,RefSeq-201,Caudovirales,Podoviridae,Jwalphavirus
    4,4,Achromobacter~phage~JWF,118,RefSeq-201,Caudovirales,Siphoviridae,
    5,5,Achromobacter~phage~JWX,67,RefSeq-201,Caudovirales,Siphoviridae,Jwxvirus
    6,6,Achromobacter~phage~phiAxp-1,64,RefSeq-201,Caudovirales,Siphoviridae,
    7,7,Achromobacter~phage~phiAxp-2,86,RefSeq-201,Caudovirales,Siphoviridae,
    8,8,Achromobacter~phage~phiAxp-3,80,RefSeq-201,Caudovirales,Podoviridae,Jwalphavirus
    

    Any other ideas on how best to retrieve these? It does look like somewhere in the merging with databasestep the contigIDs are not being attached.

    Thank you!!

  7. Susheel Bhanu Busi reporter

    Hey @Miguel Ángel Salazar: as you see above, I still had the error.

    @Ben Bolduc any updates on this front?

    Thank you!

  8. luoxiao

    Hi,@yujie zhao I met the same question,but I do not resolve my problem, have you resoved it? linking contig Id to VC ?

  9. Log in to comment