Linking contigID to VC
Dear @Ben Bolduc ,
I was wondering if there are any intermediate files which link the original contigID
to the the VC
. What is the best way to go about this?
For example, my gene2genome.csv
looks like so,
protein_id,contig_id,keywords
GL_6_Down_contig_1783590_1,GL_6_Down_contig_1783590,None_provided
GL_6_Down_contig_1783590_2,GL_6_Down_contig_1783590,None_provided
GL_6_Down_contig_1783590_3,GL_6_Down_contig_1783590,None_provided
GL_6_Down_contig_1783590_4,GL_6_Down_contig_1783590,None_provided
However, in the merged_df.csv
file, the contig_id looks like so
,pos,contig_id,proteins,origin,order,family,genus
0,0,Acholeplasma~virus~L2,16,RefSeq-94,,Plasmaviridae,Plasmavirus
1,1,Acholeplasma~virus~MV-L51,4,RefSeq-94,,Inoviridae,Plectrovirus
2,2,Achromobacter~phage~83-24,61,RefSeq-94,Caudovirales,Siphoviridae,Jwxvirus
3,3,Achromobacter~phage~JWAlpha,91,RefSeq-94,Caudovirales,Podoviridae,Jwalphavirus
Is there a file that links these two?
Thank you for your help with this!
Comments (15)
-
-
reporter Dear @Ben Bolduc ,
Please see the example above. The
merged_df.csv
file **does not** contain theoriginal contig ID
at all.My contigIDs are reprented in the
gene2genome.csv
file which looks like this:protein_id,contig_id,keywords GL_6_Down_contig_1783590_1,GL_6_Down_contig_1783590,None_provided GL_6_Down_contig_1783590_2,GL_6_Down_contig_1783590,None_provided GL_6_Down_contig_1783590_3,GL_6_Down_contig_1783590,None_provided GL_6_Down_contig_1783590_4,GL_6_Down_contig_1783590,None_provided
Or does it use the
keyword
part of thegene2genome.csv
to merge with the reference taxonomy table, which given the fact that I hadNone_provided
may be causing the missing information?Other relevant files may be the following too:
head vConTACT_contigs.csv contig_id,proteins Acholeplasma virus L2,16 Acholeplasma virus MV-L51,4 Achromobacter phage 83-24,61 Achromobacter phage JWAlpha,91 Achromobacter phage JWF,118 Achromobacter phage JWX,67
Thanks again for your prompt help!
-
It looks like your gene2genome file is correctly formatted, and the keywords column is fine with “None_provided.”
The vConTACT_contigs.csv file also looks good. It’s only getting a count of the number of proteins from each identified genome.
Merged_df.csv should contain all the references (w/ ~), taxonomy, and your genomes with only the number of proteins identified.
Could you provide the run log? Usually, an improperly formatted input file gets noticed somewhere along the way, but it sounds like the run completed successfully. Does your proteins.faa file include the protein_id as headers?
-Ben
-
reporter - attached slurm-2037236.iris-092-nomis.vcontact2.out.txt
- attached slurm-2037236.iris-092-nomis.vcontact2.err.txt
Thank you for looking into this!
-
I do see that your genomes were dropped (fewer than 650?, though I can’t tell exactly how many user sequences).
Could you update your vContact2 installation to 0.9.21? If you installed via conda, you can activate the environment, then
git clone bitbucket.org/MAVERICLab/vcontact2 cd vcontact2 && pip install . --upgrade
And, also if possible, could you use a more recent database (i.e., v201)?
I’m trying to figure out if it’s an issue that was resolved in the past few updates or an issue with the database.
-
reporter HI @Ben Bolduc
Thank you - I’ve updated
vcontact2
, but I do have a question about how to call the updated database.Typically, my command for the database looks like this:
--db 'ProkaryoticViralRefSeq94-Merged'
However, the
newer
databases seem to have the following naming structure:ViralRefSeq-prokaryotes-v201.faa.gz ViralRefSeq-prokaryotes-v201.Merged-reference.csv ViralRefSeq-prokaryotes-v201.protein2contig.csv
So, which of the following is my
database command
supposed to look like?--db 'ProkaryoticViralRefSeq201-Merged' (OR) --db 'ViralRefSeq-prokaryotes-v201.Merged'
Thanks again for your help with this!
-
reporter - attached slurm-2279809.out.txt
Log file from updated
vcontact2 with v201 database
-
reporter Hey @Ben Bolduc I’ve updated vcontact2 and finished a run as well. The
log
file is attached for your perusal. Here is confirmation that I the most recent version of the toolvcontact2) [sbusi@access1 nomis_viruses]$ vcontact2 -vv ============================This is vConTACT2 0.9.21============================
Even after this run, the output for both the
merged_df.csv or genome_by_genome_overview.csv
files don’t seem to have theoriginal contigIDs
. Please see below:(vcontact2) 0 [sbusi@access1 nomis_viruses]$ head /scratch/users/sbusi/nomis_viruses/results/vcontact2_output/GL_R10_GL11_UP_3/genome_by_genome_overview.csv ,Genome,Order,Family,Genus,VC,VC Status,Size,VC Subcluster,VC Subcluster Size,Quality,Adj P-value,Topology Confidence Score,Genera in VC,Families in VC,Orders in VC,Genus Confidence Score 0,Achromobacter~phage~83-24,Caudovirales,Siphoviridae,Jwxvirus,0_0,Clustered,2,VC_0_0,2,0.1875,0.95227493,0.1785,1,1,1,1.0 1,Achromobacter~phage~JWAlpha,Caudovirales,Podoviridae,Jwalphavirus,7_0,Clustered,20,VC_7_0,20,0.5809,1.0,0.5809,7,1,1,0.8421 2,Achromobacter~phage~JWF,Caudovirales,Siphoviridae,Unassigned,16_0,Clustered,2,VC_16_0,2,1.0,1.0,1.0,2,1,1,1.0 3,Achromobacter~phage~JWX,Caudovirales,Siphoviridae,Jwxvirus,0_0,Clustered,2,VC_0_0,2,0.1875,0.95227493,0.1785,1,1,1,1.0 4,Achromobacter~phage~phiAxp-1,Caudovirales,Siphoviridae,Unassigned,1_0,Clustered,12,VC_1_0,12,0.7038,1.0,0.7038,5,1,1,1.0 5,Achromobacter~phage~phiAxp-2,Caudovirales,Siphoviridae,Unassigned,19_0,Clustered,5,VC_19_0,5,0.0904,0.96238552,0.087,3,1,1,1.0 6,Achromobacter~phage~phiAxp-3,Caudovirales,Podoviridae,Jwalphavirus,7_0,Clustered,20,VC_7_0,20,0.5809,1.0,0.5809,7,1,1,0.8421 7,Acidianus~bottle-shaped~virus,Unassigned,Ampullaviridae,Ampullavirus,22_0,Clustered,3,VC_22_0,3,1.0,1.0,1.0,1,1,1,1.0 8,Acidianus~bottle-shaped~virus~2,Unassigned,Ampullaviridae,Ampullavirus,22_0,Clustered,3,VC_22_0,3,1.0,1.0,1.0,1,1,1,1.0
(vcontact2) 1 [sbusi@access1 nomis_viruses]$ head /scratch/users/sbusi/nomis_viruses/results/vcontact2_output/GL_R10_GL11_UP_3/merged_df.csv ,pos,contig_id,proteins,origin,order,family,genus 0,0,Acholeplasma~virus~L2,16,RefSeq-201,,Plasmaviridae,Plasmavirus 1,1,Acholeplasma~virus~MV-L51,4,RefSeq-201,Tubulavirales,Inoviridae,Plectrovirus 2,2,Achromobacter~phage~83-24,61,RefSeq-201,Caudovirales,Siphoviridae,Jwxvirus 3,3,Achromobacter~phage~JWAlpha,91,RefSeq-201,Caudovirales,Podoviridae,Jwalphavirus 4,4,Achromobacter~phage~JWF,118,RefSeq-201,Caudovirales,Siphoviridae, 5,5,Achromobacter~phage~JWX,67,RefSeq-201,Caudovirales,Siphoviridae,Jwxvirus 6,6,Achromobacter~phage~phiAxp-1,64,RefSeq-201,Caudovirales,Siphoviridae, 7,7,Achromobacter~phage~phiAxp-2,86,RefSeq-201,Caudovirales,Siphoviridae, 8,8,Achromobacter~phage~phiAxp-3,80,RefSeq-201,Caudovirales,Podoviridae,Jwalphavirus
Any other ideas on how best to retrieve these? It does look like somewhere in the
merging with database
step the contigIDs are not being attached.Thank you!!
-
Hi @Susheel Bhanu Busi
Were you able to solve this? The same is happening to me.
-
reporter Hey @Miguel Ángel Salazar: as you see above, I still had the error.
@Ben Bolduc any updates on this front?
Thank you!
-
reporter @Ben Bolduc Would you happen to any updates on this?
Thank you!
-
Hi @Ben Bolduc I am have the same issue, any updates? Thanks!
-
Hi,@yujie zhao I met the same question,but I do not resolve my problem, have you resoved it? linking contig Id to VC ?
-
I also have this issue @Ben Bolduc
-
@yujie zhao @luoxiao did you ever resolve this?
- Log in to comment
Hi @Susheel Bhanu Busi ,
The contig_id should be identical, with the exception of “~”, which are used to replace spaces in the contig IDs. In fact, gene2genome gets merged with the reference taxonomy table using the same headers.
If there are any differences (outside of ~), please let me know!
Cheers,
Ben