Issue in creating gene-to-genome mapping file
Hello!
I need help in making gene-to-genome mapping file for running VContact2
As per below command, proteins file I'm using is output file from VIBRANT and i used a script as mentioned here (https://github.com/AnantharamanLab/VIBRANT/issues/29#issuecomment-700921371) to generate the gene-to-genome mapping file but while passing these inputs to below command i'm getting errors.
vcontact2 --raw-proteins
[proteins file
]--rel-mode ‘Diamond’ --proteins-fp
[gene-to-genome mapping file
]--db 'ProkaryoticViralRefSeq97-Merged' --pcs-mode MCL --vcs-mode ClusterONE --c1-bin
[path to ClusterONE
]--output-dir
[target output directory
]
Below is how gene to genome file (first) made from above script and the proteins file (second) from VIBRANT output looks like.
Thank you!
Comments (2)
-
-
Hi Paula,
@Matthew is correct, it looks like the identifiers are incorrect. The fasta file you attached would need to have the spaces between its ID + description replaced with another character. The description field of a fasta file is anything after the first space.
protein_id contig_id keywords MH552500.1_CrAssphage_sp._isolate_ctcc615_complete_genome_1 MH552500.1_CrAssphage_sp._isolate_ctcc615_complete_genome None MH552500.1_CrAssphage_sp._isolate_ctcc615_complete_genome_2 MH552500.1_CrAssphage_sp._isolate_ctcc615_complete_genome None MH552500.1_CrAssphage_sp._isolate_ctcc615_complete_genome_3 MH552500.1_CrAssphage_sp._isolate_ctcc615_complete_genome None MH552500.1_CrAssphage_sp._isolate_ctcc615_complete_genome_4 MH552500.1_CrAssphage_sp._isolate_ctcc615_complete_genome None Note: The contig_id doesn’t change, only the proteins (protein_id) associated with it change.
VIBRANT seems to be adding _N to the genome ID/description. It also seems to be adding the gene position(1…12762) as well as the reading frame. The gene2genome accessory script won’t be able to handle a format with the extra information within the description, so I’m trying to think of the easiest way to fix the gene-to-genome file and the fasta.
Let’s try:
- Replace the spaces in your input file for VIBRANT and re-run VIBRANT.
- In one of the output directories, “VIBRANT_results_*” there’s a file, “VIBRANT_genbank_table_*.tsv” with the following headers: “protein, scaffold, accession, name”.
- Import that into your favorite table editor
- Replace “protein” with “protein_id”, replace “scaffold” with “contig_id”, replace “accession” with “keywords” and delete the “name” column.
Use that new *.tsv file as the gene-to-genome file and the *.phages_combined.faa file as the protein input for vContact2.
-Ben
- Log in to comment
I am not a developer of vcontact2 but from what you’ve supplied here, a mistake I can see is to do with sequence identifiers. The standard FASTA file format considers the first whitespace as delimiting the id from the description. Your protein_id column should be split and only the first part kept. ie. “MH552500.1”.
I have seen code which does not respect the file format and takes the entire line, but this will invariably cause problems with any downstream tool.