Issue in creating gene-to-genome mapping file

Issue #42 new
Paula Istvan created an issue

Hello!

I need help in making gene-to-genome mapping file for running VContact2

As per below command, proteins file I'm using is output file from VIBRANT and i used a script as mentioned here (https://github.com/AnantharamanLab/VIBRANT/issues/29#issuecomment-700921371) to generate the gene-to-genome mapping file but while passing these inputs to below command i'm getting errors.

vcontact2 --raw-proteins[proteins file]--rel-mode ‘Diamond’ --proteins-fp[gene-to-genome mapping file]--db 'ProkaryoticViralRefSeq97-Merged' --pcs-mode MCL --vcs-mode ClusterONE --c1-bin[path to ClusterONE]--output-dir[target output directory]

Below is how gene to genome file (first) made from above script and the proteins file (second) from VIBRANT output looks like.

Thank you!

Comments (2)

  1. Matthew DeMaere

    I am not a developer of vcontact2 but from what you’ve supplied here, a mistake I can see is to do with sequence identifiers. The standard FASTA file format considers the first whitespace as delimiting the id from the description. Your protein_id column should be split and only the first part kept. ie. “MH552500.1”.

    I have seen code which does not respect the file format and takes the entire line, but this will invariably cause problems with any downstream tool.

  2. Ben Bolduc

    Hi Paula,

    @Matthew is correct, it looks like the identifiers are incorrect. The fasta file you attached would need to have the spaces between its ID + description replaced with another character. The description field of a fasta file is anything after the first space.

    protein_id contig_id keywords
    MH552500.1_CrAssphage_sp._isolate_ctcc615_complete_genome_1 MH552500.1_CrAssphage_sp._isolate_ctcc615_complete_genome None
    MH552500.1_CrAssphage_sp._isolate_ctcc615_complete_genome_2 MH552500.1_CrAssphage_sp._isolate_ctcc615_complete_genome None
    MH552500.1_CrAssphage_sp._isolate_ctcc615_complete_genome_3 MH552500.1_CrAssphage_sp._isolate_ctcc615_complete_genome None
    MH552500.1_CrAssphage_sp._isolate_ctcc615_complete_genome_4 MH552500.1_CrAssphage_sp._isolate_ctcc615_complete_genome None

    Note: The contig_id doesn’t change, only the proteins (protein_id) associated with it change.

    VIBRANT seems to be adding _N to the genome ID/description. It also seems to be adding the gene position(1…12762) as well as the reading frame. The gene2genome accessory script won’t be able to handle a format with the extra information within the description, so I’m trying to think of the easiest way to fix the gene-to-genome file and the fasta.

    Let’s try:

    1. Replace the spaces in your input file for VIBRANT and re-run VIBRANT.
    2. In one of the output directories, “VIBRANT_results_*” there’s a file, “VIBRANT_genbank_table_*.tsv” with the following headers: “protein, scaffold, accession, name”.
    3. Import that into your favorite table editor
    4. Replace “protein” with “protein_id”, replace “scaffold” with “contig_id”, replace “accession” with “keywords” and delete the “name” column.

    Use that new *.tsv file as the gene-to-genome file and the *.phages_combined.faa file as the protein input for vContact2.

    -Ben

  3. Log in to comment