how to generate the input files for the vcontact run?

Issue #18 resolved
Former user created an issue

Hello,

I would like to use this tool to generate some sort of taxonomical oriented clustering of viral particles I recovered from a metagenomic dataset.

I was able to install it and run with the example dataset, but I wonder how were the inputs generated.

--raw-proteins and -proteins-fp

The first seems to be a translation of the nucleotides to aminoacids. Should I consider multiple frames of translation?

And the second is a csv file linking protein name and genomes names. Where do I get that for my data?

Thanks a lot for the input. We are really looking forward using this tool in our lab.

Best Regards Rodolfo

Comments (4)

  1. Ben Bolduc

    Hi Rodolfo,

    The --raw-proteins is indeed an amino-acid translation of nucleotide sequences. I often use prodigal to generate the output, but any gene prediction software would work.

    prodigal -i viral_genomes.fna -o viral_genomes.genes -a viral_genomes.faa -p meta
    

    Once done, you’ll need to create the “Gene to Genome” mapping file (--proteins-fp), which, as you stated, links the protein names to their genomes. I have included a naive wrapper, vcontact2_gene2genome.py, that will convert the gene predictions to this mapping file. The wrapper can also handle MetaGeneMark and some outputs from NCBI. Users of CyVerse - a DOE-funded Cyberinfrastructure - can use the “vContact2-Gene2Genome” app to generate this file. And anyone using KBase will have their genomes automatically processed.

    vcontact2_gene2genome -p viral_genomes.faa -o viral_genomes_g2g.csv -s 'Prodigal-FAA'
    

    That should work for your specific situation. However, you’re free to create the mapping file however you want. All that’s required is a 3-column table with the headers “genome_id, protein_id, keywords.” You can also create keyword annotations with the “keywords” column, and those will be aggregated and summarized in one of the vcontact2 outputs. But very few people use keywords, and even fewer look at those outputs.

    Cheers,

    Ben

  2. Zhanwen Cheng

    Hi Ben, I am using prodigal and vcontact2_gene2geneome to generate g2g.csv with your previous published ‘GOV2_viral_populations_larger_than_5KB_or_circular.fasta’. I noticed that there was 488131 viral contigs in the fasta file, but only 452963 contigs could be predicted by prodigal as input into vcontact2. What should I do for the unpredicted 35168 viral contigs?

  3. Adhip Mukhopadhyay

    Hello

    I am new to Cyverse and using the vContact2-Gene2Genome 1.1.0 https://de.cyverse.org/apps/agave/vContact2-Gene2Genome-1.1.0u1.

    I have submitted a job to create the “Gene to Genome” mapping file about 24 hours ago. The analysis id is 7bcd271c-0f95-4469-9028-ddd09f4b94ea-007.

    The status is still showing ‘submitted’, but the info panel is showing ‘2021-06-13 14:37:35 - FINISHED; 2021-06-13 14:37:35 - Job completed successfully’.

    In the output directory, the ‘protein.csv’ is also empty.

    Please help me to resolve the issue.

    Thanks

    Adhip

  4. Log in to comment