Missing genomes in output
I've run the latest version of the tool with the test data and it completed without errors, but some genomes are missing from the output.
There are 575 user-supplied genomes in the test file:
\$ cut -f2 MAVERICLab-vcontact2-a3541dd53c3e/test_data/proteins.csv -d ',' | sed 1d | sort -u | wc -l
575
But only 246 are found in the viral_cluster_overview file:
\$ grep -o 'VIR' viral_cluster_overview.csv | wc -l
246
And the genome_by_genome file is not present:
\$ ls
c1.clusters modules_mcl_5.0_modules.pandas
c1.ntw modules_mcl_5.0_pcs.pandas
merged.dmnd sig1.0_mcl2.0_clusters.csv
merged.faa sig1.0_mcl2.0_contigs.csv
merged.self-diamond.tab sig1.0_mcl2.0_modsig1.0_modmcl5.0_minshared3_link_mod_cluster.csv
merged.self-diamond.tab.abc sig1.0_mcl5.0_minshared3_modules.csv
merged.self-diamond.tab.mci vConTACT_contigs.csv
merged.self-diamond.tab_mcl20.clusters vConTACT_pcs.csv
merged.self-diamond.tab_mcxload.tab vConTACT_profiles.csv
merged_df.csv vConTACT_proteins.csv
modules.ntwk viral_cluster_overview.csv
modules_mcl_5.0.clusters
Any help would be appreciated
Comments (4)
-
-
Thank you for the feedback, and thanks a bunch for figuring out what caused it. I thought an earlier update resolved all the “leaky contigs” from the pipeline, but it appears I was incorrect. I have identified the offending line but need to work out the patch details.
Thanks again!
-
- changed status to on hold
On hold until fix released. In the meantime, a workaround is to ensure contig names are not subsets of each other.
-
- changed status to resolved
Fix implemented in 0.9.9. Please update to this version (or later).
Please re-open if the problem persists in a recent version! Thanks!
- Log in to comment
Simon Roux helped me out and found the bug causing this issue. It turns out that vcontact 2 has a problem handling distinct contig identifiers that overlap in their characters. See below for an illustrative example:
My original input file looked like something this:
protein_id contig_id keywords
OTU-48322_1 OTU-48322 dummy
OTU-48322_2 OTU-48322 dummy
OTU-4832_1 OTU-4832 dummy
OTU-4832_2 OTU-4832 dummy
Changing to this solved the issue:
protein_id contig_id keywords
OTU-48322_1 OTU-48322 dummy
OTU-48322_2 OTU-48322 dummy
OTU-04832_1 OTU-04832 dummy
OTU-04832_2 OTU-04832 dummy
So in the original input file having both OTU-4832 and OTU-48322 was problematic for some reason.