Missing genomes in output

Issue #6 resolved
Stephen Nayfach created an issue

I've run the latest version of the tool with the test data and it completed without errors, but some genomes are missing from the output.

There are 575 user-supplied genomes in the test file:

\$ cut -f2 MAVERICLab-vcontact2-a3541dd53c3e/test_data/proteins.csv -d ',' | sed 1d | sort -u | wc -l

575

But only 246 are found in the viral_cluster_overview file:

\$ grep -o 'VIR' viral_cluster_overview.csv | wc -l
246

And the genome_by_genome file is not present:

\$ ls

c1.clusters modules_mcl_5.0_modules.pandas
c1.ntw modules_mcl_5.0_pcs.pandas
merged.dmnd sig1.0_mcl2.0_clusters.csv
merged.faa sig1.0_mcl2.0_contigs.csv
merged.self-diamond.tab sig1.0_mcl2.0_modsig1.0_modmcl5.0_minshared3_link_mod_cluster.csv
merged.self-diamond.tab.abc sig1.0_mcl5.0_minshared3_modules.csv
merged.self-diamond.tab.mci vConTACT_contigs.csv
merged.self-diamond.tab_mcl20.clusters vConTACT_pcs.csv
merged.self-diamond.tab_mcxload.tab vConTACT_profiles.csv
merged_df.csv vConTACT_proteins.csv
modules.ntwk viral_cluster_overview.csv
modules_mcl_5.0.clusters

Any help would be appreciated

Comments (4)

  1. Stephen Nayfach

    Simon Roux helped me out and found the bug causing this issue. It turns out that vcontact 2 has a problem handling distinct contig identifiers that overlap in their characters. See below for an illustrative example:

    My original input file looked like something this:

    protein_id contig_id keywords
    OTU-48322_1 OTU-48322 dummy
    OTU-48322_2 OTU-48322 dummy
    OTU-4832_1 OTU-4832 dummy
    OTU-4832_2 OTU-4832 dummy

    Changing to this solved the issue:

    protein_id contig_id keywords
    OTU-48322_1 OTU-48322 dummy
    OTU-48322_2 OTU-48322 dummy
    OTU-04832_1 OTU-04832 dummy
    OTU-04832_2 OTU-04832 dummy

    So in the original input file having both OTU-4832 and OTU-48322 was problematic for some reason.

  2. Ben Bolduc

    Thank you for the feedback, and thanks a bunch for figuring out what caused it. I thought an earlier update resolved all the “leaky contigs” from the pipeline, but it appears I was incorrect. I have identified the offending line but need to work out the patch details.

    Thanks again!

  3. Ben Bolduc

    On hold until fix released. In the meantime, a workaround is to ensure contig names are not subsets of each other.

  4. Ben Bolduc

    Fix implemented in 0.9.9. Please update to this version (or later).

    Please re-open if the problem persists in a recent version! Thanks!

  5. Log in to comment