A lot of data is discarded during clustering

Issue #8 resolved
ZHENG Xiaoxuan created an issue

Hello, thank you for creating this tool.When I used the software to cluster my data (about 13,000 overlapping groups) with other data (from three databases, totaling about 7,000 overlapping groups), I found that more than 10,000 overlapping groups were discarded, most of them from the other three databases. My original goal was to find the difference between my data and other people's data, but now other people's data has been massively dropped. Is this normal?How can I correct this?

Comments (7)

  1. Ben Bolduc

    Thanks for reporting this and thanks for using vConTACT2!

    I agree that the loss of so many sequences is concerning. In typical datasets, most of the input data should fall within a cluster. In situations where more data is “discarded”, it’s often in highly overlapping datasets, though these are identified as overlapping genomes, or in sparse datasets where most genomes have little/no similarity to other genomes.

    I have a few Qs to figure out what might be going on here.

    1. Are you using the latest version of vConTACT2, 0.9.10+? An earlier version of vConTACT2 contained a bug where certain sequence names could cause other sequences to be ignored in the final output.
    2. Are all (or most) of your sequences in the genome_to_genome_overview.csv file? If they are not in the file, then they’re being dropped and I need to figure out where.
    3. If your sequences are in genome_to_genome_overview, are they classified as singleton, overlap, or outlier? If they’re overlap or outlier, then they’re technically classified into a cluster, but the genomes either share too much or too little w/ surrounding genomes to be confidently placed (at the genus level) into a single VC. If they’re a singleton, they weren’t identified by ClusterONE as being in any sort of cluster.
    4. Are your sequences in the c1.ntw and c1.clusters output files? If they aren’t, then there’s likely an issue with sequence naming or a 3rd party tool had an issue with the files. If they are in those output files, then somewhere vConTACT2 dropped (or classified them as not-clustered) them. If, as I mentioned above, vConTACT2 dropped them, then that’s an issue I need to fix.

    I’m basically trying to figure out if your data was dropped or simply weren’t clustered (and were classified as something else). One of the most common reasons for few/no genomes being clustered is the formatting of the input files. If the ~2K reference genomes are clustered and present in the network file, whereas all your 13K genomes arent in the network, usually that’s file parsing. If your sequences are in the network but not in the final output files, then vConTACT might have goofed up somewhere.

    -Ben

  2. ZHENG Xiaoxuan reporter

    Hi Ben! Thank you very much for your kind help.And my answer to your question is as follows: 1.The version of software I use is vcontact2 0.9.10. 2.Q: Are all (or most) of your sequences in the genome_to_genome_overview.csv file? A: No. 3.Q: If your sequences are in genome_to_genome_overview, are they classified as singleton, overlap, or outlier? A:  Yes. 4.Q: Are your sequences in the c1.ntw and c1.clusters output files? A: No. vContact2 dropped many singletons before.

    Thank you again and looking forward to your reply.

  3. Ben Bolduc

    So it appears that - as you said - most of your sequences (~%75) aren’t making it anywhere into the analysis. Since most of your sequences aren’t even in the c1.ntw file, could you send me a few lines of your proteins.csv file and any log file you have for the run? (bolduc.10 at osu.edu)

    Considering your data is dropped prior to c1.ntw, it’s either an issue with parsing or they’re dropped during the initial analysis of the PCs.

    Thanks again for submitting this - I’m always trying to push the limits of the tool and finding these bugs.

  4. ZHENG Xiaoxuan reporter

    Hi Ben, I'm glad to receive your reply! Sorry for replying you so late. I put an example of my protein.csv file and the error log in the attachment of the email.

    Best wishes.

    Zheng

     

  5. Log in to comment