More sequences input, much less it will cluster with refseq207

Issue #64 new
ray created an issue

Dear authors:

I test vcontact2 using part of my sequences or the whole sequences. (Though I would expect some small difference of the results) the results differed greatly. More sequences input, much less my sequences would cluster with refseq207.

Is there something wrong with my code:

time vcontact2 --raw-proteins cat_virome.faa
--rel-mode 'Diamond'
--proteins-fp cat_virome_map.csv
--db 'ProkaryoticViralRefSeq207-Merged'
--pcs-mode MCL --vcs-mode ClusterONE
-t 60
--c1-bin /data/db/MAVERICLab-vcontact2-34ae9c466982/bin/cluster_one-1.0.jar
--output-dir est90_only.vContact2-refseq207

Thank you!

Comments (1)

  1. ray reporter

    For example, if input 1000 sequences, 30 sequences could cluster with at least one genome in refseq in a subsectet of 700 sequcences . if input 10000 sequences, only 15 sequences could cluster with at least one genome in refseq in the same subsectet of 700 sequcences. if input 250000 sequences, 2 sequences only. What is more, the 2 sequences are not all in 15, the 15 are not all in 30!

  2. Log in to comment