vConTACT2 0.9.13 questions

Issue #11 resolved
Anastasia created an issue

Dear vConTACT2 team,

Thank you for creating this powerful tool! Currently I am learning how to use vConTACT2 0.9.13. I have encountered several issues, could you please check these if possible? Test data provided together with vConTACT2 were used in all cases described below.

I. When I am running vConTACT2 with default parameters, the run completes successfully:

vcontact \
    --raw-proteins ${vc_path}/test_data/VIRSorter_genomes.faa \
    --proteins-fp ${vc_path}/test_data/VIRSorter_genomes_g2g.csv \
    --output-dir 'vc_test' \
    --threads 10 \
    --c1-bin ${vc_path}/bin/cluster_one-1.0.jar

But the output file “genome_by_genome_overview.csv” looks a little bit unusual:

  • Non-empty values in columns “VC” and “VC Subcluster” are identical, except a “VC_” prefix is added to the values in the latter column.

  • Values in columns “Size” and “VC Subcluster Size” are identical.

  • While VC identifiers in the “VC” column consist of two numbers (e.g. 218_1), VC identifiers provided in the “VC Status” column contain only a single number (e.g. Overlap (VC_161/VC_218)).

Here is a fragment of the “genome_by_genome_overview.csv” table:

        VC                     VC.Status Size VC.Subcluster VC.Subcluster.Size
253  218_1           Clustered/Singleton    1      VC_218_1                  1
359  218_2           Clustered/Singleton    1      VC_218_2                  1
406  218_0                     Clustered    2      VC_218_0                  2
449  218_0                     Clustered    2      VC_218_0                  2
1923       Overlap (VC_60/VC_161/VC_218)   NA                               NA
1924             Overlap (VC_161/VC_218)   NA                               NA

Is it possible that instead of information about VCs, information about VC subclusters is reported in the “VC” and “Size” columns?

II. I have tried to modify clustering parameters as shown below:

vcontact \
    --raw-proteins ${vc_path}/test_data/VIRSorter_genomes.faa \
    --proteins-fp ${vc_path}/test_data/VIRSorter_genomes_g2g.csv \
    --output-dir 'vc_test' \
    --threads 10 \
    --pcs-mode 'MCL' \
    --pc-inflation 1.5 \
    --vcs-mode 'MCL' \
    --vc-inflation 1.5

But the run failed with an error message:

ERROR:vcontact2: Error in contig clustering
ERROR:vcontact2: 'numpy.float64' object cannot be interpreted as an integer

III. I have tried to use reference database “ProkaryoticViralRefSeq97-Merged”:

vcontact \
    --raw-proteins ${vc_path}/test_data/VIRSorter_genomes.faa \
    --proteins-fp ${vc_path}/test_data/VIRSorter_genomes_g2g.csv \
    --output-dir 'vc_test' \
    --db 'ProkaryoticViralRefSeq97-Merged' \
    --threads 10 \
    --c1-bin ${vc_path}/bin/cluster_one-1.0.jar

But the run failed with an error:

FileNotFoundError: [Errno 2] File b'<<PATH>>/conda_envs/vContact2/lib/python3.7/site-packages/vcontact/data/ViralRefSeq-prokaryotes-v97.Merged-reference.csv' does not exist

Kind regards,

Anastasia

Comments (3)

  1. Ben Bolduc

    Hi Anastasia,

    Regarding point 1.

    The genome_by_genome file aggregates multiple results from several files, plus a few internal calculations. Rather than cleaning up the table, I’ve kept a few columns so I can spot any irregularities during processing. “VC” and “VC Subcluster” should be identical (minus the VC, which you noted) - I’m actually merging a few tables using them as keys. This happens to carry over “Size” and “VC Subcluster size” along with it.

    I could find the issue somewhere on the tracker, but “VC” has two numbers if Clustered. The 1st denotes the original Cluster ONE cluster, the 2nd is the “subcluster.” This subcluster actually represents a 2nd pass across those original clusters and vConTACT2 refines them to more closely approximate ICTV’s genus-level groups. In vConTACT1, these would have been multi-genera VCs (those containing multiple genera). These aren’t good, so we decided to go through and use a 2nd pass in vConTACT2. The “VC Status,” however, provides a group for the type of membership a genome is in. It can be Clustered (above), Singleton (effectively, no matches to anything else), Outlier (is attached to a VC, but not statistically significant enough to confidently place with that VC), and Overlap (where a genome could not be definitely placed in one VC or another). These basically follow our understanding of viral sequence space; we can confidently taxonomically place in someplace (clustered), we have little/no idea about it (singleton), we kinda know something about it but not enough confidence (outlier), and there’s too many “overlapping” genes to definitely place it in a VC (overlap).

    Regarding point 2.

    This is a bug. --vcs-mode should almost always be ClusterONE. Basically, because it is simply superior in nearly all aspects to MCL. MCL has a few argument combinations that trigger this bug (hindsight is 20/20), and I keep on squashing them. Eventually, MCL will be removed entirely from the VC selection, leaving on Cluster ONE.

    Regarding point 3.

    This was a bug fixed in either 0.9.14 or 0.9.15. Updating should fix this problem. I’ll get around to updating the Conda version early next week.

    Cheers,

    Ben

  2. Anastasia reporter

    Dear Ben,

    Thank you so much for your detailed reply! Now I understand the
    structure of the genome_by_genome table, sorry for my initial misunderstanding!

    Kind regards,
    Anastasia

  3. Log in to comment