Unexpected Differences in iPhoP Results: Disappearing GTDB Hits in Custom Databases

Issue #82 closed
ruixian sun created an issue

Thank you very much for your software! I encountered some problems when using different custom databases on the same set of viruses.

According to my understanding, the "add_to_db" command helps users create their own custom database, which includes publicly available bacterial and archaeal sequences from GTDB and sequences provided by the user. And I used both of these custom databases to predict the hosts for the same set of viruses. You understood there were differences in the results due to the different MAGs I provided, but I noticed that some differences occurred in the GTDB sequences themselves.

For example, for virus seq ‘D01_NODE_107_length_70881_cov_12.285220__full’, the ‘Host_prediction_to_genome_m90’ for database1 have those hits:

Virus   Host genome Host taxonomy   Main method Confidence score    Additional methods
D01_NODE_107_length_70881_cov_12.285220__full   RS_GCF_900474185.1  d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__Synechococcus_E sp900474185    iPHoP-RF    95.7    None
D01_NODE_107_length_70881_cov_12.285220__full   GB_GCA_003209165.1  d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__Synechococcus_E sp003209165    iPHoP-RF    95.4    None
D01_NODE_107_length_70881_cov_12.285220__full   RS_GCF_014279755.1  d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__Synechococcus_E sp014279755    iPHoP-RF    95.4    None
D01_NODE_107_length_70881_cov_12.285220__full   GB_GCA_013911515.1  d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__Synechococcus_E sp013911515    iPHoP-RF    94.1    None
D01_NODE_107_length_70881_cov_12.285220__full   RS_GCF_000737595.1  d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__Synechococcus_E sp000737595    iPHoP-RF    94.1    None
D01_NODE_107_length_70881_cov_12.285220__full   RS_GCF_000161795.2  d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__Synechococcus_E sp000161795    iPHoP-RF    92.4    None
D01_NODE_107_length_70881_cov_12.285220__full   D71_bin.2   d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__   iPHoP-RF    91.4    None
D01_NODE_107_length_70881_cov_12.285220__full   RS_GCF_014279895.1  d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__Synechococcus_E sp002724845    iPHoP-RF    90.1    None

But all of those hits disappeared in the database2 result.

I also tired the standard db ‘Sept_2021_pub_rw’ the results of ‘D01_NODE_107_length_70881_cov_12.285220__full’ shown as below:

Virus   Host genome Host taxonomy   Main method Confidence score    Additional methods
D01_NODE_107_length_70881_cov_12.285220__full   RS_GCF_000161795.2  d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__Synechococcus_E sp000161795    iPHoP-RF    95.4    None
D01_NODE_107_length_70881_cov_12.285220__full   GB_GCA_003209165.1  d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__Synechococcus_E sp003209165    iPHoP-RF    94.1    None
D01_NODE_107_length_70881_cov_12.285220__full   RS_GCF_900474185.1  d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__Synechococcus_E sp900474185    iPHoP-RF    93.1    None
D01_NODE_107_length_70881_cov_12.285220__full   GB_GCA_003211535.1  d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__Synechococcus_E sp003211535    iPHoP-RF    92.1    None
D01_NODE_107_length_70881_cov_12.285220__full   RS_GCF_014279815.1  d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_C;s__Synechococcus_C sp014279815    iPHoP-RF    92.1    None
D01_NODE_107_length_70881_cov_12.285220__full   GB_GCA_013911515.1  d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__Synechococcus_E sp013911515    iPHoP-RF    91.4    None
D01_NODE_107_length_70881_cov_12.285220__full   RS_GCF_014279755.1  d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__Synechococcus_E sp014279755    iPHoP-RF    90.8    None
D01_NODE_107_length_70881_cov_12.285220__full   RS_GCF_000737595.1  d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__Synechococcus_E sp000737595    iPHoP-RF    90.5    None

In case the host-seq was included in custom db, I pick one host seq ’RS_GCF_000161795.2‘ to check the database information of the database2 using the command:

(iphop) ruixian@lucky-PowerEdge-R740:~$ grep "RS_GCF_900474185.1" "/media/backup2/iphop_database_dec2023/db_infos/Host_Genomes.tsv"
GCF_900474185.1 GTDB    GTDB_repr       Synechococcus sp. UW69  RS_GCF_900474185.1      d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__Synechococcus_E sp900474185

These cross-database discrepancies in GTDB hits were common, and not limited to cases where database1 had hits that database2 did not; rather, the differences were present in between these 3 databases. I am wondering why this happened. And because of the difference, could I merge the results of two db together? Thank you very much!

Best,

Ruixian

Comments (4)

  1. Simon Roux repo owner

    Hi,

    I’m not sure I see the discrepancies here. To help clarify what we are looking at in these “Host_prediction_to_genome_m90” files, these are not “hits”, these are confidence scores for each candidate phage-host pair that takes into account all hits and multiple methods, i.e. the score for phage D01_NODE_107_length_70881_cov_12.285220__full and host genome RS_GCF_000161795.2 depends not just on this host genome, but on all other host genomes present in the database. Since this context of additional host genomes is different between the default database (which includes GTDB + public genomes and bins from IMG/GEM/MGnify) and your custom database (which includes GTDB + your own bins), it is expected that the scores between these two runs will be slightly different (in this case, 95.4 vs 92.4).

    One more important thing to keep in mind is that iPHoP was optimized to predict the genus of the host, not the exact species. So in that case, both runs (using either database) give you the same answer: the most likely host genus for D01_NODE_107_length_70881_cov_12.285220__full, according to iPHoP, is g__Synechococcus_E. In terms of how to merge the results from different runs with different databases, we typically recommend taking the prediction with the highest score across runs for each input phage (although in that case, this would be g__Synechococcus_E anyway).

    Hope it helps !

    Best,

    Simon

  2. ruixian sun reporter

    Thank you very much for your answer!!!

    As I mentioned in the question, the ‘Host_prediction_to_genome_m90.csv’ from standard database and custom database1 provides me similar results for virus contig D01_NODE_107_length_70881_cov_12.285220__full, while the custom database2 provides me no result.

    I thought I understood what you mean! So, I went back to find the ‘Wdir/All_combined_scores.csv’ of the custom bd2 to have a check. And the hits for this virus contig was shown like that:

    Virus   Repr host   Repr host taxonomy  Repr host genus method  score   FDR
    D01_NODE_107_length_70881_cov_12.285220__full   XB_MAG929   d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__   d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E   iPHoP-RF    0.955999997 0.102
    D01_NODE_107_length_70881_cov_12.285220__full   GB_GCA_013911515.1  d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__Synechococcus_E sp013911515    d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E   iPHoP-RF    0.945999995 0.118
    D01_NODE_107_length_70881_cov_12.285220__full   XB_MAG351   d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__   d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E   iPHoP-RF    0.943999995 0.121
    D01_NODE_107_length_70881_cov_12.285220__full   RS_GCF_000012625.1  d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__Synechococcus_E sp000012625    d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E   iPHoP-RF    0.934   0.137
    D01_NODE_107_length_70881_cov_12.285220__full   XB_MAG472   d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__   d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E   iPHoP-RF    0.932000004 0.14
    D01_NODE_107_length_70881_cov_12.285220__full   RS_GCF_014279755.1  d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__Synechococcus_E sp014279755    d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E   iPHoP-RF    0.926000014 0.149
    D01_NODE_107_length_70881_cov_12.285220__full   RS_GCF_900474185.1  d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__Synechococcus_E sp900474185    d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E   iPHoP-RF    0.926000014 0.149
    D01_NODE_107_length_70881_cov_12.285220__full   GB_GCA_003209165.1  d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__Synechococcus_E sp003209165    d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E   iPHoP-RF    0.912000038 0.169
    D01_NODE_107_length_70881_cov_12.285220__full   RS_GCF_014279895.1  d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__Synechococcus_E sp002724845    d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E   iPHoP-RF    0.904000051 0.181
    D01_NODE_107_length_70881_cov_12.285220__full   PRE_2021_D71_bin.2  d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__   d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E   iPHoP-RF    0.898000062 0.189
    D01_NODE_107_length_70881_cov_12.285220__full   RS_GCF_000161795.2  d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__Synechococcus_E sp000161795    d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E   iPHoP-RF    0.898000062 0.189
    D01_NODE_107_length_70881_cov_12.285220__full   RS_GCF_000737575.1  d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__Synechococcus_E sp000737575    d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E   iPHoP-RF    0.884000085 0.207
    D01_NODE_107_length_70881_cov_12.285220__full   GB_GCA_002170825.1  d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__Synechococcus_E sp002170825    d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E   iPHoP-RF    0.874000102 0.22
    D01_NODE_107_length_70881_cov_12.285220__full   XS_MAG272   d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__   d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E   iPHoP-RF    0.870000094 0.225
    D01_NODE_107_length_70881_cov_12.285220__full   DS_MAG135   d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__   d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E   iPHoP-RF    0.866000086 0.229
    D01_NODE_107_length_70881_cov_12.285220__full   GB_GCA_003210735.1  d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__Synechococcus_E sp003210735    d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E   iPHoP-RF    0.866000086 0.229
    D01_NODE_107_length_70881_cov_12.285220__full   GB_GCA_003210795.1  d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__Synechococcus_E sp003210795    d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E   iPHoP-RF    0.866000086 0.229
    D01_NODE_107_length_70881_cov_12.285220__full   PRES17F07_bin.180   d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__   d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E   iPHoP-RF    0.856000066 0.241
    D01_NODE_107_length_70881_cov_12.285220__full   RS_GCF_014280175.1  d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__Synechococcus_E sp004212765    d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E   iPHoP-RF    0.854000062 0.243
    D01_NODE_107_length_70881_cov_12.285220__full   XS_MAG292   d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__   d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E   iPHoP-RF    0.850000054 0.248
    D01_NODE_107_length_70881_cov_12.285220__full   RS_GCF_000195975.1  d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__Synechococcus_E sp000195975    d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E   iPHoP-RF    0.844000041 0.254
    D01_NODE_107_length_70881_cov_12.285220__full   RS_GCF_014280195.1  d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__Synechococcus_E sp014280195    d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E   iPHoP-RF    0.836000025 0.263
    D01_NODE_107_length_70881_cov_12.285220__full   RS_GCF_000737595.1  d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__Synechococcus_E sp000737595    d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E   iPHoP-RF    0.830000013 0.27
    D01_NODE_107_length_70881_cov_12.285220__full   GB_GCA_002691345.1  d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__Synechococcus_E sp002691345    d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E   iPHoP-RF    0.80399996  0.298
    D01_NODE_107_length_70881_cov_12.285220__full   DS_MAG273   d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__   d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E   iPHoP-RF    0.801999956 0.3
    D01_NODE_107_length_70881_cov_12.285220__full   GB_GCA_003211535.1  d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__Synechococcus_E sp003211535    d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E   iPHoP-RF    0.733999819 0.383
    D01_NODE_107_length_70881_cov_12.285220__full   RS_GCF_000515235.1  d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__Synechococcus_E sp000515235    d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E   iPHoP-RF    0.733999819 0.383
    D01_NODE_107_length_70881_cov_12.285220__full   RS_GCF_000737535.1  d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__Synechococcus_E sp000737535    d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E   iPHoP-RF    0.725999802 0.394
    

    According to my understanding, the ‘Confidence score' presented in ‘Host_prediction_to_genome_m90.csv’ was calculated using the 'score' 'FDR' shown in the ‘Wdir/All_combined_scores.csv' (as described in your article in the ‘Integrating iPHoP classifiers and RaFAH into a final host prediction’ section). Therefore, in fact, the virus contig D01_NODE_107_length_70881_cov_12.285220__full does have hits in the results from the custom database2, but they are filtered, and do not exist in the final output. (Am I right?)

    Again, thank you very much for your answer. I will set up a lower cutoff (maybe --min_score 85) and have a try!

  3. Simon Roux repo owner

    Right that’s correct, it seems like these scores are here but slightly below 90 (0.121 FDR for instance should translate into a score of 87.9). This can happen for instance if you have some contamination in your MAGs, i.e. if a MAG includes core genes from one taxon, but accessory genes (including viral genes) from another, then it will lead to conflicting signal and a much lower score for all hits for any virus with hits to this contaminated MAG.

    Best,

    Simon

  4. Log in to comment