Unexpected Differences in iPhoP Results: Disappearing GTDB Hits in Custom Databases
Thank you very much for your software! I encountered some problems when using different custom databases on the same set of viruses.
According to my understanding, the "add_to_db" command helps users create their own custom database, which includes publicly available bacterial and archaeal sequences from GTDB and sequences provided by the user. And I used both of these custom databases to predict the hosts for the same set of viruses. You understood there were differences in the results due to the different MAGs I provided, but I noticed that some differences occurred in the GTDB sequences themselves.
For example, for virus seq ‘D01_NODE_107_length_70881_cov_12.285220__full’, the ‘Host_prediction_to_genome_m90’ for database1 have those hits:
Virus Host genome Host taxonomy Main method Confidence score Additional methods
D01_NODE_107_length_70881_cov_12.285220__full RS_GCF_900474185.1 d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__Synechococcus_E sp900474185 iPHoP-RF 95.7 None
D01_NODE_107_length_70881_cov_12.285220__full GB_GCA_003209165.1 d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__Synechococcus_E sp003209165 iPHoP-RF 95.4 None
D01_NODE_107_length_70881_cov_12.285220__full RS_GCF_014279755.1 d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__Synechococcus_E sp014279755 iPHoP-RF 95.4 None
D01_NODE_107_length_70881_cov_12.285220__full GB_GCA_013911515.1 d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__Synechococcus_E sp013911515 iPHoP-RF 94.1 None
D01_NODE_107_length_70881_cov_12.285220__full RS_GCF_000737595.1 d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__Synechococcus_E sp000737595 iPHoP-RF 94.1 None
D01_NODE_107_length_70881_cov_12.285220__full RS_GCF_000161795.2 d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__Synechococcus_E sp000161795 iPHoP-RF 92.4 None
D01_NODE_107_length_70881_cov_12.285220__full D71_bin.2 d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__ iPHoP-RF 91.4 None
D01_NODE_107_length_70881_cov_12.285220__full RS_GCF_014279895.1 d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__Synechococcus_E sp002724845 iPHoP-RF 90.1 None
But all of those hits disappeared in the database2 result.
I also tired the standard db ‘Sept_2021_pub_rw’ the results of ‘D01_NODE_107_length_70881_cov_12.285220__full’ shown as below:
Virus Host genome Host taxonomy Main method Confidence score Additional methods
D01_NODE_107_length_70881_cov_12.285220__full RS_GCF_000161795.2 d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__Synechococcus_E sp000161795 iPHoP-RF 95.4 None
D01_NODE_107_length_70881_cov_12.285220__full GB_GCA_003209165.1 d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__Synechococcus_E sp003209165 iPHoP-RF 94.1 None
D01_NODE_107_length_70881_cov_12.285220__full RS_GCF_900474185.1 d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__Synechococcus_E sp900474185 iPHoP-RF 93.1 None
D01_NODE_107_length_70881_cov_12.285220__full GB_GCA_003211535.1 d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__Synechococcus_E sp003211535 iPHoP-RF 92.1 None
D01_NODE_107_length_70881_cov_12.285220__full RS_GCF_014279815.1 d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_C;s__Synechococcus_C sp014279815 iPHoP-RF 92.1 None
D01_NODE_107_length_70881_cov_12.285220__full GB_GCA_013911515.1 d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__Synechococcus_E sp013911515 iPHoP-RF 91.4 None
D01_NODE_107_length_70881_cov_12.285220__full RS_GCF_014279755.1 d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__Synechococcus_E sp014279755 iPHoP-RF 90.8 None
D01_NODE_107_length_70881_cov_12.285220__full RS_GCF_000737595.1 d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__Synechococcus_E sp000737595 iPHoP-RF 90.5 None
In case the host-seq was included in custom db, I pick one host seq ’RS_GCF_000161795.2
‘ to check the database information of the database2 using the command:
(iphop) ruixian@lucky-PowerEdge-R740:~$ grep "RS_GCF_900474185.1" "/media/backup2/iphop_database_dec2023/db_infos/Host_Genomes.tsv"
GCF_900474185.1 GTDB GTDB_repr Synechococcus sp. UW69 RS_GCF_900474185.1 d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__Synechococcus_E sp900474185
These cross-database discrepancies in GTDB hits were common, and not limited to cases where database1 had hits that database2 did not; rather, the differences were present in between these 3 databases. I am wondering why this happened. And because of the difference, could I merge the results of two db together? Thank you very much!
Best,
Ruixian
Comments (4)
-
repo owner -
reporter Thank you very much for your answer!!!
As I mentioned in the question, the ‘Host_prediction_to_genome_m90.csv’ from standard database and custom database1 provides me similar results for virus contig ‘D01_NODE_107_length_70881_cov_12.285220__full’, while the custom database2 provides me no result.
I thought I understood what you mean! So, I went back to find the ‘Wdir/All_combined_scores.csv’ of the custom bd2 to have a check. And the hits for this virus contig was shown like that:
Virus Repr host Repr host taxonomy Repr host genus method score FDR D01_NODE_107_length_70881_cov_12.285220__full XB_MAG929 d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__ d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E iPHoP-RF 0.955999997 0.102 D01_NODE_107_length_70881_cov_12.285220__full GB_GCA_013911515.1 d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__Synechococcus_E sp013911515 d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E iPHoP-RF 0.945999995 0.118 D01_NODE_107_length_70881_cov_12.285220__full XB_MAG351 d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__ d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E iPHoP-RF 0.943999995 0.121 D01_NODE_107_length_70881_cov_12.285220__full RS_GCF_000012625.1 d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__Synechococcus_E sp000012625 d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E iPHoP-RF 0.934 0.137 D01_NODE_107_length_70881_cov_12.285220__full XB_MAG472 d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__ d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E iPHoP-RF 0.932000004 0.14 D01_NODE_107_length_70881_cov_12.285220__full RS_GCF_014279755.1 d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__Synechococcus_E sp014279755 d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E iPHoP-RF 0.926000014 0.149 D01_NODE_107_length_70881_cov_12.285220__full RS_GCF_900474185.1 d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__Synechococcus_E sp900474185 d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E iPHoP-RF 0.926000014 0.149 D01_NODE_107_length_70881_cov_12.285220__full GB_GCA_003209165.1 d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__Synechococcus_E sp003209165 d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E iPHoP-RF 0.912000038 0.169 D01_NODE_107_length_70881_cov_12.285220__full RS_GCF_014279895.1 d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__Synechococcus_E sp002724845 d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E iPHoP-RF 0.904000051 0.181 D01_NODE_107_length_70881_cov_12.285220__full PRE_2021_D71_bin.2 d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__ d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E iPHoP-RF 0.898000062 0.189 D01_NODE_107_length_70881_cov_12.285220__full RS_GCF_000161795.2 d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__Synechococcus_E sp000161795 d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E iPHoP-RF 0.898000062 0.189 D01_NODE_107_length_70881_cov_12.285220__full RS_GCF_000737575.1 d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__Synechococcus_E sp000737575 d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E iPHoP-RF 0.884000085 0.207 D01_NODE_107_length_70881_cov_12.285220__full GB_GCA_002170825.1 d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__Synechococcus_E sp002170825 d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E iPHoP-RF 0.874000102 0.22 D01_NODE_107_length_70881_cov_12.285220__full XS_MAG272 d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__ d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E iPHoP-RF 0.870000094 0.225 D01_NODE_107_length_70881_cov_12.285220__full DS_MAG135 d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__ d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E iPHoP-RF 0.866000086 0.229 D01_NODE_107_length_70881_cov_12.285220__full GB_GCA_003210735.1 d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__Synechococcus_E sp003210735 d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E iPHoP-RF 0.866000086 0.229 D01_NODE_107_length_70881_cov_12.285220__full GB_GCA_003210795.1 d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__Synechococcus_E sp003210795 d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E iPHoP-RF 0.866000086 0.229 D01_NODE_107_length_70881_cov_12.285220__full PRES17F07_bin.180 d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__ d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E iPHoP-RF 0.856000066 0.241 D01_NODE_107_length_70881_cov_12.285220__full RS_GCF_014280175.1 d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__Synechococcus_E sp004212765 d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E iPHoP-RF 0.854000062 0.243 D01_NODE_107_length_70881_cov_12.285220__full XS_MAG292 d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__ d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E iPHoP-RF 0.850000054 0.248 D01_NODE_107_length_70881_cov_12.285220__full RS_GCF_000195975.1 d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__Synechococcus_E sp000195975 d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E iPHoP-RF 0.844000041 0.254 D01_NODE_107_length_70881_cov_12.285220__full RS_GCF_014280195.1 d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__Synechococcus_E sp014280195 d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E iPHoP-RF 0.836000025 0.263 D01_NODE_107_length_70881_cov_12.285220__full RS_GCF_000737595.1 d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__Synechococcus_E sp000737595 d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E iPHoP-RF 0.830000013 0.27 D01_NODE_107_length_70881_cov_12.285220__full GB_GCA_002691345.1 d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__Synechococcus_E sp002691345 d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E iPHoP-RF 0.80399996 0.298 D01_NODE_107_length_70881_cov_12.285220__full DS_MAG273 d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__ d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E iPHoP-RF 0.801999956 0.3 D01_NODE_107_length_70881_cov_12.285220__full GB_GCA_003211535.1 d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__Synechococcus_E sp003211535 d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E iPHoP-RF 0.733999819 0.383 D01_NODE_107_length_70881_cov_12.285220__full RS_GCF_000515235.1 d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__Synechococcus_E sp000515235 d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E iPHoP-RF 0.733999819 0.383 D01_NODE_107_length_70881_cov_12.285220__full RS_GCF_000737535.1 d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E;s__Synechococcus_E sp000737535 d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__PCC-6307;f__Cyanobiaceae;g__Synechococcus_E iPHoP-RF 0.725999802 0.394
According to my understanding, the ‘
Confidence score
' presented in ‘Host_prediction_to_genome_m90.csv’ was calculated using the'score' 'FDR'
shown in the ‘Wdir/All_combined_scores.csv' (as described in your article in the ‘Integrating iPHoP classifiers and RaFAH into a final host prediction’ section). Therefore, in fact, the virus contig ‘D01_NODE_107_length_70881_cov_12.285220__full’ does have hits in the results from the custom database2, but they are filtered, and do not exist in the final output. (Am I right?)Again, thank you very much for your answer. I will set up a lower cutoff (maybe --min_score 85) and have a try!
-
repo owner Right that’s correct, it seems like these scores are here but slightly below 90 (0.121 FDR for instance should translate into a score of 87.9). This can happen for instance if you have some contamination in your MAGs, i.e. if a MAG includes core genes from one taxon, but accessory genes (including viral genes) from another, then it will lead to conflicting signal and a much lower score for all hits for any virus with hits to this contaminated MAG.
Best,
Simon
-
repo owner - changed status to closed
- Log in to comment
Hi,
I’m not sure I see the discrepancies here. To help clarify what we are looking at in these “Host_prediction_to_genome_m90” files, these are not “hits”, these are confidence scores for each candidate phage-host pair that takes into account all hits and multiple methods, i.e. the score for phage D01_NODE_107_length_70881_cov_12.285220__full and host genome RS_GCF_000161795.2 depends not just on this host genome, but on all other host genomes present in the database. Since this context of additional host genomes is different between the default database (which includes GTDB + public genomes and bins from IMG/GEM/MGnify) and your custom database (which includes GTDB + your own bins), it is expected that the scores between these two runs will be slightly different (in this case, 95.4 vs 92.4).
One more important thing to keep in mind is that iPHoP was optimized to predict the genus of the host, not the exact species. So in that case, both runs (using either database) give you the same answer: the most likely host genus for D01_NODE_107_length_70881_cov_12.285220__full, according to iPHoP, is g__Synechococcus_E. In terms of how to merge the results from different runs with different databases, we typically recommend taking the prediction with the highest score across runs for each input phage (although in that case, this would be g__Synechococcus_E anyway).
Hope it helps !
Best,
Simon