Question about add_to_db

Issue #59 closed
Alyzza created an issue

Hi Simon,

I had just created a custom database using the add_to_db function, and I also deleted some MAGs from db_infos/Host_Genomes.tsv of the custom database (this is based on your advice to my colleague a few months ago to remove the MAGs that we would like iPHoP to ignore at the time of host prediction).

I understand that the new database should “include GTDB genomes and the additional MAGs provided by the user, but not the GEM or IMG genomes.” So I was wondering why in my custom database, I still get GEM and IMG MAGs (although not as many). And I also removed only 114 MAGs, so I wanted to ask how exactly add_to_db works to help me figure if I might have made a mistake creating the custom database. I have attached below how the custom database more or less compares to the original iPHoP database.

Best,

Alyzza

$ cut -f2 $WORK/databases/iphop_db/Sept_2021_pub_rw/db_infos/Host_Genomes.tsv | sort | uniq -c | sort -nr
  52515 GEM
  47894 GTDB
  21372 IMG
      1 Source

$ cut -f2 $WORK/databases/iphop_db/new_db_Sept_2021_pub_rw/db_infos/Host_Genomes.tsv | sort | uniq -c | sort -nr
  42692 GTDB
  24347 GEM
  18358 IMG
    614 Additional
      1 Source

Comments (3)

  1. Simon Roux repo owner

    Hi Alyzza,

    Sorry, this indication “include GTDB genomes and the additional MAGs provided by the user, but not the GEM or IMG genomes.” is a bit misleading, and requires a better explanation.

    What happens is that iPHoP needs, for each candidate host genome, to have either this genome in the bacteria/archaea tree or a representative genome from the same species in the trees. This is because iPHoP relies a lot on phylogenetic distance between hosts when considering multiple hits (for instance if a phage has a blast hit to two bacteria, iPHoP uses the phylogenetic distance between these 2 bacteria to determine if the hits are consistent and point to the same host, or inconsistent and point to false-positive / non-useful signal). In practice, when you run “add_to_db”, iPHoP asks you to provide the results from gtdbtk de_novo, which will be bacteria/archaea trees including GTDB references and your new MAGs (if they could be included in the tree). For GEM / IMG references, if they have a GTDB representative in the tree, they will be kept (because we can still calculate phylogenetic distances from this representative). On the other hand, all the GEM/IMG genomes that represented new branches in the tree in the original database will be removed, because these genomes will not be in the new tree you generated with gtdbtk de_novo.

    Now for removing lines from this file “Host_Genomes.tsv”, this is relatively straightforward: when parsing the results of the different tools (blast, PHP, WisH, etc), iPHoP verifies that it has information (taxonomy, tree representative, etc) for each host genome. If it sees a host genome without information (i.e. a host genome not listed in “Host_Genomes.tsv”), it simply ignores it throughout.

    I hope it clarifies things at least a bit, let me know if you have any other questions !

    Best,

    Simon

  2. Log in to comment