[Question] Is it possible to ONLY include custom genomes?

Issue #66 closed
Josh L. Espinoza created an issue

For example, let’s say I have a custom database with the following:

20 eukaryotic genomes, proteins, CDS, taxonomic classifications, and GFF

100 prokaryotic genomes, proteins, CDS, taxonomic classifications, and GFF

Is it possible to ONLY include these genomes? If so, what would be needed to build this database instead of adding to it?

Comments (4)

  1. Simon Roux repo owner

    Hi Josh,

    Technically it may be feasible, however I don’t know if the results would be meaningful. iPHoP host database relies on a large-scale phylogeny, which we get by leveraging the GTDB-tk framwork. You could technically build a tree with just your genomes and recreate all GTDB-tk files, but since the tool has only been trained on large and comprehensive phylogenies, I don’t know how it would behave with only 100 genomes (i.e. I worry that the tool needs all the other genomes, if only as “decoy” / background).

    Also an important note: iPHoP only works for bacteriophages and archaeal viruses, it has not been tested or validated for eukaryotic viruses, so I don’t know what including eukaryotic genomes in the host database would do.

    Hope it helps a little !

    Best,

    Simon

  2. Josh L. Espinoza reporter

    Thank you! This is very useful. Your reasoning makes perfect sense. Regarding the microbial database, is it trained on all of GTDB?

  3. Simon Roux repo owner

    Correct, the iPHoP host database is essentially GTDB supplemented with some publicly available MAG collections. We are in the process of finalizing a new version of the database to keep up with GTDB updates, which should be released in ~ November.

  4. Log in to comment