Can I achieve virus host prediction using only my own MAGs?

Issue #51 closed
w22 created an issue

Can I use the command iphop add_to_db without specifying the path to the original iphop database directory?

Comments (12)

  1. Simon Roux repo owner

    Hi,

    The short answer is no, you can add MAGs to the default database, but can’t perform host prediction using only your own MAGs (or rather, you would have to build the entire database with just your MAGs from scratch, which is not what “add_to_db” does).

    Best,

    Simon

  2. w22 reporter

    Many thanks. If possible, could you please provide information or commands about building the entire database with just my own MAGs from scratch?

  3. Simon Roux repo owner

    I would actually discourage any user from attempting this. iPHoP is sensitive to the scope of the database, i.e. if the database is too reduced and does not cover a broad diversity of bacteria and archaea, I can’t guarantee that the host prediction will work as expected, i.e. that the prediction will be correct and that the estimated FDR will be accurate. This is why we provide the option to add custom MAGs to the existing database, rather than creating a database from scratch.

    Is there any specific reason why “add_to_db” does not work in your case?

  4. w22 reporter

    In fact, I have about 30,000 MAGs from the same environment, and I don’t if that is enough. Maybe, MAGs from the same environment might be more helpful and accurate in predicting the virus in these samples, right?

  5. w22 reporter

    And if I want to add the 30,000 MAGs to the existing database, can I split the huge task into smaller tasks? For example, the 30,000 MAGs will be divided into 10 groups of 3,000 MAGs, and the small task adding 3,000 MAGs will be run in turn.

  6. Simon Roux repo owner

    The number seems good, however it’s more about the diversity (i.e. make sure there are representatives of as many taxa as possible). Unfortunately, you can’t divide these 30,000 MAGs in smaller batches if you want all of them to be considered at once, you will need to first build a tree with these additional ~ 30,000 MAGs using GTDB-tk, and then use “add_to_db”.

    On the other hand, you could technically split them in smaller groups (I would recommend by taxon, e.g. by phylum or class), and then process each group separately, i.e. build a custom iPHoP database for each (group of) taxon. It’s not ideal, but I think it should work.

  7. w22 reporter

    Following the commands in ‘Adding bacterial and/or archaeal MAGs to the host databas', I tested the 'add_to_db’ module in iphop

    The used command was shown as such:

    iphop add_to_db --num_threads 48 --fna_dir Wetland_MAGs --gtdb_dir Wetland_MAGs_GTDB-tk_results --out_dir Sept_2021_pub_rw_w_Wetland_hosts --db_dir ~/00database/06iphop/Sept_2021_pub_rw
    

    However, I came across an issue about “[3] Load new host genomes in blast database...“

    Starting
    [1] Get a list of genomes to import...
    [2] Import information from GTDBtk trees...
    Reading Wetland_MAGs_GTDB-tk_results/gtdbtk.ar122.decorated.tree
    Reading Wetland_MAGs_GTDB-tk_results/gtdbtk.bac120.decorated.tree
    [3] Load new host genomes in blast database...
    
    Building a new DB, current time: 08/07/2023 12:20:29
    New DB name:   /home/wuzongzhi2022phd/00database/Data_test_add_to_db/Sept_2021_pub_rw_w_Wetland_hosts/db/Host_Genomes/New_host_genomes
    New DB title:  /home/wuzongzhi2022phd/00database/Data_test_add_to_db/Sept_2021_pub_rw_w_Wetland_hosts/db/Host_Genomes/New_host_genomes.fna
    Sequence type: Nucleotide
    Deleted existing Nucleotide BLAST database named /home/wuzongzhi2022phd/00database/Data_test_add_to_db/Sept_2021_pub_rw_w_Wetland_hosts/db/Host_Genomes/New_host_genomes
    Keep MBits: T
    Maximum file size: 1000000000B
    BLAST Database error: No alias or index file found for nucleotide database [/home/wuzongzhi2022phd/00database/Data_test_add_to_db/Sept_2021_pub_rw_w_Wetland_hosts/db/Host_Genomes/New_host_genomes] in search path [/home/wuzongzhi2022phd/00database/Data_test_add_to_db::]
    

    There seems to be an issue in makeblastdb. And I checked the version of blastn (2.12.0+) in conda env ‘iphop_env’.

  8. w22 reporter

    And then I tried to use ‘makeblastdb’ in the ‘Host_genome’ directory, but also failed.

    makeblastdb -in New_host_genomes.fna -out New_host_genomes -dbtype nucl
    

    The output of the above command:

    Building a new DB, current time: 08/07/2023 12:28:19
    New DB name:   /home/wuzongzhi2022phd/00database/Data_test_add_to_db/Sept_2021_pub_rw_w_Wetland_hosts/db/Host_Genomes/New_host_genomes
    New DB title:  New_host_genomes.fna
    Sequence type: Nucleotide
    Deleted existing Nucleotide BLAST database named /home/wuzongzhi2022phd/00database/Data_test_add_to_db/Sept_2021_pub_rw_w_Wetland_hosts/db/Host_Genomes/New_host_genomes
    Keep MBits: T
    Maximum file size: 1000000000B
    Bus error (core dumped)
    

  9. Log in to comment