Can I achieve virus host prediction using only my own MAGs?
Can I use the command iphop add_to_db
without specifying the path to the original iphop database directory?
Comments (12)
-
repo owner -
reporter Many thanks. If possible, could you please provide information or commands about building the entire database with just my own MAGs from scratch?
-
repo owner I would actually discourage any user from attempting this. iPHoP is sensitive to the scope of the database, i.e. if the database is too reduced and does not cover a broad diversity of bacteria and archaea, I can’t guarantee that the host prediction will work as expected, i.e. that the prediction will be correct and that the estimated FDR will be accurate. This is why we provide the option to add custom MAGs to the existing database, rather than creating a database from scratch.
Is there any specific reason why “add_to_db” does not work in your case?
-
reporter In fact, I have about 30,000 MAGs from the same environment, and I don’t if that is enough. Maybe, MAGs from the same environment might be more helpful and accurate in predicting the virus in these samples, right?
-
reporter And if I want to add the 30,000 MAGs to the existing database, can I split the huge task into smaller tasks? For example, the 30,000 MAGs will be divided into 10 groups of 3,000 MAGs, and the small task adding 3,000 MAGs will be run in turn.
-
repo owner The number seems good, however it’s more about the diversity (i.e. make sure there are representatives of as many taxa as possible). Unfortunately, you can’t divide these 30,000 MAGs in smaller batches if you want all of them to be considered at once, you will need to first build a tree with these additional ~ 30,000 MAGs using GTDB-tk, and then use “add_to_db”.
On the other hand, you could technically split them in smaller groups (I would recommend by taxon, e.g. by phylum or class), and then process each group separately, i.e. build a custom iPHoP database for each (group of) taxon. It’s not ideal, but I think it should work.
-
reporter Following the commands in ‘Adding bacterial and/or archaeal MAGs to the host databas', I tested the 'add_to_db’ module in iphop
The used command was shown as such:
iphop add_to_db --num_threads 48 --fna_dir Wetland_MAGs --gtdb_dir Wetland_MAGs_GTDB-tk_results --out_dir Sept_2021_pub_rw_w_Wetland_hosts --db_dir ~/00database/06iphop/Sept_2021_pub_rw
However, I came across an issue about “[3] Load new host genomes in blast database...“
Starting [1] Get a list of genomes to import... [2] Import information from GTDBtk trees... Reading Wetland_MAGs_GTDB-tk_results/gtdbtk.ar122.decorated.tree Reading Wetland_MAGs_GTDB-tk_results/gtdbtk.bac120.decorated.tree [3] Load new host genomes in blast database... Building a new DB, current time: 08/07/2023 12:20:29 New DB name: /home/wuzongzhi2022phd/00database/Data_test_add_to_db/Sept_2021_pub_rw_w_Wetland_hosts/db/Host_Genomes/New_host_genomes New DB title: /home/wuzongzhi2022phd/00database/Data_test_add_to_db/Sept_2021_pub_rw_w_Wetland_hosts/db/Host_Genomes/New_host_genomes.fna Sequence type: Nucleotide Deleted existing Nucleotide BLAST database named /home/wuzongzhi2022phd/00database/Data_test_add_to_db/Sept_2021_pub_rw_w_Wetland_hosts/db/Host_Genomes/New_host_genomes Keep MBits: T Maximum file size: 1000000000B BLAST Database error: No alias or index file found for nucleotide database [/home/wuzongzhi2022phd/00database/Data_test_add_to_db/Sept_2021_pub_rw_w_Wetland_hosts/db/Host_Genomes/New_host_genomes] in search path [/home/wuzongzhi2022phd/00database/Data_test_add_to_db::]
There seems to be an issue in makeblastdb. And I checked the version of blastn (2.12.0+) in conda env ‘iphop_env’.
-
reporter And then I tried to use ‘makeblastdb’ in the ‘Host_genome’ directory, but also failed.
makeblastdb -in New_host_genomes.fna -out New_host_genomes -dbtype nucl
The output of the above command:
Building a new DB, current time: 08/07/2023 12:28:19 New DB name: /home/wuzongzhi2022phd/00database/Data_test_add_to_db/Sept_2021_pub_rw_w_Wetland_hosts/db/Host_Genomes/New_host_genomes New DB title: New_host_genomes.fna Sequence type: Nucleotide Deleted existing Nucleotide BLAST database named /home/wuzongzhi2022phd/00database/Data_test_add_to_db/Sept_2021_pub_rw_w_Wetland_hosts/db/Host_Genomes/New_host_genomes Keep MBits: T Maximum file size: 1000000000B Bus error (core dumped)
-
reporter But I found command such as blastn work fine..., thanks for your advice!
-
repo owner Hi,
“Bus error” when attempting to build a blast database suggests to me that you either ran out of memory or of disk space (see https://stackoverflow.com/questions/212466/what-is-a-bus-error-is-it-different-from-a-segmentation-fault). This unfortunately makes sense for 30,000 MAGs, you will have to use relatively large node to prepare any database (including a blast one).
Best,
Simon
-
reporter Thanks for your kindly suggestions!
-
repo owner - changed status to closed
Answered
- Log in to comment
Hi,
The short answer is no, you can add MAGs to the default database, but can’t perform host prediction using only your own MAGs (or rather, you would have to build the entire database with just your MAGs from scratch, which is not what “add_to_db” does).
Best,
Simon