add MAGs to db
Hi, thank you for developing this tool. I am trying to add 120 MAGs to the DB, however the process have been running more than 12 h. Is this normal? I attached part of the log file, where I don't find an error and everything seems ok.
I ran the following command in a node with 20 cpu and 300gb RAM
iphop add_to_db --fna_dir /beegfs/work/workspace/ws/ho_kezau83-silage_timetrial-0/final_bins/ --gtdb_dir /beegfs/work/workspace/ws/ho_kezau83-silage_timetrial-0/gtdb_classify_iphop/infer/ --out_dir /beegfs/work/workspace/ws/ho_kezau83-silage_timetrial-0/Sept_2021_pub_w_silage --db_dir /beegfs/work/workspace/ws/ho_kezau83-conda-0/iphop_db/Sept_2021_pub -t 20
[3] Load new host genomes in blast database...
Created nucleotide BLAST (alias) database /beegfs/work/workspace/ws/ho_kezau83-silage_timetrial-0/Sept_2021_pub_w_silage/db/Host_Genomes/Host_Genomes with 14573964 sequences
[4] Get CRISPR arrays from new MAGs and add to database...
python /beegfs/work/workspace/ws/ho_kezau83-conda-0/conda/envs/iphop_env/lib/python3.8/site-packages/iphop/utils/CRISPR/identify_crispr.folder.py -i /beegfs/work/workspace/ws/ho_kezau83-silage_timetrial-0/final_bins/ -o /beegfs/work/workspace/ws/ho_kezau83-silage_timetrial-0/Sept_2021_pub_w_silage/db/Tmp_CRISPRs
python /beegfs/work/workspace/ws/ho_kezau83-conda-0/conda/envs/iphop_env/lib/python3.8/site-packages/iphop/utils/CRISPR/get_crispr_database.py -d /beegfs/work/workspace/ws/ho_kezau83-silage_timetrial-0/Sept_2021_pub_w_silage/db/Tmp_CRISPRs
Count total new spacers -> 1215
We have new spacers, we add to the existing db
Created nucleotide BLAST (alias) database /beegfs/work/workspace/ws/ho_kezau83-silage_timetrial-0/Sept_2021_pub_w_silage/db/All_CRISPR_spacers_nr_clean with 1399345 sequences
[5] Add new genomes to WIsH database...
[6] Add new genomes to VHM database...
Loading custom packages...
Load existing database
Running Host Db building function
Thank you for your help.
Johan Sebastián
Comments (8)
-
repo owner -
Hi!
I've encountered a similar situation.
In my case, I tried to add 960 MAGs to the database using 24 cpu and 300GB RAM. No error or warning has been reported so far, but the software has been running for more than two weeks and haven’t updated any new progress.
Here is the code that I used:
iphop add_to_db --fna_dir /bins --gtdb_dir /gtdbtk_infer --out_dir /iphop_MAGs_db --db_dir /iphop_db/Aug_2023_pub_rw -t 24
Here is the running information:
Starting [1] Get a list of genomes to import... [2] Import information from GTDBtk trees... Reading /gtdbtk_infer/gtdbtk.ar53.decorated.tree Reading /gtdbtk_infer/gtdbtk.bac120.decorated.tree [3] Load new host genomes in blast database... Created nucleotide BLAST (alias) database /iphop_MAGs_db/db/Host_Genomes/Host_Genomes with 23287350 sequences [4] Get CRISPR arrays from new MAGs and add to database... python /home/zzhou/miniconda3/envs/iphop/lib/python3.8/site-packages/iphop/utils/CRISPR/identify_crispr.folder.py -i bins/ -o /iphop_MAGs_db/db/Tmp_CRISPRs python /home/zzhou/miniconda3/envs/iphop/lib/python3.8/site-packages/iphop/utils/CRISPR/get_crispr_database.py -d /iphop_MAGs_db/db/Tmp_CRISPRs [5] Add new genomes to WIsH database... [6] Add new genomes to VHM database... ~ ~ ~ ~
Building a new DB, current time: 04/15/2024 09:24:35 New DB name: /iphop_MAGs_db/db/Host_Genomes/New_host_genomes New DB title: /iphop_MAGs_db/db/Host_Genomes/New_host_genomes.fna Sequence type: Nucleotide Keep MBits: T Maximum file size: 1000000000B Adding sequences from FASTA; added 140394 sequences in 17.4646 seconds. ~ ~ ~
I appreciate any help or suggestions. Thanks a lot.
Best wishes,
Ryan
-
repo owner Hi !
Do you have any other file/directory in your MAG folder that is not a fasta file ? We have seen this happen with the script stuck at this step (“Add new genomes to VHM database”) when there are non-fasta files / directories in the same folder as the MAGs, and the script does not know how to handle it.
-
Yes, I checked the folder containing MAGs and found sub-folders within it. After removing the sub-folders, I reran the script, and it worked!
Thank you very much for your prompt response
-
repo owner Perfect, thanks for the update !
-
repo owner - changed status to resolved
-
Hi Simon,
I’m having a similar issue with hanging on the VHM database step:
iphop add_to_db --fna_dir $MAG_dir --gtdb_dir $gtdb-tk_dir --out_dir May_2024_w_TR_hosts --db_dir /blastdb/iphop-db-aug23-rw/Aug_2023_pub_rw Starting [1] Get a list of genomes to import... [2] Import information from GTDBtk trees... Reading /workspace/rweed/TR_phage_redo/06_HOSTS/TR_2021_B05_GTDB-tk_results/gtdbtk.ar53.decorated.tree Reading /workspace/rweed/TR_phage_redo/06_HOSTS/TR_2021_B05_GTDB-tk_results/gtdbtk.bac120.decorated.tree ln: failed to create symbolic link '/automounts/workspace/workspace/rweed/TR_phage_redo/06_HOSTS/May_2024_w_TR_hosts/db_infos/Translate_genus_to_full_taxo.tsv': File exists ln: failed to create symbolic link '/automounts/workspace/workspace/rweed/TR_phage_redo/06_HOSTS/May_2024_w_TR_hosts/db/rafah_data': File exists [3] Load new host genomes in blast database... Building a new DB, current time: 05/06/2024 16:22:12 New DB name: /automounts/workspace/workspace/rweed/TR_phage_redo/06_HOSTS/May_2024_w_TR_hosts/db/Host_Genomes/New_host_genomes New DB title: /automounts/workspace/workspace/rweed/TR_phage_redo/06_HOSTS/May_2024_w_TR_hosts/db/Host_Genomes/New_host_genomes.fna Sequence type: Nucleotide Deleted existing Nucleotide BLAST database named /automounts/workspace/workspace/rweed/TR_phage_redo/06_HOSTS/May_2024_w_TR_hosts/db/Host_Genomes/New_host_genomes Keep MBits: T Maximum file size: 1000000000B Adding sequences from FASTA; added 67197 sequences in 2.08247 seconds. Created nucleotide BLAST (alias) database /automounts/workspace/workspace/rweed/TR_phage_redo/06_HOSTS/May_2024_w_TR_hosts/db/Host_Genomes/Host_Genomes with 23214153 sequences [4] Get CRISPR arrays from new MAGs and add to database... We will use the existing /automounts/workspace/workspace/rweed/TR_phage_redo/06_HOSTS/May_2024_w_TR_hosts/db/Tmp_CRISPRs/All_additional_spacers.nr.fasta ln: failed to create symbolic link '/automounts/workspace/workspace/rweed/TR_phage_redo/06_HOSTS/May_2024_w_TR_hosts/db_infos/All_CRISPR_array_size.tsv': File exists ln: failed to create symbolic link '/automounts/workspace/workspace/rweed/TR_phage_redo/06_HOSTS/May_2024_w_TR_hosts/db_infos/All_CRISPR_spacers_nr_clean.metrics.csv': File exists [5] Add new genomes to WIsH database... Skipping generation of wish models because /automounts/workspace/workspace/rweed/TR_phage_redo/06_HOSTS/May_2024_w_TR_hosts/db/rewish_models_extra/Batch_extra.pkl is already here - please remove the file if you want to re-run this part [6] Add new genomes to VHM database...
I don’t have any additional files in the MAG directory, but I wonder if this may be hanging because of the “file already exists” warnings? I’m trying to add additional MAGs to a database that I already added MAGs too (I was hoping to be able to iteratively add MAGs). Is this supported?
Thanks for any advice you might have!
-Roo
-
repo owner Hi !
Iteratively adding MAGs is not supported at this point, you will need to gather all your MAGs in a single folder and a single GTDB-tk de novo run, and add them in a single step to the iPHoP original database.
Best,Simon
- Log in to comment
RIght, so adding MAGs to db is still a bit experimental, but can be quite long. I agree that I don’t see an error just yet, so for now I would say to let it run if possible, and see if it finishes ?