add MAGs to db

Issue #3 resolved
Former user created an issue

Hi, thank you for developing this tool. I am trying to add 120 MAGs to the DB, however the process have been running more than 12 h. Is this normal? I attached part of the log file, where I don't find an error and everything seems ok.

I ran the following command in a node with 20 cpu and 300gb RAM

iphop add_to_db --fna_dir /beegfs/work/workspace/ws/ho_kezau83-silage_timetrial-0/final_bins/ --gtdb_dir /beegfs/work/workspace/ws/ho_kezau83-silage_timetrial-0/gtdb_classify_iphop/infer/ --out_dir /beegfs/work/workspace/ws/ho_kezau83-silage_timetrial-0/Sept_2021_pub_w_silage --db_dir /beegfs/work/workspace/ws/ho_kezau83-conda-0/iphop_db/Sept_2021_pub -t 20
[3] Load new host genomes in blast database...
Created nucleotide BLAST (alias) database /beegfs/work/workspace/ws/ho_kezau83-silage_timetrial-0/Sept_2021_pub_w_silage/db/Host_Genomes/Host_Genomes with 14573964 sequences
[4] Get CRISPR arrays from new MAGs and add to database...
python /beegfs/work/workspace/ws/ho_kezau83-conda-0/conda/envs/iphop_env/lib/python3.8/site-packages/iphop/utils/CRISPR/identify_crispr.folder.py -i /beegfs/work/workspace/ws/ho_kezau83-silage_timetrial-0/final_bins/ -o /beegfs/work/workspace/ws/ho_kezau83-silage_timetrial-0/Sept_2021_pub_w_silage/db/Tmp_CRISPRs
python /beegfs/work/workspace/ws/ho_kezau83-conda-0/conda/envs/iphop_env/lib/python3.8/site-packages/iphop/utils/CRISPR/get_crispr_database.py -d /beegfs/work/workspace/ws/ho_kezau83-silage_timetrial-0/Sept_2021_pub_w_silage/db/Tmp_CRISPRs
Count total new spacers -> 1215
We have new spacers, we add to the existing db
Created nucleotide BLAST (alias) database /beegfs/work/workspace/ws/ho_kezau83-silage_timetrial-0/Sept_2021_pub_w_silage/db/All_CRISPR_spacers_nr_clean with 1399345 sequences
[5] Add new genomes to WIsH database...

[6] Add new genomes to VHM database...
Loading custom packages...
Load existing database
Running Host Db building function

Thank you for your help.

Johan Sebastián

Comments (8)

  1. Simon Roux repo owner

    RIght, so adding MAGs to db is still a bit experimental, but can be quite long. I agree that I don’t see an error just yet, so for now I would say to let it run if possible, and see if it finishes ?

  2. Zhengyuan Zhou

    Hi!

    I've encountered a similar situation.

    In my case, I tried to add 960 MAGs to the database using 24 cpu and 300GB RAM. No error or warning has been reported so far, but the software has been running for more than two weeks and haven’t updated any new progress.

    Here is the code that I used:

    iphop add_to_db --fna_dir /bins --gtdb_dir /gtdbtk_infer --out_dir /iphop_MAGs_db --db_dir /iphop_db/Aug_2023_pub_rw -t 24
    

    Here is the running information:

    Starting                                                                                                                              
    [1] Get a list of genomes to import...                                                                                                
    [2] Import information from GTDBtk trees...                                                                                           
    Reading /gtdbtk_infer/gtdbtk.ar53.decorated.tree        
    Reading /gtdbtk_infer/gtdbtk.bac120.decorated.tree      
    [3] Load new host genomes in blast database...                                                                                        
    Created nucleotide BLAST (alias) database /iphop_MAGs_db/db/Host_Genomes/Host_Genomes with 23287350 sequences                                                                                 
    [4] Get CRISPR arrays from new MAGs and add to database...                                                                            
    python /home/zzhou/miniconda3/envs/iphop/lib/python3.8/site-packages/iphop/utils/CRISPR/identify_crispr.folder.py -i bins/ -o /iphop_MAGs_db/db/Tmp_CRISPRs       
    python /home/zzhou/miniconda3/envs/iphop/lib/python3.8/site-packages/iphop/utils/CRISPR/get_crispr_database.py -d /iphop_MAGs_db/db/Tmp_CRISPRs                                               
    [5] Add new genomes to WIsH database...                                                                                               
    
    [6] Add new genomes to VHM database...                                                                                                
    ~                                                                                                                                     
    ~                                                                                                                                     
    ~                                                                                                                                     
    ~   
    
    Building a new DB, current time: 04/15/2024 09:24:35                                                                                  
    New DB name:   /iphop_MAGs_db/db/Host_Genomes/New_host_genomes                                                                                                                                
    New DB title:  /iphop_MAGs_db/db/Host_Genomes/New_host_genomes.fna                                                                                                                            
    Sequence type: Nucleotide                                                                                                             
    Keep MBits: T                                                                                                                         
    Maximum file size: 1000000000B                                                                                                        
    Adding sequences from FASTA; added 140394 sequences in 17.4646 seconds.                                                               
    
    
    ~                                                                                                                                     
    ~                                                                                                                                     
    ~       
    

    I appreciate any help or suggestions. Thanks a lot.

    Best wishes,

    Ryan

  3. Simon Roux repo owner

    Hi !

    Do you have any other file/directory in your MAG folder that is not a fasta file ? We have seen this happen with the script stuck at this step (“Add new genomes to VHM database”) when there are non-fasta files / directories in the same folder as the MAGs, and the script does not know how to handle it.

  4. Zhengyuan Zhou

    Yes, I checked the folder containing MAGs and found sub-folders within it. After removing the sub-folders, I reran the script, and it worked!

    Thank you very much for your prompt response

  5. Louise Weed

    Hi Simon,

    I’m having a similar issue with hanging on the VHM database step:

    iphop add_to_db --fna_dir $MAG_dir --gtdb_dir $gtdb-tk_dir --out_dir May_2024_w_TR_hosts --db_dir /blastdb/iphop-db-aug23-rw/Aug_2023_pub_rw
    
    Starting
    
    [1] Get a list of genomes to import...
    
    [2] Import information from GTDBtk trees...
    
    Reading /workspace/rweed/TR_phage_redo/06_HOSTS/TR_2021_B05_GTDB-tk_results/gtdbtk.ar53.decorated.tree
    
    Reading /workspace/rweed/TR_phage_redo/06_HOSTS/TR_2021_B05_GTDB-tk_results/gtdbtk.bac120.decorated.tree
    
    ln: failed to create symbolic link '/automounts/workspace/workspace/rweed/TR_phage_redo/06_HOSTS/May_2024_w_TR_hosts/db_infos/Translate_genus_to_full_taxo.tsv': File exists
    
    ln: failed to create symbolic link '/automounts/workspace/workspace/rweed/TR_phage_redo/06_HOSTS/May_2024_w_TR_hosts/db/rafah_data': File exists
    
    [3] Load new host genomes in blast database...
    
    Building a new DB, current time: 05/06/2024 16:22:12
    
    New DB name:   /automounts/workspace/workspace/rweed/TR_phage_redo/06_HOSTS/May_2024_w_TR_hosts/db/Host_Genomes/New_host_genomes
    
    New DB title:  /automounts/workspace/workspace/rweed/TR_phage_redo/06_HOSTS/May_2024_w_TR_hosts/db/Host_Genomes/New_host_genomes.fna
    
    Sequence type: Nucleotide
    
    Deleted existing Nucleotide BLAST database named /automounts/workspace/workspace/rweed/TR_phage_redo/06_HOSTS/May_2024_w_TR_hosts/db/Host_Genomes/New_host_genomes
    
    Keep MBits: T
    
    Maximum file size: 1000000000B
    
    Adding sequences from FASTA; added 67197 sequences in 2.08247 seconds.
    
    Created nucleotide BLAST (alias) database /automounts/workspace/workspace/rweed/TR_phage_redo/06_HOSTS/May_2024_w_TR_hosts/db/Host_Genomes/Host_Genomes with 23214153 sequences
    
    [4] Get CRISPR arrays from new MAGs and add to database...
    
    We will use the existing /automounts/workspace/workspace/rweed/TR_phage_redo/06_HOSTS/May_2024_w_TR_hosts/db/Tmp_CRISPRs/All_additional_spacers.nr.fasta
    
    ln: failed to create symbolic link '/automounts/workspace/workspace/rweed/TR_phage_redo/06_HOSTS/May_2024_w_TR_hosts/db_infos/All_CRISPR_array_size.tsv': File exists
    
    ln: failed to create symbolic link '/automounts/workspace/workspace/rweed/TR_phage_redo/06_HOSTS/May_2024_w_TR_hosts/db_infos/All_CRISPR_spacers_nr_clean.metrics.csv': File exists
    
    [5] Add new genomes to WIsH database...
    
    Skipping generation of wish models because /automounts/workspace/workspace/rweed/TR_phage_redo/06_HOSTS/May_2024_w_TR_hosts/db/rewish_models_extra/Batch_extra.pkl is already here - please remove the file if you want to re-run this part
    
    [6] Add new genomes to VHM database...
    

    I don’t have any additional files in the MAG directory, but I wonder if this may be hanging because of the “file already exists” warnings? I’m trying to add additional MAGs to a database that I already added MAGs too (I was hoping to be able to iteratively add MAGs). Is this supported?

    Thanks for any advice you might have!

    -Roo

  6. Simon Roux repo owner

    Hi !

    Iteratively adding MAGs is not supported at this point, you will need to gather all your MAGs in a single folder and a single GTDB-tk de novo run, and add them in a single step to the iPHoP original database.
    Best,

    Simon

  7. Log in to comment