Error in adding MAGs to standard DB

Issue #22 closed
S.Y. Hsieh created an issue

Hi Simon,

Thanks for developing this useful tool. It works well when running standard prediction on our HPC cluster.

I also tried to add my own bacterial/archaeal MAGs in the DB to make a customised DB, so I have run gtdbtk de_novo_wf and got the output. However, I got error messages when starting to create a new DB (that is, the 2nd step), it’s like below:

Starting
[1] Get a list of genomes to import...
[2] Import information from GTDBtk trees...
Traceback (most recent call last):
  File "/opt/anaconda_iphop/bin/iphop", line 10, in <module>
    sys.exit(cli())
  File "/opt/anaconda_iphop/lib/python3.8/site-packages/iphop/iphop.py", line 122, in cli
    args["func"](args)
  File "/opt/anaconda_iphop/lib/python3.8/site-packages/iphop/modules/master_add_to_db.py", line 159, in main
    args['tree_a'] = glob.glob(os.path.join(args['gtdb_dir'],"gtdbtk.ar[0-9]*.decorated.tree"))[0]
IndexError: list index out of range

Our computering colleague thought it seems like a software bug. Do you have any ideas about this issue?

I also attached my script FYI. I have a total of 319 MAGs from 12 samples and currently these are in different folders seaprated by sample IDs but under the same directory, and gtdbtk outputs are also separated by sample IDs. Should I keep them separate or I must pool all MAGs in the same directory (and gtdbtk output too)?

Many thanks.

Best

Ernie

Comments (9)

  1. Simon Roux repo owner

    Hi Ernie,

    The script looks ok to me, what seems to be the problem is that iPHoP does not find the archaeal tree. Can you check whether there is a file called “gtdbtk.ar[some numbers]*.decorated.tree” in your GTDB-tk output ? If there is none, it’s either that GTDBtk had an issue, or that in some cases it does not generate this file (which iPHoP did not expect). If it’s the latter, I think you could try to create an empty file with the above name and try it again, hopefully this would be enough.

    Best,

    Simon

  2. Simon Roux repo owner

    Hi Ernie,

    Quick follow up: it seems indeed that if there is no archaeal MAGs in your set, gtdbtk does not generate the archaeal tree (anymore ? I thought it used to, but I may be wrong). We will try to fix this in iPHoP as soon as possible, but in the meantime an “easy” way to get things going on your side will be to add the MAGs provided to test the tool (https://bitbucket.org/srouxjgi/iphop/downloads/Data_test_add_to_db.tar.gz - folder Wetland_MAGs/) to your MAG set. Because our test MAGs include both bacteria and archaea, gtdbtk should generate both a bacteria and an archaea tree, and you should not see the error in iPHoP add_to_db anymore.

    Best,

    Simon

  3. S.Y. Hsieh reporter

    Hi Simon,

    Following your suggestion I have tried to add your tested MAGs in alongside my MAGs to re-run ‘gtdb-tk de_novo_wf’ and re-generate both arc and bac trees, but it still failed to generate a decorated tree for archaeal hosts when I used p__Altarchaeota as my outgroup taxon. I could see ar53 files in align and identify folders but, I didn’t see ar53.decorated.tree and its decorated.tree.taxonomy file in the infer folder (only see bac files).

    May I ask you what does ‘outgroup_taxon' mean? In the error log it seems no outgroup could be set so that the tree couldn’t be rooted and generated. It is weird, but I found another one for archaeal phylum (p__Undinarchaeota) described in the input example of gtdb-tk’s de_novo_wf manual, so I tested this taxon and eventually it worked to generate an archaeal tree! So in my case, it seems the issue is outgroup_taxon. If p__Altarchaeota does not generate an archaeal tree, I guess users could consider trying others (e.g., p__Undinarchaeota).

    PS. FYI. Our gtdb-tk ver. is v.2.0.0 and its DB’s ver. is r207. In case this affects the outgroup setting?

    Next, I used both ar53 and bac120 tree files to create a new DB, but it failed again at the 8th step (please see below):

    Starting
    [1] Get a list of genomes to import...
    [2] Import information from GTDBtk trees...
    Reading ./MAG-gtdbtk-results/infer/gtdbtk.ar53.decorated.tree
    Reading ./MAG-gtdbtk-results/infer/gtdbtk.bac120.decorated.tree
    [3] Load new host genomes in blast database...
    Created nucleotide BLAST (alias) database /qib/research-groups/Simon-Carding/Ernie/ME_new_analysis/wms/wms-virus-host-iphop/Sept_2021_pub_new_MAGs_hosts_DB/db/Host_Genomes/Host_Genomes with 14601665 sequences
    [4] Get CRISPR arrays from new MAGs and add to database...
    python /opt/anaconda_iphop/lib/python3.8/site-packages/iphop/utils/CRISPR/identify_crispr.folder.py -i ./final_binned_MAGs -o /qib/research-groups/Simon-Carding/Ernie/ME_new_analysis/wms/wms-virus-host-iphop/Sept_2021_pub_new_MAGs_hosts_DB/db/Tmp_CRISPRs
    python /opt/anaconda_iphop/lib/python3.8/site-packages/iphop/utils/CRISPR/get_crispr_database.py -d /qib/research-groups/Simon-Carding/Ernie/ME_new_analysis/wms/wms-virus-host-iphop/Sept_2021_pub_new_MAGs_hosts_DB/db/Tmp_CRISPRs
    [5] Add new genomes to WIsH database...
    [6] Add new genomes to VHM database...
    [7] Add new genomes to PHP database...
    [8] Now build the new host genome metadata file...
    Traceback (most recent call last):
      File "/opt/anaconda_iphop/bin/iphop", line 10, in <module>
        sys.exit(cli())
      File "/opt/anaconda_iphop/lib/python3.8/site-packages/iphop/iphop.py", line 122, in cli
        args["func"](args)
      File "/opt/anaconda_iphop/lib/python3.8/site-packages/iphop/modules/master_add_to_db.py", line 222, in main
        add_to_genome_file(args,logger)
      File "/opt/anaconda_iphop/lib/python3.8/site-packages/iphop/modules/master_add_to_db.py", line 55, in add_to_genome_file
        args['taxo_a'] = glob.glob(os.path.join(args['gtdb_dir'],"infer","gtdbtk.ar[0-9]*.decorated.tree-taxonomy"))[0]
    IndexError: list index out of range
    

    And my command is as follows:

    iphop add_to_db --fna_dir ./final_binned_MAGs --gtdb_dir ./MAG-gtdbtk-results/infer --out_dir ./Sept_2021_pub_new_MAGs_hosts_DB --db_dir /qib/research-groups/Simon-Carding/Ernie/ME_new_analysis/wms/wms-virus-host-iphop/iphop_db/Sept_2021_pub/
    

    Do you have any ideas about this new issue? Many thanks!

    Best,

    Ernie

  4. Simon Roux repo owner

    Hi Ernie,

    The error seems to suggest that GTDB-tk did not finish correctly. For some potential explanation:

    • the “outgroup” option in GTDB-tk is used to specify the taxon on which the tree should be rooted. My recommendation (“p__Altarchaeota”) is for an older version of GTDB, and it looks like this taxon has now been renamed “p__Altiarchaeota”. You should be able to to use the latter, although p__Undinarchaeota should also work for iPHoP
    • The error from iPHoP is that it does not find a file with the tree-taxonomy, which should be in the output from GTDB-tk (in a folder called “infer”). Can you check what is the list of files in each folder in the GTDB-tk result directory ?
    • And actually, looking at your command line, coud you also try replacing “--gtdb_dir ./MAG-gtdbtk-results/infer” by “--gtdb_dir ./MAG-gtdbtk-results/” ?

    Best,

    Simon

  5. S.Y. Hsieh reporter

    Hi Simon,

    Thanks so much for your kind response.

    • Sure. I can re-try it using p__Altiarchaeota. Hopefully it can work!
    • In original GTDK-tk output directory, I separately ran archaea and bacteria in different folders (that is, ‘./MAGs-gtdbtk-output/arc' and './MAGs-gtdbtk-output/bac'), so in each of their 'infer' folder, yes, I can see gtdbtk.ar53.decorated.tree, gtdbtk.ar53.decorated.tree-taxonomy, gtdbtk.bac120.decorated.tree and gtdbtk.bac120.decorated.tree-taxonomy. Then I copied and pasted these four files to ./MAG-gtdbtk-results/infer. Should I run it in their original directories so I don’t need to copy them to another new folder? (For the list of files please see below)
    • Actually, I have also tried --gtdb_dir ./MAG-gtdbtk-results/ but I got the same error.

    GTDB-tk directory (./MAGs-gtdbtk-output):

    “arc”:

    align/ :

    gtdbtk.ar53.filtered.tsv

    gtdbtk.ar53.msa.fasta.gz

    gtdbtk.ar53.user_msa.fasta.gz

    gtdbtk.bac120.filtered.tsv

    gtdbtk.bac120.msa.fasta.gz

    gtdbtk.bac120.user_msa.fasta.gz

    identify/ :

    gtdbtk.ar53.markers_summary.tsv

    gtdbtk.bac120.markers_summary.tsv

    gtdbtk.failed_genomes.tsv

    gtdbtk.translation_table_summary.tsv

    infer/ :

    intermediate_results/

    gtdbtk.ar53.decorated.tree (177KB)

    gtdbtk.ar53.decorated.tree-table

    gtdbtk.ar53.decorated.tree-taxonomy (487KB)

    “bac”:

    align/ : same as above

    identify/ : same as above

    infer/ :

    intermediate_results/

    gtdbtk.bac120.decorated.tree (3146KB)

    gtdbtk.bac120.decorated.tree-table

    gtdbtk.bac120.decorated.tree-taxonomy (9023KB)

    Cheers,

    Ernie

  6. Simon Roux repo owner

    Hi Ernie,

    Ok, that makes sense then. GTDB-tk can be run for both bacteria and archaea on the same output directory, and that’s what iPHoP expects (it is not able to understand that the files it’s looking for are in”bac” and “arc” folders). What iPHoP needs is a single folder with all files from both bacteria and archaea computations. You should be able to copy them over, or e.g. re-run GTDB-tk archaea in the “bac” folder and provide this bac folder to iPHoP.

    Best,

    Simon

  7. S.Y. Hsieh reporter

    Thanks Simon. I re-ran all things in one single directory for gtdb-tk and then re-created a DB. Now it works as expected.

    In fact, I did the same action once previously but as gtdb-tk failed to generate an archaeal tree, I still failed to create a new DB. So I think using a correct outgroup_taxon (“p__Altiarchaeota”) for archaeal host may be the key (at laest in my case), as users may get an empty tree for archaea (for some reasons…) when running gtdb-tk, and it would cause iphop to stop the run.

    Many thanks for your kind help!

    Best

    Ernie

  8. Log in to comment