Error in adding MAGs to standard DB
Hi Simon,
Thanks for developing this useful tool. It works well when running standard prediction on our HPC cluster.
I also tried to add my own bacterial/archaeal MAGs in the DB to make a customised DB, so I have run gtdbtk de_novo_wf and got the output. However, I got error messages when starting to create a new DB (that is, the 2nd step), it’s like below:
Starting
[1] Get a list of genomes to import...
[2] Import information from GTDBtk trees...
Traceback (most recent call last):
File "/opt/anaconda_iphop/bin/iphop", line 10, in <module>
sys.exit(cli())
File "/opt/anaconda_iphop/lib/python3.8/site-packages/iphop/iphop.py", line 122, in cli
args["func"](args)
File "/opt/anaconda_iphop/lib/python3.8/site-packages/iphop/modules/master_add_to_db.py", line 159, in main
args['tree_a'] = glob.glob(os.path.join(args['gtdb_dir'],"gtdbtk.ar[0-9]*.decorated.tree"))[0]
IndexError: list index out of range
Our computering colleague thought it seems like a software bug. Do you have any ideas about this issue?
I also attached my script FYI. I have a total of 319 MAGs from 12 samples and currently these are in different folders seaprated by sample IDs but under the same directory, and gtdbtk outputs are also separated by sample IDs. Should I keep them separate or I must pool all MAGs in the same directory (and gtdbtk output too)?
Many thanks.
Best
Ernie
Comments (11)
-
repo owner -
repo owner Hi Ernie,
Quick follow up: it seems indeed that if there is no archaeal MAGs in your set, gtdbtk does not generate the archaeal tree (anymore ? I thought it used to, but I may be wrong). We will try to fix this in iPHoP as soon as possible, but in the meantime an “easy” way to get things going on your side will be to add the MAGs provided to test the tool (https://bitbucket.org/srouxjgi/iphop/downloads/Data_test_add_to_db.tar.gz - folder Wetland_MAGs/) to your MAG set. Because our test MAGs include both bacteria and archaea, gtdbtk should generate both a bacteria and an archaea tree, and you should not see the error in iPHoP add_to_db anymore.
Best,
Simon
-
reporter Hi Simon,
Thanks. I will do a try and see if I can get it work.
Cheers,
Ernie
-
reporter Hi Simon,
Following your suggestion I have tried to add your tested MAGs in alongside my MAGs to re-run ‘gtdb-tk de_novo_wf’ and re-generate both arc and bac trees, but it still failed to generate a decorated tree for archaeal hosts when I used
p__Altarchaeota
as my outgroup taxon. I could see ar53 files in align and identify folders but, I didn’t see ar53.decorated.tree and its decorated.tree.taxonomy file in the infer folder (only see bac files).May I ask you what does ‘outgroup_taxon' mean? In the error log it seems no outgroup could be set so that the tree couldn’t be rooted and generated. It is weird, but I found another one for archaeal phylum (
p__Undinarchaeota
) described in the input example of gtdb-tk’s de_novo_wf manual, so I tested this taxon and eventually it worked to generate an archaeal tree! So in my case, it seems the issue is outgroup_taxon. Ifp__Altarchaeota
does not generate an archaeal tree, I guess users could consider trying others (e.g.,p__Undinarchaeota
).PS. FYI. Our gtdb-tk ver. is v.2.0.0 and its DB’s ver. is r207. In case this affects the outgroup setting?
Next, I used both ar53 and bac120 tree files to create a new DB, but it failed again at the 8th step (please see below):
Starting [1] Get a list of genomes to import... [2] Import information from GTDBtk trees... Reading ./MAG-gtdbtk-results/infer/gtdbtk.ar53.decorated.tree Reading ./MAG-gtdbtk-results/infer/gtdbtk.bac120.decorated.tree [3] Load new host genomes in blast database... Created nucleotide BLAST (alias) database /qib/research-groups/Simon-Carding/Ernie/ME_new_analysis/wms/wms-virus-host-iphop/Sept_2021_pub_new_MAGs_hosts_DB/db/Host_Genomes/Host_Genomes with 14601665 sequences [4] Get CRISPR arrays from new MAGs and add to database... python /opt/anaconda_iphop/lib/python3.8/site-packages/iphop/utils/CRISPR/identify_crispr.folder.py -i ./final_binned_MAGs -o /qib/research-groups/Simon-Carding/Ernie/ME_new_analysis/wms/wms-virus-host-iphop/Sept_2021_pub_new_MAGs_hosts_DB/db/Tmp_CRISPRs python /opt/anaconda_iphop/lib/python3.8/site-packages/iphop/utils/CRISPR/get_crispr_database.py -d /qib/research-groups/Simon-Carding/Ernie/ME_new_analysis/wms/wms-virus-host-iphop/Sept_2021_pub_new_MAGs_hosts_DB/db/Tmp_CRISPRs [5] Add new genomes to WIsH database... [6] Add new genomes to VHM database... [7] Add new genomes to PHP database... [8] Now build the new host genome metadata file... Traceback (most recent call last): File "/opt/anaconda_iphop/bin/iphop", line 10, in <module> sys.exit(cli()) File "/opt/anaconda_iphop/lib/python3.8/site-packages/iphop/iphop.py", line 122, in cli args["func"](args) File "/opt/anaconda_iphop/lib/python3.8/site-packages/iphop/modules/master_add_to_db.py", line 222, in main add_to_genome_file(args,logger) File "/opt/anaconda_iphop/lib/python3.8/site-packages/iphop/modules/master_add_to_db.py", line 55, in add_to_genome_file args['taxo_a'] = glob.glob(os.path.join(args['gtdb_dir'],"infer","gtdbtk.ar[0-9]*.decorated.tree-taxonomy"))[0] IndexError: list index out of range
And my command is as follows:
iphop add_to_db --fna_dir ./final_binned_MAGs --gtdb_dir ./MAG-gtdbtk-results/infer --out_dir ./Sept_2021_pub_new_MAGs_hosts_DB --db_dir /qib/research-groups/Simon-Carding/Ernie/ME_new_analysis/wms/wms-virus-host-iphop/iphop_db/Sept_2021_pub/
Do you have any ideas about this new issue? Many thanks!
Best,
Ernie
-
repo owner Hi Ernie,
The error seems to suggest that GTDB-tk did not finish correctly. For some potential explanation:
- the “outgroup” option in GTDB-tk is used to specify the taxon on which the tree should be rooted. My recommendation (“p__Altarchaeota”) is for an older version of GTDB, and it looks like this taxon has now been renamed “p__Altiarchaeota”. You should be able to to use the latter, although p__Undinarchaeota should also work for iPHoP
- The error from iPHoP is that it does not find a file with the tree-taxonomy, which should be in the output from GTDB-tk (in a folder called “infer”). Can you check what is the list of files in each folder in the GTDB-tk result directory ?
- And actually, looking at your command line, coud you also try replacing “--gtdb_dir ./MAG-gtdbtk-results/infer” by “--gtdb_dir ./MAG-gtdbtk-results/” ?
Best,
Simon
-
reporter Hi Simon,
Thanks so much for your kind response.
- Sure. I can re-try it using p__Altiarchaeota. Hopefully it can work!
- In original GTDK-tk output directory, I separately ran archaea and bacteria in different folders (that is, ‘./MAGs-gtdbtk-output/arc' and './MAGs-gtdbtk-output/bac'), so in each of their 'infer' folder, yes, I can see gtdbtk.ar53.decorated.tree, gtdbtk.ar53.decorated.tree-taxonomy, gtdbtk.bac120.decorated.tree and gtdbtk.bac120.decorated.tree-taxonomy. Then I copied and pasted these four files to ./MAG-gtdbtk-results/infer. Should I run it in their original directories so I don’t need to copy them to another new folder? (For the list of files please see below)
- Actually, I have also tried --gtdb_dir ./MAG-gtdbtk-results/ but I got the same error.
GTDB-tk directory (./MAGs-gtdbtk-output):
“arc”:
align/ :
gtdbtk.ar53.filtered.tsv
gtdbtk.ar53.msa.fasta.gz
gtdbtk.ar53.user_msa.fasta.gz
gtdbtk.bac120.filtered.tsv
gtdbtk.bac120.msa.fasta.gz
gtdbtk.bac120.user_msa.fasta.gz
identify/ :
gtdbtk.ar53.markers_summary.tsv
gtdbtk.bac120.markers_summary.tsv
gtdbtk.failed_genomes.tsv
gtdbtk.translation_table_summary.tsv
infer/ :
intermediate_results/
gtdbtk.ar53.decorated.tree (177KB)
gtdbtk.ar53.decorated.tree-table
gtdbtk.ar53.decorated.tree-taxonomy (487KB)
“bac”:
align/ : same as above
identify/ : same as above
infer/ :
intermediate_results/
gtdbtk.bac120.decorated.tree (3146KB)
gtdbtk.bac120.decorated.tree-table
gtdbtk.bac120.decorated.tree-taxonomy (9023KB)
Cheers,
Ernie
-
repo owner Hi Ernie,
Ok, that makes sense then. GTDB-tk can be run for both bacteria and archaea on the same output directory, and that’s what iPHoP expects (it is not able to understand that the files it’s looking for are in”bac” and “arc” folders). What iPHoP needs is a single folder with all files from both bacteria and archaea computations. You should be able to copy them over, or e.g. re-run GTDB-tk archaea in the “bac” folder and provide this bac folder to iPHoP.
Best,
Simon
-
reporter Thanks Simon. I re-ran all things in one single directory for gtdb-tk and then re-created a DB. Now it works as expected.
In fact, I did the same action once previously but as gtdb-tk failed to generate an archaeal tree, I still failed to create a new DB. So I think using a correct outgroup_taxon (“p__Altiarchaeota”) for archaeal host may be the key (at laest in my case), as users may get an empty tree for archaea (for some reasons…) when running gtdb-tk, and it would cause iphop to stop the run.
Many thanks for your kind help!
Best
Ernie
-
repo owner - changed status to closed
Solved
-
If we have only archaeal genomes to add, is it will error?
"During the second step “iphop predict” the error appeared : ValueError: Unexpected result of predict_function (Empty batch_outputs). Please use Model.compile(..., run_eagerly=True), or tf.config.run_functions_eagerly(True) for more information of where went wrong, or file a issue/bug to tf.keras.”
-
repo owner It’s possible, but easy to fix by adding the MAGs provided for the test case on top of your own MAGs.
- Log in to comment
Hi Ernie,
The script looks ok to me, what seems to be the problem is that iPHoP does not find the archaeal tree. Can you check whether there is a file called “gtdbtk.ar[some numbers]*.decorated.tree” in your GTDB-tk output ? If there is none, it’s either that GTDBtk had an issue, or that in some cases it does not generate this file (which iPHoP did not expect). If it’s the latter, I think you could try to create an empty file with the above name and try it again, hopefully this would be enough.
Best,
Simon