decorated file not detected in step 8

Issue #100 on hold
James Riddell created an issue

Hi Simon,

I attempted to add 2,304 Stordalen MAGs to the iphop db using iphop v1.3.2. I first ran the wetland MAGs test and was able to get the correct output, but when adding my custom MAG database it was unable to recognize my decorated tree files even though they existed. Here are scripts for how I built my GTDB database, and also my runscript for building the iphop database. Any idea what’s going on, or if my results are actually OK?

I checked the custom database versus the original database and noticed the decorated tree files were larger in the custom. I’m curious if this is just a warning and these were actually added to the tree, or if I’ll run into any issues attempting to predict hosts with this database? Thanks in advance!!

Verifying custom db is larger:

# Custom database
ls -lahS 
total 158M
-rw-r--r-- 1 riddell26 PAS1573 129M Apr 29 15:59 All_CRISPR_spacers_nr_clean.metrics.csv
-rw-r--r-- 1 riddell26 PAS1573  18M Apr 29 16:49 Host_Genomes.tsv
-rw-r--r-- 1 riddell26 PAS1573 3.3M Apr 29 15:59 All_CRISPR_array_size.tsv
-rw-r----- 1 riddell26 PAS1573 3.3M Apr 29 16:42 Wish_negFits.csv
-rw-r--r-- 1 riddell26 PAS1573 3.2M Apr 29 14:44 gtdbtk.bac120.decorated.tree
-rw-r----- 1 riddell26 PAS1573 1.3M Apr 29 14:46 List_contigs_removed_blast.tsv
-rw-r--r-- 1 riddell26 PAS1573 181K Apr 29 14:44 gtdbtk.ar122.decorated.tree
-rw-r--r-- 1 riddell26 PAS1573 151K Apr 29 16:42 Wish_extra_negFits.csv
drwxr-xr-x 2 riddell26 PAS1573 4.0K Apr 29 16:49 .
drwxr-xr-x 4 riddell26 PAS1573 4.0K Apr 29 14:44 ..
lrwxrwxrwx 1 riddell26 PAS1573  105 Apr 29 14:44 Translate_genus_to_full_taxo.tsv -> /fs/project/PAS1117/modules/sequence_dbs/iPHoP/Sept_2021_pub_rw/db_infos/Translate_genus_to_full_taxo.tsv
# Original database
ls -lahS /fs/project/PAS1117/modules/sequence_dbs/iPHoP/Sept_2021_pub_rw/db_infos
total 161M
-rw-r----- 1 osu9664 PAS1117 128M Mar 31  2023 All_CRISPR_spacers_nr_clean.metrics.csv
-rw-r----- 1 osu9664 PAS1117  22M Mar 31  2023 Host_Genomes.tsv
-rw-r----- 1 osu9664 PAS1117 3.3M Mar 31  2023 Wish_negFits.csv
-rw-r----- 1 osu9664 PAS1117 3.2M Mar 31  2023 All_CRISPR_array_size.tsv
-rw-r----- 1 osu9664 PAS1117 2.7M Mar 31  2023 gtdbtk.bac120.decorated.tree
-rw-r----- 1 osu9664 PAS1117 1.3M Mar 31  2023 List_contigs_removed_blast.tsv
-rw-r----- 1 osu9664 PAS1117 393K Mar 31  2023 Translate_genus_to_full_taxo.tsv
-rw-r----- 1 osu9664 PAS1117 161K Mar 31  2023 gtdbtk.ar122.decorated.tree
drwxr-x--- 2 osu9664 PAS1117 4.0K Mar 31  2023 .
drwxr-x--- 4 osu9664 PAS1117 4.0K Apr  2  2023 ..

GTDB-infer script

module use /fs/project/PAS1117/modulefiles
module load GTDB-Tk

dataDir="/fs/scratch/Sullivan_Lab/JamesR/Grantham_Bioreactor/MAGs"

# bacteria
gtdbtk de_novo_wf --genome_dir ${dataDir}/ --bacteria --outgroup_taxon p__Patescibacteria --out_dir /fs/scratch/Sullivan_Lab/JamesR/Grantham_Bioreactor/Grantham_MAGs_GTDB-tk_results/ --cpus 40 --force --extension fa

# archaea
gtdbtk de_novo_wf --genome_dir ${dataDir}/ --archaea --outgroup_taxon p__Undinarchaeota --out_dir /fs/scratch/Sullivan_Lab/JamesR/Grantham_Bioreactor/Grantham_MAGs_GTDB-tk_results/ --cpus 40 --force --extension fa

GTDB-infer logfile

[2023-12-06 10:58:07] INFO: GTDB-Tk v2.1.1
[2023-12-06 10:58:07] INFO: gtdbtk de_novo_wf --genome_dir /fs/scratch/Sullivan_Lab/JamesR/Grantham_Bioreactor/MAGs/ --archaea --outgroup_taxon p__Undinarchaeota --out_dir /fs/scratch/Sullivan_Lab/JamesR/Grantham_Bioreactor/Grantham_MAGs_GTDB-tk_results/ --cpus 40 --force --extension fa
[2023-12-06 10:58:07] INFO: Using GTDB-Tk reference data version r207: /fs/project/PAS1117/modules/GTDB-Tk/2.1.1/share/gtdbtk-2.1.1/db
[2023-12-06 10:58:09] INFO: Identifying markers in 2,304 genomes with 40 threads.
[2023-12-06 10:58:10] TASK: Running Prodigal V2.6.3 to identify genes.
                                                                                   [2023-12-06 11:18:22] INFO: Completed 2,304 genomes in 20.21 minutes (114.02 genomes/minute).
[2023-12-06 11:18:25] TASK: Identifying TIGRFAM protein families.
                                                                                   [2023-12-06 11:31:47] INFO: Completed 2,304 genomes in 13.36 minutes (172.50 genomes/minute).
[2023-12-06 11:31:47] TASK: Identifying Pfam protein families.
                                                                                    [2023-12-06 11:32:12] INFO: Completed 2,304 genomes in 25.30 seconds (91.05 genomes/second).
[2023-12-06 11:32:12] INFO: Annotations done using HMMER 3.1b2 (February 2015).
[2023-12-06 11:32:12] TASK: Summarising identified marker genes.
                                                                                   [2023-12-06 11:33:58] INFO: Completed 2,304 genomes in 1.76 minutes (1,308.39 genomes/minute).
[2023-12-06 11:33:58] INFO: Done.
[2023-12-06 11:34:02] INFO: Aligning markers in 2,304 genomes with 40 CPUs.
[2023-12-06 11:34:02] INFO: Processing 2,218 genomes identified as bacterial.
[2023-12-06 11:34:09] INFO: Read concatenated alignment for 62,291 GTDB genomes.
[2023-12-06 11:34:09] TASK: Generating concatenated alignment for each marker.
                                                                                    [2023-12-06 11:34:14] INFO: Completed 2,218 genomes in 1.99 seconds (1,115.86 genomes/second).
[2023-12-06 11:34:15] TASK: Aligning 120 identified markers using hmmalign 3.1b2 (February 2015).
[2023-12-06 11:35:36] INFO: Completed 120 markers in 1.28 minutes (93.51 markers/minute).
[2023-12-06 11:35:37] TASK: Masking columns of bacterial multiple sequence alignment using canonical mask.
                                                                                          [2023-12-06 11:37:26] INFO: Completed 64,509 sequences in 1.81 minutes (35,655.60 sequences/minute).
[2023-12-06 11:37:26] INFO: Masked bacterial alignment from 41,084 to 5,036 AAs.
[2023-12-06 11:37:26] INFO: 1 bacterial user genomes have amino acids in <10.0% of columns in filtered MSA.
[2023-12-06 11:37:26] INFO: Creating concatenated alignment for 64,508 bacterial GTDB and user genomes.
[2023-12-06 11:37:44] INFO: Creating concatenated alignment for 2,217 bacterial user genomes.
[2023-12-06 11:37:45] INFO: Processing 86 genomes identified as archaeal.
[2023-12-06 11:37:45] INFO: Read concatenated alignment for 3,412 GTDB genomes.
[2023-12-06 11:37:46] TASK: Generating concatenated alignment for each marker.
                                                                               [2023-12-06 11:37:48] INFO: Completed 86 genomes in 0.12 seconds (698.03 genomes/second).
[2023-12-06 11:37:49] TASK: Aligning 53 identified markers using hmmalign 3.1b2 (February 2015).
[2023-12-06 11:37:53] INFO: Completed 53 markers in 2.14 seconds (24.78 markers/second).
[2023-12-06 11:37:54] TASK: Masking columns of archaeal multiple sequence alignment using canonical mask.
[2023-12-06 11:37:57] INFO: Completed 3,498 sequences in 3.68 seconds (951.10 sequences/second).
[2023-12-06 11:37:57] INFO: Masked archaeal alignment from 13,540 to 10,153 AAs.
[2023-12-06 11:37:57] INFO: 0 archaeal user genomes have amino acids in <10.0% of columns in filtered MSA.
[2023-12-06 11:37:57] INFO: Creating concatenated alignment for 3,498 archaeal GTDB and user genomes.
[2023-12-06 11:37:59] INFO: Creating concatenated alignment for 86 archaeal user genomes.
[2023-12-06 11:37:59] INFO: Done.
[2023-12-06 11:37:59] INFO: Inferring FastTree (WAG, SH support values) using a maximum of 40 CPUs.
[2023-12-06 12:28:03] INFO: FastTree version: precision
[2023-12-06 12:28:03] INFO: Done.
[2023-12-06 12:28:03] INFO: Reading GTDB taxonomy for representative genomes.
[2023-12-06 12:28:03] INFO: Read taxonomy for 65,703 genomes.
[2023-12-06 12:28:03] INFO: Identifying genomes from the specified outgroup: p__Undinarchaeota
[2023-12-06 12:28:03] INFO: Identified 5 outgroup taxa in the tree.
[2023-12-06 12:28:03] INFO: Identified 3,493 ingroup taxa in the tree.
[2023-12-06 12:28:03] INFO: Outgroup is monophyletic.
[2023-12-06 12:28:03] INFO: Rerooting tree.
[2023-12-06 12:28:03] INFO: Rerooted tree written to: /fs/scratch/Sullivan_Lab/JamesR/Grantham_Bioreactor/Grantham_MAGs_GTDB-tk_results/infer/intermediate_results/gtdbtk.ar53.rooted.tree
[2023-12-06 12:28:03] INFO: Done.
[2023-12-06 12:28:03] INFO: Reading GTDB taxonomy for representative genomes.
[2023-12-06 12:28:04] INFO: Read taxonomy for 65,703 genomes.
[2023-12-06 12:28:04] INFO: Reading tree.
[2023-12-06 12:28:04] INFO: Removing any previous internal node labels.
[2023-12-06 12:28:04] INFO: Calculating F-measure statistic for each taxa.
[2023-12-06 12:28:04] INFO: Calculating taxa within each lineage.
[2023-12-06 12:28:04] INFO: Processing 1 taxa at Domain rank.
[2023-12-06 12:28:04] INFO: Processing 20 taxa at Phylum rank.
[2023-12-06 12:28:05] INFO: Processing 53 taxa at Class rank.
[2023-12-06 12:28:05] INFO: Processing 133 taxa at Order rank.
[2023-12-06 12:28:05] INFO: Processing 457 taxa at Family rank.
[2023-12-06 12:28:05] INFO: Processing 1,344 taxa at Genus rank.
[2023-12-06 12:28:06] INFO: Processing 3,412 taxa at Species rank.
[2023-12-06 12:28:06] WARNING: There are 41 taxa with multiple placements of equal quality.
[2023-12-06 12:28:06] WARNING: These were resolved by placing the label at the most terminal position.
[2023-12-06 12:28:06] WARNING: Ideally, taxonomic assignment of all genomes should be established before tree decoration.
[2023-12-06 12:28:06] INFO: Placing labels on tree.
[2023-12-06 12:28:06] INFO: Writing out statistics for taxa.
[2023-12-06 12:28:06] INFO: Writing out inferred taxonomy for each genome.
[2023-12-06 12:28:06] INFO: Writing out decorated tree.
[2023-12-06 12:28:06] INFO: Done.
[2023-12-06 12:28:06] INFO: Removing intermediate files.
[2023-12-06 12:28:34] INFO: Intermediate files removed.
[2023-12-06 12:28:34] INFO: Done.

Checking trees are there:

ls -lahS /fs/scratch/Sullivan_Lab/JamesR/Grantham_Bioreactor/Grantham_MAGs_GTDB-tk_results/
total 19K
drwxr-xr-x 6 riddell26 PAS1573  16K Apr 26 11:38 ..
drwxr-xr-x 3 riddell26 PAS1573 4.0K Dec  6 12:28 .
drwxr-xr-x 2 riddell26 PAS1573 4.0K Dec  6 12:28 infer
lrwxrwxrwx 1 riddell26 PAS1573   40 Dec  6 05:53 gtdbtk.bac120.decorated.tree-table -> infer/gtdbtk.bac120.decorated.tree-table
lrwxrwxrwx 1 riddell26 PAS1573   38 Dec  6 12:28 gtdbtk.ar53.decorated.tree-table -> infer/gtdbtk.ar53.decorated.tree-table
lrwxrwxrwx 1 riddell26 PAS1573   34 Dec  6 05:53 gtdbtk.bac120.decorated.tree -> infer/gtdbtk.bac120.decorated.tree
lrwxrwxrwx 1 riddell26 PAS1573   32 Dec  6 12:28 gtdbtk.ar53.decorated.tree -> infer/gtdbtk.ar53.decorated.tree
ls -lahS /fs/scratch/Sullivan_Lab/JamesR/Grantham_Bioreactor/Grantham_MAGs_GTDB-tk_results/infer
total 7.9M
-rw-r--r-- 1 riddell26 PAS1573 4.2M Dec  6 05:53 gtdbtk.bac120.decorated.tree-table
-rw-r--r-- 1 riddell26 PAS1573 3.2M Dec  6 05:53 gtdbtk.bac120.decorated.tree
-rw-r--r-- 1 riddell26 PAS1573 266K Dec  6 12:28 gtdbtk.ar53.decorated.tree-table
-rw-r--r-- 1 riddell26 PAS1573 181K Dec  6 12:28 gtdbtk.ar53.decorated.tree
drwxr-xr-x 2 riddell26 PAS1573 4.0K Dec  6 12:28 .
drwxr-xr-x 3 riddell26 PAS1573 4.0K Dec  6 12:28 ..

iphop build custom database runscript

module use /fs/project/PAS1117/modulefiles
module load iPHoP/1.3.2

iphop add_to_db --fna_dir /fs/scratch/Sullivan_Lab/JamesR/Grantham_Bioreactor/MAGs/ \
--gtdb_dir /fs/scratch/Sullivan_Lab/JamesR/Grantham_Bioreactor/Grantham_MAGs_GTDB-tk_results/ \
--out_dir iphop_2021_pub_grantham_rw \
--db_dir /fs/project/PAS1117/modules/sequence_dbs/iPHoP/Sept_2021_pub_rw/ \
--num_threads 40

iphop build custom database head and tail of logfile

head

Starting
[1] Get a list of genomes to import...
[2] Import information from GTDBtk trees...
Reading /fs/scratch/Sullivan_Lab/JamesR/Grantham_Bioreactor/Grantham_MAGs_GTDB-tk_results/gtdbtk.ar53.decorated.tree
Reading /fs/scratch/Sullivan_Lab/JamesR/Grantham_Bioreactor/Grantham_MAGs_GTDB-tk_results/gtdbtk.bac120.decorated.tree
[3] Load new host genomes in blast database...
STM_0716_E_M_E026_E030_E034_D_multi_bin.12 doesn't have a representative in the trees, so we don't include in the database


Building a new DB, current time: 04/29/2024 14:45:54
New DB name:   /fs/ess/PAS1117/riddell26/Grantham_Bioreactor/03-predict-hosts/scripts/iphop_2021_pub_grantham_rw/db/Host_Genomes/New_host_genomes
New DB title:  /fs/ess/PAS1117/riddell26/Grantham_Bioreactor/03-predict-hosts/scripts/iphop_2021_pub_grantham_rw/db/Host_Genomes/New_host_genomes.fna
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 713676 sequences in 56.588 seconds.


Created nucleotide BLAST (alias) database /fs/ess/PAS1117/riddell26/Grantham_Bioreactor/03-predict-hosts/scripts/iphop_2021_pub_grantham_rw/db/Host_Genomes/Host_Genomes with 15242149 sequences
[4] Get CRISPR arrays from new MAGs and add to database...

tail

Created nucleotide BLAST (alias) database /fs/ess/PAS1117/riddell26/Grantham_Bioreactor/03-predict-hosts/scripts/iphop_2021_pub_grantham_rw/db/All_CRISPR_spacers_nr_clean with 1408782 sequences
[5] Add new genomes to WIsH database...
/users/PAS1573/riddell26/.local/lib/python3.8/site-packages/iphop/modules/wish.py:181: FutureWarning: Not prepending group keys to the result index of transform-like apply. In the future, the group keys will be included in the index, regardless of whether the applied function returns a like-indexed object.
To preserve the previous behavior, use

    >>> .groupby(..., group_keys=False)

To adopt the future behavior and silence this warning, use 

    >>> .groupby(..., group_keys=True)
  final_output = final_output.sort_values(by='LL',ascending=False).groupby('Virus').apply(lambda x: x.nlargest(n=n_hostbyphage,columns='LL',keep='all')).reset_index(drop=True)

[6] Add new genomes to VHM database...
[7] Add new genomes to PHP database...
counting kmer ...
Preparing output file
done.
[8] Now build the new host genome metadata file...
Reading /fs/scratch/Sullivan_Lab/JamesR/Grantham_Bioreactor/Grantham_MAGs_GTDB-tk_results/gtdbtk.ar53.decorated.tree
Reading /fs/scratch/Sullivan_Lab/JamesR/Grantham_Bioreactor/Grantham_MAGs_GTDB-tk_results/gtdbtk.bac120.decorated.tree

We added 0 additional bacteria genomes and 0 additional archaea genomes
[9] All done

!#!#!#!#!#! WARNING --- SOME UNEXPECTED EVENTS HAPPENED -- WE LIST THEM BELOW, IT COULD BE NOTHING, BUT YOU SHOULD STILL DOUBLE-CHECK #!#!#!#!#!#!#

Note - we did not find a decorated file for the archaeal tree, so we did not use any data from a new archaeal genome
Note - we did not find a decorated file for the bacterial tree, so we did not use any data from a new bacterial genome

!#!#!#!#!!#!#!#!#!!#!#!#!#!!#!#!#!#!!#!#!#!#!!#!#!#!#!!#!#!#!#!!#!#!#!#!!#!#!#!#!!#!#!#!#!!#!#!#!#!!#!#!#!#!!#!#!#!#!!#!#!#!#!!#!#!#!#!!#!#!#!#!#!#!

Comments (9)

  1. Simon Roux repo owner

    That is weird, these warnings are usually correct, but then I’m not sure why the tree files would be different. Can you check in Host_Genomes.tsv whether your bins are listed ? Otherwise the next step would be to re-run “add_to_db” with the “--debug” option and see if the expanded log provides more information.

  2. Simon Roux repo owner

    Oh wait a minute, I think I just understood what happened: the files that are missing are the ones ending in “.decorated.tree-taxonomy”, which I don’t see in your GTDB-tk output ? Can you check with version of GTDB-tk you are using ? I wonder if something changed in GTDB-tk that generated input files that are not compatible with iPHoP.

  3. James Riddell reporter

    Thanks for catching that! I am using gtdbtk: version 2.1.1. I’ll check your readme again and see which version I need to use to be compatible with iphop 1.3.2.

  4. James Riddell reporter

    I’m also noticing now that iphop v1.3.3 has been released and is using the updated GTDB database. If I want to use this instead, what version if GTDB would be required?

  5. Simon Roux repo owner

    That is weird, for reference I used version 2.3.2 recently and it worked all right, but I believe even in 2.1.1 you should see these “taxonomy” files in the infer folder.. not sure what’s happening, but hopefully updating the version of gtdbtk would fix it ?

  6. Simon Roux repo owner

    Sorry our messages crossed :-) I typically recommend GTDB-tk 2.3.2, using GTDB database version 214.1 (I have not worked with the very latest GTDB database version yet).

  7. Log in to comment