Warning message: I could not understand the host so I will not use the corresponding line

Issue #80 closed
Karen Jin created an issue

I divided my sequences into 10 batches (~3000 seqs per batch) and ran iphop prediction; everything seemed to run smoothly but I got this warning message at the end of the log file:

[10/1] Preparing the detailed output...
[10/2] Preparing the iPHoP-only result file, linking viruses to individual genomes (/storage/jufengLab/jinlingrong/Project/river_virome/9_iphop/host_all_vOTUs_fl/fl_votu_sequences.03/Host_prediction_to_genome_m90.csv) ...
[10/3] Preparing the combined iPHoP / RaFAH output summarized at the genus rank (/storage/jufengLab/jinlingrong/Project/river_virome/9_iphop/host_all_vOTUs_fl/fl_votu_sequences.03/Host_prediction_to_genus_m90.csv) ... 

!#!#!#!#!#! WARNING --- SOME UNEXPECTED EVENTS HAPPENED -- WE LIST THEM BELOW, IT COULD BE NOTHING, BUT YOU SHOULD STILL DOUBLE-CHECK #!#!#!#!#!#!#

I could not understand the host of 3300012136_9|3300012136.a:Ga0153985_1000961, so I will not use the corresponding line
I could not understand the host of 3300027777_28|3300027777.a:Ga0209829_10011870, so I will not use the corresponding line
I could not understand the host of GCF_003469815.1|NZ_QSKE01000034.1, so I will not use the corresponding line
I could not understand the host of GCA_900555455.1|USLF01000010.1, so I will not use the corresponding line
I could not understand the host of 3300029842_40|3300029842.a:Ga0245273_107660, so I will not use the corresponding line
I could not understand the host of 3300029841_29|3300029841.a:Ga0245272_101460, so I will not use the corresponding line
I could not understand the host of 3300029768_34|3300029768.a:Ga0245196_102685, so I will not use the corresponding line
I could not understand the host of 3300029758_26|3300029758.a:Ga0242784_101726, so I will not use the corresponding line

The warning message went on for about 2000 lines and it happened for every batch run. I checked the “host_prediction_to_genus.csv” and “host_prediction_to_genome.csv” file and they seemed to be fine. Should I be concerned about the warning message?

Comments (4)

  1. Karen Jin reporter

    I forgot to mention that I added my own MAGs to the database according to the instruction (it ran successfully without warning messages in a previous experimental run). I’m not sure if this may be related to the warning message (or if it is related to recent database update - I used iphop=1.3.2 with the default “Sept_2021_pub_rw” database)

  2. Simon Roux repo owner

    Hi Karen,

    This is exactly the question I was going to ask (“is this the default database or a custom one”) :-) If this is a custom database, this is entirely normal. Briefly, building a custom database relies on generating a new tree de novo using GTDB-tk. This new tree will include user-provided MAGs but will not include genomes that we added to the default database and that are not currently in GTDB (e.g. all the GEM MAGs, such as 3300012136_9, 3300027777_28, etc). Because these genomes are not in the tree anymore, they do not have a taxonomic information in the custom database, but they are still present in the BLAST and CRISPR db (because removing them causes more issues than just leaving them and ignoring them at the parsing step later). So these are the warning you see here: you got some hits to genomes that are in the default database but were removed from the tree in your custom database.

    The bottom line is: the results you got are ok, although given that you seem to have a lot of hits (you mentioned ~ 2000 lines) to host genomes that are filtered out in the custom database, I would recommend running the same virus sequences against the default database as well, and merging the results (in our experience, you can simply take the prediction with the best score across the default and custom databases).

    I will adjust the documentation to mention this, and then think of maybe a way to avoid the 2,000 lines and make the warning message more clear. Thanks !

  3. Karen Jin reporter

    Hi Simon, thanks for your quick response. That sounds reasonable - I will try running predictions with the default database and merge the results.

  4. Log in to comment