Can't calculate distances using iphop add_to_db

Issue #46 closed
xichuan Zhai created an issue

Hi Simon,

Thanks for providing this powerful tool! I have a similar issue to #20, but I didn’t find any solutions.

Looks like everything is now set up, we will first clean up the input file, and then we will start the host prediction steps themselves
[1/1/Run] Running blastn against genomes...
[1/3/Run] Get relevant blast matches...
[2/1/Run] Running blastn against CRISPR...
[2/2/Run] Get relevant crispr matches...
[3/1/Run] Running (recoded)WIsH...

Welcome to iPHoP

[3/1/Run] Running WIsH extra database...
[3/2/Run] Get relevant WIsH hits...
[4/1/Run] Running VHM s2 similarities...
[4/2/Run] Get relevant VHM hits...
[5/1/Run] Running PHP...
[5/2/Run] Get relevant PHP hits...
[6/1/Run] Running RaFAH...
[6/2/Run] Get relevant RaFAH scores...
[6.5/1/Run] Running Diamond comparison to RaFAH references...
[6.5/2/Run] Get AAI distance to RaFAH refs...
[7] Aggregating all results and formatting for TensorFlow...
[7/1] Loading all parsed data...
[7/2] Loading corresponding host taxonomy...
[7/3] Link matching genomes to representatives and filter out redundant / useless matches...
Filtering blast data
Filtering crispr data
Filtering wish data
Filtering vhm data
Filtering PHP data
[7/4] Write the matrices for TensorFlow...
Starting to built the matrices for TensorFlow
Loading trees
Processing data for virus vOTU_1
Can't find GB_GCA_001516055.1 and/or MAG162 in the trees, so can't calculate distances

GB_GCA_001516055.1 was found from one of decorated tree, but not for MAG162. MAG162 was found in Wish_extra_negFits.csv.

I am using the latest version (iPHoP v1.3.2), do you have any suggestions to avoid this kind of problem?

Best regards,

Xichuan

Comments (12)

  1. Simon Roux repo owner

    Hi Xichuan,

    Is MAG162 found in one of the tree? This is the first thing we need to check, in theory “add_to_db” should only add to the database MAGs that were in one of the two trees (bacteria or archaea).

    Best,

    Simon

  2. xichuan Zhai reporter

    Hi Simon,

    MAG162 was not found from both trees, but it exists in my MAG bins when I added it to the new database. This may be due to de_novo_wf of GTDB-Tk is missing something that I cannot trace back, even though no error occurred during the trees generation. 

    Best regards,

    Xichuan

  3. Simon Roux repo owner

    That is weird, if it’s not found in the trees it should not have been added to the WIsH database, I will have to look into this. In the meantime, you should be able to do the following:

    • In the new host database folder, look into “db_infos/Host_Genomes.tsv”, and remove all the lines that correspond to MAG that are not in the trees (you can make a copy of the file before just in case). The key part is: if a genome is not listed in this file, iPHoP should not try to include it later on
    • Once this is done, you need to go into the output folder of iPHoP, then into “Wdir”, and remove all the files ending with “…parsed.csv”. Then run iPHoP again

    Let me know how this goes, as this is not an “expected” use case, but I think it should solve your issue.

  4. xichuan Zhai reporter

    Yeah, It can continue for the following steps. But got error from step 8.

    Welcome to iPHoP

    Looks like everything is now set up, we will first clean up the input file, and then we will start the host prediction steps themselve s
    [1/1/Skip] Skipping computation of blastn against microbial genomes...
    [1/3/Run] Get relevant blast matches...
    [2/1/Skip] Skipping computation of blastn against CRISPR...
    [2/2/Run] Get relevant crispr matches...
    [3/1/Skip] Skipping computation of WIsH scores...
    [3/2/Run] Get relevant WIsH hits...
    [4/1/Skip] Skipping computation of VHM s2 similarities...
    [4/2/Run] Get relevant VHM hits...
    [5/1/Skip] Skipping computation of PHP scores...
    [5/2/Run] Get relevant PHP hits...
    [6/1/Skip] Skipping RaFAH...
    [6/2/Run] Get relevant RaFAH scores...
    [6.5/1/Skip] Skipping diamond search against RaFAH refs...
    [6.5/2/Run] Get AAI distance to RaFAH refs...
    [7/Skip] We already found all the expected files, we skip...
    [7.5] Aggregating all results and formatting for RF...
    [8] Running the convolution networks...
    [8/1] Loading data as tensors..
    [8/1.1] Getting blast-based scores..
    [8/1.2] Run blast classifier Model_blast_Conv-87 (by batch)..
    Predicting confidence score for all batches of input data [====================================] 100%
    [8/1.2] Run blast classifier Model_blast_RF-39 (by batch)..
    TF Parameter Server distributed training not available (this is expected for the pre-build release).
    [INFO kernel.cc:1153] Loading model from path
    [INFO decision_forest.cc:617] Model loaded with 1000 root(s), 637446 node(s), and 15 input feature(s).
    [INFO abstract_model.cc:1063] Engine "RandomForestOptPred" built
    [INFO kernel.cc:1001] Use fast generic engine
    Traceback (most recent call last):
    File "/home/server/mambaforge/envs/iphop_env/bin/iphop", line 10, in <module>
    sys.exit(cli())
    File "/home/server/mambaforge/envs/iphop_env/lib/python3.8/site-packages/iphop/iphop.py", line 128, in cli
    args"func"
    File "/home/server/mambaforge/envs/iphop_env/lib/python3.8/site-packages/iphop/modules/master_predict.py", line 106, in main
    runmodels.run_individual_models(args)
    File "/home/server/mambaforge/envs/iphop_env/lib/python3.8/site-packages/iphop/modules/runmodels.py", line 55, in run_individual_models
    full_predicted = run_single_classifier_rf(classifier,args["matrix_blast_rf"],args,full_matrix_labels)
    File "/home/server/mambaforge/envs/iphop_env/lib/python3.8/site-packages/iphop/modules/runmodels.py", line 265, in run_single_classifier_rf
    batch_predicted = best_model.predict(tfdf.keras.pd_dataframe_to_tf_dataset(feature_matrix))
    File "/home/server/mambaforge/envs/iphop_env/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 67, in error_handler
    raise e.with_traceback(filtered_tb) from None
    File "/home/server/mambaforge/envs/iphop_env/lib/python3.8/site-packages/keras/engine/training.py", line 1804, in predict
    raise ValueError('Unexpected result of predict_function '
    ValueError: Unexpected result of predict_function (Empty batch_outputs). Please use Model.compile(..., run_eagerly=True), or tf.config.run_functions_eagerly(True) for more information of where went wrong, or file a issue/bug to tf.keras.

  5. xichuan Zhai reporter

    PS, I run the same viome fasta file using the original database without any error. But the error occurred when I added the MAGs to the database. The error is the same to Issue #10, but still have no clue to solve this.

  6. Simon Roux repo owner

    I don’t think it’s the same issue as #10, but if it is, the good news is that it is fairly easy to solve (you just need to add the sequences from https://bitbucket.org/srouxjgi/iphop/raw/d27b6bbdcd39a6a1cb8407c44ccbcc800d2b4f78/test/test_input_phages.fna to your input file). What I would try is to run iPHoP with your custom database (where you added your MAGs, and removed the lines that should not be there in “Host_genomes.tsv”) in a completely new output folder, and with these additional sequences as input. If this does not work, then I would probably need to look at the output directory because I would then suspect you stumbled upon an unexpected edge case (sorry !)

  7. xichuan Zhai reporter

    To be honest, I don’t think it works because I run the same viome fasta file using the original database without any error.

  8. Simon Roux repo owner

    Technically, the original database has some genomes that are not anymore present in your custom database after you add your own MAGs (that is why we recommend to run iPHoP with both the original and the custom databases). But it really depends on your input data and the number of hits you got. There is also the possibility that trying to run on an existing output folder causes unforeseen issues, so running against the custom database in a “clean” output folder may help (or at least would be worth trying).

  9. xichuan Zhai reporter

    Hi Simon,

    I have finished all the steps by doing the following things:

    1. adding the sequences you suggested
    2. removing all the lines that correspond to MAG that are not in the trees from “db_infos/Host_Genomes.tsv”, actually there are no taxonomy for these MAGs
    3. running against the custom database in a “clean” output folder.

    Thanks a lot for your help.

    Best regards,

    Xichuan

  10. Simon Roux repo owner

    Great, thanks for the update. Sorry this was so complicated, but glad that it worked out in the end !

  11. Log in to comment