implementation of stopped pipeline : "iphop predict"

Issue #95 closed
Hyeongwon Lee created an issue

Hi Simon . Thank you for you great work. I was amazed by your approach .

I have one questions about dataset, add_new_genome and code implementation and one mention.

You updated database in iphop v 1.3.3, and wrote the some genome information .
”Host genomes extracted from GTDB r214, IMG published genomes as of Aug. 2023, MGnify MAG collections, and GEMv1. This will be the default database starting with the release of iPHoP version 1.3.3.”

Question : dataset

I am confused that all IMG and MGnify MAG collections are integrated in iphop database or some part of them were utilized?

In your original paper, You made database using GTDB species representative genome + all GEM genomes + IMG.
MGnify currently hold about 400k MAGs , did you integrated all of them?? It would be helpful if you tell about it.

Question : add_new_genome
When I want to add some genomes of interest, do you recommend to put all of the genomes or just put some species representative genomes or dereplicated genomes ?

Question : code implementation
Due to error on my computer, ‘iphop predict’ step was terminated abruptly.
To save the computing resources, I look into the script : “$CONDA/envs/iphop_new/lib/python3.8/site-packages/iphop/modules/master_predict.py“

I noticed line 193 of this script hold some controllable running process I guess:        

## List of potential tools (get changed to 1 if computed, 2 if parsed)

args['list_tools'] = {}

args['list_tools']['blast'] = 0

args['list_tools']['crispr'] = 0

args['list_tools']['wish'] = 0

args['list_tools']['vhm'] = 0

args['list_tools']['php'] = 0

args['list_tools']['rafah'] = 0

args['list_tools_rf'] = {}

args['list_tools_rf']['blast'] = 0

args['list_tools_rf']['crispr'] = 0

If I would like to run from terminated step, Do I just need to convert finished programs to 1 ?
Or could you tell me any way to rerun pipeline from the interrupted process ?
I am asking you this question, because the contig number is quiet large.

Mention : potential error point about add_to_db

One more thing I would like to mention is GTDB archaeal file.
In my memory , while using v.1.3.3 iphop both conda and github, master_add_to_db.py generated error complaining something like “there is no gtdbtk.ar122.decorated.tree“ . After I manually converted ar122 to ar53 in script, I ran add_to_db command successfully. (I think it would be helpful if you could look at it once)

I’ll look forward to your answer. Thank you very much in advance.

Comments (5)

  1. Simon Roux repo owner

    Hi , and thanks for your interest in iPHoP. I’ll start with the very last mention: you are correct, this is a bug in the code which should be fixed in the next release. Now for some answers to your other questions:
    - Question : dataset:
    The database is built from GTDB Representative genomes (sorry for the confusion), so it does not include the 400k.
    - Question : add_new_genome
    For efficiency (and time) purposes, I would recommend only including dereplicated MAGs. We saw very little benefit at including multiple near-identical genomes
    - Question : code implementation
    You should not have to modify the code in any way, you should be able to just relaunch iphop using the same command line with the same output directory, and iPHoP should start where it left off. There is always the possibility that it stopped at the wrong time (e.g. mid-way through writing a file), so if you get an error when trying to relaunch, you can go in the “Wdir” folder of the output, and remove the last file(s) generated before relaunching. So just to be sure, what I would do would be: (i) make a copy of my current (partial) output folder, (ii) try to just relaunch iPHoP on the original output folder and see if it completes, (iii) if it does not, then restore the original output folder, remove the files linked to the last step it was trying to perform, and relaunch.

    Let me know if you have any issue, especially with the code restart !

    Best,

    Simon

  2. Hyeongwon Lee reporter

    Thank you very much for your detailed and prompt answer. I’ll consider the information for our analysis.
    I have one more question about this tool. iphop aims to provide accurate information for phage host detection in genus level.
    In the paper, species level prediction has higher error rate as I understand.

    Is it unreliable and hard to predict host taxonomy in species level or do you have any recommendation for the species level host taxonomy detection ?
    If you have any opinion about this, please tell me.

    For the program rerun I’ve asked : Following your recommendation, the program automatically found previous calculated data and restarted without any problem.

    For the question I asked, I found out the detailed information of all genomes in db_infos forlder.
    GTDB rep genomes IMG genomes.
    Thank you.

    Best regards,

  3. Simon Roux repo owner

    Glad that the restart function of iPHoP worked as expected ! And also happy to hear that you found the list of genomes you needed in the db_infos folder, let me know if you have any question about this.

    As for predicting host at the species level, I do believe this is still very error-prone at this time, and this is why iPHoP only predicts host at the genus level. You may look into the specific signal used to try to get species-level predictions, for instance if you find a near-identical (~ 99% - 100%) integrated prophage via blast (“blast” method in iPHoP), this could likely be used to link the corresponding virus to this host species. But this will need to be done manually as iPHoP was not designed for this purpose.

  4. Hyeongwon Lee reporter

    Thank you for your kind answer.
    Considering your comment and research papers, it seems hard to link correctly virus to host at species level based on bulk metagenome data only now.

    I will manually analyze the iPHoP blast results minding error rate.
    My issues are resolved now, Thank you!

    Best regards,

  5. Simon Roux repo owner

    Issue should be fixed now, but please re-open this issue (or open a new one) if you have any problems in the future.

  6. Log in to comment