KeyError: 'GB_GCA_002163135.1_vs_RS_GCF_014196925.1'

Issue #86 closed
psampara created an issue

Hi Simon,

I am facing an issue with a debug run on v.1.3.3 with a custom database prepared by adding MAGs (>70% complete and < 10% redundant) to the database “Aug_2023_pub_rw”. I ran with the test data by adding the same MAGs and the script runs to completion. However, with the full dataset, the step after “6.5/2” fails with the message “KeyError: 'GB_GCA_002163135.1_vs_RS_GCF_014196925.1'“. These references are in the GTDB tree, so I am unsure how to resolve the key error. I looked at other similar issues previously posted and still am unable to resolve the issue.

Could you also let me know your thoughts on understanding output after the script? Particularly, if it is a good idea to consider only the MAGs as hosts meeting a cutoff for p-value, and ignoring reference genomes? If I understand correctly, using the full database allows for better false detection rate adjustments and a broader diversity. But since there is a confidence level corrected for multiple testing after using the full database, would it be reasonable to consider only those MAGs as hosts exceeding a threshold of p-value? If so, is there such a reference p-value that could be considered?

Please find attached the scripts I ran and the outputs for both database creation and “predict” function scripts.

Thanks in advance!

Pranav

Comments (7)

  1. Simon Roux repo owner

    Hi Pranav,

    You’re right, this doesn’t look like the typical error (which is usually because of some custom MAGs not being included in the tree). What I suspect may be happening there is that it looks like you are trying to use the new database with an existing output directory, which I believe may have been generated with the original database ? I could see how this could lead to this error, so if that is the case the recommendation would be to try to run iPHoP with the custom MAG database and a new output folder.
    For the second question, I would not recommend ignoring references and focusing on MAGs. This is because iPHoP is not meant to link individual viruses to individual strains/genomes, but is designed to predict the most likely taxon (ideally at the genus level, for lower score at the family level) of the real host(s). Adding MAGs definitely helps in this way, but essentially we still need the references :-)

    Let me know if the fix worked for the custom db !

    Best,

    Simon

  2. psampara reporter

    Hi Simon,

    Thanks for the advice on the utilization of genus-level predictions and to include reference genomes. Since you suggested here and the paper also mentions that iphop was designed for host predictions at the genus rank, I wanted to confirm if a genome can be directly inferred as the host, or should the associations be made at the genus level (or family level at low score)?

    My intention is to infer hosts if possible. Specifically, I wanted to confirm if I can utilize the results in the file “Host_prediction_to_genome_m90.csv” to pick the best “genome” with the highest confidence score, rather than the “genus” level associations from the file “Host_prediction_to_genus_m90.csv”.

    Thanks for the help!

    Pranav

  3. Simon Roux repo owner

    Hi Pranav,

    My recommendation would be to make the associations at the genus level (or family level at low score). This is because the model was trained to predict the correct genus, i.e. the higher score should correspond to the correct genus, but not necessarily the correct genome, i.e. we can not guarantee that the correct host(s) species or genomes have the highest score.
    Best,

    Simon

  4. psampara reporter

    Thank you for the clarification, Simon! I am curious what the best use-case scenario for the file "Host_prediction_to_genome_m90.csv" would be, given that I should make any association with the "Host_prediction_to_genus_m90.csv" file.

    Thanks again!

    Pranav

  5. Simon Roux repo owner

    So the reason why the file “Host_prediction_to_genome_m90.csv" is provided is for users to be able to see which genomes are at the origin of the prediction. For instance if virus X is predicted to be associated with genus Y, then you may be curious to know which genomes specifically yielded the corresponding prediction, whether this was a single genome or multiple genomes in this genus, whether this genome is an isolate or a MAG, etc. So this is really more to provide a possibility to trace back where the prediction originated from than to provide a reliable species/genome-level prediction.

  6. Log in to comment