Detailed_output_by_tool.csv

Issue #94 closed
Apoorva Prabhu created an issue

Hi,

I’m still looking into CRISPR spacers in archaeaviruses, and have a quick question about the Detailed_output_by_tool.csv. I looked at this file, and do have some hits ie mismatches upto 4 is observed, a score1/2 and rank. My questions are -

a) why does this not show in Host prediction to genome/genus files even though I have x viruses in detailed output by tool file

b) Does the rank mean a better prediction ie 1 is better than 4

c) if CRISPR match was found, how is it calculated - by my understanding the score is given as Number of mismatches / Length of CRISPR spacer in detailed output csv output, but from example outputs provided from the tool description, it appears to be calculated >= 90

Thanks, and apologies if these questions are trivial!

Apoorva

Comments (4)

  1. Simon Roux repo owner

    Hi !
    These are good questions :-) The easiest to answer are b) and c):
    b) Yes, predictions are roughly ranked based on “confidence”, although there can be ties (i.e. 1 is either better or equivalent to 4).
    c) In “Detailed output”, you get the “raw” scores for each tool, which in the case of CRISPR is the number of mismatches and the length of the spacer. iPHoP has internal models that consider all CRISPR hits for a given virus and assign a score. This score is based partially on this number of mismatches and length of the spacer, but there is not a single formula that allows you to calculate one from the other.

    For a), what likely happens is that iPHoP considered these CRISPR hits to be not informative / not reliable for host prediction. CRISPR hits with multiple mismatches (more than 1) are often not reliable by themselves, and the only reason iPHoP considers them is in relation to all other signals (blast hit, VirHostMatcher score, WIsH score, RafaH, etc). Said otherwise: if VirHostMatcher, WIsH, and RaFAH point to the same host, and the virus also has several CRISPR hits with 4 mismatches all pointing to the same host as well, then iPHoP will consider these hits to increase its overall confidence in the host prediction. If on the other hand, iPHoP sees a few CRISPR hits with 4 mismatches pointing to different hosts, and other tools (WIsH, etc) pointing to other tools entirely, then it will not report these as host prediction in the “Host prediction to genome/genus” files.

    Hope this helps a little, let me know if you have any other questions !

  2. Apoorva Prabhu reporter

    Hi Simon,

    Great, thank you for explaining them so well.

    Just one more question - CRISPR scores are derived only from blastn hits between viruses to CRISPR spacer database, would that be correct? The result tells you potential spacers on viral hosts to protospacers (region) on viral contigs - perhaps I could get this info from blastcrispr.tsv file?

    Not trying to fish in the dark here, just keen to use the output files from iphop predict. Archaeaviruses are so challenging to work with, and it sounds like I could use the CRISPR output perhaps backed by tRNA matching if there are no other signals coming from other tools.

    Thank you!

    Apoorva

  3. Simon Roux repo owner

    That is correct, CRISPR scores are only derived from the blast of input virus sequences against a database of spacers. For your purpose, I would recommend working from the file “crisprparsed.csv”, as it has the list of blast hits to the CRISPR database already nicely filtered (no more than 8 mismatches) and with some metadata about each spacer such as the original genome it was assembled in and the length.
    And yes, I think what you describe is totally something you can do to try to identify archaeal viruses, my main recommendation would be to be careful with (i.e. not trust) hits that have multiple mismatches, unless you have other indication that point to this virus-host pair to be correct. I would also recommend looking into recent papers like https://www.nature.com/articles/s41564-023-01347-5 for how to interpret CRISPR data (in some ecosystems, it seems like even “genuine” CRISPR hits are not always a good predictor of true virus-host interaction).
    Good luck in your search for archaeal viruses !

  4. Log in to comment