Output file for CRISPR

Issue #69 closed
Apoorva Prabhu created an issue

Hi Simon,

Thanks for the great tool - have now succesfully run it with added on custom db of MAGs recovered in my study.

I wanted to look at the individual files for a) if any CRISPR spacers were found in the archaeal genomes and b) The blast result between identified viruses and CRISPR spacers found in those genomes. I tried to have a look at the blastcrispr.tsv and blastgenomes.tsv , but not able to fully understand each of the columns as there are no header information.

Would you have this information somewhere in the readme or anywhere else?

Thanks,

Apoorva

Comments (5)

  1. Simon Roux repo owner

    Hi Apoorva,
    Right, the blast result does not include headers for simplicity. You can check the format we use in the file “blastcrispr.tsv.cmd”, which includes the command line used to run this blast. Specifically, we use -outfmt '6 qseqid sseqid slen sstart send evalue qseq sseq qlen qstart qend bitscore score sstrand nident positive' , which means the columns are:

    • Query id
    • Subject id
    • Subject length
    • Subject start coord
    • Subject end coord
    • evalue
    • Query sequence aligned
    • Subject sequence aligned
    • Query length
    • Query start coord
    • Query end coord
    • Bitscore
    • Score
    • Strand
    • Number of identical matches in the alignment
    • Number of positive matches in the alignment

    You can find more information on each column at https://www.metagenomics.wiki/tools/blast/blastn-output-format-6

    To know whether CRISPR spacers were found in the archaeal genomes, you should be able to look at the genome ID in column #2.

    Hope this helps, let me know if you have any other questions,

    Best,

    Simon

  2. Apoorva Prabhu reporter

    Hi Simon,

    Great - thanks I should have realised it is a blast output.

    Yes, all good - I’m not too sure but I see that the hits of CRISPR spacers are very few in my dataset. I understand from the paper that a large population of bacterial genomes do not have it, but I was expecting it to be there for archaeal genomes (I have many many viruses predicted with blast and with iPHoP-RF). Just curious as to why few CRISPR spacers - is there a higher cutoff being used?

    Thanks,

    Apoorva

  3. Simon Roux repo owner

    Good question, the number of CRISPR spacer hits depends on a number of factors, namely does the host encode one or several CRISPR arrays, were these CRISPR arrays correctly predicted, are they used to defend against the phage/virus you are working with, and has the phage/virus already changed enough to escape this CRISPR array defense. Then you have the additional complexity of proviruses: you mention a lot of prediction with blast and iPHoP-RF, this seems to suggest that you are working with a number of temperate viruses who can integrate in their host genome. In this case, CRISPR targeting is counter-selected, because targeting an integrated element with a CRISPR-Cas system is self-destructive for the cell (see https://www.frontiersin.org/articles/10.3389/fmicb.2019.03078/full, it gets even more complicated if there are some anti-CRISPR in the mix).

    All that to say: it’s probably not the blast cutoff, but some complicated biology at play

  4. Log in to comment