Filter SNPs in consensus sequence

Issue #7 resolved
Jakob Nissen created an issue

I’ve mapped some reads from an influenza-infected cell against an influenza reference gene which I know is closely related to the one present in my sample.

Part of the .aln output looks like this:

template:       ATTGTACATTTGGGGGGTTCACCACCCGGGTACGGACAAAGACCAAATCTTCCTGTATGC
                |||||||||||||||_|||||||||||||||_|||||||_|||||||||_||||||||||
query:          ATTGTACATTTGGGGTGTTCACCACCCGGGTGCGGACAAGGACCAAATc-tccTGTATGC

The deletion in the query does not make biologically sense, since it would result in a frameshift mutation, ruining the gene. Furthermore, the deletion is reported in an area where the query sequence is written in lower-case. Does lowercase not mean that there is no coverage at that area, or?.. If so, it seems unreasonable that KMA should report a deletion there.

Comments (4)

  1. ptlcc

    Lower case bases means that the basecall is of low quality, which is usually due to low depth or mixed variants. Unfortunately, there is no standard on how to report an insignificant gap in fasta format.
    You can check both by investigating the position in the assembly matrix, provided with the “-matrix” flag.

  2. Jakob Nissen reporter

    Aha, thank you, I see the issue. There is indeed 4 bases of low coverage (one single read).

    I think that most users who are interested in a consensus sequence would also want to be able to discard any called SNPs not supported by enough reads. For example, if less than 5 reads report a SNP (or deletion), it’s probably better in most cases to default to the reference in the poorly-covered area.

    Of course it might not be the case for all applications. I can parse the matrix file, look at each base, see if it has sufficient coverage, and then recreate a reliable consensus seq. But it might be nice to have this functionality in KMA itself

  3. Log in to comment