Strange bad alignments - poor default scoring scheme?

Dear Philip

Again thank you for developing the KMA aligner, which I now use routinely for generating consensus sequences at my work.

Today, I discovered a strange failure of the aligner which led to a gap. My reference is a single viral segment of around 1500 bp - a few somewhat self-similar patches of DNA, but no actual repeats. My data was short-read (around 25 bp) Illumina data. The base quality was fine.

Looking through the .frag.gz file, I find the following alignments. The number on the left is the starting position of the read mapping position (0-based indexing), as reported by the .frag.gz file. For each position in my list below there are hundred of reads that nearly all agree on the sequence and position (except for the ones aligning at position 1186, where they disagree).

For this mapping I used k=12, since I deemed the default k=16 for such short reads would not work. I find this bad mapping perplexing.

ref TTCATCAGGGGAACAAGAGTGGTCCCAAGAGGACAACT
1184    TCAGAGGGACAAGAGTGGTCCCAAGA

ref TTCATCAGGGGAACAAGAGTGGTCCCAAGAGGACAACT
1185     CAGAGGGACAAGAGTGGTCCCAAGAG

ref TTCATCAGGGGAACAAGAGTGGTCCCAAGAGGACAACT
1186      AGAGGGACAAGAGTGGTCCCAAGAGG <- minority
1186      AGGGACAAGAGTGGTCCCAAGAGG   <- majority (misaligned)

ref TTCATCAGGGGAACAAGAGTGGTCCCAAGAGGACAACT
1187       GAGGGACAAGAGTGGTCCCAAGAGGA

ref TTCATCAGGGGAACAAGAGTGGTCCCAAGAGGACAACT
1188        GGGACAAGAGTGGTCCCAAGAGGACA <- misaligned

ref TTCATCAGGGGAACAAGAGTGGTCCCAAGAGGACAACT
1189         GGACAAGAGTGGTCCCAAGAGGAC <- misaligned
1190          GACAAGAGTGGTCCCAAGAGGACAA <- misaligned

ref TTCATCAGGGGAACAAGAGTGGTCCCAAGAGGACAACT
1192            ACAAGAGTGGTCCCAAGAGGACAAC

I will probably use k=8 from now on, however, this seems to cause the sparse mapping I use to find the best template (ref issue #10) to be less accurate.

I’m not 100% sure this is a bug, but it sure looks wrong to me. Let me know if you need any additional information to reproduce this.

Would you recommend using a larger K (say, k=16) for sparse mapping, and a smaller K (say, k=8) for alignment? If so, how can I use both k’s with the same index?
Is this actually a bug?

‌

Edit: This might be an issue with the default scoring scheme. With gap_open = -3, mismatch = -2 and match = 1, these two alignments

temp: CGACT
quer   CACT

temp: CGACT
quer  C-ACT

will score equally high, namely +1. However, I don’t find it biologically credible to assume an indel is just as likely as an snv. Since you’ve apparently experimented with the parameters for your CGE work and set gap_open to -5, perhaps this would be a better default?

Comments (5)