Strange bad alignments - poor default scoring scheme?

Issue #22 resolved
Jakob Nissen created an issue

Dear Philip

Again thank you for developing the KMA aligner, which I now use routinely for generating consensus sequences at my work.

Today, I discovered a strange failure of the aligner which led to a gap. My reference is a single viral segment of around 1500 bp - a few somewhat self-similar patches of DNA, but no actual repeats. My data was short-read (around 25 bp) Illumina data. The base quality was fine.

Looking through the .frag.gz file, I find the following alignments. The number on the left is the starting position of the read mapping position (0-based indexing), as reported by the .frag.gz file. For each position in my list below there are hundred of reads that nearly all agree on the sequence and position (except for the ones aligning at position 1186, where they disagree).

For this mapping I used k=12, since I deemed the default k=16 for such short reads would not work. I find this bad mapping perplexing.

ref TTCATCAGGGGAACAAGAGTGGTCCCAAGAGGACAACT
1184    TCAGAGGGACAAGAGTGGTCCCAAGA

ref TTCATCAGGGGAACAAGAGTGGTCCCAAGAGGACAACT
1185     CAGAGGGACAAGAGTGGTCCCAAGAG

ref TTCATCAGGGGAACAAGAGTGGTCCCAAGAGGACAACT
1186      AGAGGGACAAGAGTGGTCCCAAGAGG <- minority
1186      AGGGACAAGAGTGGTCCCAAGAGG   <- majority (misaligned)

ref TTCATCAGGGGAACAAGAGTGGTCCCAAGAGGACAACT
1187       GAGGGACAAGAGTGGTCCCAAGAGGA

ref TTCATCAGGGGAACAAGAGTGGTCCCAAGAGGACAACT
1188        GGGACAAGAGTGGTCCCAAGAGGACA <- misaligned

ref TTCATCAGGGGAACAAGAGTGGTCCCAAGAGGACAACT
1189         GGACAAGAGTGGTCCCAAGAGGAC <- misaligned
1190          GACAAGAGTGGTCCCAAGAGGACAA <- misaligned

ref TTCATCAGGGGAACAAGAGTGGTCCCAAGAGGACAACT
1192            ACAAGAGTGGTCCCAAGAGGACAAC

I will probably use k=8 from now on, however, this seems to cause the sparse mapping I use to find the best template (ref issue #10) to be less accurate.

I’m not 100% sure this is a bug, but it sure looks wrong to me. Let me know if you need any additional information to reproduce this.

  • Would you recommend using a larger K (say, k=16) for sparse mapping, and a smaller K (say, k=8) for alignment? If so, how can I use both k’s with the same index?
  • Is this actually a bug?

Edit: This might be an issue with the default scoring scheme. With gap_open = -3, mismatch = -2 and match = 1, these two alignments

temp: CGACT
quer   CACT

temp: CGACT
quer  C-ACT

will score equally high, namely +1. However, I don’t find it biologically credible to assume an indel is just as likely as an snv. Since you’ve apparently experimented with the parameters for your CGE work and set gap_open to -5, perhaps this would be a better default?

Comments (5)

  1. ptlcc

    Hi Jakob

    The default scoring parameters is set relatively loose to better align ONT reads against whole genomes. So I would not be happy to change the default.

    Going for a shorter k when aligning sounds like a good idea, especially for such short sequences.
    If you use the approach from issue #10, with a sparse mapping followed by a -Mt1 alignment, you can get away with one database. You need to index with the k-mer size best for the sparse mapping, and set “-k” to the wanted k-mer size when aligning (e.g. -k 8).

    For super short reads k-mer based methods does however have a disadvantage, and the original BWA-backtrack algorithm (bwa aln) might be better suited.

    Best,
    Philip

  2. Jakob Nissen reporter

    Alright, thanks. I think (hope) k=8 will be sufficient for my use cases, even with 25 bp reads - because my references are highly similar, so it will almost certainly still match a kmer within the read.

    Perhaps you’re right - I assume you’ve evaluated and benchmarked the parameters when developing kma. Still, the default score enables the exchange of one gap opening in exchange to convert even a single mismatch to a match. I would assume that, even when aligning ONT reads with lots of indels, inserting gaps would still cause more than one match, and so a more strict gap open penalty could still be used. I’m just guessing, though.

    Closing this issue as solved.

  3. Log in to comment