KMA could handle indels better

ptlcc

Hi Jakob

It should be possible to get it down to 1-3 iteration, its just a question about how much that needs to be changed. Which might not be possible with the current version.

Does the reads with major indels appear to align, or are they slowly included during the next iterations.
If they do not align in the first round this might be solved by inferring two-piece affine gap costs, like Minimap2 and Minigraph has. This will be an update I have considered for a long time, and can be implemented relatively fast. Ie. as the next project.

If they do align in the first round it would indicate that the consensus generation needs an update, which will take somewhat longer, and require more advanced algorithms to incorporate mismatches between reads in indel regions better.

Best,
Philip

2021-11-09T14:19:34+00:00

Jakob Nissen reporter

Unfortunately, it looks like they map just fine across. In my case the reads contain a 72 bp insertion - I don’t know if that makes it easier than if the reads contain a deletion.

From a cursory skim, it looks like what happens is that, while the majority of the reads agree that there indeed is an insertion, they disagree on:

What base is inserted at any particular location, and
Where exactly the insertion is.

For example, a part of the mat.gz file where the deletion is located may look like this, and is being called as deletions, because the reads disagree so "gap“ wins out in the majority vote.

T       18      566     2       1163    0       37
-       24      787     92      112     0       771
-       752     96      14      10      0       914
-       708     106     22      22      0       928
-       685     98      14      39      0       950
-       44      26      643     120     0       953
-       98      691     6       35      0       956

All the reference bases following the insertion looks like this in the mat.gz file:

-       0       0       2       1       0       1782
-       0       1       0       2       0       1782
-       0       2       0       0       0       1783
A       1658    19      9       58      0       44
-       70      19      2       61      0       1635
-       42      43      1       52      0       1649
-       9       64      8       15      0       1691
-       38      9       1       4       0       1735
-       19      3       1       27      0       1737
-       12      1       1       7       0       1766
-       2       7       2       1       0       1775

That is, the reads disagree where the indel is placed. Some reads vote it is placed after the reference A in the second box above. The scattering of the insertion across the reference also contribute to making "gap“ win out in the majority vote.

The first round of the assembly ends up only adding 5 single-basepair insertions and 1 2-bp insertion, where the real insertion size is approximately 72 bp. The second round is able to add 22 more bp at approximately the same location. The matrix file looks similar to above, except at more positions the non-gap called base just slightly wins out over gap. The majority of the rest of the bases are added at round 3, with the final parts being added at round 4.

This data is from virus isolated in wild birds, so I might be able to send the reference and the reads to you if you have any interest in a test case.

2021-11-11T10:08:21+00:00

Jakob Nissen reporter

I know some tools use multiple sequence alignment (MSA) to align the reads around areas where they significantly differ from the reference. It might be prohibitively expensive with a depth of tens of thousands, but just an idea.

2021-11-11T10:15:31+00:00

Jakob Nissen reporter

Out of curiosity, I have a question about the score model. Affine gap score models have always struck me as a little simplified. Real-life indels of thousands of bases do really occur, and they are not as rare as one might think. If I had a read that had 50 good matching basepairs across a huge gap, I would want that alignment. However, an affine gap score model with a nonzero gap_extendpretty much precludes sequences to align with large gaps.

Naively, it seems easy to just cap the gap penalty at some maximum, or alternatively to include yet another, still lower extend score used above a length of say, 20. Is there any reason most score models of aligners don’t do this?

2021-11-11T10:21:55+00:00

ptlcc

The problem with MSA is as you suggest, it is too expensive. I will try to come up with a better solution.

The score model you mention is what is solved by the 2-pice affine gap cost models. Which will allow you set a “normal” gap cost for small gaps, and infer a large gap-openning cost with no extend costs for the longer gaps. It is a little slower to run and requires somewhat more code. The reason for it not being implemented more often is probably because we only “recently” got long reads that could span these longer indels, and thus called for a solution. But in reality it has existed for a while as you describe, with Illumina 250bp and IonTorrent reads.

2021-11-11T12:54:54+00:00

Comments (5)