KMA could handle indels better

Issue #47 new
Jakob Nissen created an issue

Okay so I’m not sure it’s even feasible to make this better. But I’ve observed that KMA doesn’t handle indels very well.

If the reference has an indel of say, 50 bp compared to the reads, then the full 50 bp will not be filled in. Instead, 10-ish basepairs are filled in. In order to fill in larger indels, one must iteratively re-assemble again and again, often using 5-10 rounds.

Are there any settings I could set to improve this? I’m currently running with: -mp 20 -bc 0.7 -mrs 0.3 -bcNano for nanopore and -mrs 0.3 -gapopen -5 for Illumina.

My reads are long, 2x250 bp for Illumina and 500 bp for Nanopore, so it should be possible to align over the gap / align in the non-inserted part.

Is there any way either KMA or my workflow could be improved to allow this? I quite often see indels compared to my references.

Comments (5)

  1. ptlcc

    Hi Jakob

    It should be possible to get it down to 1-3 iteration, its just a question about how much that needs to be changed. Which might not be possible with the current version.

    Does the reads with major indels appear to align, or are they slowly included during the next iterations.
    If they do not align in the first round this might be solved by inferring two-piece affine gap costs, like Minimap2 and Minigraph has. This will be an update I have considered for a long time, and can be implemented relatively fast. Ie. as the next project.

    If they do align in the first round it would indicate that the consensus generation needs an update, which will take somewhat longer, and require more advanced algorithms to incorporate mismatches between reads in indel regions better.

    Best,
    Philip

  2. Jakob Nissen reporter

    Unfortunately, it looks like they map just fine across. In my case the reads contain a 72 bp insertion - I don’t know if that makes it easier than if the reads contain a deletion.

    From a cursory skim, it looks like what happens is that, while the majority of the reads agree that there indeed is an insertion, they disagree on:

    • What base is inserted at any particular location, and
    • Where exactly the insertion is.

    For example, a part of the mat.gz file where the deletion is located may look like this, and is being called as deletions, because the reads disagree so "gap“ wins out in the majority vote.

    T       18      566     2       1163    0       37
    -       24      787     92      112     0       771
    -       752     96      14      10      0       914
    -       708     106     22      22      0       928
    -       685     98      14      39      0       950
    -       44      26      643     120     0       953
    -       98      691     6       35      0       956
    

    All the reference bases following the insertion looks like this in the mat.gz file:

    -       0       0       2       1       0       1782
    -       0       1       0       2       0       1782
    -       0       2       0       0       0       1783
    A       1658    19      9       58      0       44
    -       70      19      2       61      0       1635
    -       42      43      1       52      0       1649
    -       9       64      8       15      0       1691
    -       38      9       1       4       0       1735
    -       19      3       1       27      0       1737
    -       12      1       1       7       0       1766
    -       2       7       2       1       0       1775
    

    That is, the reads disagree where the indel is placed. Some reads vote it is placed after the reference A in the second box above. The scattering of the insertion across the reference also contribute to making "gap“ win out in the majority vote.

    The first round of the assembly ends up only adding 5 single-basepair insertions and 1 2-bp insertion, where the real insertion size is approximately 72 bp. The second round is able to add 22 more bp at approximately the same location. The matrix file looks similar to above, except at more positions the non-gap called base just slightly wins out over gap. The majority of the rest of the bases are added at round 3, with the final parts being added at round 4.

    This data is from virus isolated in wild birds, so I might be able to send the reference and the reads to you if you have any interest in a test case.

  3. Jakob Nissen reporter

    I know some tools use multiple sequence alignment (MSA) to align the reads around areas where they significantly differ from the reference. It might be prohibitively expensive with a depth of tens of thousands, but just an idea.

  4. Jakob Nissen reporter

    Out of curiosity, I have a question about the score model. Affine gap score models have always struck me as a little simplified. Real-life indels of thousands of bases do really occur, and they are not as rare as one might think. If I had a read that had 50 good matching basepairs across a huge gap, I would want that alignment. However, an affine gap score model with a nonzero gap_extendpretty much precludes sequences to align with large gaps.

    Naively, it seems easy to just cap the gap penalty at some maximum, or alternatively to include yet another, still lower extend score used above a length of say, 20. Is there any reason most score models of aligners don’t do this?

  5. ptlcc

    The problem with MSA is as you suggest, it is too expensive. I will try to come up with a better solution.

    The score model you mention is what is solved by the 2-pice affine gap cost models. Which will allow you set a “normal” gap cost for small gaps, and infer a large gap-openning cost with no extend costs for the longer gaps. It is a little slower to run and requires somewhat more code. The reason for it not being implemented more often is probably because we only “recently” got long reads that could span these longer indels, and thus called for a solution. But in reality it has existed for a while as you describe, with Illumina 250bp and IonTorrent reads.

  6. Log in to comment