Bug in estimated depth and base calling

Issue #11 resolved
Jakob Nissen created an issue

I’m noticing some of my alignments where the number of bases called at some positions plateaus at 65535, e.g. part of my mat.gz file that looks like this:

A       65535   3       55      2       0       6
G       13      0       65535   8       0       7
T       26      34      14      65535   0       3
-       0       0       26      0       0       37
C       10      65535   1       15      0       5
-       0       0       0       2       0       64
G       54      3       65535   6       0       5
G       28      0       65535   6       0       5
A       65535   4       43      4       0       7
A       65535   27      81      47      0       7

65535 is the largest number in a 16-bit unsigned int. That can’t be a conincidence. For my use case this does not matter much, because it’s still easily able to call the right base. But this might cause issues for other datasets, if the correct depth is needed.

Is it possible to change to a 32-bit integer? Depending on the platform, this might actually cause a speed increase, but I don’t know if there will be some memory issues or something.

Edit: BWA reports a depth of around 7400 at these positions. So this is likely a bug in KMA that produces the erroneous value of 65535, it’s not that it maxes out.

Comments (7)

  1. Jakob Nissen reporter

    Actually, I should also mention this anomaly:

    A       65535   4       43      4       0       7
    A       65535   27      81      47      0       7
    -       0       1       0       0       0       154
    C       11      65535   2       9       0       4
    -       0       0       0       26      0       21
    -       0       0       0       26      0       21
    -       0       0       26      0       0       21
    C       7       65535   2       25      0       4
    -       1       0       1       0       0       340
    -       0       0       0       1       0       390
    C       16      65535   2       47      0       6
    A       65504   6       58      7       0       3
    

    Notice the three rows with a depth of around 50 in the middle of rows with depths of 65k+. This cannot be true - alignment with BWA shows that all the rows shown here have a coverage of several thousands reads, which are uniformly distributed (i.e. it’s not like they all stop aligning at row 4 and then start again at row 8 ). The result is that my assembled consensus sequence is called with the three extra bases “ttg”. While I can’t biologically rule out that there are an insertion there, the depth pattern makes me suspicious.

    This must be a mistake in the depth calculation which affects base calling.

  2. Jakob Nissen reporter

    Just to clarify, I’m seeing two distinct errors here, that may or may not be caused by the same bug.

    • Reported depth of near 65535 where the true depth is around 7000, possibly due to underflow of unsigned short of the assembly struct in assembly.h
    • Reported depth of near zero where the true depth is around 7000 at positions that are marked as a deletion in the reference sequence.

  3. ptlcc

    The nucleotide counts is capped at 65535, as depths exceeding this are usually avoided due to sequencing costs. Changing the short unsigned to int would result in twice the memory consumption for this part of the algorithm, which for viruses is not the largest concern but might be for metagenomic and plant analysis. I will keep it in mind to make a solution that does require more memory in the future.

    The other scenario with close to zero depth for insertions in areas of extremely high depth, was caused by an overflow where I had forgotten to cap the counts. It is fixed with the “Issue #11“ commit.

    Best,

    Philip

  4. Jakob Nissen reporter

    I see. You are right that 65535 is more than enough for most applications (including mine!), and that most users would prefer KMA to be more memory efficient. Perhaps in the not-so-distant future, depths significantly above 65k may be obtained in some applications, but it is not too relevant now, and I’m not sure it’s worth adding to KMA.

    I’ve discovered that the true depth indeed is closer to 65535, not 7300 - the 7300 was obtained by samtools mpileup, which caps depth at 8000 (but somehow actually prints lower depths).

    The only remaining issue is the overflow, which you say has already been fixed. I’ll grab the latest commit and rerun my analysis using that.

    Thanks for fixing this!

  5. Log in to comment