Template N's reported as A's

Issue #43 new
Jakob Nissen created an issue

Given that it seems to report matches between A’s in the read and N’s in the template, it may actually be that it is not just reported as A, but actually somehow stored as A internally.

Minimal example:

/tmp/kma $ kma -v
KMA-1.3.22

/tmp/kma $ cat read.fq
@read
AGTCTGATGTAGCTGAAAATAG
+
FFFFFFFFFFFFFFFFFFFFFF

/tmp/kma $ cat ref.fna
>one
AGTCTGATGTAGCTGANNNNNNNNNNNNNNNNNNNNNTGATCGTA

/tmp/kma $ kma -i read.fq -t_db ref.fna -o testout > /dev/null 2> /dev/null

/tmp/kma $ cat testout.aln
# one
template:   AGTCTGATGTAGCTGAAAAAAAAAAAAAAAAAAAAAATGATCGTA
            |||||||||||||||||||_|________________________
query:      agtctgatgtagctgaaaatag-----------------------

Comments (3)

  1. ptlcc

    Hi Jakob

    It is stored as A’s internally. We also considered randomising the ambiguous bases at indexing, but the long stretches of A’s had the advantage of avoiding most random matches.
    Ambiguous bases in the query sequence is still kept, and treated as such under alignment.

    Best,
    Philip

  2. Jakob Nissen reporter

    That’s not ideal. I mean, it may be reasonable to only support ACGTU internally for memory issues or whatnot. But silently converting the user’s input to a different sequence is a dangerous trap - what if the read contains a poly-A-tail and it suddenly maps against a stretch of Ns?

    Would it be possible to error in kma indexif any input sequence contains ambiguous nucleotides? If these are not supported by KMA, I think most users would much prefer an error instead of KMA operating on different sequences

  3. ptlcc

    I see your point, I have been looking at different options for including the template N’s under alignment too. But it will take some time to incorporate. In the meantime I can add an error-message to the index, or add a description to the README and Manual.

    Best,
    Philip

  4. Log in to comment