Bad sparse mapping

Issue #26 resolved
Jakob Nissen created an issue

I’m doing a workflow where I need to get reliable assemblies based on mapping reads to a collection of highly similar reference sequences. In another issue, I was recommended first doing a -Sparse mapping against the database to pick out the best templates, and then mapping against solely that reference. So far, this has been working well.

However, I discovered one gene which suddenly stopped being able to be assembled. Looking into it, I noticed that the -Sparse mapping suddenly pointed to another reference as being best. But, assembly failed. Assembling to the old reference still works. When counting all kmers present in the reads and in different references, the old reference comes out on top, and the new result from the -Sparse mapping does not fit with the amount of kmers present.

You might need a minimal example - I’ll send you a mail with one. I’m using KMA 1.3.10.

Comments (10)

  1. ptlcc

    Hi Jakob

    If you want to include all k-mers with the sparse mapping, you will need to index with “-Sparse -“. This indexes the reverse complement as well, which is needed for the sparse mapping.

    We have also recently been seeing some spurious results from the sparse mapping. It seems the problem is solved at our end by using the following options:

    • sasm -ef -1t1 -mem_mode

    Instead of the “-Sparse” option. This will give you an output file called *.mapstat with the number of reads matching each template. Choosing the template with the highest number of assigned reads here should give a more robust result, and take similar time to compute.

    But for single genes the prefix has to be shortened, for the latter option it is optional to include a prefix.

    Best,
    Philip

  2. Jakob Nissen reporter

    It appears it still doesn’t work correctly. Even after indexing with `kma index -k 12 -Sparse - -i NA.fna -o kmadb`, the top hit is a poor template (and has much fewer k-mers in common with the reads than other templates by manual comparison, regardless of which complementarity of the kmers you look at).

    I think I have discovered the problem. I had been counting the number of unique kmers that are shared between the reads and a template, not the total number of kmers. It appears that the similarity of some templates are driven by a large number of sequencing artifact kmers, for example TTTTTTTTTTTT , which happens to be present in some templates, see attached plot. I have trimmed the reads, so not sure what to do about that.

    I honestly don’t know what to do about that, but it seems like something worth fixing. One solution could be to count the number of reference positions that a kmer maps to, by using a bit vector per template. Just a suggestion.

    For now, I think I’ll use a regular mapping to get the best template.

  3. Jakob Nissen reporter

    By the way: In the .mapstat file, some templates with numerous hits report a `refCoveredPositions` of zero. I assume that’s a bug. If that was fixed, I could perhaps use that to get the “unique” kmers?

  4. ptlcc

    Hi Jakob

    That option is already implemented for sparse mapping.
    If you set the sparse sorting option to coverage, “-ss c”, it will output the best template as the one with the highest coverage of unique k-mers in each template.

    Default is query coverage, so how many of the k-mers in the input that are matched. The last option are to sort on depth, which will probably give the same result.

    Best,
    Philip

  5. ptlcc

    The `refCoveredPositions` positions will be zero when using the option “-sasm”, as “-sasm” skips alignment and assembly and thus where the templates are covered.

  6. Jakob Nissen reporter

    Well, regarding the -ss c with Sparse mapping, all my rows contain inf in the Template_Coverage column. Here’s the first few rows

    #Template       Num     Score   Expected        Template_length Query_Coverage  Template_Coverage       Depth   tot_query_Coverage      tot_template_Coverage   tot_depth       q_value p_value
    CY085376        54      461903  0       0           0.16             inf             inf            0.16             inf             inf        461903.00       1.0e-26
    MH347185        191     330641  0       0           0.12             inf             inf            0.12             inf             inf        330641.00       1.0e-26
    KC859216        80      212280  0       0           0.07             inf             inf            0.10             inf             inf        212280.00       1.0e-26
    

    About the refCoveredPositions, only a few of the rows are zero when using -sasm, most are nonzero. Here’s an excerpt:

    # refSequence   readCount       fragmentCount   mapScoreSum     refCoveredPositions     refConsensusSum bpTotal depthVariance   nucHighDepthVariance    depthMax        snpSum  in
    sertSum       deletionSum     readCountAln    fragmentCountAln
    CY009630        4       2       0       193     0       208     0.000000        0       0       0       0       0       0       0
    AB600860        2       1       0       72      0       74      0.000000        0       0       0       0       0       0       0
    AB600862        2       1       0       70      0       72      0.000000        0       0       0       0       0       0       0
    AB600863        1       1       0       225     0       246     0.000000        0       0       0       0       0       0       0
    AB517990        14      7       0       644     0       850     0.000000        0       0       0       0       0       0       0
    CY116332        2       1       0       70      0       72      0.000000        0       0       0       0       0       0       0
    CY160207        4       2       0       177     0       190     0.000000        0       0       0       0       0       0       0
    CY158827        28      14      0       996     0       1721    0.000000        0       0       0       0       0       0       0
    KF840462        2       1       0       75      0       78      0.000000        0       0       0       0       0       0       0
    LC430954        1       1       0       42      0       43      0.000000        0       0       0       0       0       0       0
    LC430962        12      6       0       485     0       590     0.000000        0       0       0       0       0       0       0
    LC430970        2       1       0       94      0       98      0.000000        0       0       0       0       0       0       0
    DQ150427        4       2       0       149     0       158     0.000000        0       0       0       0       0       0       0
    AY790307        2       1       0       89      0       92      0.000000        0       0       0       0       0       0       0
    DQ139321        82      45      0       0       0       13062   0.000000        0       0       0       0       0       0       0
    EU004442        44      23      0       1343    0       3677    0.000000        0       0       0       0       0       0       0
    

  7. ptlcc

    I have added an error message to the failed sparse mapping in case the database was not sparse indexed.

    I can see the “refCoveredPositions“ currently is estimated from the depth when the option “sam” is used, I have set this to zero in the new version.

    Best,
    Philip

  8. Jakob Nissen reporter

    Perfect, thank you Philip. To recap, the problem was that I inadvertently didn’t use -Sparse when indexing the database, and I didn’t sort the sparse mapping by template coverage to get the best template.

    Just to be sure - I can use the same index both for sparse mapping and for regular mapping if I add -Sparse to the index operation, without degradation of performance, right?

    Closing this issue as resolved.

  9. Log in to comment