Wrong template length in spa file

Issue #28 resolved
Jakob Nissen created an issue

Dear Philip

I’ve made a sparse mapping file .spa file, with the following content:

#Template   Num Score   Expected    Template_length Query_Coverage  Template_Coverage   Depthtot_query_Coverage tot_template_Coverage   tot_depth   q_value p_value
JF275940    44  16017346    352120  4662       22.90       76.03     3435.72       22.90       76.03     3435.72    14991283.14 1.0e-26
MK339074    100 2596739 966565  4660        3.71       22.09      557.24       10.03       32.63     1505.64    745786.52   1.0e-26
FJ200425    37  1400499 1021806 4660        2.00       17.75      300.54       10.87       35.02     1632.21    59202.98    1.0e-26
AB600245    4   1073524 1007506 4538        1.53       14.98      236.56        6.93       27.28     1067.58     2094.33    1.0e-26

The correct length of my templates are around 2300.

Now, I don’t know the format of the .length.b file, but when reading it in as an array of 32-bit integers, I note that it contains first 8 bytes of header (or something), then 109 integers around 2300 (I have 109 sequences in my template file), then a zero, and then 219 integers around 4600. This must be the source of the error. Deleting and re-creating the index gives the same result.

Comments (2)

  1. ptlcc

    That is because the template length is in k-mers for the sparse mapping, i.e. total number of indexed k-mers in the template.

    The format of *.length.b file is:
    First 4 bytes contain the number of templates in DB (DB_size)
    Next DB_size * 4 bytes is the template lengths in bp.
    Next DB_size * 4 bytes is the template lengths in indexed k-mers.
    Next DB_size * 4 bytes is the template lengths in indexed unique k-mers.

    Template #0 is a special-case template which has lengths=0 under normal circumstances.

    Best,
    Philip

  2. Jakob Nissen reporter

    Ah, figured it out. The length given must be the total number of kmers in the template, which makes sense since sparse mapping considers kmers independently. The remaining fields in the .length.bfiles must be number of sequences, the value of k, and number of distinct k mers per template. In that case it all works out and is correct.

    Closing this issue

  3. Log in to comment