jgi_summarize_bam_contig_depths coverage

Issue #48 resolved
Former user created an issue

Hi,

I'm in the middle of writing a program CoverM which calculates coverage from a BAM file - like jgi_summarize_bam_contig_depths except targeted more broadly (e.g. can calculate per-MAG coverage as well as per-contig). I was hoping that one output of CoverM could be a replicate of the depth output file of jgi_summarize_bam_contig_depths but I am having some trouble, probably because of my misunderstanding what jgi_summarize_bam_contig_depths does exactly.

My current understanding is that it calculates the mean (and variance) of pileup depths. However, I tried it on a BAM file where 2 of 7 contigs are covered, and the mean of both is 1.2. There is 12 100bp reads mapping exactly to each of these 1000bp reference sequences. The BAM file is here: https://github.com/wwood/CoverM/blob/master/tests/data/7seqs.reads_for_seq1_and_seq2.bam

Run through jgi_summarize_bam_contig_depths from metabat 2.12.1 I don't get 1.2 for each

contigName  contigLen   totalAvgDepth   7seqs.reads_for_seq1_and_seq2.bam   7seqs.reads_for_seq1_and_seq2.bam-var
genome1~random_sequence_length_11000    11000   0   0   0
genome1~random_sequence_length_11010    11010   0   0   0
genome2~seq1    1000    1.41176 1.41176 1.30493
genome3~random_sequence_length_11001    11001   0   0   0
genome4~random_sequence_length_11002    11002   0   0   0
genome5~seq2    1000    1.24353 1.24353 0.686205
genome6~random_sequence_length_11003    11003   0   0   0

So, what am I not understanding correctly? Or is this a bug? Thanks, ben

Comments (10)

  1. Rob Egan

    Excluding the possibility of a bug, the output of depth coverages is corrected from the raw count by a number of factors to give a more realistic metric considering the nature of reads, errors, and strain variations present in the data and samples.

    1) The edges of a contig are generally excluded from the coverage counts up to a default of 75 bases or the average read length (--includeEdgeBases, --maxEdgeBases). This is because, generally mappers have a difficult time aligning a partial read to a contig when it would extend off edge and the coverage ramps up from 0 to the true coverage in this region 2) reads that map imperfectly are excluded when the %ID of the mapping drops below a threshold (--percentIdentity=97). MetaBAT is designed to resolve strain variation and mapping reads with low %ID indicate that the read actually came from a different strain/species. 3) clips/insertions/deletions/mismatches are excluded from the coverage count -- only the read bases that exactly match the reference are counted as coverage. This generally has a small effect, except in the case of long reads from PacBio and Nanopore.

  2. Ben Woodcroft

    Thanks for the quick response, that clears things up - I get the same answers processing the BAM file manually (except for the 6th decimal place of the seq2 variance, and I guess that is of effectively no consequence).

    I accidentally didn't login before opening this issue so cannot close it - please feel free to. ben

  3. Ben Woodcroft

    Oh, one more question actually. Is there some way that you'd like to refer to this algorithm e.g. can we refer to this as "metabat coverage"? Thanks.

  4. Rob Egan

    Thanks Ben. Just call it "MetaBAT adjusted coverage", and I'll add a description of the adjustments to the README on bitbucket.

  5. Ben Woodcroft

    Hi again,

    I've been trying further to copy the algorithm you helpfully detailed, and now I'm able to get everything within 2% of the mean. I noticed empirically that the 97 % threshold means that when reads have exactly 97% are discarded, which seems to contradict the "drops below a threshold" words in the README. I notice there's also a typo in "Substituions" there too.

    One thing I suspect might be unintentional is this strange result I get, running the program over a bam file with just one mapping:

    @HD VN:1.6  SO:coordinate
    @SQ SN:k141_1289381 LN:244
    NS500333:6:H1124BGXX:1:22209:17475:4754 129 k141_1289381    1   60  11S139M *   1444    0   GCCGTCGGCGTCCGCCACAAATTCGGTGACCACGCTCAGGCAGTCCTGATGTTCCAGCAGGGAAGTCTTGCCCGTGGAAAGCTCTTCCGGCAGCATGGGGAAGTTTCGGATTCCGGTGTAGACGGTCGTCGTTTCCCTGGCGGCATGCTG  AAAAAFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFAFFFFFFFFFFFFFFFFFFFFFFFF<FFFFFFFFFFFFFFFFFFFFFFFAFFAFFFFFFFFFFFFFFFFFAAF<FFFFF  NM:i:0  MD:Z:139    MC:Z:146M   AS:i:139    XS:i:62
    

    The output I get is

    contigName  contigLen   totalAvgDepth   bam bam-var
    k141_1289381    244 0.670213    0.680851    0.219629
    

    Notice the totalAvgDepth is different to the bam column - I assumed that they should be the same when there is only one sample. I calculate the bam column as being correct (139-75)/(244-75-75) = 0.6808511

    Any ideas? In the bam file I'm using to test this 1.7% of contigs with non-zero coverage have different entries, so it is rather rare. Happy to send you that bam file if that is of use.

    Thanks, ben

  6. Rob Egan

    Hi Ben,

    Sorry for the late reply... I haven't touched this code in years, but I think this may be an off-by one error where the code is under-counting the contig-edge bases. I'll take a deeper look later this week, but If I am correct then it would be only those reads that map close to the edge where their coverage is under reported, so a rare and small effect.

    And regarding the %ID filter, I verified that it is failing if getPctId() < failPctId (i.e. 97 / 100.0). Maybe floating point precision is triggering the "exactly 97% are discarded you observe"? ... i.e. 97% of a 150 base read is 145.5 so it would be scored as either 145 bases == 96.66% or 146 == 97.333%. in the %Id calculations, soft or hard clipped bases in the CIGAR string are ignored and is calculated as ((double) exactmatches) / (exactmatches + substitutions + insertions + deletions)

    Cheers, Rob

  7. Ben Woodcroft

    Hi Rob,

    My apologies for asking you to look at old code..

    Re the 97% thing, I was observing it with 100M but NM:i:3. Here's a concrete example:

    $ cat tests/data/k141_2005182.offending_read.sam
    @HD VN:1.6  SO:coordinate
    @SQ SN:k141_2005182 LN:225
    NS500333:6:H1124BGXX:2:21307:6303:13731 147 k141_2005182    54  55  50S100M =   74  -80 GCAGAGTTATGTTTCGAGTTCCTCGATGTTCTCCACGAGTTCGTTGATCTCCTCTCTGTGTTCCTAGAGTGCCTCTGCTTGACCGCATCCGCCGGCCACGGTCGCGCAGAGCGTCTGTCCGAGTTCGAAGAGCACCTCTCGGTGTTCCTC  ))))))).)))7F.FF)F<FA.F)F).<.7.A))F7<77FFFF)))7)AF)F)F<..<FF.F..A)FFF.F7.7FA7F.FFF<FFFF.FAFA7FF.FFFF77<<FFFFFFFAFFFAFF)FFFFF<FFFFFFFFFAFFFAF<F<FFAAA<<  NM:i:3  MD:Z:6G8C3C80   MC:Z:5S110M34S  AS:i:85 XS:i:73 XA:Z:k141_2005182,-138,50S88M12S,3;k141_2005182,-1,81S69M,0;
    
    $ cat tests/data/k141_2005182.offending_read.sam |jgi_summarize_bam_contig_depths --outputDepth /dev/stdout /dev/stdin
    Output depth matrix to /dev/stdout
    Output matrix to /dev/stdout
    Opening 1 bams
    Consolidating headers
    Processing bam files
    Thread 0 finished: stdin with 0 reads and 0 readsWellMapped
    Creating depth matrix file: /dev/stdout
    contigName  contigLen   totalAvgDepth   stdin   stdin-var
    k141_2005182    225 0   0   0
    Closing most bam files
    Closing last bam file
    Finished
    

    Thanks, ben

  8. Rob Egan

    Well I learned something from that... 97 / 100.0 == 0.9700000 if it is a float but 0.9700000286102295 if it is a double upcast from a float.

    fprintf(stdout, "          int/float=%0.16f, int/double=%0.16lf isEqual=%d\n", (float) (97/100.0), (double) (97/100.0), (float) (97/100.0) == (double) (97/100.0));
    fprintf(stdout, "floats    int/float=%0.16f, int/double=%0.16lf isEqual=%d\n", (float) (97/(float) 100.0), (double) (97/(float) 100.0), (float) (97/(float) 100.0) == (double) (97/(float) 100.0));
    fprintf(stdout, "doubles   int/float=%0.16f, int/double=%0.16lf isEqual=%d\n", (float) (97/(double) 100.0), (double) (97/(double) 100.0), (float) (97/(double) 100.0 ) == (double) (97/(double) 100.0));
    
              int/float=0.9700000286102295, int/double=0.9700000000000000 isEqual=0
    floats    int/float=0.9700000286102295, int/double=0.9700000286102295 isEqual=1
    doubles   int/float=0.9700000286102295, int/double=0.9700000000000000 isEqual=0
    

    To save space in some of the containers, I use floats, but use doubles for the calculations and the pctId was being calculated as a double but comparing to a float.

  9. Rob Egan

    And the other bug where the totalDepth != bamDepth for 1 bam was an integer rounding error where I as casting a long from a float * int and it was getting truncated, not rounded.

    Thanks for your test edge cases! I'll push these changes to this repo soon.

  10. Log in to comment