jgi_summarize_bam_contig_depths coverage
Hi,
I'm in the middle of writing a program CoverM which calculates coverage from a BAM file - like jgi_summarize_bam_contig_depths except targeted more broadly (e.g. can calculate per-MAG coverage as well as per-contig). I was hoping that one output of CoverM could be a replicate of the depth output file of jgi_summarize_bam_contig_depths but I am having some trouble, probably because of my misunderstanding what jgi_summarize_bam_contig_depths does exactly.
My current understanding is that it calculates the mean (and variance) of pileup depths. However, I tried it on a BAM file where 2 of 7 contigs are covered, and the mean of both is 1.2. There is 12 100bp reads mapping exactly to each of these 1000bp reference sequences. The BAM file is here: https://github.com/wwood/CoverM/blob/master/tests/data/7seqs.reads_for_seq1_and_seq2.bam
Run through jgi_summarize_bam_contig_depths from metabat 2.12.1 I don't get 1.2 for each
contigName contigLen totalAvgDepth 7seqs.reads_for_seq1_and_seq2.bam 7seqs.reads_for_seq1_and_seq2.bam-var
genome1~random_sequence_length_11000 11000 0 0 0
genome1~random_sequence_length_11010 11010 0 0 0
genome2~seq1 1000 1.41176 1.41176 1.30493
genome3~random_sequence_length_11001 11001 0 0 0
genome4~random_sequence_length_11002 11002 0 0 0
genome5~seq2 1000 1.24353 1.24353 0.686205
genome6~random_sequence_length_11003 11003 0 0 0
So, what am I not understanding correctly? Or is this a bug? Thanks, ben
Comments (10)
-
-
Thanks for the quick response, that clears things up - I get the same answers processing the BAM file manually (except for the 6th decimal place of the seq2 variance, and I guess that is of effectively no consequence).
I accidentally didn't login before opening this issue so cannot close it - please feel free to. ben
-
Oh, one more question actually. Is there some way that you'd like to refer to this algorithm e.g. can we refer to this as "metabat coverage"? Thanks.
-
Thanks Ben. Just call it "MetaBAT adjusted coverage", and I'll add a description of the adjustments to the README on bitbucket.
-
- changed status to resolved
Detailed the MataBAT Adjusted coverage algorithm in the README
-
Hi again,
I've been trying further to copy the algorithm you helpfully detailed, and now I'm able to get everything within 2% of the mean. I noticed empirically that the 97 % threshold means that when reads have exactly 97% are discarded, which seems to contradict the "drops below a threshold" words in the README. I notice there's also a typo in "Substituions" there too.
One thing I suspect might be unintentional is this strange result I get, running the program over a bam file with just one mapping:
@HD VN:1.6 SO:coordinate @SQ SN:k141_1289381 LN:244 NS500333:6:H1124BGXX:1:22209:17475:4754 129 k141_1289381 1 60 11S139M * 1444 0 GCCGTCGGCGTCCGCCACAAATTCGGTGACCACGCTCAGGCAGTCCTGATGTTCCAGCAGGGAAGTCTTGCCCGTGGAAAGCTCTTCCGGCAGCATGGGGAAGTTTCGGATTCCGGTGTAGACGGTCGTCGTTTCCCTGGCGGCATGCTG AAAAAFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFAFFFFFFFFFFFFFFFFFFFFFFFF<FFFFFFFFFFFFFFFFFFFFFFFAFFAFFFFFFFFFFFFFFFFFAAF<FFFFF NM:i:0 MD:Z:139 MC:Z:146M AS:i:139 XS:i:62
The output I get is
contigName contigLen totalAvgDepth bam bam-var k141_1289381 244 0.670213 0.680851 0.219629
Notice the totalAvgDepth is different to the bam column - I assumed that they should be the same when there is only one sample. I calculate the bam column as being correct (139-75)/(244-75-75) = 0.6808511
Any ideas? In the bam file I'm using to test this 1.7% of contigs with non-zero coverage have different entries, so it is rather rare. Happy to send you that bam file if that is of use.
Thanks, ben
-
Hi Ben,
Sorry for the late reply... I haven't touched this code in years, but I think this may be an off-by one error where the code is under-counting the contig-edge bases. I'll take a deeper look later this week, but If I am correct then it would be only those reads that map close to the edge where their coverage is under reported, so a rare and small effect.
And regarding the %ID filter, I verified that it is failing if getPctId() < failPctId (i.e. 97 / 100.0). Maybe floating point precision is triggering the "exactly 97% are discarded you observe"? ... i.e. 97% of a 150 base read is 145.5 so it would be scored as either 145 bases == 96.66% or 146 == 97.333%. in the %Id calculations, soft or hard clipped bases in the CIGAR string are ignored and is calculated as ((double) exactmatches) / (exactmatches + substitutions + insertions + deletions)
Cheers, Rob
-
Hi Rob,
My apologies for asking you to look at old code..
Re the 97% thing, I was observing it with 100M but NM:i:3. Here's a concrete example:
$ cat tests/data/k141_2005182.offending_read.sam @HD VN:1.6 SO:coordinate @SQ SN:k141_2005182 LN:225 NS500333:6:H1124BGXX:2:21307:6303:13731 147 k141_2005182 54 55 50S100M = 74 -80 GCAGAGTTATGTTTCGAGTTCCTCGATGTTCTCCACGAGTTCGTTGATCTCCTCTCTGTGTTCCTAGAGTGCCTCTGCTTGACCGCATCCGCCGGCCACGGTCGCGCAGAGCGTCTGTCCGAGTTCGAAGAGCACCTCTCGGTGTTCCTC ))))))).)))7F.FF)F<FA.F)F).<.7.A))F7<77FFFF)))7)AF)F)F<..<FF.F..A)FFF.F7.7FA7F.FFF<FFFF.FAFA7FF.FFFF77<<FFFFFFFAFFFAFF)FFFFF<FFFFFFFFFAFFFAF<F<FFAAA<< NM:i:3 MD:Z:6G8C3C80 MC:Z:5S110M34S AS:i:85 XS:i:73 XA:Z:k141_2005182,-138,50S88M12S,3;k141_2005182,-1,81S69M,0; $ cat tests/data/k141_2005182.offending_read.sam |jgi_summarize_bam_contig_depths --outputDepth /dev/stdout /dev/stdin Output depth matrix to /dev/stdout Output matrix to /dev/stdout Opening 1 bams Consolidating headers Processing bam files Thread 0 finished: stdin with 0 reads and 0 readsWellMapped Creating depth matrix file: /dev/stdout contigName contigLen totalAvgDepth stdin stdin-var k141_2005182 225 0 0 0 Closing most bam files Closing last bam file Finished
Thanks, ben
-
Well I learned something from that... 97 / 100.0 == 0.9700000 if it is a float but 0.9700000286102295 if it is a double upcast from a float.
fprintf(stdout, " int/float=%0.16f, int/double=%0.16lf isEqual=%d\n", (float) (97/100.0), (double) (97/100.0), (float) (97/100.0) == (double) (97/100.0)); fprintf(stdout, "floats int/float=%0.16f, int/double=%0.16lf isEqual=%d\n", (float) (97/(float) 100.0), (double) (97/(float) 100.0), (float) (97/(float) 100.0) == (double) (97/(float) 100.0)); fprintf(stdout, "doubles int/float=%0.16f, int/double=%0.16lf isEqual=%d\n", (float) (97/(double) 100.0), (double) (97/(double) 100.0), (float) (97/(double) 100.0 ) == (double) (97/(double) 100.0)); int/float=0.9700000286102295, int/double=0.9700000000000000 isEqual=0 floats int/float=0.9700000286102295, int/double=0.9700000286102295 isEqual=1 doubles int/float=0.9700000286102295, int/double=0.9700000000000000 isEqual=0
To save space in some of the containers, I use floats, but use doubles for the calculations and the pctId was being calculated as a double but comparing to a float.
-
And the other bug where the totalDepth != bamDepth for 1 bam was an integer rounding error where I as casting a long from a float * int and it was getting truncated, not rounded.
Thanks for your test edge cases! I'll push these changes to this repo soon.
- Log in to comment
Excluding the possibility of a bug, the output of depth coverages is corrected from the raw count by a number of factors to give a more realistic metric considering the nature of reads, errors, and strain variations present in the data and samples.
1) The edges of a contig are generally excluded from the coverage counts up to a default of 75 bases or the average read length (--includeEdgeBases, --maxEdgeBases). This is because, generally mappers have a difficult time aligning a partial read to a contig when it would extend off edge and the coverage ramps up from 0 to the true coverage in this region 2) reads that map imperfectly are excluded when the %ID of the mapping drops below a threshold (--percentIdentity=97). MetaBAT is designed to resolve strain variation and mapping reads with low %ID indicate that the read actually came from a different strain/species. 3) clips/insertions/deletions/mismatches are excluded from the coverage count -- only the read bases that exactly match the reference are counted as coverage. This generally has a small effect, except in the case of long reads from PacBio and Nanopore.