jgi_summarize_bam_contig_depths coverage

Issue #48 resolved

Former user created an issue 2018-11-09

Hi,

I'm in the middle of writing a program CoverM which calculates coverage from a BAM file - like jgi_summarize_bam_contig_depths except targeted more broadly (e.g. can calculate per-MAG coverage as well as per-contig). I was hoping that one output of CoverM could be a replicate of the depth output file of jgi_summarize_bam_contig_depths but I am having some trouble, probably because of my misunderstanding what jgi_summarize_bam_contig_depths does exactly.

My current understanding is that it calculates the mean (and variance) of pileup depths. However, I tried it on a BAM file where 2 of 7 contigs are covered, and the mean of both is 1.2. There is 12 100bp reads mapping exactly to each of these 1000bp reference sequences. The BAM file is here: https://github.com/wwood/CoverM/blob/master/tests/data/7seqs.reads_for_seq1_and_seq2.bam

Run through jgi_summarize_bam_contig_depths from metabat 2.12.1 I don't get 1.2 for each

contigName  contigLen   totalAvgDepth   7seqs.reads_for_seq1_and_seq2.bam   7seqs.reads_for_seq1_and_seq2.bam-var
genome1~random_sequence_length_11000    11000   0   0   0
genome1~random_sequence_length_11010    11010   0   0   0
genome2~seq1    1000    1.41176 1.41176 1.30493
genome3~random_sequence_length_11001    11001   0   0   0
genome4~random_sequence_length_11002    11002   0   0   0
genome5~seq2    1000    1.24353 1.24353 0.686205
genome6~random_sequence_length_11003    11003   0   0   0

So, what am I not understanding correctly? Or is this a bug? Thanks, ben

Comments (10)

Rob Egan
Excluding the possibility of a bug, the output of depth coverages is corrected from the raw count by a number of factors to give a more realistic metric considering the nature of reads, errors, and strain variations present in the data and samples.

1) The edges of a contig are generally excluded from the coverage counts up to a default of 75 bases or the average read length (--includeEdgeBases, --maxEdgeBases). This is because, generally mappers have a difficult time aligning a partial read to a contig when it would extend off edge and the coverage ramps up from 0 to the true coverage in this region 2) reads that map imperfectly are excluded when the %ID of the mapping drops below a threshold (--percentIdentity=97). MetaBAT is designed to resolve strain variation and mapping reads with low %ID indicate that the read actually came from a different strain/species. 3) clips/insertions/deletions/mismatches are excluded from the coverage count -- only the read bases that exactly match the reference are counted as coverage. This generally has a small effect, except in the case of long reads from PacBio and Nanopore.
- 2018-11-09T17:25:15+00:00
Ben Woodcroft
Thanks for the quick response, that clears things up - I get the same answers processing the BAM file manually (except for the 6th decimal place of the seq2 variance, and I guess that is of effectively no consequence).

I accidentally didn't login before opening this issue so cannot close it - please feel free to. ben
- 2018-11-10T12:06:05+00:00
Ben Woodcroft
Oh, one more question actually. Is there some way that you'd like to refer to this algorithm e.g. can we refer to this as "metabat coverage"? Thanks.
- 2018-11-10T22:30:51+00:00
Rob Egan
Thanks Ben. Just call it "MetaBAT adjusted coverage", and I'll add a description of the adjustments to the README on bitbucket.
- 2018-11-12T18:15:23+00:00
Rob Egan
- changed status to resolved
Detailed the MataBAT Adjusted coverage algorithm in the README
- 2018-11-13T19:49:53+00:00
Ben Woodcroft
Hi again,

I've been trying further to copy the algorithm you helpfully detailed, and now I'm able to get everything within 2% of the mean. I noticed empirically that the 97 % threshold means that when reads have exactly 97% are discarded, which seems to contradict the "drops below a threshold" words in the README. I notice there's also a typo in "Substituions" there too.

One thing I suspect might be unintentional is this strange result I get, running the program over a bam file with just one mapping:
```
@HD VN:1.6  SO:coordinate
@SQ SN:k141_1289381 LN:244
NS500333:6:H1124BGXX:1:22209:17475:4754 129 k141_1289381    1   60  11S139M *   1444    0   GCCGTCGGCGTCCGCCACAAATTCGGTGACCACGCTCAGGCAGTCCTGATGTTCCAGCAGGGAAGTCTTGCCCGTGGAAAGCTCTTCCGGCAGCATGGGGAAGTTTCGGATTCCGGTGTAGACGGTCGTCGTTTCCCTGGCGGCATGCTG  AAAAAFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFAFFFFFFFFFFFFFFFFFFFFFFFF<FFFFFFFFFFFFFFFFFFFFFFFAFFAFFFFFFFFFFFFFFFFFAAF<FFFFF  NM:i:0  MD:Z:139    MC:Z:146M   AS:i:139    XS:i:62
```
The output I get is
```
contigName  contigLen   totalAvgDepth   bam bam-var
k141_1289381    244 0.670213    0.680851    0.219629
```
Notice the totalAvgDepth is different to the bam column - I assumed that they should be the same when there is only one sample. I calculate the bam column as being correct (139-75)/(244-75-75) = 0.6808511

Any ideas? In the bam file I'm using to test this 1.7% of contigs with non-zero coverage have different entries, so it is rather rare. Happy to send you that bam file if that is of use.

Thanks, ben
- 2019-01-03T07:08:09+00:00
Rob Egan
Hi Ben,

Sorry for the late reply... I haven't touched this code in years, but I think this may be an off-by one error where the code is under-counting the contig-edge bases. I'll take a deeper look later this week, but If I am correct then it would be only those reads that map close to the edge where their coverage is under reported, so a rare and small effect.

And regarding the %ID filter, I verified that it is failing if getPctId() < failPctId (i.e. 97 / 100.0). Maybe floating point precision is triggering the "exactly 97% are discarded you observe"? ... i.e. 97% of a 150 base read is 145.5 so it would be scored as either 145 bases == 96.66% or 146 == 97.333%. in the %Id calculations, soft or hard clipped bases in the CIGAR string are ignored and is calculated as ((double) exactmatches) / (exactmatches + substitutions + insertions + deletions)

Cheers, Rob
- 2019-01-09T00:57:16+00:00

Ben Woodcroft

Hi Rob,

My apologies for asking you to look at old code..

Re the 97% thing, I was observing it with 100M but NM:i:3. Here's a concrete example:

$ cat tests/data/k141_2005182.offending_read.sam
@HD VN:1.6  SO:coordinate
@SQ SN:k141_2005182 LN:225
NS500333:6:H1124BGXX:2:21307:6303:13731 147 k141_2005182    54  55  50S100M =   74  -80 GCAGAGTTATGTTTCGAGTTCCTCGATGTTCTCCACGAGTTCGTTGATCTCCTCTCTGTGTTCCTAGAGTGCCTCTGCTTGACCGCATCCGCCGGCCACGGTCGCGCAGAGCGTCTGTCCGAGTTCGAAGAGCACCTCTCGGTGTTCCTC  ))))))).)))7F.FF)F<FA.F)F).<.7.A))F7<77FFFF)))7)AF)F)F<..<FF.F..A)FFF.F7.7FA7F.FFF<FFFF.FAFA7FF.FFFF77<<FFFFFFFAFFFAFF)FFFFF<FFFFFFFFFAFFFAF<F<FFAAA<<  NM:i:3  MD:Z:6G8C3C80   MC:Z:5S110M34S  AS:i:85 XS:i:73 XA:Z:k141_2005182,-138,50S88M12S,3;k141_2005182,-1,81S69M,0;

$ cat tests/data/k141_2005182.offending_read.sam |jgi_summarize_bam_contig_depths --outputDepth /dev/stdout /dev/stdin
Output depth matrix to /dev/stdout
Output matrix to /dev/stdout
Opening 1 bams
Consolidating headers
Processing bam files
Thread 0 finished: stdin with 0 reads and 0 readsWellMapped
Creating depth matrix file: /dev/stdout
contigName  contigLen   totalAvgDepth   stdin   stdin-var
k141_2005182    225 0   0   0
Closing most bam files
Closing last bam file
Finished

Thanks, ben

2019-01-09T01:13:48+00:00

Rob Egan

Well I learned something from that... 97 / 100.0 == 0.9700000 if it is a float but 0.9700000286102295 if it is a double upcast from a float.

fprintf(stdout, "          int/float=%0.16f, int/double=%0.16lf isEqual=%d\n", (float) (97/100.0), (double) (97/100.0), (float) (97/100.0) == (double) (97/100.0));
fprintf(stdout, "floats    int/float=%0.16f, int/double=%0.16lf isEqual=%d\n", (float) (97/(float) 100.0), (double) (97/(float) 100.0), (float) (97/(float) 100.0) == (double) (97/(float) 100.0));
fprintf(stdout, "doubles   int/float=%0.16f, int/double=%0.16lf isEqual=%d\n", (float) (97/(double) 100.0), (double) (97/(double) 100.0), (float) (97/(double) 100.0 ) == (double) (97/(double) 100.0));

          int/float=0.9700000286102295, int/double=0.9700000000000000 isEqual=0
floats    int/float=0.9700000286102295, int/double=0.9700000286102295 isEqual=1
doubles   int/float=0.9700000286102295, int/double=0.9700000000000000 isEqual=0

To save space in some of the containers, I use floats, but use doubles for the calculations and the pctId was being calculated as a double but comparing to a float.

2019-01-09T18:06:53+00:00

Rob Egan
And the other bug where the totalDepth != bamDepth for 1 bam was an integer rounding error where I as casting a long from a float * int and it was getting truncated, not rounded.

Thanks for your test edge cases! I'll push these changes to this repo soon.
- 2019-01-09T18:48:35+00:00
Log in to comment

Assignee: –

Type: bug

Priority: major

Status: resolved

Votes: 0

Watchers: None