How the depth of contig was calculated, why it's different from samtool result.

Issue #36 resolved
Yun created an issue

Hi, I am sorry that I have to reopen the same issure of contig depth calculate since the old post was labeled as solved.

I am some confused about what's the formula used to calculate the contig depth. In samtools, I can use command:

samtools depth k119_4_sorted.bam >d1.txt

to generate the depth of each site.

If I use metabat, I used the following command to generate the contig depth:

jgi_summarize_bam_contig_depths --outputDepth d2.txt k119_4_sorted.bam

Here is the result from metabat d2.png

Initially I thought the average depth of each site(from samtool) will be the contig depth.

awk '{sum+=$3}END{print sum/329}' d1.txt

d1.png However, it's not true. Can you help me to figure out where it goes wrong?

Thanks,

Yun

Comments (5)

  1. Rob Egan

    As described in issue #35, the contig depth calculation does not include all reads blindly (as samtools depth does). The minimum %ID to the reference is set to a high threshold (97% by default to ensure that different strains/species are not counted) and the edges of the reference contigs are ignored (because read mappers generally have trouble aligning to the edge of a contig and one can observed a nearly linear drop off of coverage within 1 read length of the edge).

  2. Yun reporter

    Thanks Rob, Is there any way I can replicate the result from jgi_summarize_bam_contig_depths? Or can you let know where I can look at for the detailed description of the formula?

    Thanks, Yun

  3. Rob Egan

    Hi Yun,

    There is no simple formula that fully replicates what jgi_summarize_bam_contig_depths, but rather it is an algorithm. Feel free to inspect the source code https://bitbucket.org/berkeleylab/metabat/src/acf3ae629c4ef829c42406f3041680f093bc067e/src/jgi_summarize_bam_contig_depths.cpp?at=master&fileviewer=file-view-default

    We chose, specifically not to use samtools depth because it does not generate a great estimate of the relative abundance that metabat requires for good separation of samples. When a contig base has 0 coverage, then samtools depth neglects to report that, throwing off some calcualtions and when a read poorly maps samtools depth does count it, again foiling some of the species and strain variation nuances.

    You can get a result closer to samtools depth if you run jgi_summarize_bam_contig_depths --percentIdentity 0 --includeEdgeBases you will get similar numbers to samtools depth, so long as you divide by the contig length and not the number of entries reported by samtools depth.

    There is one more thing going on where mismatches are removed from the coverage depth calculations, so our tool will always report a slightly lower coverage number than samtools depth reports, unless all reads are perfect matches. If required we could provide an option to disable that discounting.

  4. Log in to comment