How the depth of contig was calculated, why it's different from samtool result.
Hi, I am sorry that I have to reopen the same issure of contig depth calculate since the old post was labeled as solved.
I am some confused about what's the formula used to calculate the contig depth. In samtools, I can use command:
samtools depth k119_4_sorted.bam >d1.txt
to generate the depth of each site.
If I use metabat, I used the following command to generate the contig depth:
jgi_summarize_bam_contig_depths --outputDepth d2.txt k119_4_sorted.bam
Here is the result from metabat
Initially I thought the average depth of each site(from samtool) will be the contig depth.
awk '{sum+=$3}END{print sum/329}' d1.txt
However, it's not true. Can you help me to figure out where it goes wrong?
Thanks,
Yun
Comments (5)
-
-
reporter Thanks Rob, Is there any way I can replicate the result from jgi_summarize_bam_contig_depths? Or can you let know where I can look at for the detailed description of the formula?
Thanks, Yun
-
Hi Yun,
There is no simple formula that fully replicates what jgi_summarize_bam_contig_depths, but rather it is an algorithm. Feel free to inspect the source code https://bitbucket.org/berkeleylab/metabat/src/acf3ae629c4ef829c42406f3041680f093bc067e/src/jgi_summarize_bam_contig_depths.cpp?at=master&fileviewer=file-view-default
We chose, specifically not to use samtools depth because it does not generate a great estimate of the relative abundance that metabat requires for good separation of samples. When a contig base has 0 coverage, then samtools depth neglects to report that, throwing off some calcualtions and when a read poorly maps samtools depth does count it, again foiling some of the species and strain variation nuances.
You can get a result closer to samtools depth if you run jgi_summarize_bam_contig_depths --percentIdentity 0 --includeEdgeBases you will get similar numbers to samtools depth, so long as you divide by the contig length and not the number of entries reported by samtools depth.
There is one more thing going on where mismatches are removed from the coverage depth calculations, so our tool will always report a slightly lower coverage number than samtools depth reports, unless all reads are perfect matches. If required we could provide an option to disable that discounting.
-
reporter Thanks, Rob.
-
- changed status to resolved
- Log in to comment
As described in issue
#35, the contig depth calculation does not include all reads blindly (as samtools depth does). The minimum %ID to the reference is set to a high threshold (97% by default to ensure that different strains/species are not counted) and the edges of the reference contigs are ignored (because read mappers generally have trouble aligning to the edge of a contig and one can observed a nearly linear drop off of coverage within 1 read length of the edge).