Interpreting coverage histogram pre- and post-purging

Issue #74 resolved
Tom Jenkins created an issue

Hi, thank you for creating a program that is very easy to install and use!

I have a question about interpreting the coverage histogram before and after executing purge_haplotigs. As shown below, the post-purging coverage histogram is exactly the same curve as the pre-purging histogram, just with much fewer contigs, because a few thousand were identified as junk in previous steps. I have tried multiple cutoffs including -l 5 -m 25 -h 120 and -l 5 -m 45 -h 500 and both sets of parameters did not change the shape of the histogram post-purging. The curated.fasta had the same number of BUSCOs as the original assembly but three fewer duplicated BUSCOs. Any advice would be great thanks!

Pre-purging:

Post-purging:

Commands used:
minimap2 -ax map-pb -H --secondary=no assembly.fasta pacbio_reads.fq -o assembly.sam -t 16

samtools view --threads 16 -b assembly.sam > assembly.bam

samtools sort --threads 16 assembly.bam -o assembly.sorted.bam

purge_haplotigs hist -b assembly.sorted.bam -g assembly.fasta -t 16

purge_haplotigs cov -i assembly.sorted.bam.gencov -l 5 -m 25 -h 120

purge_haplotigs purge -g assembly.fasta -c coverage_stats.csv -t 16

Comments (4)

  1. Michael Roach repo owner

    It looks like the coverage peaks are overlapping a bit; you want to capture all the 1/2 coverage contigs. try: -l 5 -m 60 -h 190 for the cov step, then -a 60 for the purge step. What sort of BUSCO duplication do you have?

  2. Tom Jenkins reporter

    Cool, I will try that. The number of duplicated BUSCOs was 12 (1.2%) pre-purging and 9 post-purging, which isn’t bad but the non-uniform curve post-purging concerns me because it is very different to your examples and many other posts on this issues forum.

  3. Michael Roach repo owner

    I see what you mean, 1.2% is not a lot of duplication, but the coverage is a bit odd. If you wanted to dig deeper you could visualize the coverage and see where the culprit contigs/regions are. An alignment viewer like IGV is fine for small genomes but you might want to make a plot for larger genomes, e.g. https://bioinf.cc/misc/2020/07/12/circos-histograms.html

  4. Log in to comment