Choosing cutoffs for a mostly haploid assembly

Issue #13 closed
Anurag Priyam created an issue

Hi Michael,

I just tried purge_haplotigs to remove allelic contigs from a mostly haploid Canu assembly and it seems to have done a pretty good job. I want to play around with the parameters a bit more. Can I get your feedback to make sure I am thinking about this correctly?

I mapped raw PacBio reads to the assembly using minimap2 and retained only the primary alignments (samtools view -F 256). Since my assembly is mostly haploid, the diploid peak is kind of absent (figure below). However, below 9x coverage the histogram deviates from a normal distribution. I interpret this as any contig below 9x coverage is either junk or suspect. Accordingly, I set the low, mid and high cutoffs to 2, 9, and 200 (i.e., below 2x is junk, 2-8x may be allelic contigs, 9-199x is good, 200 and above is repetitive). Do you think this reasoning is fair?

mapping_primary.bam.histogram.png

Comments (3)

  1. Michael Roach repo owner

    Yes this looks ok. The genecov file has everything at a depth of 200 and above collapsed to 200, if you want to remove 200+ as repetitive you should set -h 199. I should probably state this somewhere.

    low cov = cov < -l
    hap cov =  -l >= cov  <= -m
    dip cov =  -m > cov <= -h
    high cov = cov > -h
    
  2. Log in to comment