Choosing cutoffs for a mostly haploid assembly
Hi Michael,
I just tried purge_haplotigs to remove allelic contigs from a mostly haploid Canu assembly and it seems to have done a pretty good job. I want to play around with the parameters a bit more. Can I get your feedback to make sure I am thinking about this correctly?
I mapped raw PacBio reads to the assembly using minimap2 and retained only the primary alignments (samtools view -F 256
). Since my assembly is mostly haploid, the diploid peak is kind of absent (figure below). However, below 9x coverage the histogram deviates from a normal distribution. I interpret this as any contig below 9x coverage is either junk or suspect. Accordingly, I set the low, mid and high cutoffs to 2, 9, and 200 (i.e., below 2x is junk, 2-8x may be allelic contigs, 9-199x is good, 200 and above is repetitive). Do you think this reasoning is fair?
Comments (3)
-
repo owner -
reporter Thanks!
-
reporter - changed status to closed
- Log in to comment
Yes this looks ok. The genecov file has everything at a depth of 200 and above collapsed to 200, if you want to remove 200+ as repetitive you should set
-h 199
. I should probably state this somewhere.