Perform purging many times could imrpove the assembly?

Issue #131 resolved
Lvinx24 created an issue

Hi, I’m working with a plant genome and I have assembled a genome from Pac-Bio reads.

From this assembly, I obtained the following histograms from the alignment with minimap2:

I have then applied purging with these cutoff thresholds: -l 0 -h 110 -m 35.

The results obtained are the following:

My questions are: is it normal that I still have the first low peak? Maybe should I have used other cutoff values (maybe less stringent)? Could a second run of purging improve further the assembly?

Thanks in advance!

Comments (7)

  1. Michael Roach

    Hi,

    You do still have a lot of duplication after purging. There’s no need to rerun the whole pipeline. I would just try rerunning the final purge step with -a 60 or even -a 50.

  2. Lvinx24 reporter

    Hi Michael,

    I tried to run again the purge step with -a 50. The assembly was reduced a little bit more, but the BUSCO analysis remains exactly the same, so duplication is not improved.

    I also remapped the pac-bio reads to the curated assembly to obtain the histogram, but still, the peak at 15x is present. Lower, but still there.

    I wonder if lowering the -align_cov parameter again makes sense.

    Maybe the -max_match parameter can affect a bit if we decrease it from 250? About this parameter, if it is a score, why do we talk about percentages of cutoff?

    Or maybe the initial cutoff values (-l 0 -h 110 -m 35) were not completely adequate?

  3. Michael Roach

    The initial cutoff was fine. It’s not uncommon to see a smaller half coverage peak remaining after purging. The assembly has been improved; that initial peak is smaller and the genome size is reduced. The duplicated BUSCOs are unusually high for this sort of a histogram but I suspect the assembly isn’t as bad as the BUSCO report is suggesting. I’d be interested to know how many duplicated BUSCOs are on contigs that fall within the larger peak versus smaller peak. This would give you an indication on where the baseline is for purging.

  4. Lvinx24 reporter

    Hi Michael, I tried what you suggested. I took the contigs containing the BUSCOs duplication and I look at their average coverage in the bam file. They all present an average coverage in a range between 52.0 and 67.0, so they are located under the second peak in the histogram. This is a good method to understand if performing another run of purging could help in reducing further the duplication rate or not; thank you for the advice. In the end, I think the genome obtained is good to continue with other analyses.

    Thank you very much for your help!!

  5. Log in to comment