reference genome: masked?

Issue #19 resolved
Amina Echchiki created an issue

Hi @mroachawri

I had a sudden doubt. Was wondering if in PurgeHaplotigs I should use as reference:

  1. the diploid assembly that I get from the assembler directly (eg. Canu) or
  2. the repeat-masked diploid assembly (e.g. Canu + Winmasker)

And if you please can help me understand which one is better and why

Best Amina

Comments (5)

  1. Michael Roach repo owner

    Hi Amina,

    I haven't tried it with a repeat-masked assembly yet; I just use the assembly and supply repeat annotations to the pipeline. If you do try using a repeat-masked FASTA I'd be interested to know how it goes.

    I've also had mixed results with raw windowmasker annotations as the annotations tend to be more fragmented compared to repeatmodeller+repeatmasker and I've had to apply 'smoothing' over windows to address the issue. I can show you the commands if you're interested?

    Cheers, Mike

    EDIT: I think it's generally better to supply repeat annotations if you have them. Otherwise you can have repeat-rich contigs being reassigned as haplotigs.

  2. Amina Echchiki reporter

    ok, thanks Mike!

    in fact, I had a "too high" haploid genome size than expected, this is why I asked about whether you thought the masked genome would be better. And yes, I'll be totally interested in the commands, thanks.

    Amina

  3. Michael Roach repo owner

    Generate the counts like so:

    windowmasker -in genome.fasta -infmt fasta -mk_counts -parse_seqids \
        -out genome.counts -sformat obinary
    

    You can the use windowmasker to output the masked coordinates, I use a perl 1-liner to convert to bed format.

    windowmasker -in genome.fasta -infmt fasta -ustat genome.counts -outfmt interval | sed 's/ - /\t/' \
        | perl -e 'my$s;while(<>){if(m/^>/){$s=$_;$s=~s/>|\n//g;next;}else{my@l=split/\s+/;print"$s\t$l[0]\t$l[1]\n";}}' \
        | sort -k1,1 -k2,2n -k3,3n | bedtools merge > genome.winmask.bed
    

    If you open that in a genome viewer you'll notice the annotations are fragmented, and also that there tends to be dense clusters. To apply 'smoothing' you make some genome windows and use bedtools coverage, and awk is used to keep only the windows above a certain density (here it's 0.5 or 50% coverage, but you'll want to try a few different cutoffs and visualise against the winmask annotations).

    bedtools makewindows -g genome.fasta.fai -w 500 -s 100 > genome.windows
    
    bedtools coverage -a genome.windows -b genome.winmask.bed | awk '$7>=0.5{print}' \
        | bedtools merge > genome.repeats.bed
    

    I think it would be better not to mask the FASTA (as in don't change the bases to 'N's or 'X's), particularly for mapping as masking large repeat regions might confound the read-depth analysis; but I haven't tried this and I don't know if it actually has an impact. I'll update here if I do.

  4. Log in to comment