mroachawri / purge_haplotigs / issues / #19 - reference genome: masked? — Bitbucket

Issue #19 resolved

Amina Echchiki created an issue 2018-11-19

Hi @mroachawri

I had a sudden doubt. Was wondering if in PurgeHaplotigs I should use as reference:

the diploid assembly that I get from the assembler directly (eg. Canu) or
the repeat-masked diploid assembly (e.g. Canu + Winmasker)

And if you please can help me understand which one is better and why

Best Amina

Comments (5)

Amina Echchiki reporter
- edited description
- 2018-11-19T13:30:02+00:00
Michael Roach repo owner
Hi Amina,

I haven't tried it with a repeat-masked assembly yet; I just use the assembly and supply repeat annotations to the pipeline. If you do try using a repeat-masked FASTA I'd be interested to know how it goes.

I've also had mixed results with raw windowmasker annotations as the annotations tend to be more fragmented compared to repeatmodeller+repeatmasker and I've had to apply 'smoothing' over windows to address the issue. I can show you the commands if you're interested?

Cheers, Mike

EDIT: I think it's generally better to supply repeat annotations if you have them. Otherwise you can have repeat-rich contigs being reassigned as haplotigs.
- 2018-11-19T23:07:39+00:00
Amina Echchiki reporter
ok, thanks Mike!

in fact, I had a "too high" haploid genome size than expected, this is why I asked about whether you thought the masked genome would be better. And yes, I'll be totally interested in the commands, thanks.

Amina
- 2018-11-20T11:21:50+00:00
Michael Roach repo owner
Generate the counts like so:
```
windowmasker -in genome.fasta -infmt fasta -mk_counts -parse_seqids \
    -out genome.counts -sformat obinary
```
You can the use windowmasker to output the masked coordinates, I use a perl 1-liner to convert to bed format.
```
windowmasker -in genome.fasta -infmt fasta -ustat genome.counts -outfmt interval | sed 's/ - /\t/' \
    | perl -e 'my$s;while(<>){if(m/^>/){$s=$_;$s=~s/>|\n//g;next;}else{my@l=split/\s+/;print"$s\t$l[0]\t$l[1]\n";}}' \
    | sort -k1,1 -k2,2n -k3,3n | bedtools merge > genome.winmask.bed
```
If you open that in a genome viewer you'll notice the annotations are fragmented, and also that there tends to be dense clusters. To apply 'smoothing' you make some genome windows and use bedtools coverage, and awk is used to keep only the windows above a certain density (here it's 0.5 or 50% coverage, but you'll want to try a few different cutoffs and visualise against the winmask annotations).
```
bedtools makewindows -g genome.fasta.fai -w 500 -s 100 > genome.windows

bedtools coverage -a genome.windows -b genome.winmask.bed | awk '$7>=0.5{print}' \
    | bedtools merge > genome.repeats.bed
```
I think it would be better not to mask the FASTA (as in don't change the bases to 'N's or 'X's), particularly for mapping as masking large repeat regions might confound the read-depth analysis; but I haven't tried this and I don't know if it actually has an impact. I'll update here if I do.
- 2018-11-20T23:01:20+00:00
Michael Roach repo owner
- changed status to resolved
- 2019-06-12T10:37:32+00:00
Log in to comment

Assignee: –

Type: task

Priority: major

Status: resolved

Votes: 0

Watchers: 1