- edited description
reference genome: masked?
Hi @mroachawri
I had a sudden doubt. Was wondering if in PurgeHaplotigs I should use as reference:
- the diploid assembly that I get from the assembler directly (eg. Canu) or
- the repeat-masked diploid assembly (e.g. Canu + Winmasker)
And if you please can help me understand which one is better and why
Best Amina
Comments (5)
-
reporter -
repo owner Hi Amina,
I haven't tried it with a repeat-masked assembly yet; I just use the assembly and supply repeat annotations to the pipeline. If you do try using a repeat-masked FASTA I'd be interested to know how it goes.
I've also had mixed results with raw windowmasker annotations as the annotations tend to be more fragmented compared to repeatmodeller+repeatmasker and I've had to apply 'smoothing' over windows to address the issue. I can show you the commands if you're interested?
Cheers, Mike
EDIT: I think it's generally better to supply repeat annotations if you have them. Otherwise you can have repeat-rich contigs being reassigned as haplotigs.
-
reporter ok, thanks Mike!
in fact, I had a "too high" haploid genome size than expected, this is why I asked about whether you thought the masked genome would be better. And yes, I'll be totally interested in the commands, thanks.
Amina
-
repo owner Generate the counts like so:
windowmasker -in genome.fasta -infmt fasta -mk_counts -parse_seqids \ -out genome.counts -sformat obinary
You can the use windowmasker to output the masked coordinates, I use a perl 1-liner to convert to bed format.
windowmasker -in genome.fasta -infmt fasta -ustat genome.counts -outfmt interval | sed 's/ - /\t/' \ | perl -e 'my$s;while(<>){if(m/^>/){$s=$_;$s=~s/>|\n//g;next;}else{my@l=split/\s+/;print"$s\t$l[0]\t$l[1]\n";}}' \ | sort -k1,1 -k2,2n -k3,3n | bedtools merge > genome.winmask.bed
If you open that in a genome viewer you'll notice the annotations are fragmented, and also that there tends to be dense clusters. To apply 'smoothing' you make some genome windows and use bedtools coverage, and awk is used to keep only the windows above a certain density (here it's 0.5 or 50% coverage, but you'll want to try a few different cutoffs and visualise against the winmask annotations).
bedtools makewindows -g genome.fasta.fai -w 500 -s 100 > genome.windows bedtools coverage -a genome.windows -b genome.winmask.bed | awk '$7>=0.5{print}' \ | bedtools merge > genome.repeats.bed
I think it would be better not to mask the FASTA (as in don't change the bases to 'N's or 'X's), particularly for mapping as masking large repeat regions might confound the read-depth analysis; but I haven't tried this and I don't know if it actually has an impact. I'll update here if I do.
-
repo owner - changed status to resolved
- Log in to comment