Pipeline to help with curating heterozygous diploid genome assemblies (for instance when assembling using FALCON or FALCON-unzip).
- Resource Usage
- Running Purge Haplotigs
- Further Curation
Some parts of a genome may have a very high degree of heterozygosity. This causes contigs for both haplotypes of that part of the genome to be assembled as separate primary contigs, rather than as a contig and an associated haplotig. This can be an issue for downstream analysis whether you're working on the haploid or phased-diploid assembly.
Identify pairs of contigs that are syntenic and move one of them to the haplotig 'pool'. The pipeline uses mapped read coverage and Minimap2 alignments to determine which contigs to keep for the haploid assembly. Dotplots are optionally produced for all flagged contig matches, juxtaposed with read-coverage, to help the user determine the proper assignment of any remaining ambiguous contigs. The pipeline will run on either a haploid assembly (i.e. Canu, FALCON or FALCON-Unzip primary contigs) or on a phased-diploid assembly (i.e. FALCON-Unzip primary contigs + haplotigs). Here are two examples of how Purge Haplotigs can improve a haploid and diploid assembly.
- BEDTools (tested with v2.25.0)
- SAMTools (tested with v1.3.1)
- Minimap2 (tested with v2.11/v2.12, https://github.com/lh3/minimap2)
- MUMmer (for ncbiplace script, version 4.0 or higher, https://github.com/mummer4/mummer)
- Perl (with core modules: FindBin, Getopt::Long, Time::Piece, threads, Thread::Semaphore, Thread::Queue, List::Util)
- Rscript (with ggplot2)
The current pipeline (version 1.0.2+) has had some recent improvements to reduce peak RAM and disk IO. Benchmarks were run against four assemblies using a 16-core Intel(R) Xeon(R) E5-2670 based workstation with 64 GB of RAM, running Ubuntu 16.04 LTS. For all four assemblies, 16 threads were supplied for STEP 1 and all 32 threads were supplied for STEP 3. Runtimes and peak RAM usage are tabled below.
|Assembly||Diploid size (Mbp)||Runtime (hh:mm:ss)||Peak RAM (Gb)|
|Arabidopsis thaliana (Cvi-0 x Col-0)||245||00:03:23||9|
|Vitis vinifera L. Cv. Cabernet Sauvignon||959||00:29:05||36|
|Taeniopygia guttata||1 983||00:23:09||35|
Currently only tested on Ubuntu, there is a Detailed manual installation example for Ubuntu 16.04 LTS in the wiki.
Easy Installation using bioconda
- Add the bioconda and conda-forge channels if you haven't already
conda config --add channels defaults conda config --add channels bioconda conda config --add channels conda-forge
- create (if needed) and activate a conda environment
conda create -n purge_haplotigs_env conda activate purge_haplotigs_env
conda install purge_haplotigs
- Test Purge Haplotigs
- Install dependencies, make sure they're in the system PATH
- Pull/clone this git
- Either softlink
purge_haplotigsto a directory in your system PATH
ln -s /path/to/purge_haplotigs/bin/purge_haplotigs ~/bin/purge_haplotigs
- Or add the Purge Haplotigs bin directory to your system PATH
export PATH=$PATH:/path/to/purge_haplotigs/bin printf "\nexport PATH=\$PATH:/path/to/purge_haplotigs/bin\n" >> ~/.bashrc
Running Purge Haplotigs
There is a tutorial in the wiki that you can follow which goes into a bit more detail for each step.
Map your PacBio subreads, or some decent long reads (or even short reads) to your haploid or diploid genome assembly.
You'll want to map a library that produces an even coverage and use a 'randombest' alignment for multimappers.
Sort and index the bam with
Index your genome.fasta file with
Example below uses minimap2 to map some pacbio FASTA format subreads.
minimap2 -ax map-pb genome.fa subreads.fasta.gz \ | samtools view -hF 256 - \ | samtools sort -@ 8 -m 1G -o aligned.bam -T tmp.ali
Generate a coverage histogram by running the first script. This script will produce a histogram png image file for you to look at and a BEDTools 'genomecov' output file that you'll need for STEP 2.
USAGE: purge_haplotigs readhist -b aligned.bam -g genome.fasta [ -t threads ] REQUIRED: -b / -bam BAM file of aligned and sorted reads/subreads to the reference -g / -genome Reference FASTA for the BAM file. OPTIONAL: -t / -threads Number of worker threads to use, DEFAULT = 4, MINIMUM = 2
You should have a bimodal histogram--one peak for haploid level of coverage, one peak for diploid level of coverage. NOTE: If you're using the phased assembly the diploid peak may be very small. Choose cutoffs for low coverage, low point between the two peaks, and high coverage. Example histograms for choosing cutoffs:
Run the second script using the cutoffs from the previous step to analyse the coverage on a contig by contig basis. This script produces a contig coverage stats csv file with suspect contigs flagged for further analysis or removal.
USAGE: purge_haplotigs contigcov -i aligned.bam.genecov -l 30 -m 80 -h 145 [-o coverage_stats.csv -j 80 -s 80 ] REQUIRED: -i / -in The bedtools genomecov output that was produced from 'purge_haplotigs readhist' -l / -low The read depth low cutoff (use the histogram to eyeball these cutoffs) -h / -high The read depth high cutoff -m / -mid The low point between the haploid and diploid peaks OPTIONAL: -o / -out Choose an output file name (CSV format, DEFAULT = coverage_stats.csv) -j / -junk Auto-assign contig as "j" (junk) if this percentage or greater of the contig is low/high coverage (DEFAULT = 80, > 100 = don't junk anything) -s / -suspect Auto-assign contig as "s" (suspected haplotig) if this percentage or less of the contig is diploid level of coverage (DEFAULT = 80)
Run the purging pipeline.
This script will automatically run a BEDTools windowed coverage analysis (if generating dotplots), and minimap2 alignments to assess which contigs to reassign and which to keep.
The pipeline will make several iterations of purging.
Optionally, parse repeats
-r in BED format for improved handling of repetitive regions
USAGE: purge_haplotigs purge -g genome.fasta -c coverage_stats.csv REQUIRED: -g / -genome Genome assembly in fasta format. Needs to be indexed with samtools faidx. -c / -coverage Contig by contig coverage stats csv file from the previous step. OPTIONAL: -t / -threads Number of worker threads to use. DEFAULT = 4 -o / -outprefix Prefix for the curated assembly. DEFAULT = "curated" -r / -repeats BED-format file of repeats to ignore during analysis. -d / -dotplots Generate dotplots for manual inspection. -b / -bam Samtools-indexed bam file of aligned and sorted reads/subreads to the reference, required for generating dotplots. ADVANCED: -a / -align_cov Percent cutoff for identifying a contig as a haplotig. DEFAULT = 70 -m / -max_match Percent cutoff for identifying repetitive contigs. Ignored when using repeat annotations (-repeats). DEFAULT = 250 -I Minimap2 indexing, drop minimisers every N bases, DEFAULT = 4G -v / -verbose Print EVERYTHING. -limit_io Limit for I/O intensive jobs. DEFAULT = -threads -wind_min Min window size for BED coverage plots (for dotplots). DEFAULT = 5000 -wind_nmax Max windows per contig for BED coverage plots (for dotplots). DEFAULT = 200
You will have five files
- <prefix>.fasta: These are the curated primary contigs
- <prefix>.haplotigs.fasta: These are all the haplotigs identified in the initial input assembly. The seq IDs will remain the same but the description field for the contigs will now show the contig "associations" and purging order, e.g.
- <prefix>.artefacts.fasta: These are the very low/high coverage contigs (identified in STEP 2). NOTE: you'll probably have mitochondrial/chloroplast/etc. contigs in here with the assembly junk.
- <prefix>.reassignments.tsv: These are all the reassignments that were made, as well as the suspect contigs that weren't reassigned.
- <prefix>.contig_associations.log: This shows the contig "associations"/purging order, e.g
000000F,PRIMARY -> 000000F_005,HAPLOTIG -> 000000F_009,HAPLOTIG -> 000000F_003,HAPLOTIG -> 000000F_010,HAPLOTIG
If generating dotplots you will get two directories:
- dotplots_unassigned_contigs: These are the dotplots for the suspect contigs that remain unassigned.
- dotplots_reassigned_contigs: These are the dotplots for the contigs that were reassigned.
You can go through the dotplots and check the assignments. You might wish to move reassigned contigs back into the primary contig pool or purge contigs that were too ambiguous for automatic reassignment. Below are some examples of what might occur.
A haplotig - Hopefully most of your reassigned haplotigs will look like this.
Contig is mostly haplotig - This example has part of the contig with a diploid level of coverage and part at haploid level with poor alignment to either reference (possibly hemizygous region).
Contig circling through string graph 'knots' - while this may be a valid string graph path it will likely still confound short read mapping to the haploid assembly.
The pipeline is now published at BMC Bioinformatics:
Read the latest updates HERE