HTTPS SSH

install with bioconda

Purge Haplotigs

Pipeline to help with curating heterozygous diploid genome assemblies (for instance when assembling using FALCON or FALCON-unzip).

Contents

Introduction

The problem

Some parts of a genome may have a very high degree of heterozygosity. This causes contigs for both haplotypes of that part of the genome to be assembled as separate primary contigs, rather than as a contig and an associated haplotig. This can be an issue for downstream analysis whether you're working on the haploid or phased-diploid assembly.

The solution

Identify pairs of contigs that are syntenic and move one of them to the haplotig 'pool'. The pipeline uses mapped read coverage and Minimap2 alignments to determine which contigs to keep for the haploid assembly. Dotplots are optionally produced for all flagged contig matches, juxtaposed with read-coverage, to help the user determine the proper assignment of any remaining ambiguous contigs. The pipeline will run on either a haploid assembly (i.e. Canu, FALCON or FALCON-Unzip primary contigs) or on a phased-diploid assembly (i.e. FALCON-Unzip primary contigs + haplotigs). Here are two examples of how Purge Haplotigs can improve a haploid and diploid assembly.

Dependencies

  • Bash
  • BEDTools (tested with v2.25.0)
  • SAMTools (tested with v1.3.1)
  • Minimap2 (tested with v2.11/v2.12, https://github.com/lh3/minimap2)
  • MUMmer (for ncbiplace script, version 4.0 or higher, https://github.com/mummer4/mummer)
  • Perl (with core modules: FindBin, Getopt::Long, Time::Piece, threads, Thread::Semaphore, Thread::Queue, List::Util)
  • Rscript (with ggplot2)

Resource Usage

The current pipeline (version 1.0.2+) has had some recent improvements to reduce peak RAM and disk IO. Benchmarks were run against four assemblies using a 16-core Intel(R) Xeon(R) E5-2670 based workstation with 64 GB of RAM, running Ubuntu 16.04 LTS. For all four assemblies, 16 threads were supplied for STEP 1 and all 32 threads were supplied for STEP 3. Runtimes and peak RAM usage are tabled below.

Assembly Diploid size (Mbp) Runtime (hh:mm:ss) Peak RAM (Gb)
Clavicorona pyxidata 65.4 00:00:47 2
Arabidopsis thaliana (Cvi-0 x Col-0) 245 00:03:23 9
Vitis vinifera L. Cv. Cabernet Sauvignon 959 00:29:05 36
Taeniopygia guttata 1 983 00:23:09 35

Installation

Currently only tested on Ubuntu, there is a Detailed manual installation example for Ubuntu 16.04 LTS in the wiki.

Easy Installation using bioconda

  • Add the bioconda and conda-forge channels if you haven't already
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
  • create (if needed) and activate a conda environment
conda create -n purge_haplotigs_env
conda activate purge_haplotigs_env
  • Install

conda install purge_haplotigs

  • Test Purge Haplotigs

purge_haplotigs test

Manual Installation

  • Install dependencies, make sure they're in the system PATH
  • Pull/clone this git
  • Either softlink purge_haplotigs to a directory in your system PATH

ln -s /path/to/purge_haplotigs/bin/purge_haplotigs ~/bin/purge_haplotigs

  • Or add the Purge Haplotigs bin directory to your system PATH
export PATH=$PATH:/path/to/purge_haplotigs/bin
printf "\nexport PATH=\$PATH:/path/to/purge_haplotigs/bin\n" >> ~/.bashrc
  • test

purge_haplotigs test

Running Purge Haplotigs

There is a tutorial in the wiki that you can follow which goes into a bit more detail for each step.

PREPARATION

Map your PacBio subreads, or some decent long reads (or even short reads) to your haploid or diploid genome assembly. You'll want to map a library that produces an even coverage and use a 'randombest' alignment for multimappers. Sort and index the bam with samtools index. Index your genome.fasta file with samtools faidx. Example below uses minimap2 to map some pacbio FASTA format subreads.

minimap2 -ax map-pb genome.fa subreads.fasta.gz \
    | samtools view -hF 256 - \
    | samtools sort -@ 8 -m 1G -o aligned.bam -T tmp.ali

STEP 1

Generate a coverage histogram by running the first script. This script will produce a histogram png image file for you to look at and a BEDTools 'genomecov' output file that you'll need for STEP 2.

USAGE:
purge_haplotigs  readhist  -b aligned.bam  -g genome.fasta  [ -t threads ]

REQUIRED:
-b / -bam       BAM file of aligned and sorted reads/subreads to the reference
-g / -genome    Reference FASTA for the BAM file.

OPTIONAL:
-t / -threads   Number of worker threads to use, DEFAULT = 4, MINIMUM = 2

MANUAL STEP

You should have a bimodal histogram--one peak for haploid level of coverage, one peak for diploid level of coverage. NOTE: If you're using the phased assembly the diploid peak may be very small. Choose cutoffs for low coverage, low point between the two peaks, and high coverage. Example histograms for choosing cutoffs:

PacBio subreads on a Diploid-phased assembly (Primary + Haplotigs)

Illumina PE reads on a Haploid assembly (Primary contigs)

STEP 2

Run the second script using the cutoffs from the previous step to analyse the coverage on a contig by contig basis. This script produces a contig coverage stats csv file with suspect contigs flagged for further analysis or removal.

USAGE:
purge_haplotigs  contigcov  -i aligned.bam.genecov  -l 30  -m 80  -h 145  [-o coverage_stats.csv -j 80  -s 80 ]

REQUIRED:
-i / -in        The bedtools genomecov output that was produced from 'purge_haplotigs readhist'
-l / -low       The read depth low cutoff (use the histogram to eyeball these cutoffs)
-h / -high      The read depth high cutoff
-m / -mid       The low point between the haploid and diploid peaks

OPTIONAL:
-o / -out       Choose an output file name (CSV format, DEFAULT = coverage_stats.csv)
-j / -junk      Auto-assign contig as "j" (junk) if this percentage or greater of the contig is 
                low/high coverage (DEFAULT = 80, > 100 = don't junk anything)
-s / -suspect   Auto-assign contig as "s" (suspected haplotig) if this percentage or less of the
                contig is diploid level of coverage (DEFAULT = 80)

STEP 3

Run the purging pipeline. This script will automatically run a BEDTools windowed coverage analysis (if generating dotplots), and minimap2 alignments to assess which contigs to reassign and which to keep. The pipeline will make several iterations of purging. Optionally, parse repeats -r in BED format for improved handling of repetitive regions

USAGE:
purge_haplotigs  purge  -g genome.fasta  -c coverage_stats.csv

REQUIRED:
-g / -genome        Genome assembly in fasta format. Needs to be indexed with samtools faidx.
-c / -coverage      Contig by contig coverage stats csv file from the previous step.

OPTIONAL:
-t / -threads       Number of worker threads to use. DEFAULT = 4
-o / -outprefix     Prefix for the curated assembly. DEFAULT = "curated"
-r / -repeats       BED-format file of repeats to ignore during analysis.
-d / -dotplots      Generate dotplots for manual inspection.
-b / -bam           Samtools-indexed bam file of aligned and sorted reads/subreads to the
                    reference, required for generating dotplots.

ADVANCED:
-a / -align_cov     Percent cutoff for identifying a contig as a haplotig. DEFAULT = 70
-m / -max_match     Percent cutoff for identifying repetitive contigs. Ignored when 
                    using repeat annotations (-repeats). DEFAULT = 250
-I                  Minimap2 indexing, drop minimisers every N bases, DEFAULT = 4G
-v / -verbose       Print EVERYTHING.
-limit_io           Limit for I/O intensive jobs. DEFAULT = -threads
-wind_min           Min window size for BED coverage plots (for dotplots). DEFAULT = 5000
-wind_nmax          Max windows per contig for BED coverage plots (for dotplots). DEFAULT = 200

All done!

You will have five files

  • <prefix>.fasta: These are the curated primary contigs
  • <prefix>.haplotigs.fasta: These are all the haplotigs identified in the initial input assembly. The seq IDs will remain the same but the description field for the contigs will now show the contig "associations" and purging order, e.g.
>000000F_003 HAPLOTIG<--000000F_009_HAPLOTIG<--000000F_PRIMARY
  • <prefix>.artefacts.fasta: These are the very low/high coverage contigs (identified in STEP 2). NOTE: you'll probably have mitochondrial/chloroplast/etc. contigs in here with the assembly junk.
  • <prefix>.reassignments.tsv: These are all the reassignments that were made, as well as the suspect contigs that weren't reassigned.
  • <prefix>.contig_associations.log: This shows the contig "associations"/purging order, e.g
000000F,PRIMARY -> 000000F_005,HAPLOTIG
                -> 000000F_009,HAPLOTIG -> 000000F_003,HAPLOTIG
                -> 000000F_010,HAPLOTIG

If generating dotplots you will get two directories:

  • dotplots_unassigned_contigs: These are the dotplots for the suspect contigs that remain unassigned.
  • dotplots_reassigned_contigs: These are the dotplots for the contigs that were reassigned.

Further Curation

You can go through the dotplots and check the assignments. You might wish to move reassigned contigs back into the primary contig pool or purge contigs that were too ambiguous for automatic reassignment. Below are some examples of what might occur.

A haplotig - Hopefully most of your reassigned haplotigs will look like this.

Contig is mostly haplotig - This example has part of the contig with a diploid level of coverage and part at haploid level with poor alignment to either reference (possibly hemizygous region).

Haplotig with many tandem repeats

Haplotig is palindrome

Contig circling through string graph 'knots' - while this may be a valid string graph path it will likely still confound short read mapping to the haploid assembly.

Citation

The pipeline is now published at BMC Bioinformatics:

Purge Haplotigs: allelic contig reassignment for third-gen diploid genome assemblies

Updates

Read the latest updates HERE