With improved sequencing technology, sequencing costs are decreasing rapidly. But bioinformatics challenges for processing data and inferring genotypes have increased. To address this, we have developed a general, graph-based, computational framework called the Practical Haplotype Graph (PHG), that can be used with a variety of skim sequencing methods to infer high-density genotypes directly from low-coverage sequence. The idea behind the PHG is that in a given breeding program, all parental genotypes can be sequenced at high coverage, and loaded as parental haplotypes in a relational database. Progeny can then be sequenced at low coverage and used to infer which parental haplotypes/genotypes from the database are most likely present in a given progeny individual.
The PHG is a trellis graph-based representation of genic and intergenic regions (called reference ranges) which represent diversity across and between taxa. It can be used to: create custom genomes for alignment, call rare alleles, impute genotypes, and efficiently store genomic data from many lines (i.e. reference, assemblies, and other lines). Skim sequences generated for a given taxon are aligned to the graph to identify the haplotype node at a given anchor. All the anchors for a given taxon are processed through a Hidden Markov Model (HMM) to identify the most likely path through the graph. Path information is used to identify variants (SNPs). Low cost sequencing technologies, coupled with the PHG, facilitate the genotyping of large number of samples to increase the size of training populations for genomic selection models. This can in turn increase predictive accuracy and selection intensity in a breeding program.
A more detailed introduction can be seen in Peter Bradbury's PHG presentation slides from the 2018 Plant and Animal Genome conference (PAG).
Previous version of the pipeline
This documentation describes the current version of the PHG (maizegenetics/phg:0.0.21 and later). The older version of the pipeline is documented here. The older version consists of a series of scripts that should still work but are more complicated to use.
How to use the PHG
A Practical Haplotype Graph can be used for many types of data processing and analysis. You can build a PHG from genome assembly-quality genomes or whole-genome resequencing data and use it to impute either variants or haplotypes, both for taxa within the database and for new taxa that are not included in the DB. A non-exhaustive list of possible use cases includes:
- I want to populate a db using WGS
- I want to populate a db using Assemblies
- I want to create consensus haplotypes
- I have skim sequencing and I want variant information
- I have sparse SNPs and I want to imputate more SNPs
- I have a bunch of old/varied sequencing data and I want to merge them to get one set of variant calls on the same set of loci
- I want to phase my heterozygous material by finding paths through db
- I want to phase the assemblies I am putting into the db
- I want haplotype information so I can choose populations to fine map traits
- I want haplotype information for association analysis and/or GS
- I want to understand recombination/haplotype formation in my population/species
- I want to identify ancestral haplotypes & study haplotype quantity/diversity
- I want to study rare haplotypes
- I have some material and want to figure out what its parents are
- I have some material and want to figure out what it is similar/related to
- I want to do chromosome painting
- I want to check the quality of some assemblies
The data needed for any of these use cases can be produced through one of two pipelines. Users can either create and populate a new PHG database, or can download and use an existing PHG database to impute variant or haplotype information. Wrapper scripts that run the whole pipeline are available, or you can follow the decision flow charts to move through each pipeline on your own. More information about each step can be found by clicking the link associated with that step.
Download and install the PHG Docker
The PHG code can be downloaded and run with either Docker, Singularity, or Conda. Click on the Step 0 image to learn more.
Create a PHG database and populate it with haplotypes
The first step is building an empty PHG database and optionally splitting reference range intervals into different groups of interest. The second step is adding haplotypes (from assemblies, WGS, or both) to that database, with an optional consensus building step. These steps can be run simultaneously. Click on the image to learn more.
Use an existing PHG database to impute variants or haplotypes
Once you have a PHG database with haplotypes, the next step is to use that database to impute variants or haplotypes for new taxa. If you have not already downloaded the PHG docker, follow Step 0 above. You may also need to update the PHG database (Step 2.5) before moving on to Step 3. The pipeline plugins, MakeInitialPHGDBPipelinePlugin, PopulatePHGDBPipelinePlugin, and ImputePipelinePlugin, always run this check as a first step. As a result, if you use those plugins, step 2.5 does not need to be run. Also, if you have just completed Step 2 then you can skip step 2.5. If you are running one of the component plugins outside the pipeline and want to make sure that your database is in sync with the current version of the PHG software, then run step 2.5.
Step 2.5: Optional (see above) Update PHG database schema
PHG config files
The current version of the pipeline expects almost all parameters to be set in a config file. While it is possible to set parameters from the command line for individual plugins, the number of parameters makes doing so impractical. Also, the pipeline plugins call other plugins in turn. As a result when using the pipeline plugins, some parameters must be specified in a config file. Because of the number of parameters, we provide two separate sample config files, one for steps 1 and 2 (creating a database and populating it with haplotypes) and one for step 3 (imputing variants). See those pages for details.
PHG key files
You will need three types of key files with various portions of the PHG code. They are as follows:
- Assemblies key file - associates chromosome fastas for assembly genomes with the name you ewant that taxon to have within the PHG database. These can include information for a single assembly, or for multiple assemblies.
- Haplotype key file - associates specific WGS files with the name you want that taxon to have within the PHG database.
- Pathfinding key file - associates paired-end read files for pathfinding step.
A small example database for testing the PHG can be found on the Example Database page
PHG Application Programming Interface (API)
PHG Database Schema
Please use Biostars to ask for help. Instructions for using Biostars are here
The PHG is under active development. If you find a bug, please submit a pull request so that we can address it.
Other ways to use the PHG
The rPHG package allows users to explore PHG databases in R.
The rTASSEL package allows users to run TASSEL through R.
Wheat PHG Hackathon (February 24-28, 2020): Cornell University - Ithaca
Wheat CAP PHG workshop (July 8-12, 2019): Cornell University - Ithaca
PHG Workshop (August 17-18, 2018): IRRI - Philippines
PHG Workshop (June 4-8, 2018): Cornell University - Ithaca
PHG @ PAG (January 13-17, 2018): Plant and Animal Genome XXVI Conference - San Diego