Wiki

Clone wiki

purity / Home

Purity

Description
Purity is a genetic algorithm approach to identify a set of informative genetic markers. Given a target number of markers, it will try to identify a set of markers that can identify the most number of unique samples within the dataset. It utilizes Jenetics and TASSEL5 java libraries.

Requirements
To use purity, you must have Java 8 or newer. Check if you have Java, otherwise you can install it from here.

Installation
Download purity_171218.zip then extract.

Testing in Windows
You can quickly test purity by double-clicking run.bat. It should produce two output files, a marker metafile and a subset HapMap file.

Running
Usage: java -jar <purity jar file> <No. of target markers> <Solution size> <Distance of duplicates> <Input HapMap file> <Output text file>

Example: java -jar purity_171218.jar 10 1000 0.05 test.hmp.txt test.out.txt

The output score is the number of samples with unique genotypes.

PARAMETERS

  • <No. of target markers> - Integer
    * Total no. markers to select among given dataset.

  • <Solution size> - Integer
    * Number of solutions to consider.
    * Higher size yields better results but takes longer time.

  • <Distance of duplicates> - Decimal (0.00-1.00)
    * Min. distance bet. two samples before considering as duplicates.
    * Set to 0.00 for exact match.

  • <Input HapMap file> - HapMap file (.hmp.txt, .hmp.txt.gz)
    * Dataset to choose markers from.
    * Supports either IUPAC or diploid format.

  • <Output text file> - Text file
    * File to write output score and info of current best set of markers.
    * The score is the no. of uniquely identified samples.
    * Also outputs a subset HapMap file.

Other Features

MaximumTaxaDiversity

MaximumTaxaDiversity identifies a subset of samples that will try to maximize heterozygosity of a marker set. Useful for selecting samples for marker validation to increase representation of heterozygous genotype calls.

NPolymorphic

NPolymorphic works exactly like purity but uses a specified minimum number of polymorphic markers to discriminate between samples instead of genetic distance.

SubpopSpecificMarkers

SubpopSpecificMarkers identifies a subset of markers that will try to maximize differences among groups of samples and minimize difference within groups. Useful for selecting markers to discriminate subpopulations.

SearchBackupMarkers

SearchBackupMarkers is for selecting additional informative, highly polymorphic markers to an existing set of markers. Useful in replacing poor quality markers in a marker set.

Contact

For any questions, you may contact me at ignacio.8@buckeyemail.osu.edu.

References
Jenetics. 2017, Franz Wilhelmstötter.
TASSEL-5-SOURCE. 2007, Bradbury et. al.

Citation

#!

J.C.I. Ignacio, Purity: a genetic algorithm approach to identify informative genetic markers, (2019), Bitbucket repository, https://bitbucket.org/jcignacio/purity/wiki/Home

Bibtex:

#!bibtex
@misc{Ignacio2019,
  author = {Ignacio, J.C.I.},
  title = {Purity: a genetic algorithm approach to identify informative genetic markers},
  year = {2019},
  publisher = {Bitbucket},
  journal = {Bitbucket repository},
  howpublished = {\url{https://bitbucket.org/jcignacio/purity/wiki/Home}},
  commit = {60de1e7704609d0e4b9167e4fc8300697caba940}
}

Old notes when we developed QC panel for rice

Filtering the input dataset:
I suggest to only take homozygous genotypes, then filter by 0.10-0.25 MAF then by LD 0.9 R-Square. I usually do this in PLINK then convert to HapMap using TASSEL.

-- Further notes on requirements --

  • A genotyping dataset of a representative of the panel you want to perform QC on. Examples are:
    • Skim sequencing of parents to perform QC on breeding populations; or
    • GBS data of diverse panel from genebank for global QC panel
  • Efficient filtering of the genotypes such as:
    • Flanking sequences of candidate SNPs should be specific (I have a script for this)
    • If heterozygous calls are reliable, keep them, otherwise set them to missing
    • Using PLINK filtering:
      • Relatively high MAF, > 0.10
      • High call rate, > 0.90
      • Linkage Disequilibrium of at most 0.80
    • Optimizing the number of target SNPs to select for by running purity (the marker selection program) with increasing the target number on each run until the score plateaus
  • Running purity with the optimized number of target SNPs; considering a very high amount of solutions (up to 100k) at a time, this is computation intensive, took me 3-4 days
  • Getting the flanking sequences of the resulting SNPs for submission to LGC

Updated