Wiki

Purity

Description
Purity is a genetic algorithm approach to identify a set of informative genetic markers. Given a target number of markers, it will try to identify a set of markers that can identify the most number of unique samples within the dataset. It utilizes Jenetics and TASSEL5 java libraries.

Requirements
To use purity, you must have Java 8 or newer. Check if you have Java, otherwise you can install it from here.

Installation
Download purity_171218.zip then extract.

Testing in Windows
You can quickly test purity by double-clicking run.bat. It should produce two output files, a marker metafile and a subset HapMap file.

Running
Usage: java -jar <purity jar file> <No. of target markers> <Solution size> <Distance of duplicates> <Input HapMap file> <Output text file>

Example: java -jar purity_171218.jar 10 1000 0.05 test.hmp.txt test.out.txt

The output score is the number of samples with unique genotypes.

PARAMETERS

<No. of target markers> - Integer
* Total no. markers to select among given dataset.
<Solution size> - Integer
* Number of solutions to consider.
* Higher size yields better results but takes longer time.
<Distance of duplicates> - Decimal (0.00-1.00)
* Min. distance bet. two samples before considering as duplicates.
* Set to 0.00 for exact match.
<Input HapMap file> - HapMap file (.hmp.txt, .hmp.txt.gz)
* Dataset to choose markers from.
* Supports either IUPAC or diploid format.
<Output text file> - Text file
* File to write output score and info of current best set of markers.
* The score is the no. of uniquely identified samples.
* Also outputs a subset HapMap file.

Contact

For any questions, you may contact me at ignacio.8@buckeyemail.osu.edu.

References
Jenetics. 2017, Franz Wilhelmstötter.
TASSEL-5-SOURCE. 2007, Bradbury et. al.

Citation

#!

J.C.I. Ignacio, Purity: a genetic algorithm approach to identify informative genetic markers, (2019), Bitbucket repository, https://bitbucket.org/jcignacio/purity/wiki/Home

Bibtex:

#!bibtex
@misc{Ignacio2019,
  author = {Ignacio, J.C.I.},
  title = {Purity: a genetic algorithm approach to identify informative genetic markers},
  year = {2019},
  publisher = {Bitbucket},
  journal = {Bitbucket repository},
  howpublished = {\url{https://bitbucket.org/jcignacio/purity/wiki/Home}},
  commit = {60de1e7704609d0e4b9167e4fc8300697caba940}
}

Old notes when we developed QC panel for rice

Filtering the input dataset:
I suggest to only take homozygous genotypes, then filter by 0.10-0.25 MAF then by LD 0.9 R-Square. I usually do this in PLINK then convert to HapMap using TASSEL.

-- Further notes on requirements --

A genotyping dataset of a representative of the panel you want to perform QC on. Examples are:
- Skim sequencing of parents to perform QC on breeding populations; or
- GBS data of diverse panel from genebank for global QC panel
Efficient filtering of the genotypes such as:
- Flanking sequences of candidate SNPs should be specific (I have a script for this)
- If heterozygous calls are reliable, keep them, otherwise set them to missing
- Using PLINK filtering:
  - Relatively high MAF, > 0.10
  - High call rate, > 0.90
  - Linkage Disequilibrium of at most 0.80
- Optimizing the number of target SNPs to select for by running purity (the marker selection program) with increasing the target number on each run until the score plateaus
Running purity with the optimized number of target SNPs; considering a very high amount of solutions (up to 100k) at a time, this is computation intensive, took me 3-4 days
Getting the flanking sequences of the resulting SNPs for submission to LGC

Wiki

purity / Home

Purity

Other Features

MaximumTaxaDiversity

NPolymorphic

SubpopSpecificMarkers

SearchBackupMarkers

Contact

Old notes when we developed QC panel for rice