Wiki

Clone wiki

purity / SubpopulationSpecificMarkers

SubpopSpecificMarkers

SubpopSpecificMarkers identifies a subset of markers that will try to maximize differences among groups of samples and minimize difference within groups. Useful for selecting markers to discriminate subpopulations.

Requirements

Need Java 13 or newer. Check if you have Java, otherwise you can install it from here.

Installation

Through git clone

git clone https://jcignacio@bitbucket.org/jcignacio/purity.git

Then go to the directory

purity/builds/SubpopSpecificMarkers_201201

OR

Download it from SubpopSpecificMarkers_201201.zip then extract.

Usage

java -jar ssm.jar <No. of target markers> <Population size> <Distance bet. dupe samples> <Input HapMap file> <Output file> <Input grouping file>

Example: java -jar ssm.jar 15 10000 0.0 rice_sp.hmp.txt out.txt groups.txt

Parameters Description
Number of target markers, N (integer) number of markers to select
Popoulation size (integer) number of solutions to consider at a time (better results when higher but uses more RAM)
Distance bet. dup samples (decimal, 0 to 1) genetic distance threshold for considering duplicated samples, set to 0 for exact match. Set higher for more polymorphic markers between samples.
Input hmp.txt (file) hapmap file where to pick markers from
Output file (file) output file, e.g. out.txt
Input grouping file input file with grouping of samples formatted as a tab-delimited TASSEL trait file. Group has to be numerical, strings are not supported yet.

Example of a grouping file with 6 samples and 3 groups.

<Phenotype> 
taxa    factor
taxa    group
sample1 1
sample2 1
sample3 2
sample4 3
sample5 1
sample6 3

Results and outputs Description
Score ([m1,m2,..mN --> x]) x = s1 + s2 * 2, lower score means better discrimination of groups
s1 "Same genotype with different group", which is the sum of # of samples with genotype i * (# of groups where the samples with genotype i belong to - 1) as i goes to n unique genotypes of duplicated samples
s2 "Same group with different genotype", which is the sum of [(# of unique genotypes in group i) - 1] as i goes to n groups
Text file contains some information on the selected markers
Distance matrix (csv) comma-separated distance matrix generated from the selected markers
Hapmap file subset of the input hapmap file containing the selected markers only

Updated