Frequency spectrum format
SNP data format
Frequency spectra are created from a data dictionary using either the method
Spectrum.from_data_dict_corrected. For user convenience, the method
Misc.make_data_dict reads a specific SNP file format to generate an appropriate data dictionary. That SNP file format is described here. An example can be found in the
examples/fs_from_data/data.txt file included with the ∂a∂i source distribution.
The data file begins with any number of comment lines that being with
#. The first parsed line is a column header line. No spaces are allowed within any entry in the table. Individual rows make be commented out using
The first column should contain the ingroup reference sequence at that SNP, including the flanking bases. If the flanking bases are unknown, they can be denoted by
-. The header label is arbitrary.
The second column should contain the aligned outgroup reference sequence at that SNP, including the flanking bases. Unknown entries can be denoted by
-. The header label is arbitrary.
The third column gives the first segregating allele. The column header must be exactly
Then follows an arbitrary number of columns, one for each population, each giving the number of times Allele1 was observed in that population. The header for each column should be the population identifier.
The next column gives the second segregating allele. The column header must be exactly
Then follows one column for each population, each giving the number of times Allele2 was observed in that population. The header for each column should be the population identifier, and the columns should be in the same order as for the Allele1 entries.
Then follows an arbitrary number of columns which will be concatenated with
_ to assign a label for each SNP.
The following table gives an example for data from two populations, YRI and CEU.
Allele2 headers must be exactly those values because the number of columns between those two is used to infer the number of populations in the file.
Spectrum.from_data_dict_corrected polarizes the SNPs using outgroup information and applies a statistical correction for multiple mutations described in Hernandez et al. (2007). Any SNPs without full trinucleotide ingroup and outgroup sequences will be ignored, as well as SNPs in which the flanking bases are not conserved between ingroup and outgroup, or in which the outgroup allele is not one of the segregating alleles. The correction uses the expected number of substitutions per site, the trinucleotide mutation rate matrix, and a stationary trinucleotide distribution. These are summarized in a table of misidentification probabilities that can be calculated using
Misc.make_fux_table. (It should also be possible to develop a correction using only the single-site transition matrix. If this would be helpful, please contact the developers of ∂a∂i.)
Spectrum.from_data_dict can work with or without outgroup information. If the method parameter
True, SNPs are ignored for which the outgroup allele is missing, not one of the two segregating alleles, or is
-. If the method parameter
False, any outgroup information is ignored, and the returned
Spectrum is folded.
Note that the total number of calls to
Allele2 in a given population may not be the same for each SNP. When constructing the
Spectrum each SNP will be projected down to the requested number of samples in each population. (Note that SNPs cannot be projected up, so SNPs without enough calls in any population will be ignored.)