Wiki

Clone wiki

dadi / DataFormats

Frequency spectrum format

See help(dadi.Spectrum.to_file).

SNP data format

Frequency spectra are created from a data dictionary using either the method Spectrum.from_data_dict or Spectrum.from_data_dict_corrected. For user convenience, the method Misc.make_data_dict reads a specific SNP file format to generate an appropriate data dictionary. That SNP file format is described here. An example can be found in the examples/fs_from_data/data.txt file included with the ∂a∂i source distribution.

Details

The data file begins with any number of comment lines that being with #. The first parsed line is a column header line. No spaces are allowed within any entry in the table. Individual rows make be commented out using #.

The first column should contain the ingroup reference sequence at that SNP, including the flanking bases. If the flanking bases are unknown, they can be denoted by -. The header label is arbitrary.

The second column should contain the aligned outgroup reference sequence at that SNP, including the flanking bases. Unknown entries can be denoted by -. The header label is arbitrary.

The third column gives the first segregating allele. The column header must be exactly Allele1.

Then follows an arbitrary number of columns, one for each population, each giving the number of times Allele1 was observed in that population. The header for each column should be the population identifier.

The next column gives the second segregating allele. The column header must be exactly Allele2.

Then follows one column for each population, each giving the number of times Allele2 was observed in that population. The header for each column should be the population identifier, and the columns should be in the same order as for the Allele1 entries.

Then follows an arbitrary number of columns which will be concatenated with _ to assign a label for each SNP.

Example

The following table gives an example for data from two populations, YRI and CEU.

Human Chimp Allele1 YRI CEU Allele2 YRI CEU Gene Position
ACG ATG C 29 24 T 1 0 abcb1 289
CCT CCT C 29 23 G 3 2 abcb1 345
... ... ... ... ... ... ... ... ... ...
ATT ATT C 14 10 T 15 13 xrcc5 101569

Notes

The Allele1 and Allele2 headers must be exactly those values because the number of columns between those two is used to infer the number of populations in the file.

The method Spectrum.from_data_dict_corrected polarizes the SNPs using outgroup information and applies a statistical correction for multiple mutations described in Hernandez et al. (2007). Any SNPs without full trinucleotide ingroup and outgroup sequences will be ignored, as well as SNPs in which the flanking bases are not conserved between ingroup and outgroup, or in which the outgroup allele is not one of the segregating alleles. The correction uses the expected number of substitutions per site, the trinucleotide mutation rate matrix, and a stationary trinucleotide distribution. These are summarized in a table of misidentification probabilities that can be calculated using Misc.make_fux_table. (It should also be possible to develop a correction using only the single-site transition matrix. If this would be helpful, please contact the developers of ∂a∂i.)

The method Spectrum.from_data_dict can work with or without outgroup information. If the method parameter polarized is True, SNPs are ignored for which the outgroup allele is missing, not one of the two segregating alleles, or is -. If the method parameter polarized is False, any outgroup information is ignored, and the returned Spectrum is folded.

Note that the total number of calls to Allele1 and Allele2 in a given population may not be the same for each SNP. When constructing the Spectrum each SNP will be projected down to the requested number of samples in each population. (Note that SNPs cannot be projected up, so SNPs without enough calls in any population will be ignored.)

Updated