Wiki
Clone wikiFAST / InputFileFormats
File format for Input Genotype Data (mode=genotype)
Tip: To ensure proper running, please check that all input files are tab–delimited.
Input genotype data can be provided in either of the two ways: (1) IMPUTE2 format similar to the output of the imputation software IMPUTE2 (2) Format specific to FAST.
IMPUTE2 format
genotype File (option --impute2-file): For 2 individuals at 5 SNPs whose genotypes are
SNP 1 : AA AA SNP 2 : GG GT SNP 3 : CC CT SNP 4 : CT CT SNP 5 : AG GG
The correct genotype file would be
SNP1 rs1 1000 A C 1 0 0 1 0 0 SNP2 rs2 2000 G T 1 0 0 0 1 0 SNP3 rs3 3000 C T 1 0 0 0 1 0 SNP4 rs4 4000 C T 0 1 0 0 1 0 SNP5 rs5 5000 A G 0 1 0 0 0 1
So, at SNP3 the two alleles are C and T so the set of 3 probabilities for each individual correspond to the genotypes CC, CT and TT respectively.
imputation information File (option --impute2-info-file): Name of SNP-wise information file with one line per SNP and a single mandatory header line at the beginning. This file always contains the following columns:
1. SNP identifier 2. rsID 3. base pair position 4. expected frequency of allele coded '1' in genotype file 5. measure of the observed statistical information associated with the allele frequency estimate 6. average certainty of best-guess genotypes 7. internal "type" assigned to SNP (not used by FAST, set to 0) 8. info_typeX (not used by FAST, set to 0) 9. concord_typeX (not used by FAST, set to 0) 10. r2_typeX (not used by FAST, set to 0)
FAST format
tped File (option --tped-file): This file must be tab delimited and it contains the genotype dosage data. Each row represents a SNP, and each column an individual. Each genotype is a real value between 0 and 2 as output from a genotype imputation algorithm like Impute or Mach. Further, genotype data represented with two alleles - ‘a/a’, ‘A/a’, ‘A/A’ can be converted to dosage of any allele i.e, the count of allele ‘a’ so that ‘a/a’ becomes 0, ‘A/a’ becomes 1 and ‘A/A’ becomes 2.
Note
1. Missing genotype values are indicated with a negative value (default = -1), see option “--missing-val”. 2. No header line is allowed in this file. The file corresponds to a single chromosome.
0.2 0.5 1.2 1.9 1.1 1.4 0.0 0.0 2.0 2.0 ...
mlinfo File (option --mlinfo-file): This file must be tab delimited and it should have exactly 6 columns:
1. rs no. or SNP identifier, 2. allele1, 3. allele2, 4. frequency of allele1, 5. minor allele frequency(MAF), and 6. imputation quality for each SNP(Qual).
For imputed data, ‘Qual’ can represent the ‘Rsq’ metric ouput by Mach or ‘Info’ metric output by Impute algorithms. For genotype data that are converted to dosage data for analysis, the ‘Qual’ column can be all 1.0 representing perfect quality. The format of this file is a mandatory header line followed by one row for each SNP.
Note The header line must start with a ‘#’.
The file corresponds to a single chromosome. Example:
#SNP Allele1 Allele2 Freq Maf Qual rs1 A G 0.3 0.3 0.9 rs2 T C 0.8 0.2 0.5 ...
snp info File (option --snpinfo-file): This file must be tab delimited. Each line of the file describes a single marker and must contain exactly 4 columns:
1. rs no. or SNP identifier, 2. chromosome, 3. genetic distance (morgans, not used by FAST, can set it to 0), and 4. base-pair position (bp units).
Note The header line must start with a ‘#’. The file corresponds to a single chromosome.
Example:
#SNP Chr GD BP rs1 1 0 10000 rs2 1 0 10004 ...
Other input files
Individual ID File (option --indiv-file): This file contains the unique individual ID’s corresponding to each column of the genotype or tped file. The file contains a single column where each row contains a single ID. The count of individual IDs in this file must match the count of columns of the tped file. The order of individuals in this file must match the order in the tped file and the genotype file.
Note No header line is allowed in this file.
Example:
indiv_1 indiv_2 indiv_3 indiv_4 indiv_5 ...
Phenotype + Covariate File (option --trait-file) : This file must be tab delimited . This file describes the phenotypes and covariates for each individual following PLINK format The first six columns are mandatory:
1. Family ID 2. Individual ID 3. Paternal ID 4. Maternal ID 5. Sex (1=male; 2=female; other=unknown) 6. Phenotype
The Phenotype column can be optionally followed by more than one covariate column (when --num-covariates > 0).
Note
1. The first line must be a header line starting with a ‘#’. 2. Only a single phenotype column is permitted, column 6. 3. All covariates specified will be used for analysis (column 7 onwards). 4. Missing phenotype/covariate values must be specified with NA.
Example: (Note, the columns Cov1, Cov2 are optional)
#Fam_ID Ind_ID Dad_ID Mom_ID Sex Phenotype Cov1 Cov2 fam_id1 ind_1 ind_3 ind_5 1 0.3833 10.344 10 fam_id2 ind_2 ind_4 ind_6 2 -0.2231 21.322 20 ...
The phenotype can be either a quantitative trait or a binary affection status column: FAST will automatically detect which type. Quantitative traits with decimal points must be coded with a period/full-stop character and not a comma, i.e. 5.123 not 5,123. For dichotomous trait, any two integer values (e.g. 0/1 or 1/2) must be used.
If Sex/Gender needs to be specified as a covariate, it must also be specified (i.e repeated) in one of the covariate columns, e.g.
#Fam_ID Ind_ID Dad_ID Mom_ID Sex Phenotype Cov1 Sex fam_id1 ind_1 ind_3 ind_5 1 0.3833 10.344 1 fam_id2 ind_2 ind_4 ind_6 2 -0.2231 21.322 2 ...
Note for FAST.2.4 For both single SNP and gene-based Cox methods, the phenotype file requires 7 mandatory columns:
1. Family ID 2. Individual ID 3. Paternal ID 4. Maternal ID 5. Sex (1=male; 2=female; other=unknown) 6. Status 7. Time to Event
#Fam_ID Ind_ID Dad_ID Mom_ID Sex Status tTEvent Sex Cov fam_id1 ind_1 ind_3 ind_5 1 1 1.03 1 5.67 fam_id2 ind_2 ind_4 ind_6 2 0 2.06 2 3.46 ...
For SAPPHO, the phenotype file requires 6+j mandatory columns, where j is the number of phenotypes being test on, which is done by setting "--num-phenotypes j" as one input option. Currently SAPPHO allows for maximum 20 phenotypes. An example for SAPPHO phenotype file is as following:
#Fam_ID Ind_ID Dad_ID Mom_ID Sex Phenotype1 Phenotype2 Sex fam_id1 ind_1 ind_3 ind_5 1 0.3833 10.344 1 fam_id2 ind_2 ind_4 ind_6 2 -0.2231 21.322 2 ...
File format for Input Summary Data (mode=summary)
Tip: To ensure proper running, please check that all input files are tab–delimited.
summary data file (option --summary-file) : This file contains the meta-analysis information for each SNP. The file must be tab delimited. The first line is a mandatory header line and must start with a ‘#’. Each subsequent row provides the information for each SNP and must have the following 10 columns:
1. Chromosome 2. rs no. or SNP identifier, 3. Allele 1 4. Allele 2 5. minor allele frequency 6. number of samples without missing data 7. SNP base pair position 8. Single SNP regression coefficient (beta) 9. Single SNP regression standard error (se) 10. Single SNP regression pvalue
Example:
#chr snp Allele1 Allele2 Maf Nsample bp beta se pvalue 10 rs1 A T 0.3 2000 123456 0.34 0.12 0.108 10 rs2 G C 0.2 1998 123478 1.4 0.2 0.045 ...
Note
1. The first line must be a header line starting with a ‘#’. 2. The file corresponds to a single chromosome. 3. The SNP alleles can be coded as A/G/T/C or 1/2/3/4.
#chr snp Allele1 Allele2 Maf bp nsample1 beta1 se1 pvalue1 nsample2 beta2 se2 pvalue2 10 rs1 A T 0.3 123456 8000 0.34 0.12 0.108 7000 0.45 0.32 0.234 10 rs2 G C 0.2 125678 6000 1.4 0.2 0.045 5000 1.8 2.7 0.304 ...
summary data file simple format (option --summary-file) : This file contains the meta-analysis information for each SNP in a simpler format. The file must be tab delimited. The first line is a mandatory header line and must start with a ‘#’. Each subsequent row provides the information for each SNP and must have the following 4 columns:
1. Chromosome 2. rs no. or SNP identifier, 3. SNP base pair position 4. Single SNP regression pvalue
Example:
#chr snp bp pvalue 10 rs1 123456 0.108 10 rs2 123478 0.045 ...
Note
1. The first line must be a header line starting with a ‘#’. 2. The file corresponds to a single chromosome. 3. The SNP alleles can be coded as A/G/T/C or 1/2/3/4. 4. SAPPHO does not support summary simple format.
LD file (option --ld-file) : This file specifies the pair-wise LD information between SNPs. The file must be tab delimited. The first line is a mandatory header line and must start with a ‘#’. Each line contains mandatory 7 columns:
1. Chromosome of SNP 1 2. base pair position of SNP 1 3. rs no. or SNP identifier for SNP 1 4. Chromosome of SNP 2 5. base pair position of SNP 2 6. rs no. or SNP identifier for SNP 2 7. Correlation between SNP 1 and SNP 2 (a value between -1 and +1).
#CHR1 BP1 SNP1 CHR2 BP2 SNP2 LD 1 12345 rs1 1 12346 rs2 0.342 1 12345 rs1 1 12347 rs3 -0.59 ...
Note
1. The file corresponds to a single chromosome. 2. The file must be sorted first in ascending order of the base pair position of SNP 1, and then in ascending order of the base pair position of SNP 2.
1. File could be from different chromosomes. 2. The file must be sorted first in ascending order of the chromosome number of SNP1, then in ascending order of base pair position of SNP 1, then in ascending order of the chromosome number of SNP 2, and then in ascending order of the base pair position of SNP 2.
Phenotype variance-covariance file (option --pheno-varcov-file) : This file specifies the variance-covariance structure for all phenotypes. The file must be tab delimited. The first line is a mandatory header line. Each line contains mandatory 3 columns:
1. Name of phenotype a 2. Name of phenotype b 3. Covariance between phenotype a and b.
Pheno1 Pheno2 Cov phenotype_a phenotype_a 1 phenotype_a phenotype_b 0.5 phenotype_a phenotype_c 0.2 phenotype_b phenotype_b 1 phenotype_b phenotype_c 0.8 phenotype_c phenotype_c 1 ...
allele info file (option --allele-file): This file specifies the reference and alternate alleles used in computing the LD in the LD file. The file must be tab delimited. The first line is a mandatory header line and must start with a ‘#’. Each line contains mandatory 3 columns:
1. rs no. or SNP identifier 2. Allele 1 3. Allele 2
Example:
#snp Allele1 Allele2 rs1 A T rs2 G C
Note
1. The file corresponds to a single chromosome, SNP must appear in same order as summary file. 2. Must be tab delimited. 3. For SAPPHO, SNP could be from different chromosomes; SNP must appear in same order as summary file.
Haplotye file (option --hap-file): This file specifies the reference haplotypes for computing LD on the fly. The file must be tab delimited. Each line contains a single SNP with the columns:
1. chromosome 2. rs # or SNP identifier 3. base pair position 4. Frequency of Allele2 5. Allele1 6. Allele2 7. String of 0 and 1, where 0 represents Allele1 and 1 represents Allele2
Example:
22 rs1 12345 0.8 A T 1 0 1 0 1 ... 22 rs2 18345 0.3 T G 0 0 1 0 0 ... ...
1. No header line present. Each column (column 7 onwards) is a haplotype. 2. The file corresponds to a single chromosome. 3. Must be tab delimited. 4. For SAPPHO, SNP could be from different chromosomes; SNP must appear in same order as summary file.
1. chromosome 2. rs no. or SNP identifier 3. start of haplotype position **in bytes** in the haplotype file. 4. Allele1 5. Allele2 6. Frequency of Allele2
Example:
22 rs1 0 A T 0.8 22 rs2 4096 T G 0.3 ...
Note
1. No header line present. 2. The file corresponds to a single chromosome. 3. Must be tab delimited. 4. Not needed for SAPPHO
Multipos File: This file maintains a pre-computed list of problematic SNPs that are mapped to multiple loci and will be skipped during analysis, one line per snp name or rs no.
Pre-computed haplotype files from 1000 Genomes reference panels
Pre-computed haplotype files and their corresponding index files (for input with options --hap-file, --pos-file) from 1000 Genomes released on May 2012 are available for download from 1000G haplotypes for use with mode=summary. The following reference populations are currently available :
ASW CEU
File format for input files specifying gene list
Tip: To ensure proper running, please check that all input files are tab–delimited.
Gene-set File: This file specifies the gene boundary information for each gene to be used in the analysis. The first line is a mandatory header line and must start with a ‘#’. Each line contains 5 mandatory columns:
1. Gene id 2. Gene name 3. Chromosome 4. Gene start position in base-pairs 5. Gene end position in base-pairs
Example:
#GeneID GeneName Chr Start End GeneID:347688 TUBB8 10 82997 85178 GeneID:439945 LOC439945 10 116561 122386 ...
Build 37.3 A list of genes for build 37.3 is available at Genes.37.3. Lists of genes for build 37.3 with effective number of tests for Hapmap and 1000G is available at Genes.37.3.Hapmap, Genes.37.3.1000G.
Note
1. The file must be sorted in ascending order of chromosome. Within each chromosome, it should be sorted by ascending order of gene start positions. 2. Must be tab delimited.
Updated