Wiki

Clone wiki

FAST / InputFileFormats

File format for Input Genotype Data (mode=genotype)

Tip: To ensure proper running, please check that all input files are tab–delimited.

Input genotype data can be provided in either of the two ways: (1) IMPUTE2 format similar to the output of the imputation software IMPUTE2 (2) Format specific to FAST.

IMPUTE2 format

genotype File (option --impute2-file): For 2 individuals at 5 SNPs whose genotypes are

SNP 1 : AA    AA
SNP 2 : GG    GT
SNP 3 : CC    CT
SNP 4 : CT    CT
SNP 5 : AG    GG

The correct genotype file would be

SNP1    rs1    1000    A    C    1    0    0    1    0    0
SNP2    rs2    2000    G    T    1    0    0    0    1    0
SNP3    rs3    3000    C    T    1    0    0    0    1    0
SNP4    rs4    4000    C    T    0    1    0    0    1    0
SNP5    rs5    5000    A    G    0    1    0    0    0    1

So, at SNP3 the two alleles are C and T so the set of 3 probabilities for each individual correspond to the genotypes CC, CT and TT respectively.

imputation information File (option --impute2-info-file): Name of SNP-wise information file with one line per SNP and a single mandatory header line at the beginning. This file always contains the following columns:

1. SNP identifier
2. rsID 
3. base pair position
4. expected frequency of allele coded '1' in genotype file
5. measure of the observed statistical information associated with the allele frequency estimate
6. average certainty of best-guess genotypes
7. internal "type" assigned to SNP (not used by FAST, set to 0)
8. info_typeX  (not used by FAST, set to 0)
9. concord_typeX (not used by FAST, set to 0)
10. r2_typeX (not used by FAST, set to 0)

FAST format

tped File (option --tped-file): This file must be tab delimited and it contains the genotype dosage data. Each row represents a SNP, and each column an individual. Each genotype is a real value between 0 and 2 as output from a genotype imputation algorithm like Impute or Mach. Further, genotype data represented with two alleles - ‘a/a’, ‘A/a’, ‘A/A’ can be converted to dosage of any allele i.e, the count of allele ‘a’ so that ‘a/a’ becomes 0, ‘A/a’ becomes 1 and ‘A/A’ becomes 2.

Note

1. Missing genotype values are indicated with a negative value (default = -1), see option “--missing-val”.
2. No header line is allowed in this file. The file corresponds to a single chromosome.
For example, here are five individuals typed for 2 SNPs (one row = one snp, one column = one individual):
0.2              0.5            1.2             1.9               1.1
1.4              0.0            0.0             2.0               2.0
                                      ...                                           

mlinfo File (option --mlinfo-file): This file must be tab delimited and it should have exactly 6 columns:

1. rs no. or SNP identifier, 
2. allele1, 
3. allele2, 
4. frequency of allele1, 
5. minor allele frequency(MAF), and 
6. imputation quality for each SNP(Qual).

For imputed data, ‘Qual’ can represent the ‘Rsq’ metric ouput by Mach or ‘Info’ metric output by Impute algorithms. For genotype data that are converted to dosage data for analysis, the ‘Qual’ column can be all 1.0 representing perfect quality. The format of this file is a mandatory header line followed by one row for each SNP.

Note The header line must start with a ‘#’.

The file corresponds to a single chromosome. Example:

#SNP    Allele1   Allele2   Freq    Maf   Qual
rs1     A         G         0.3     0.3    0.9
rs2     T         C         0.8     0.2    0.5
            ...     

snp info File (option --snpinfo-file): This file must be tab delimited. Each line of the file describes a single marker and must contain exactly 4 columns:

1. rs no. or SNP identifier,
2. chromosome,   
3. genetic distance (morgans, not used by FAST, can set it to 0), and  
4. base-pair position (bp units). 
The format of this file is a mandatory header line followed by one row for each SNP.

Note The header line must start with a ‘#’. The file corresponds to a single chromosome.

Example:

#SNP   Chr  GD   BP
rs1    1    0    10000
rs2    1    0    10004
         ...        

Other input files

Individual ID File (option --indiv-file): This file contains the unique individual ID’s corresponding to each column of the genotype or tped file. The file contains a single column where each row contains a single ID. The count of individual IDs in this file must match the count of columns of the tped file. The order of individuals in this file must match the order in the tped file and the genotype file.

Note No header line is allowed in this file.

Example:

indiv_1
indiv_2
indiv_3
indiv_4
indiv_5
...

Phenotype + Covariate File (option --trait-file) : This file must be tab delimited . This file describes the phenotypes and covariates for each individual following PLINK format The first six columns are mandatory:

1. Family ID  
2. Individual ID 
3. Paternal ID  
4. Maternal ID  
5. Sex (1=male; 2=female; other=unknown)  
6. Phenotype

The Phenotype column can be optionally followed by more than one covariate column (when --num-covariates > 0).

Note

   1. The first line must be a header line starting with a ‘#’. 
   2.  Only a single phenotype column is permitted, column 6. 
   3.  All covariates specified will be used for analysis (column 7 onwards). 
   4.  Missing phenotype/covariate values must be specified with NA.

Example: (Note, the columns Cov1, Cov2 are optional)

#Fam_ID   Ind_ID   Dad_ID   Mom_ID   Sex   Phenotype   Cov1     Cov2
fam_id1   ind_1    ind_3    ind_5    1     0.3833      10.344   10
fam_id2   ind_2    ind_4    ind_6    2     -0.2231     21.322   20
            ...         

The phenotype can be either a quantitative trait or a binary affection status column: FAST will automatically detect which type. Quantitative traits with decimal points must be coded with a period/full-stop character and not a comma, i.e. 5.123 not 5,123. For dichotomous trait, any two integer values (e.g. 0/1 or 1/2) must be used.

If Sex/Gender needs to be specified as a covariate, it must also be specified (i.e repeated) in one of the covariate columns, e.g.

#Fam_ID   Ind_ID   Dad_ID   Mom_ID   Sex   Phenotype   Cov1     Sex
fam_id1   ind_1    ind_3    ind_5    1     0.3833      10.344   1
fam_id2   ind_2    ind_4    ind_6    2     -0.2231     21.322   2
                            ...

Note for FAST.2.4 For both single SNP and gene-based Cox methods, the phenotype file requires 7 mandatory columns:

1. Family ID  
2. Individual ID 
3. Paternal ID  
4. Maternal ID  
5. Sex (1=male; 2=female; other=unknown)  
6. Status 
7. Time to Event
These columns could be followed then by m stratification columns and then by n covariate columns (this could be done by setting "--cox-strata m" and "--cox-cov n" as input options). PLEASE note that stratification columns come first, followed by covariate columns; ALSO note that users should set covariate number for Cox model with "--cox-cov" option, not "--num-covariates" option. An example for Cox phenotype file is as following:

#Fam_ID   Ind_ID   Dad_ID   Mom_ID   Sex   Status tTEvent   Sex     Cov
fam_id1   ind_1    ind_3    ind_5    1     1      1.03      1       5.67
fam_id2   ind_2    ind_4    ind_6    2     0      2.06      2       3.46
                            ...

For SAPPHO, the phenotype file requires 6+j mandatory columns, where j is the number of phenotypes being test on, which is done by setting "--num-phenotypes j" as one input option. Currently SAPPHO allows for maximum 20 phenotypes. An example for SAPPHO phenotype file is as following:

#Fam_ID   Ind_ID   Dad_ID   Mom_ID   Sex   Phenotype1   Phenotype2     Sex
fam_id1   ind_1    ind_3    ind_5    1     0.3833       10.344         1
fam_id2   ind_2    ind_4    ind_6    2     -0.2231      21.322         2
                            ...
Similar to regular phenotype files, the phenotype columns could be optionally followed by more than one covariate column (when --num-covariates > 0; in the above example, if we set "--num-covariates 1" as one input option, then Sex would be considered as one covariate).

File format for Input Summary Data (mode=summary)

Tip: To ensure proper running, please check that all input files are tab–delimited.

summary data file (option --summary-file) : This file contains the meta-analysis information for each SNP. The file must be tab delimited. The first line is a mandatory header line and must start with a ‘#’. Each subsequent row provides the information for each SNP and must have the following 10 columns:

1. Chromosome
2. rs no. or SNP identifier,  
3. Allele 1
4. Allele 2
5. minor allele frequency
6. number of samples without missing data
7. SNP base pair position
8. Single SNP regression coefficient (beta)
9. Single SNP regression standard error (se)
10. Single SNP regression pvalue 

Example:

#chr    snp     Allele1  Allele2  Maf     Nsample      bp      beta      se      pvalue
10      rs1     A        T        0.3     2000       123456    0.34      0.12    0.108
10      rs2     G        C        0.2     1998       123478    1.4       0.2     0.045
                  ...

Note

    1. The first line must be a header line starting with a ‘#’. 
    2.  The file corresponds to a single chromosome.
    3.  The SNP alleles can be coded as A/G/T/C or 1/2/3/4.
Note for FAST.2.4 CoxPh methods do not run in summary mode in the current version of FAST. For SAPPHO methods, the format of summary file has 6+4j columns, where j is the number of total phenotypes that needs to be analyzed, and set by "--num-phenotypes j" option. SNPs could be from multiple chromosomes, but should be first ordered by chromosome number then by BP number, both in ascending order. The first 6 columns of summary file sbould be Chromosome Number, SNP ID, allele1(the non-coding allele), allele2(the coding allele), minor allele frequency, BP (note that the 6th column is different from single phenotype); then follows 4j columns of sample numbers, regression coefficients, standard errors, and pvalues for each phenotype.

#chr    snp     Allele1  Allele2  Maf     bp      nsample1  beta1    se1    pvalue1     nsample2  beta2    se2    pvalue2
10      rs1     A        T        0.3     123456   8000     0.34     0.12    0.108       7000     0.45     0.32    0.234
10      rs2     G        C        0.2     125678   6000      1.4     0.2     0.045       5000      1.8      2.7    0.304
                  ...

summary data file simple format (option --summary-file) : This file contains the meta-analysis information for each SNP in a simpler format. The file must be tab delimited. The first line is a mandatory header line and must start with a ‘#’. Each subsequent row provides the information for each SNP and must have the following 4 columns:

1. Chromosome
2. rs no. or SNP identifier,  
3. SNP base pair position
4. Single SNP regression pvalue 

Example:

#chr    snp     bp       pvalue
10      rs1     123456    0.108
10      rs2     123478    0.045
                  ...

Note

    1. The first line must be a header line starting with a ‘#’. 
    2. The file corresponds to a single chromosome.
    3. The SNP alleles can be coded as A/G/T/C or 1/2/3/4.
    4. SAPPHO does not support summary simple format.

LD file (option --ld-file) : This file specifies the pair-wise LD information between SNPs. The file must be tab delimited. The first line is a mandatory header line and must start with a ‘#’. Each line contains mandatory 7 columns:

1. Chromosome of SNP 1
2. base pair position of SNP 1
3. rs no. or SNP identifier for SNP 1
4. Chromosome of SNP 2
5. base pair position of SNP 2
6. rs no. or SNP identifier for SNP 2
7. Correlation between SNP 1 and SNP 2 (a value between -1 and +1).
Example:
#CHR1     BP1     SNP1    CHR2    BP2     SNP2    LD
1      12345    rs1      1      12346   rs2     0.342
1      12345    rs1      1      12347   rs3     -0.59
                ...

Note

    1. The file corresponds to a single chromosome.
    2. The file must be sorted first in ascending order of the base pair position of SNP 1,
       and then in ascending order of the base pair position of SNP 2.
Note for FAST.2.4
    1. File could be from different chromosomes.
    2. The file must be sorted first in ascending order of the chromosome number of SNP1, then in ascending order of base pair position of SNP 1, then in ascending order of the chromosome number of SNP 2, and then in ascending order of the base pair position of SNP 2.

Phenotype variance-covariance file (option --pheno-varcov-file) : This file specifies the variance-covariance structure for all phenotypes. The file must be tab delimited. The first line is a mandatory header line. Each line contains mandatory 3 columns:

1. Name of phenotype a
2. Name of phenotype b
3. Covariance between phenotype a and b.
Order of phenotypes within this file is IMPORTANT. First, the order of phenotypes has to be consistent with the order of phenotypes in the summary file. Then, the variance/covariance for each pair of phenotypes is listed as one row in the file. Starting from the first phenotype, for each phenotype in order, the rows include first the variance of that phenotype, then the covariance between that phenotype and every phenotype that comes after it, in order. Example for three phenotypes (phenotype_a, phenotype_b, phenotype_c):
Pheno1        Pheno2         Cov
phenotype_a   phenotype_a    1
phenotype_a   phenotype_b    0.5
phenotype_a   phenotype_c    0.2
phenotype_b   phenotype_b    1
phenotype_b   phenotype_c    0.8
phenotype_c   phenotype_c    1
        ...

allele info file (option --allele-file): This file specifies the reference and alternate alleles used in computing the LD in the LD file. The file must be tab delimited. The first line is a mandatory header line and must start with a ‘#’. Each line contains mandatory 3 columns:

1. rs no. or SNP identifier
2. Allele 1
3. Allele 2

Example:

#snp    Allele1 Allele2
rs1 A   T
rs2 G   C

Note

 1. The file corresponds to a single chromosome, SNP must appear in same order as summary file.
 2. Must be tab delimited.
 3. For SAPPHO, SNP could be from different chromosomes; SNP must appear in same order as summary file. 

Haplotye file (option --hap-file): This file specifies the reference haplotypes for computing LD on the fly. The file must be tab delimited. Each line contains a single SNP with the columns:

1. chromosome
2. rs # or SNP identifier
3. base pair position
4. Frequency of Allele2
5. Allele1 
6. Allele2
7. String of 0 and 1, where 0 represents Allele1 and 1 represents Allele2

Example:

22    rs1    12345    0.8  A    T    1    0    1    0    1    ... 
22    rs2    18345    0.3  T    G    0    0    1    0    0    ...
                           ...                       
Note
 1. No header line present. Each column (column 7 onwards) is a haplotype.
 2. The file corresponds to a single chromosome.
 3. Must be tab delimited.
 4. For SAPPHO, SNP could be from different chromosomes; SNP must appear in same order as summary file.
Haplotye index file (option --pos-file): This file specifies the reference haplotypes start byte positions for computing LD on the fly. The file must be tab delimited. Each line contains a single SNP with the columns:
1. chromosome
2. rs no. or SNP identifier
3. start of haplotype position **in bytes** in the haplotype file. 
4. Allele1 
5. Allele2
6. Frequency of Allele2

Example:

22    rs1    0       A    T    0.8
22    rs2    4096    T    G    0.3     
             ...

Note

  1. No header line present.
  2. The file corresponds to a single chromosome.
  3. Must be tab delimited.
  4. Not needed for SAPPHO

Multipos File: This file maintains a pre-computed list of problematic SNPs that are mapped to multiple loci and will be skipped during analysis, one line per snp name or rs no.

Pre-computed haplotype files from 1000 Genomes reference panels

Pre-computed haplotype files and their corresponding index files (for input with options --hap-file, --pos-file) from 1000 Genomes released on May 2012 are available for download from 1000G haplotypes for use with mode=summary. The following reference populations are currently available :

ASW  CEU
Please use the correct reference panel with same ethnicity as the study sample for gene-based analysis.

File format for input files specifying gene list

Tip: To ensure proper running, please check that all input files are tab–delimited.

Gene-set File: This file specifies the gene boundary information for each gene to be used in the analysis. The first line is a mandatory header line and must start with a ‘#’. Each line contains 5 mandatory columns:

1. Gene id
2. Gene name
3. Chromosome
4. Gene start position in base-pairs
5. Gene end position in base-pairs

Example:

#GeneID         GeneName    Chr    Start    End
GeneID:347688   TUBB8       10     82997    85178
GeneID:439945   LOC439945   10     116561   122386
        ...

Build 37.3 A list of genes for build 37.3 is available at Genes.37.3. Lists of genes for build 37.3 with effective number of tests for Hapmap and 1000G is available at Genes.37.3.Hapmap, Genes.37.3.1000G.

Note

   1. The file must be sorted in ascending order of chromosome. Within each chromosome, it should be sorted by 
      ascending order of gene start positions.  
   2. Must be tab delimited.

Updated