Wiki

Clone wiki

ATLAS / VCF Tools: VCFToLFMM

Overview

Convert a VCF to LFMM file. Various filters (MAF, depth, variant quality, missingness, specific samples, genomic regions, chromosomes etc.) can be set.

Input

  • VCF file: to be converted
  • geno: which LFMM format to be used. Either calledGeno, if called genotypes should be stored (input for LFMM1); or postGeno, if the mean posterior genotype should be stored (input for LFMM2). Please note: LFMM2 does not accept missing genotypes. Impute your vcf before using postGeno and do not set filters that lead to missing sites.
  • txt file (optional): e.g. samplesPopulations.txt

This file is a user-created .txt file containing the samples to be used.

Example:

sample1

sample2

sample5

sample8

Output

  • LFMM file with suffix ".lfmm". Contains the genotypes (parameter calledGeno) or the mean posterior genotypes (parameter postGeno) in LFMM format.
  • text file with suffix ".lfmm.kept_loci". Contains the names (chr:pos) of loci that passed all filters and are present in the LFMM file.

Usage Example

./atlas task=VCFToLFMM geno=calledGeno vcf=example.vcf.gz samples=samplesPopulations.txt

Specific Arguments

  • samples: specify samples to be used
  • limitLines: amount of lines to be read from VCF file
  • minDepth: only store sites with minimum depth. Default = 1
  • minSamplesWithData: only store sites with minimum number of samples. Default = 1
  • minMAF: only store sites where initial estimate of allele frequency is larger or equal to minMAF. Default = 0.0
  • minVariantQuality: only store sites with minimum variant quality. Default = 0
  • keepChromosomes: only loci on these chromosomes are kept. The argument can be a filename (which needs to end with .txt); or a comma-seperated list of chromosome names
  • window: a BED-file with three columns that correspond to chromosome, start (0-based) and end position of windows that should be kept. If both keepChromosomes and window are defined, only the overlap of the two are kept
  • reportFreq: after how many lines the reading progress is printed to the terminal. Default = 10000
  • epsF: epsilon for EM algorithm to estimate allele frequencies. Default = 0.0001

Updated