Clone wiki

bgen / BGEN_in_the_UK_Biobank

BGEN has been used for release of imputed genotype probability data in the UK Biobank. This page contains technical details of the formats used.

Note: Questions about the UK Biobank genomics data releases should be directed at the UKB-GENETICS mailing list (https://www.jiscmail.ac.uk/cgi-bin/webadmin?A0=UKB-GENETICS).

UK Biobank genotype and imputed data full release

The UK Biobank has released both imputed genotype and phased haplotype data for the full biobank cohort (487,409 individuals after QC, including the individuals from the interim release). This section contains details of what is found in these data. For full information on data processing, see Bycroft, Freeman, Petkova et al, "The UK Biobank resource with deep phenotyping and genomic data", Nature (2018).

Phased haplotype data

Phased haplotypes have been released in bgen files with names of the form ukb_hap_chr<chr>_v2.bgen, with corresponding .bgen.bgi index files. Here are details of what is found in these files:

  • Genotypes for individuals and SNPs passing QC were statistically phased using SHAPEIT3, as described here.
  • The output data is encoded in BGEN format with 8 bits per probability, and using zlib compression.
  • The phasing process outputs hard-called phased genotypes. This implies that each stored probability in the file is either 0 or 1, and these can be interpreted as hard calls.
  • The phasing process also 'fills in' missing genotype calls with hard-called phased genotypes. Note however that it is still possible that genotypes marked as missing appear in the file - e.g. this could occur due to individuals whose data has been masked out.
  • Phasing was carried out in chunks, but chunks were then 'ligated' so that the phase in the files can be treated as consistent across each chromosome. Estimates of error rates are provided in the paper above.

A simple way to see the contents of these files is to convert to vcf format using bgenix - e.g. using the command:

bgenix -g ukb_hap_chr10_v2.bgen -vcf

This output reflects the fact that data is conceptually stored as four probabilities per individual per variant (i.e. the probability of each of the two alleles on each of the two haplotypes), and is directly convertable to a phased genotype call. See the BGEN format specification for full details of data storage.

A note on chromosome information: a processing issue means that these files have been encoded with blank chromosome information (instead, the chromosome is encoded in the variant ID field of the file). This has consequences for analysis using the bgen tools. Please see Using the UK Biobank full release index files for more information on this and a workaround.

Imputed genotype data

Imputed data files have been released in BGEN format files, with filenames of the form ukb_imp_chr<chr>_<version>.bgen, and corresponding index files. Here the version is either 'v2' (for the initial release of these data) or 'v3' (for the later release, which fixed a number of bugs in the initial release). Here are details of what is found in these files:

  • Imputation has been performed into both the Haplotype Reference Consortium and the UK10K reference panels. These results have been merged into a single release dataset forming directly typed or imputed genotypes at 92,693,895 variants across the autosomal chromosomes.
  • Imputed (unphased) genotypes for this release are being supplied in the BGEN v1.2 format using 8 bits per probability, and using zlib compression. Index files for use with bgenix will also be provided. The expected total file size is around 2.1Tb.
  • Data will initially be released for autosomal chromosomes. All samples will be diploid. Some genotypes may be missing and these should be taken account of when processing these data.
  • Coordinates in these files are with respect to the GRCh37/NCBI build 37 reference assembly, and are '1-based', i.e. they treat the first position on each chromosome as position 1.
  • Alleles in these files are expressed with respect to the forward (+) strand. The first allele is the reference allele, and the second allele the alternative allele.
  • Where multi-allelic variants exist in these data, they have been split into a series of bi-allelic variants. This implies that several variants may share the same genomic position but with different alternative alleles.

!! Important . For most purposes you should be using the final 'v3' version of these files. Please read Using the UK Biobank full release index files if you are using the UK Biobank-supplied index files supplied with the initial ('v2') version of these data.

UK Biobank genotype and imputed data interim release

In May 2015 the UK Biobank released imputed genotype data for 152,249 individuals, typed / imputed at 72,355,667 variants genome-wide. This data was released in BGEN v1.1 format. See the UK Biobank Data Showcase page for more information on these data.

Updated