Clone wiki

bgen / Using_the_UK_Biobank_full_release_index_files

Note on the UK Biobank full release phased haplotype files (11/10/2017).

The information in this section pertains to the 'v2' haplotype files, which are the most current at the time of writing.

Phased haplotype data (named in the form ukb_hap_chr<chr>_v2.bgen) have been encoded in BGEN format. However, the chromosome information in these files, and also in the corresponding index files, is set to blank (i.e. to the empty string) for every variant.

This impacts workflows using these data. For example, to extract the first 100kb of chromosome 10 using bgenix, one would normally write:

bgenix -g ukb_hap_chr10_v2.bgen -incl-range "10:0-100000"

but instead must write

bgenix -g ukb_hap_chr10_v2.bgen -incl-range ":0-100000"

because the chromosome is blank in the files.

Workaround: If this isn't satisfactory, a workaround is to manually update the index files using sqlite. For example, the command

sqlite3 ukb_hap_chr10_v2.bgen.bgi "UPDATE Variant SET chromosome = '10'"

will update the chromosome identifier in the chromosome 10 index file to the appropriate value. (You would need to run this once per chromosome). Following this the first query above will work. Note however that this command only updates the index, not the data file itself, so that the output will still have blank chromosome information.

Note on the UK Biobank full release 'v2' imputed genotype data index files (13/07/2017).

The information in this section applies only to the initial 'v2' release of UK Biobank imputed genotype data, not to the later 'v3' version. (The version identifier is Our understanding is that all users should now be using version 'v3', but we've left this information here for legacy purposes.

The index files provided with the UK Biobank imputed data full release 'v2' have been named in a way that bgenix does not recognise by default (i.e. in the form ukb_imp_chr<N>_v2.bgi instead of the expected ukb_imp_chr<N>_v2.bgen.bgi). To workaround this, one of the following options can be used:

  1. Rename or copy each index file to the expected name, e.g. rename ukb_bgi_chr<N>_v2.bgi to ukb_imp_chr<N>_v2.bgen.bgi.
  2. Use bgenix to recreate the index files (e.g. run bgenix -g ukb_imp_chr<N>_v2.bgen -index for each chromosome). This typically takes a few minutes per file. We recommend this option because this will additionally include extra metadata in the index file which bgenix can use for sanity checking.
  3. Specify a non-default index filename to bgenix (via the -i option). While this option will work, it is not generally recommended because it makes it harder to get the command line right and may not solve the problem for tools other than bgenix.

This does not apply to the 'v3' version of these data, where the index files have been appropriately named.

Other information

See BGEN in the UK Biobank for technical information on the imputed data provided by the UK Biobank,

Updated