Clone wiki

bgen / Missing data in BGEN

BGEN has two means of handling missing data, corresponding to the two available formats.

Missing data in BGEN v1.1

In v1.1, genotype data is stored as three probability values per sample (in the same way as for GEN files). Thus, completely missing genotypes are encoded by three zero probabilities. It's also possible to encode partly missing genotypes by having probability value that sum to less than one; values like this are sometimes output by clustering-based genotype calling algorithms and are interpreted as having nonzero probability of a NULL genotype call.

Note that in this implementation, BGEN v1.1 does not treat missing values specially; values stored as zeroes are simply returned to the calling code like any other probability value.

Missing data in BGEN v1.2

In v1.2, only 2 probabilities are stored per sample (for a diploid sample at a biallelic SNP). The third probability is interpreted as one minus the sum of the others. Thus, BGEN v1.2 cannot be used to store partially-missing genotypes. The implication of this is that BGEN v1.2 can be used:

  • For genotype data that consists of hard genotype calls (possibly with missingness).

  • For most phased datasets.

  • For most imputed datasets.

But it may not be used for genotype data from a clustering algorithm that outputs nonzero probability of NULL genotype call.

Missing values are treated specially in BGEN v1.2. In this implementation probability values for samples with missing data are returned as NA (that is, as the genfile::MissingValue class).