bgenix is a tool to create an index of variants in a bgen file and to use that index for efficient retrieval of data for specific variants or regions.
Here's a quick list of common bgenix command lines and what they do:
|Command line or option||What it does|
||Print help on the various options bgenix supports|
||Don't output data; instead create an index file for the given bgen file. It will be named
||Output genotype data in the same format as file.bgen but ordered as in the index.|
||Restrict the output to variants with the specified rsid|
||Restrict to variants in the given genomic range.|
||Transcode data to VCF format.|
||Transcode data to BGEN v1.1 format.|
||Don't output genotype data, just list the variants in the index.|
Query options and output format options can of course be combined, e.g. the command
bgenix -g file.bgen -list -incl-range 11:3500000-6500000
will list all variants in the given range, while
bgenix -g file.bgen -vcf -incl-range 11:3500000-6500000
will output a VCF file for that region.
Note: bgenix writes its output to stdout. A full command line will therefore often redirect the output to a file, as in:
bgenix -g file.bgen -incl-range 11:3500000-6500000 > output.bgen
or pipe it to another command, as in:
bgenix -g file.bgen -incl-range 11:3500000-6500000 | qctool -g - -filetype bgen -snp-stats -osnp stats.txt
bgenix can be used to construct an index file like this:
bgenix -g file.bgen -index
This creates an index file called
file.bgen.bgi containing the index. Subsequently, data can be retrieved by further calls, e.g.
bgenix -g file.bgen -incl-range 11:0-1000000 > region.bgen
As the command suggests this outputs a bgen file containing only data in the specified region.
bgenix -help for a full list of supported options. See The bgenix index file format for a fuller description of the index file format.
-list option is given, bgenix will list variants instead of outputting a bgen file. For example, using the file
complex.bgen included in the
example/ folder in the bgen repository, the command:
bgenix example/complex.bgen -incl-range 01:0- -list
produces this output:
# bgenix: started 2016-07-06 09:01:15 alternate_ids rsid chromosome position number_of_alleles first_allele . V1 01 1 2 A G V2.1 V2 01 2 2 A G . V3 01 3 2 A G . M4 01 4 3 A G,T . M5 01 5 2 A G . M6 01 7 4 A G,GT,GTT . M7 01 7 6 A G,GT,GTT,GTTT,GTTTT . M8 01 8 7 A G,GT,GTT,GTTT,GTTTT,GTTTTT . M9 01 9 8 A G,GT,GTT,GTTT,GTTTT,GTTTTT,GTTTTTT . M10 01 10 2 A G # bgenix: success, total 10 variants.
As described below, another way to list variants is to query the index directly using
sqlite3 -header -csv example/complex.bgen.bgi "SELECT * FROM Variant"
Note: you need sqlite3 version 3.8.2 or above for this to work out of the box - see below for more information.
Bgenix can pull out data based on chromosome and position, or by variant identifier.
In general, a variant will be output if it satisfies at least one inclusion (
-incl-*) condition, and it does not satisfy any exclusion (
-excl-*) condition. The relevant options are:
|Include specific identifiers||
||For each argument, bgenix looks to see if the argument is the name of a readable file. If so it opens the file and reads whitespace-separated identifiers from it. Otherwise it assumes the argument itself is an identifier.||
|Exclude specific identifiers||
||One or both positions can be missing in which case a half-open interval is assumed. As for -incl-rsids, bgenix checks to see if the argument is the name of a readable file and if so reads ranges from it; otherwise it treats the argument itself as a range.||
Transcoding to other formats
By default Bgenix outputs data in the same format it is stored. This is the most efficient way of using bgenix since it does not involve any decompression or recompression. However, optionally you can ask bgenix to transcode data to other formats instead. Two formats are currently supported:
- the VCF format (specified using the
- BGEN v1.1 format (specified using the
-v11option). Currently this is only supported when the input data is in a specific format, namely BGEN with 'layout=2' blocks, 8-bit probability encoding, and all samples are diploid.
Details of the index file format
bgenix stores its index in a plain sqlite3 file. This has a number of advantages.
For example, you can inspect the index file using the sqlite3 command-line program. E.g. to get a list of variants:
sqlite3 file.bgen.bgi "SELECT * FROM Variant"
You can use sqlite3 to do whatever you want with the file - for example, you could create a new table containing only a specific subset of variants for use in your project (the
-table option can be used to tell
bgenix to use this new table).
Note: For performance reasons bgenix uses "WITHOUT ROWID" tables to implement the index. This means you need sqlite3 version 3.8.2 or greater to inspect the file - otherwise you'll get a message like "Error: malformed database schema"".
Alternatively, you can use the
-with-rowid option when building the index:
bgenix -g myfile.bgen -index -with-rowid
This builds the index in a table with a
rowid column, useable with earlier versions of sqlite.
See The bgenix index file format for full details.
bgenix is motivated by and in some respects designed to mimic tabix, the htslib tool for indexing tab-delimited files. The key functionality of
bgenix is all implemented using the sqlite3 library. Thank you, sqlite authors!