Clone wiki

bx-python / IO / SeekingInBzip2Files

Seeking to a specific position in the uncompressed data of a compressed file is generally difficult. Most common compression formats (such as gzip) are stream based -- to decompress any portion you have to decompress every preceding byte. For gzip some variations exist that work around this (dictzip, rsyncable gzip). Paul Sladen has posted some useful information on this subject here and here.

bzip2 is (fortunately) somewhat different, in that the file is compressed in chunks which are completely independent, thus you can decompress the 7th chunk without needing to look at chunks 1 through 6, as long as you know where in the file to find chunk 7. Thus, pseudo-random access can be implemented by determining what compressed block the desired data is in.

Paul Sladen (in the above pages) states:

''Bzip2 is block-based, not stream based as gzip is. A fixed amount of input is taken (normally 900kB) and the block in considered on its own in a completely independant fashion. First a BWT transform is performed, placing likely similar letters closer together. Next is a Move-to-Front transform resulting in values clustered around zero. Both BWT and MTF processing steps are reversible and do not alter the length of the text. Final compression is performed using Huffman, much the same as for gzip.''

Unfortunately this is not strictly true. Bzip2 blocks ''do not'' correspond to a fixed chunk of the input, the input is first run length encodded and then 900kb chunks are taken and compressed. Thus a single chunk can contain a large amount of uncompressed data if the data is highly repetitive. The implication of this for seeking is that it is impossible to just use a lookup table (indexed by uncompressed_index mod block_size) for seeking, instead we need to search a list of both the compressed and corresponding uncompressed position of each block.

To support this, we have implemented the `bzip-table` program that scans through a bzip2 compressed stream and prints the position (in bits since bzip2 chunks are not byte aligned) where each compressed block starts, and the number of bytes of uncompressed data that block contains. For example:

henduck% ls -al chrY.maf.bz2
-rw-r--r--   1 james  james  12646982 Mar 22 23:03 chrY.maf.bz2

henduck% ~/work/seek-bzip2/bzip-table < chrY.maf.bz2 > chrY.maf.bz2t

henduck% head chrY.maf.bz2t 
32         904472
1397269    904854
2773995    906025
4317352    900401
5938996    894743
7678512    895929
9421291    897561
11189100   895831
12914746   895979
14640089   895640

This information can then be used to extract a single block of compressed data from a bzip2 file. Unfortunately, standard bzip2 implementations (e.g. bzlib) do not support uncompressing a single block, thus we have also provided functions to do so. The `seek-bunzip` program provides a trivial example of seeking to a particular block and uncompressing it. It can be run from the command line by passing a block position in bits:

henduck% ~/work/seek-bzip2/seek-bunzip 2773995 < chrY.maf.bz2 > t
henduck% head t | less

`seek-bunzip.c` defines two functions, `seek_bits` for moving to the start of a particular blocks, and `uncompressblock` for uncompressing the block at a particular position to standard out. These are intended as a starting point for implementing a higher level interface (WORK IN PROGRESS). A pure C implementation (sbz_seek, sbz_fread, sbz_tell, etc) is currently left as an excercise, however a Python/C extension that provides file-like access to bzip2 files with tables has been implemented to support indexed access to MAF files, see IO/Bzip2SeekableFile.

These programs are based on a version of micro-bunzip that has been modified to support reading a single block of compressed data in a simple and efficient way. All of the code is available here.