1. James Taylor
  2. bx-python
  3. Issues
Issue #38 new

read a bigwig file is slow

fidel_ramirez
created an issue

Hi,

I noticed a large execution time difference between a perl script reading a bigwig file and a python script using bx BigWigFile.

The following code shows the problem: {{{

!python

""" usage: %prog bigwig_file.bw < bed_file.bed """ from bx.intervals.io import GenomicIntervalReader from bx.bbi.bigwig_file import BigWigFile import numpy as np import time import sys

bw = BigWigFile( open( sys.argv[1] ) ) ll = [] for interval in GenomicIntervalReader( sys.stdin ): start = time.time() bw.query(interval.chrom, interval.start, interval.end, 20 ) total = time.time() - start ll.append(total)

print np.mean(ll)

}}}

In my case the mean values are around 0.02 seconds, while the perl counterpart takes less than 0.001 sec. In practical terms this means to wait for several minutes using python compared to a couple of seconds in perl

I profiled the above code and most of the running time is spend on the method read_and_unpack() of bx/misc/binary_file.py. For each call to the BigWigFile query() method there are 10.000 calls to read_and_unpack() which seems excessive to me; however, I could not explain why so many calls are needed by browsing the source code.

Otherwise thanks for the great code,

Fidel

Comments (5)

  1. Anonymous

    Hi Brent,

    Yes, the Perl code I am comparing to also provides summary information. Here is an example copied from the perl library:

    use Bio::DB::BigWig 'binMean';
    
    my $wig  = Bio::DB::BigWig->new(-bigwig=>'ExampleData/dpy-27-variable.bw',
                                        -fasta=>'/tmp/elegans.fa');
    
    # Statistical summaries using "bin" feature type
    # Fetch 10 intervals across region 5M=>6M on chromosome I
    my @bins = $wig->features(-seq_id=>'I',
                               -start  => 5_000_000,
                               -end    => 6_000_000,
                                 -type=>'bin:10');
    for my $b (@bins) {
       my $start = $b->start;
       my $end   = $b->end;
       print "$start..$end, mean = ",$b->mean,"\n";
    }
    

    I haven't compiled or created cython code, therefore I don't know whether I am using a cythonized version or not. However, I have the impression that the current algorithm could be optimized.

  2. James Taylor repo owner

    Yes, I agree that reading a block there and coercing would be faster.

    I think the other (related) performance hit is BinaryFileReader. It uses the struct module heavily, and is not Cythonized. It is likely an easy target for optimization. A subclass that assumes it has a PyFile rather than any file-like object could be even faster.

  3. Log in to comment