binning could be faster in certain cases

Issue #37 new
Thomas Gilgenast created an issue

Non-overlapping equal sized bins are easy to compute, but bins with arbitrary overlap seem to be more complicated. However, many practical bin parameter choices have the property that they could be computed by doing non-overlapping binning on a finer grid, then performing a convolution over that finer grid.

For example, 4/12 binning (4 kb wide bins, smoothing window side length 12 kb) can be computed by first binning on a 4 kb grid, then convolving with np.ones((3, 3)). To compute a mean, you can compute two quantities on the initial grid: running sum (or sum of logged counts if doing gmean) and total number of FFLJs. After passing both of these quantities through the convolution, the mean is just the ratio of the two quantities.

We may continue thinking about whether or not a similar approach could work when the smoothing window width is not an odd integer multiple of the bin resolution, but one workaround could be to pick the smoothing window width first and just adapt the bin width as needed (since bin width isn't as important, as long as it's small enough). For example, 4/16 binning could be rounded to 3.2/16.

Comments (3)

  1. Thomas Gilgenast reporter

    this is getting a priority bump because binning is probably the slowest pipeline step (thanks to recent improvements in the donut expected) and is likely to see heavy usage in the future for all 5C analyses

  2. Log in to comment