1. Tatsunori Hashimoto
  2. Fixseq

Overview

HTTPS SSH

Overview

FIXSEQ is a over-dispersion correction technique that serves as a smarter way to de-duplicate counts.

Consider using FIXSEQ If:

  • You're going to de-duplicate (remove per base count to one).
  • Downstream analysis does not account for overdispersion
  • Downstream analysis is linear (ie SVM or logit regression)
  • Downsteram analysis uses binned counts

FIXSEQ is not designed for the following use cases:

  • Correcting exon-level counts (Fixseq will be far too agressive at the exon level. Apply fixseq at the per-base level or use a specialized RNA-seq pipeline)
  • Correcting nearly single-base spike data such as ChIP-exo (The core assumption of Fixseq is that per-base information can be attenuated safely).

FIXSEQ will most likely not hurt even if your method accounts for overdispersion, but if you already use a specialized analysis pipeline that accounts for overdispersion, there's no need for FIXSEQ.

How to run

methods.r is the Fixseq core code needed to perform count corrections, included is a set of fitting algorithms for the Poisson+Gamma, Poisson+Lognormal and Poisson+Logconcave

Contact

For technical issues and bugs contact thashim@mit.edu with your input.csv file.

Changelog

Updated to fix issues with plotcode and documentation.

The continous mapping outputs three types of weighted output: the paper's scheme of inverting the density and using the median, minimizing expected log-loss to set eta, and minimizing log-loss directly. For typical ChIP and DNAse-seq all output similar weights (up to scaling) but for large counts direct log-loss minimizaiton seems more robust.

In the paper we suggested mapping continous -> integer counts by rounding, in the current code we minimize log-loss after rounding.

Dependencies

R version 2.15 + and packages 'statmod', 'cobs', 'logcondens' are used in the lognormal and logconcave fitting, methods. They should be installed via install.packages in your local R environment if they are not already available. In order to compile several secondary dependencies, the BLAS and LAPACK libraries should be available on your system. In Ubuntu, this can be achieved by running "sudo apt-get install libblas-dev liblapack-dev".

Inputs

Code can be run without inputs (it generates its own example input.csv file), but if correcting some count data, input.csv should be encoded such that column 1 is the frequency of seeing K counts per base and column 2 should be K. This is the marginal per-base count histogram of read start positions. For complete clarity: an experiment where a single read starts at every base (and there are 1000000 bases total) would have the single line 1000000,1 in its input.csv file.

Outputs

Given counts as input, file outputs count correction file output.csv mapping original counts (first column) to eights (second column), rounded weights (third column), and their truncated version (fourth column).

The code also outputs llhfit.pdf to compare distributional fit to the negative binomial and Poisson-lognormal distributions.