HTTPS SSH

diffmer: Differentially Methylated Regions by a LASSO-like Optimization Approach

This software discovers differentially methylated CpGs (DMCs) and regions of such CpGs (DMRs) in whole genome bisulfite sequencing data. It is for research use only, not for diagnostic purposes.

The software is written in Python 3.4 and requires the following additional modules: pip, numba, numpy, pandas, hdf5, h5py, matplotlib, seaborn.

Installation

We highly recommend to use the conda package manager for Python 3.4 to install and configure the necessary Python environment. This will allow you to keep your scientific projects independent from your system Python and install required packages in your home directory, so no root access is required.

After installing conda, create a Python environment for diffmer and activate it as follows:

conda create --name diffmer  python=3.4  pip numpy numba pandas hdf5 h5py seaborn
source activate diffmer

(On windows, the last command is just activate diffmer.)

You can now install additional packages in this environment with conda or pip according to your needs, e.g. Snakemake for workflow management.

The next step is to install this package (diffmer), either regularly, or in development mode (so you can edit the code to play with it).

Clone this git repository:

git clone git@bitbucket.org:genomeinformatics/diffmer.git

Make sure that the diffmer environment is still activated, change to the diffmer directory and run setup:

python setup.py build install  # regular install
python setup.py develop        # developer install

You should now be able to run the software; try printing the basic help information:

diffmer --help

To deactivate the Python environment, run

source deactivate

(Again, on Windows, this is just deactivate.)

To completely remove the environment, use

conda remove --name diffmer --all

Basic Usage

This section describes how to prepare your input data, how to run diffmer and how to postprocess the output.

BED Input Format, Sample Names and Classes

Diffmer assumes that raw reads have already been mapped with a bisulfite-aware readmapper, such as BSMAP. It also assumes that the methylation status of each CpG in each read has been called by examining C/T substitutions in resulting BAM files.

Diffmer thus expects input in BED format (tab-separated columns), one line per CpG, three positional columns, and two columns per sample, as follows:

chr1 12001 12002  12 15   10 11   9 9
chr1 12004 12005  10 12   8  9    0 20

The first three columns indicate the CpG position (chromosome, start position, end position). The end position must differ from the start position by at most +1. For diffmer, it is irrelevant what these positions represent, but the distance of start positions is taken into account. In this example, we have three samples (two case samples, one control sample), so there are six columns with read counts (columns 4--9). Columns 4 and 5 provide read counts for methylated reads and total interpretable reads (methylated+unmethylated) for the first case sample.

The counts describe

  • number of reads with methylated C (first number in column pair)
  • number of reads with methylated or unmethylated C (C or T; second number)

The BED file has no header line; so we must describe the sample names and class memberships separately. This can be done in two ways:

  1. Use option --classfile classes.txt to provide a file with sample names and classes. The file must contain lines of the form sample=class, where sample is a sample name for a column in the BED file and class must be 0 (for the control group) or 1 (for the case group). The order of names in the file must correspond to the order of column pairs in the BED file. In the example above, there should be three rows: case_1=1, case_2=1, control_1=0.

  2. Use option --classes case_1=1 case_2=1 control_1=0 to specify class assignment on the command line.

HDF5 Input Format

As an alternative to BED format, a specific HDF5 format can be used in the future. Its specification is not yet stable and for testing purposes only.

Class Methylation Level Estimation

Running the estimation procedure is simple: specify the input BED file and the sample names with class memberships (0 for control, 1 for case). Example:

diffmer estimate  input.bed  --classes case_1=1 case_2=1 control_1=0 > results.classlevels

This will print the estimated class methylation rates to standard output and redirect them to results.classlevels.

Use option --times to print progress and timing information to standard error.

Calling differentially methylated regions

The output from diffmer estimate contains for every CpG site the computed class methylation levels for case and control. The class methylation difference is simply the (absolute) difference of both values.

DMR detection depends on several parameters described in the paper:

  • --delta D: minimum methylation difference required to call a CpG a DMC (differentially methylated CpG)
  • --size S: minimum size (number of contained CpGs) of a DMR
  • --meandelta M: minimum average methylation difference of each size-S window of a DMR
  • --skip P: maximum number of coniguous non-DMCs in a DMR

The output is written to stdout in BED format

Example:

diffmer dmrs  results.classlevels  --delta 0.3 --size 4 --meandelta 0.4 --skip 0

will print output formatted like

chr1    747692  747830   4       0.225   1       0.2     0.2     0.3     0.2
chr1    2425117 2425150  4       0.2     1       0.2     0.2     0.2     0.2
chr1    4363478 4363510  8       0.2     1       0.2     0.2     0.2     0.2     0.2     0.2     0.2     0.2
chr1    4692177 4692198  6       0.2     1       0.2     0.2     0.2     0.2     0.2     0.2
chr1    4692333 4692400  7       0.2714  1       0.3     0.3     0.3     0.2     0.3     0.3     0.2
chr1    5619516 5619655  4       0.325   0       -0.4    -0.4    -0.3    0.2

The first three columns are required by the BED format and specify the genomic interval of the DMR (position of C in first and last contained CpG). The fourth column specifies the size (number of CpGs) of the DMR. The fifth column shows the average absolute methylation difference. The sixth column displays the kind of DMR (1 for hypermethylated case samples in comparison to control, -1 for hypometylated ones, 0 for mixed). Output can be constrained to specific kinds using the --kind option. The remaining columns (their number corresponds to the size of the DMR) list the (signed) methylation differences of each CpG.

Visualizing Regions

DMRs can be visualized with diffmer visualize. The data for all CpGs in the region plus two further positions on both sides of the region are plotted using a colorcode for the methylation levels (red: high methylation, blue: low methylation). Optionally, the computed methylation levels can be plotted, too.

Example:

diffmer visualize input.bed --classfile classfile.txt --dmrs dmrs.bed --classlevels results.classlevels

Contact

Please contact Nina Hesse with questions about the software, or post bugs in the bug tracker of this repository.