diffmer: Differentially Methylated Regions by a LASSO-like Optimization Approach
This software discovers differentially methylated CpGs (DMCs) and regions of such CpGs (DMRs) in whole genome bisulfite sequencing data. It is for research use only, not for diagnostic purposes.
The software is written in Python 3.4 and requires the following additional modules: pip, numba, numpy, pandas, hdf5, h5py, matplotlib, seaborn.
We highly recommend to use the conda package manager for Python 3.4 to install and configure the necessary Python environment. This will allow you to keep your scientific projects independent from your system Python and install required packages in your home directory, so no root access is required.
conda, create a Python environment for
diffmer and activate it as follows:
conda create --name diffmer python=3.4 pip numpy numba pandas hdf5 h5py seaborn source activate diffmer
(On windows, the last command is just
You can now install additional packages in this environment with
pip according to your needs, e.g.
Snakemake for workflow management.
The next step is to install this package (
diffmer), either regularly, or in development mode (so you can edit the code to play with it).
Clone this git repository:
git clone email@example.com:genomeinformatics/diffmer.git
Make sure that the diffmer environment is still activated, change to the diffmer directory and run setup:
python setup.py build install # regular install python setup.py develop # developer install
You should now be able to run the software; try printing the basic help information:
To deactivate the Python environment, run
(Again, on Windows, this is just
To completely remove the environment, use
conda remove --name diffmer --all
This section describes how to prepare your input data, how to run diffmer and how to postprocess the output.
BED Input Format, Sample Names and Classes
Diffmer assumes that raw reads have already been mapped with a bisulfite-aware readmapper, such as BSMAP. It also assumes that the methylation status of each CpG in each read has been called by examining C/T substitutions in resulting BAM files.
Diffmer thus expects input in BED format (tab-separated columns), one line per CpG, three positional columns, and two columns per sample, as follows:
chr1 12001 12002 12 15 10 11 9 9 chr1 12004 12005 10 12 8 9 0 20
The first three columns indicate the CpG position (chromosome, start position, end position). The end position must differ from the start position by at most +1. For diffmer, it is irrelevant what these positions represent, but the distance of start positions is taken into account. In this example, we have three samples (two case samples, one control sample), so there are six columns with read counts (columns 4--9). Columns 4 and 5 provide read counts for methylated reads and total interpretable reads (methylated+unmethylated) for the first case sample.
The counts describe
- number of reads with methylated C (first number in column pair)
- number of reads with methylated or unmethylated C (C or T; second number)
The BED file has no header line; so we must describe the sample names and class memberships separately. This can be done in two ways:
--classfile classes.txtto provide a file with sample names and classes. The file must contain lines of the form
sampleis a sample name for a column in the BED file and
classmust be 0 (for the control group) or 1 (for the case group). The order of names in the file must correspond to the order of column pairs in the BED file. In the example above, there should be three rows:
--classes case_1=1 case_2=1 control_1=0to specify class assignment on the command line.
HDF5 Input Format
As an alternative to BED format, a specific HDF5 format can be used in the future. Its specification is not yet stable and for testing purposes only.
Class Methylation Level Estimation
Running the estimation procedure is simple: specify the input BED file and the sample names with class memberships (0 for control, 1 for case). Example:
diffmer estimate input.bed --classes case_1=1 case_2=1 control_1=0 > results.classlevels
This will print the estimated class methylation rates to standard output and redirect them to
--times to print progress and timing information to standard error.
Calling differentially methylated regions
The output from
diffmer estimate contains for every CpG site the computed class methylation levels for case and control.
The class methylation difference is simply the (absolute) difference of both values.
DMR detection depends on several parameters described in the paper:
--delta D: minimum methylation difference required to call a CpG a DMC (differentially methylated CpG)
--size S: minimum size (number of contained CpGs) of a DMR
--meandelta M: minimum average methylation difference of each size-S window of a DMR
--skip P: maximum number of coniguous non-DMCs in a DMR
The output is written to stdout in BED format
diffmer dmrs results.classlevels --delta 0.3 --size 4 --meandelta 0.4 --skip 0
will print output formatted like
chr1 747692 747830 4 0.225 1 0.2 0.2 0.3 0.2 chr1 2425117 2425150 4 0.2 1 0.2 0.2 0.2 0.2 chr1 4363478 4363510 8 0.2 1 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 chr1 4692177 4692198 6 0.2 1 0.2 0.2 0.2 0.2 0.2 0.2 chr1 4692333 4692400 7 0.2714 1 0.3 0.3 0.3 0.2 0.3 0.3 0.2 chr1 5619516 5619655 4 0.325 0 -0.4 -0.4 -0.3 0.2
The first three columns are required by the BED format and specify the genomic interval of the DMR (position of C in first and last contained CpG).
The fourth column specifies the size (number of CpGs) of the DMR.
The fifth column shows the average absolute methylation difference.
The sixth column displays the kind of DMR (1 for hypermethylated case samples in comparison to control, -1 for hypometylated ones, 0 for mixed).
Output can be constrained to specific kinds using the
The remaining columns (their number corresponds to the size of the DMR) list the (signed) methylation differences of each CpG.
DMRs can be visualized with
diffmer visualize. The data for all CpGs in the region plus two further positions on both sides of the region are plotted using a colorcode for the methylation levels (red: high methylation, blue: low methylation). Optionally, the computed methylation levels can be plotted, too.
diffmer visualize input.bed --classfile classfile.txt --dmrs dmrs.bed --classlevels results.classlevels
Please contact Nina Hesse with questions about the software, or post bugs in the bug tracker of this repository.