ChIP-eat: from raw reads to high quality TFBS predictions
A complete pipeline designed to process ChIP-seq data sets from raw reads to accurate transcription factor (TF) binding site (TFBS) prediction. It is split into three main parts: aligning and filtering the raw data, calling ChIP-seq peaks and predicting TFBSs. Users can use any or all of the three components of the pipeline. For the alignment, ChIP-eat is able to deal with single-ended and pair-ended data. Starting with FASTQ files, the reads are aligned to a reference genome and several filtering steps (e.g., filtering for non uniquely aligned reads, removing eventual PCR duplicates) are applied to prepare the data for peak calling. Four different peak callers are supported: MACS2, HOMER, BCP, and GEM). Subsequent to the peak-calling, each set of peaks serves as input in one or all the TFBS prediction models supported PWMs (DiMO optimized), binding energy model, transcription factor flexible models, and DNA-shaped TFBSs. Finally, a non-parametric, data driven, entropy-based thresholding algorithm is applied to automatically define an enrichment zone within which the TFBSs predicted have the highest confidence.
Citation: A map of direct TF-DNA interactions in the human genome, doi:https://doi.org/10.1093/nar/gky1210
To be able to use ChIP-eat, you need to download the project on your machine and follow the steps below to install and set up the needed paths to the software dependencies.
A UNIX based operating system is needed in order to run the ChIP-eat pipeline or having Bash Unix Shell support on a Microsoft Windows machine.
Note that the shell should support the
sed commands. If this is not the case, install them using the software package manager of your system.
You can either
git clone or download the compressed pipeline from this page. Make sure you have write access in the folder you are going to host the pipeline.
Raw data alignment tools
In order to perform all the raw data processing, filtering and transformation steps, several dedicated sets of tools are required:
Install BOWTIE 2 from here. Note that if running on a system with Intel processors, you might need to install the Intel Threading Building Blocks (TBB) libraries before installing
BOWTIE 2. That can be done either via the software package manager or from here
Make sure you create also an index file with
BOWTIE2 for the genome version of interest. This can be done by following this tutorial.
Important the same genome version should be used throughout the ChIP-eat pipeline!
Install SAMTOOLS from here
Besides the genome index file created by BOWTIE, you will need a file containing the chromosome sizes and a whole genome FASTA file for the reference genome. As an example, for Homo Sapiens for the GRCh38, these files can be retrieved from here
Peak calling tools
To call peaks on the resulting aligned and filtered data, or on your own input files, install one of the four supported peak callers: MACS2, HOMER, BCP, or GEM. We recommend the first two, as they were found to be the most stable. We tested several peak callers due to the different statistical tests implemented for signal detection, and the varying quality/resolution of the ChIP-seq data sets. Note that this pipeline was developed and mainly tested using MACS v2.1.1. You can install the peak callers in a location of your choice or inside the
src folder of the pipeline, just make sure that the paths to them are correctly set inside the
bin/config.sh file.The BCP and HOMER peak callers are available in the
src/zips folder of the pipeline. The other two were not included due to their size.
Important: please refer to the installation notes for each of the peak callers, in order to install their required dependencies.
Peak processing tools
This is the most demanding part of the pipeline regarding dependencies, due to the different programming languages used in the development of the prediction models. Please make sure you respect the version (where specified) and that you install all dependent libraries.
Important: you should have
BEDTOOLS 2 installed and the chromosome sizes and the reference genome files available (see the raw data alignment part of the README). If the reference genome is not indexed,
BEDTOOLS 2 will automatically generate the index file.
The Java Development Kit
The latest version of the Java Development Kit (JDK) can be installed from here depending on your Unix distribution. make sure that the
JAVA_HOME path is set and that you have java added to your
PATH. You can check it via:
If it is not set and you do not know where
java is installed, you can do:
on your command line and export the path via:
and add it to the
R statistical environment
You can install the latest version of the R statistical environment from here or installing
r-base via the software package manager of your system. The minimum required R version is 2.5.0.
rJava package is also required. It can be downloaded from here.
Important: if you install it as described below, make sure that the
rJava package is installed in the
src folder of the ChIP-eat pipeline.
For your convenience, the 0.9-9 version of it is provided under
src/zips. Being in the
src folder, install it via the
R environment with the command:
install.packages("zips/rJava_0.9-9.tar.gz", lib = ".", repos = NULL, type = "source")
Note that the installation may fail if the JDK is not properly configured in
R or if you try to install the pipeline in a virtual machine. If so, try to reconfigure
sudo R CMD javareconf or
R CMD javareconf -e if you do not have sudo privileges, and try again to install
rJava. If it still fails, then try installing the
r-cran-rjava package via the software package manager of your system.
DiMO package contains one of the prediction models used in this pipeline. The latest version of it can be downloaded from here.
For ChIP-eat, we have used
DiMO v1.6. This version is provided under
src/zips. Being in the
src folder, install it via the
R environment with the command:
install.packages("zips/DiMO_v1.6.tar.gz", lib = ".", repos = NULL, type = "source")
Important as for the
rJava package, if you install it with the above command, make sure that
DiMO is installed in the
src folder of the pipeline.
MASS package is also required, but should be already available in the
R base packages. To make sure it is installed, through the
R environment, run
library("MASS") and see if the package is successfully loaded. If not successful, install it via:
and selecting a mirror/repository.
For three out of the five implemented prediction models, the
Python programming language is required. Importantly, the version 2.7 is used across the pipeline. This is due to the dependent libraries that were developed under this version. Throughout the ChIP-eat scripts,
python2.7 is invoked explicitly. You can install this version from here. After installation, there are several required libraries that have to be installed.
Python should be pre-installed on your Linux distribution. Check it by typing:
Note that if you want to make use only of the simple
PWM prediction model, you do not need to install the
To ease up the library installation process, please make sure you have
pip installed. If not, you can install it via your software package manager following the tutorial here. All the libraries below should be installed and their location added to the
Important: check if the location of the installed libraries is added to the
echo $PYTHONPATH and add them to it if necessary via
TFBS prediction models
It is encouraged to have the prediction models inside the
src folder of the pipeline. The
DNAshapedTFBS models are already present in the
src/zips folder of the ChIP-eat pipeline. You will just have to
unzip them in
src. If you want to have a different version of them, you can find
TFFM here and
DNAshapedTFBS here. There is no installation per se of these two models, just decompress the files. If not stored in the
src folder, update accordingly the
DiMO prediction model should already be installed via the R package in the previous steps.
The implementation of the
PWM and binding energy model (
BEM) are already present in the
bin/utils folder of the pipeline.
If you will make use of prediction models that involve machine learning (e.g., DiMO, TFFM or DNAshapedTFBS) you should have available the
BiasAway tool. This tool will allow you to generate the background files for training the algorithms, that will match the GC composition of the foreground files. The tool is available here, and also provided in the
src/zips folder. It is recommended to
unzip it in the
src folder. If you change the location of
BiasAway, please make sure that the etries referring to
BiasAway in the
bin/config.sh file are modified accordingly.
BiasAway makes use of the whole genome file and will generate a repository of short length (101bp) sequences grouped by their GC composition the first time a machine learning based prediction model is run. The whole genome file will be split in these chunks and output by default in the
data/mappable_genome folder of the pipeline. The background repository is output in the
data/BiasAway_repository folder of the pipeline. This process is done only once, but it will take several hours to complete, depending on your machine. It will speed up the subsequent generation of background training sets.
If you want to make use of the
DNAshapedTFBS prediction module, then you will have to download the files that contain the DNA features used by the prediction model:
minor width groove,
propeller twist, and
roll. These features are available through the
GBshape database available here through their FTP server. You should download (at least) the files corresponding to the four features enumerated, for the first order and second order.
Important: make sure that the location of the
GBshape data is set in the
PATH_TO_GB_SHAPE parameter in the
bin/config.sh file, as well as the names of the files corresponding to the DNA features in the associated parameters (e.g.,
GBshape data is required to be from the same genome version (e.g., hg38) as the reference genome file.
The only file you need to modify according to the pipeline and dependency installation is the
bin/config file. The rest, downloading the entire project and maintaining its folder structure should suffice. If this is the case, and you have also made use of the provided files under
src/zips, the set of parameters under
INTERNAL TO THE PIPELINE in the
bin/config.sh file should not require any modification.
Just make sure that the DEPENDENCIES LOCATION part of the configuration file is reflecting your system configuration.
Running the ChIP-eat pipeline
Alignment and filtering step
The first part of the processing pipeline would be the alignment of the raw reads to the reference genome. This can be achieved as follows: Download one or several ChIP-seq data sets from either the ENCODE project or from GEO. For reliable results, make sure that the datasets contain also control files bsides the treatment or input files. It is assumed that the data sets consist of gzipped FASTQ files for both control and treatment. Example of an input file:
Once you have obtained the ChIP-seq data sets, run the alignment part of the pipeline:
bash alignment_pipeline.sh /path/to/my/data/set
The folder structure is maintained for the output. If you wish to launch the alignment on multiple datasets in parallel, you simply run:
bash launch_alignment.sh /path/to/my/data/sets 20
20 represents the number of processors to allocate for this task.
Note that you can always use the launch_alignment.sh (recommended) with only
1 processor as parameter!
Important that depending on the file size, and your system configuration, the alignment and filtering steps can take several hours, even days. Log files are exported for each dataset to check the percentage of the read alignment and other relevant information. Moreover, make sure you have write access in the output folder.
The peak calling step:
You can call ChIP-seq peaks either on the "freshly" aligned files or on your own input
Note that it depends on the peak caller wether it accepts the
BED format, or both. Moreover, also peak caller dependent, if you can input only replicates, or you need control files as well. For instance,
MACS2 accepts both formats and can deal with data sets not having control files. As an overall advice, it is always recommended to use the control files when available.
Important: if multiple replicates or control files are present, all will be concatenated and used in the peak calling.
To run the peak calling part of the pipeline:
bash peak_calling.sh /path/to/my/input/files /path/to/the/output/folder PEAK_CALLER
or, for parallelized processes:
bash launch_peak_calling.sh /path/to/my/input/data/sets /path/to/the/output/folder HOMER 20
Supported values for the
PEAK_CALLER parameter: MACS, HOMER, BCP, GEM
Note that as for alignment, you can always use the launch_peak_calling.sh (recommended) with only
1 processor as parameter!
More information about the individual steps in this part of the pipeline and the parameter settings used can be found in the workflow diagram located in the doc folder of the project.
The peak processing step
The resulting ChIP-seq peaks can be used for high confidence TFBSs predictions using one of the five different models implemented in the ChIP-eat pipeline. Those models vary from simple to complex, and their level of improvement over the predictions is subject to be influenced by the ChIP'ed protein. For instance, using the
DNAshaped TFBS model can have a higher impact on the predictions of certain family of TFs as compared to other families. Refer to relevant literature.
Important: for the prediction models that use machine learning approaches, background training sets will be automatically generated via the
BiasAway tool for the current dataset, in order to match the GC% of the foreground training sets. When running the first time the ChIP-eat pipeline with such a prediction model (e.g., DiMO, TFFM or DNA),
BiasAway is invoked and will check if a background repository was already generated. If not, it will generate it (refer to the BiasAway section of this file) and subsequently generate the training files, storing them in a
training_files folder inside the results folder of the dataset. This files will be generated only once and used by the other prediction models if invoked.
The predictions are based on reference motifs for each TF. A TF mapping file can be provided (see example file in
bin/utils TF_mapping.tsv), with each dataset that is to be processed, to what TF it was targeted, to what position frequency matrix (PFM) it maps. Alternatively, you can specify your own
PFM as shown in the examples below.
Important: for convenience, folders containing all the JASPAR 2018 TF profiles under PFM, position weight matrices (PWMs) and PFMs in MEME format are provided in the
data folder of the pipeline. If you want to use your own profiles for different TFs and launch processes in parallel, make sure you change the entries in the
bin/config.sh file accordingly, as well as the
TF_mapping.tsv. This file will be automatically sorted (once) to gain some computation time when searching for the associated PFM on large amounts of datasets. The PMF and PWM files contain only numeric characters.
Note that in the mapping file, you should respect the order of the columns. The column names are not taken into consideration.
To process the ChIP-seq peaks, use either:
bash peak_processing.sh /path/to/my/input/folder /path/to/my/output/folder/ PEAK_CALLER MODEL [my_pfm] [my_tf_name] [window_size] [training_window_size]
bash launch_peak_processing.sh /my/input/folder /my/output/folder/ PEAK_CALLER MODEL NB_PROC [my_pfm] [my_tf_name] [window_size] [training_window_size]
PEAK_CALLER is the peak caller used on the files (e.g., one of MACS, HOMER, BCP, GEM),
MODEL is the prediction model to use: one of position weight matrices (PWM), binding energy models (BEM), optimized PWMs (DiMO), transcription factor flexible models (TFFM), and DNAshaped TFBS (DNA). You can also use
ALL and all supported prediction models will be run in the above order.
The arguments between brackets are optional. You can use a custom PFM
[my_pfm] if you like, a custom TF name
[my_tf_name] that will be part of the output file names and inside the produced files. Importantly, the TF name is by default parsed from the name of the dataset passed as input. For instace, if the dataset
MY_DATASET_IS_COOL is passed, the name is split by
_ and the last part is considered the TF. (e.g.,
MY_DATA_SET will result in TF name
CHIP-SEQ_JUND will result in TF name
[window_size] is the amount of base pairs to increase around the peak summit, upstream and downstream. This region is considered the prediction window. Default is
500bp, so the length of the sequences used for the prediction and scoring is 1001bp (500bp upstream, 1bp peak summit, 500bp downstream). The
[training_window_size] is similar to
[window_size] but for the training datasets. The default is
bash launcher_peak_processing.sh /path/to/my/chip-seq/peak/datasets /path/to/my/output/folder MACS PWM 30
bash launcher_peak_processing.sh /path/to/my/chip-seq/peak/datasets /path/to/my/output/folder MACS TFFM 10
bash launcher_peak_processing.sh /path/to/my/chip-seq/peak/datasets /path/to/my/output/folder HOMER ALL 20
bash launcher_peak_processing.sh /path/to/my/chip-seq/peak/datasets /path/to/my/output/folder MACS PWM 20 [MY_PFM] [TF_NAME] [PREDICTION_WINDOW] [TRAINING_WINDOW]
Note: that in order to have the DNAshaped TFBSs model applied with TFFM, you need to have the TFFM model ran first and the
.xml file with the model parameters present.
Three sets of ChIP-seq peaks obtained with
MACS2 are provided as test data in the
test_data folder of the pipeline.
In the results folder passed as parameter, a subfolder for each dataset will be created. Within this folder, for every model invoked a subfolder with the name of the model is created. Each model will generatethe following files with the file names under this format:
<PEAK_CALLER>_<TF>_<PFM_ID>.score - the sequences along with their assigned score
<PEAK_CALLER>_<TF>_<PFM_ID>.score.thr - the thresholds on motif score and distance defining the enrichment zone, as well as the centrlity p-value
<PEAK_CALLER>_<TF>_<PFM_ID>.ez.png - a visual representation of the ChIP-seq peaks and the thresholding result
<PEAK_CALLER>_<TF>_<PFM_ID>.tfbs.bed - the TFBS predicted inside the enrichment zone in BED6 format
- Marius Gheorghe - Development and design - MariusGh
- Anthony Mathelier - Design and tutoring - amathelier
This project is licensed under the GNU General Public License
- Hat tip to anyone who's bits and pieces of code were used
If you make use of the ChIP-eat pipeline, please cite A map of direct TF-DNA interactions in the human genome, doi:https://doi.org/10.1093/nar/gky1210