HTTPS SSH

ChIP-eat: from raw reads to high quality TFBS predictions

A complete pipeline designed to process ChIP-seq data sets from raw reads to accurate transcription factor (TF) binding site (TFBS) prediction. It is split into three main parts: aligning and filtering the raw data, calling ChIP-seq peaks and predicting TFBSs. Users can use any or all of the three components of the pipeline. For the alignment, ChIP-eat is able to deal with single-ended and pair-ended data. Starting with FASTQ files, the reads are aligned to a reference genome and several filtering steps (e.g., filtering for non uniquely aligned reads, removing eventual PCR duplicates) are applied to prepare the data for peak calling. Four different peak callers are supported: MACS2, HOMER, BCP, and GEM). Subsequent to the peak-calling, each set of peaks serves as input in one or all the TFBS prediction models supported PWMs (DiMO optimized), binding energy model, transcription factor flexible models, and DNA-shaped TFBSs. Finally, a non-parametric, data driven, entropy-based thresholding algorithm is applied to automatically define an enrichment zone within which the TFBSs predicted have the highest confidence.

Citation: A map of direct TF-DNA interactions in the human genome, doi:https://doi.org/10.1093/nar/gky1210

Getting Started

To be able to use ChIP-eat, you need to download the project on your machine and follow the steps below to install and set up the needed paths to the software dependencies.

Prerequisites

A UNIX based operating system is needed in order to run the ChIP-eat pipeline or having Bash Unix Shell support on a Microsoft Windows machine. Note that the shell should support the git, gzip, tr, awk and sed commands. If this is not the case, install them using the software package manager of your system.

Downloading ChIP-eat

You can either git clone or download the compressed pipeline from this page. Make sure you have write access in the folder you are going to host the pipeline.

Raw data alignment tools

In order to perform all the raw data processing, filtering and transformation steps, several dedicated sets of tools are required:

Install BOWTIE 2 from here. Note that if running on a system with Intel processors, you might need to install the Intel Threading Building Blocks (TBB) libraries before installing BOWTIE 2. That can be done either via the software package manager or from here

Make sure you create also an index file with BOWTIE2 for the genome version of interest. This can be done by following this tutorial.

Important the same genome version should be used throughout the ChIP-eat pipeline!

Install SAMTOOLS from here

Install BEDTOOLS 2 by downloading it from here and following the installation process here

Besides the genome index file created by BOWTIE, you will need a file containing the chromosome sizes and a whole genome FASTA file for the reference genome. As an example, for Homo Sapiens for the GRCh38, these files can be retrieved from here

Peak calling tools

To call peaks on the resulting aligned and filtered data, or on your own input files, install one of the four supported peak callers: MACS2, HOMER, BCP, or GEM. We recommend the first two, as they were found to be the most stable. We tested several peak callers due to the different statistical tests implemented for signal detection, and the varying quality/resolution of the ChIP-seq data sets. Note that this pipeline was developed and mainly tested using MACS v2.1.1. You can install the peak callers in a location of your choice or inside the src folder of the pipeline, just make sure that the paths to them are correctly set inside the bin/config.sh file.The BCP and HOMER peak callers are available in the src/zips folder of the pipeline. The other two were not included due to their size.

Important: please refer to the installation notes for each of the peak callers, in order to install their required dependencies.

Peak processing tools

This is the most demanding part of the pipeline regarding dependencies, due to the different programming languages used in the development of the prediction models. Please make sure you respect the version (where specified) and that you install all dependent libraries.

Important: you should have BEDTOOLS 2 installed and the chromosome sizes and the reference genome files available (see the raw data alignment part of the README). If the reference genome is not indexed, BEDTOOLS 2 will automatically generate the index file.

The Java Development Kit

The latest version of the Java Development Kit (JDK) can be installed from here depending on your Unix distribution. make sure that the JAVA_HOME path is set and that you have java added to your PATH. You can check it via:

echo $JAVA_HOME

If it is not set and you do not know where java is installed, you can do:

whereis java

on your command line and export the path via:

export JAVA_HOME="/path/to/my/java/installation"

and add it to the PATH via:

PATH=$PATH:$JAVA_HOME

R statistical environment

You can install the latest version of the R statistical environment from here or installing r-base via the software package manager of your system. The minimum required R version is 2.5.0.

rJava package

The rJava package is also required. It can be downloaded from here.

Important: if you install it as described below, make sure that the rJava package is installed in the src folder of the ChIP-eat pipeline.

For your convenience, the 0.9-9 version of it is provided under src/zips. Being in the src folder, install it via the R environment with the command:

install.packages("zips/rJava_0.9-9.tar.gz", lib = ".", repos = NULL, type = "source")

Note that the installation may fail if the JDK is not properly configured in R or if you try to install the pipeline in a virtual machine. If so, try to reconfigure java in R via sudo R CMD javareconf or R CMD javareconf -e if you do not have sudo privileges, and try again to install rJava. If it still fails, then try installing the r-cran-rjava package via the software package manager of your system.

DiMO package

The DiMO package contains one of the prediction models used in this pipeline. The latest version of it can be downloaded from here.

For ChIP-eat, we have used DiMO v1.6. This version is provided under src/zips. Being in the src folder, install it via the R environment with the command:

install.packages("zips/DiMO_v1.6.tar.gz", lib = ".", repos = NULL, type = "source")

Important as for the rJava package, if you install it with the above command, make sure that DiMO is installed in the src folder of the pipeline.

MASS package

The MASS package is also required, but should be already available in the R base packages. To make sure it is installed, through the R environment, run library("MASS") and see if the package is successfully loaded. If not successful, install it via:

install.packages("MASS")

and selecting a mirror/repository.

Python

For three out of the five implemented prediction models, the Python programming language is required. Importantly, the version 2.7 is used across the pipeline. This is due to the dependent libraries that were developed under this version. Throughout the ChIP-eat scripts, python2.7 is invoked explicitly. You can install this version from here. After installation, there are several required libraries that have to be installed.

Note that Python should be pre-installed on your Linux distribution. Check it by typing:

whereis python

Note that if you want to make use only of the simple PWM prediction model, you do not need to install the Python libraries.

Dependent libraries:

To ease up the library installation process, please make sure you have pip installed. If not, you can install it via your software package manager following the tutorial here. All the libraries below should be installed and their location added to the PYTHONPATH:

Biopython

ghmm

joblib

matplotlib

numpy

pandas

scikit-learn

seaborn

xgboost

Important: check if the location of the installed libraries is added to the PYTHONPATH via echo $PYTHONPATH and add them to it if necessary via PYTHONPATH=$PYTHONPATH:/path/to/my/library/or/module

TFBS prediction models

It is encouraged to have the prediction models inside the src folder of the pipeline. The TFFM and DNAshapedTFBS models are already present in the src/zips folder of the ChIP-eat pipeline. You will just have to unzip them in src. If you want to have a different version of them, you can find TFFM here and DNAshapedTFBS here. There is no installation per se of these two models, just decompress the files. If not stored in the src folder, update accordingly the bin/config.sh file.

The DiMO prediction model should already be installed via the R package in the previous steps. The implementation of the PWM and binding energy model (BEM) are already present in the bin/utils folder of the pipeline.

Bias Away

If you will make use of prediction models that involve machine learning (e.g., DiMO, TFFM or DNAshapedTFBS) you should have available the BiasAway tool. This tool will allow you to generate the background files for training the algorithms, that will match the GC composition of the foreground files. The tool is available here, and also provided in the src/zips folder. It is recommended to unzip it in the src folder. If you change the location of BiasAway, please make sure that the etries referring to BiasAway in the bin/config.sh file are modified accordingly.

Important: BiasAway makes use of the whole genome file and will generate a repository of short length (101bp) sequences grouped by their GC composition the first time a machine learning based prediction model is run. The whole genome file will be split in these chunks and output by default in the data/mappable_genome folder of the pipeline. The background repository is output in the data/BiasAway_repository folder of the pipeline. This process is done only once, but it will take several hours to complete, depending on your machine. It will speed up the subsequent generation of background training sets.

GBshape

If you want to make use of the DNAshapedTFBS prediction module, then you will have to download the files that contain the DNA features used by the prediction model: helix twist, minor width groove, propeller twist, and roll. These features are available through the GBshape database available here through their FTP server. You should download (at least) the files corresponding to the four features enumerated, for the first order and second order.

Important: make sure that the location of the GBshape data is set in the PATH_TO_GB_SHAPE parameter in the bin/config.sh file, as well as the names of the files corresponding to the DNA features in the associated parameters (e.g., HELT=$PATH_TO_GB_SHAPE"/hg38.HelT.wig.bw" and HELT2=$PATH_TO_GB_SHAPE"/hg38.HelT.2nd.wig.bw"). The GBshape data is required to be from the same genome version (e.g., hg38) as the reference genome file.

Configuring ChIP-eat

The only file you need to modify according to the pipeline and dependency installation is the bin/config file. The rest, downloading the entire project and maintaining its folder structure should suffice. If this is the case, and you have also made use of the provided files under src/zips, the set of parameters under INTERNAL TO THE PIPELINE in the bin/config.sh file should not require any modification.

Just make sure that the DEPENDENCIES LOCATION part of the configuration file is reflecting your system configuration.

Running the ChIP-eat pipeline

Alignment and filtering step

The first part of the processing pipeline would be the alignment of the raw reads to the reference genome. This can be achieved as follows: Download one or several ChIP-seq data sets from either the ENCODE project or from GEO. For reliable results, make sure that the datasets contain also control files bsides the treatment or input files. It is assumed that the data sets consist of gzipped FASTQ files for both control and treatment. Example of an input file:

ENCFF000QQJ.fastq.gz

Once you have obtained the ChIP-seq data sets, run the alignment part of the pipeline:

cd /path/to/ChIP-eat/bin/alignment
bash alignment_pipeline.sh /path/to/my/data/set

The folder structure is maintained for the output. If you wish to launch the alignment on multiple datasets in parallel, you simply run:

bash launch_alignment.sh /path/to/my/data/sets 20

where 20 represents the number of processors to allocate for this task.

Note that you can always use the launch_alignment.sh (recommended) with only 1 processor as parameter!

Important that depending on the file size, and your system configuration, the alignment and filtering steps can take several hours, even days. Log files are exported for each dataset to check the percentage of the read alignment and other relevant information. Moreover, make sure you have write access in the output folder.

The peak calling step:

You can call ChIP-seq peaks either on the "freshly" aligned files or on your own input BAM or BED files. Note that it depends on the peak caller wether it accepts the BAM format, BED format, or both. Moreover, also peak caller dependent, if you can input only replicates, or you need control files as well. For instance, MACS2 accepts both formats and can deal with data sets not having control files. As an overall advice, it is always recommended to use the control files when available.

Important: if multiple replicates or control files are present, all will be concatenated and used in the peak calling.

To run the peak calling part of the pipeline:

cd /path/to/ChIP-eat/bin/peak_calling
bash peak_calling.sh /path/to/my/input/files /path/to/the/output/folder PEAK_CALLER

or, for parallelized processes:

bash launch_peak_calling.sh /path/to/my/input/data/sets /path/to/the/output/folder HOMER 20

Supported values for the PEAK_CALLER parameter: MACS, HOMER, BCP, GEM

Note that as for alignment, you can always use the launch_peak_calling.sh (recommended) with only 1 processor as parameter!

More information about the individual steps in this part of the pipeline and the parameter settings used can be found in the workflow diagram located in the doc folder of the project.

The peak processing step

The resulting ChIP-seq peaks can be used for high confidence TFBSs predictions using one of the five different models implemented in the ChIP-eat pipeline. Those models vary from simple to complex, and their level of improvement over the predictions is subject to be influenced by the ChIP'ed protein. For instance, using the DNAshaped TFBS model can have a higher impact on the predictions of certain family of TFs as compared to other families. Refer to relevant literature.

Important: for the prediction models that use machine learning approaches, background training sets will be automatically generated via the BiasAway tool for the current dataset, in order to match the GC% of the foreground training sets. When running the first time the ChIP-eat pipeline with such a prediction model (e.g., DiMO, TFFM or DNA), BiasAway is invoked and will check if a background repository was already generated. If not, it will generate it (refer to the BiasAway section of this file) and subsequently generate the training files, storing them in a training_files folder inside the results folder of the dataset. This files will be generated only once and used by the other prediction models if invoked.

The predictions are based on reference motifs for each TF. A TF mapping file can be provided (see example file in bin/utils TF_mapping.tsv), with each dataset that is to be processed, to what TF it was targeted, to what position frequency matrix (PFM) it maps. Alternatively, you can specify your own PFM as shown in the examples below.

Important: for convenience, folders containing all the JASPAR 2018 TF profiles under PFM, position weight matrices (PWMs) and PFMs in MEME format are provided in the data folder of the pipeline. If you want to use your own profiles for different TFs and launch processes in parallel, make sure you change the entries in the bin/config.sh file accordingly, as well as the TF_mapping.tsv. This file will be automatically sorted (once) to gain some computation time when searching for the associated PFM on large amounts of datasets. The PMF and PWM files contain only numeric characters. Note that in the mapping file, you should respect the order of the columns. The column names are not taken into consideration.

To process the ChIP-seq peaks, use either:

cd /path/to/ChIP-eat/bin/peak_processing
bash peak_processing.sh /path/to/my/input/folder /path/to/my/output/folder/ PEAK_CALLER MODEL [my_pfm] [my_tf_name] [window_size] [training_window_size]

or

bash launch_peak_processing.sh /my/input/folder /my/output/folder/ PEAK_CALLER MODEL NB_PROC [my_pfm] [my_tf_name] [window_size] [training_window_size]

where PEAK_CALLER is the peak caller used on the files (e.g., one of MACS, HOMER, BCP, GEM), MODEL is the prediction model to use: one of position weight matrices (PWM), binding energy models (BEM), optimized PWMs (DiMO), transcription factor flexible models (TFFM), and DNAshaped TFBS (DNA). You can also use ALL and all supported prediction models will be run in the above order.

The arguments between brackets are optional. You can use a custom PFM [my_pfm] if you like, a custom TF name [my_tf_name] that will be part of the output file names and inside the produced files. Importantly, the TF name is by default parsed from the name of the dataset passed as input. For instace, if the dataset MY_DATASET_IS_COOL is passed, the name is split by _ and the last part is considered the TF. (e.g., MY_DATA_SET will result in TF name SET, CHIP-SEQ_JUND will result in TF name JUND). The [window_size] is the amount of base pairs to increase around the peak summit, upstream and downstream. This region is considered the prediction window. Default is 500bp, so the length of the sequences used for the prediction and scoring is 1001bp (500bp upstream, 1bp peak summit, 500bp downstream). The [training_window_size] is similar to [window_size] but for the training datasets. The default is 50bp

Examples

cd /path/to/ChIP-eat/bin/peak_processing
bash launcher_peak_processing.sh /path/to/my/chip-seq/peak/datasets /path/to/my/output/folder MACS PWM 30

or

bash launcher_peak_processing.sh /path/to/my/chip-seq/peak/datasets /path/to/my/output/folder MACS TFFM 10 

or

bash launcher_peak_processing.sh /path/to/my/chip-seq/peak/datasets /path/to/my/output/folder HOMER ALL 20

or

bash launcher_peak_processing.sh /path/to/my/chip-seq/peak/datasets /path/to/my/output/folder MACS PWM 20 [MY_PFM] [TF_NAME] [PREDICTION_WINDOW] [TRAINING_WINDOW]

Note: that in order to have the DNAshaped TFBSs model applied with TFFM, you need to have the TFFM model ran first and the .xml file with the model parameters present.

Test data

Three sets of ChIP-seq peaks obtained with MACS2 are provided as test data in the test_data folder of the pipeline.

Output

In the results folder passed as parameter, a subfolder for each dataset will be created. Within this folder, for every model invoked a subfolder with the name of the model is created. Each model will generatethe following files with the file names under this format:

<PEAK_CALLER>_<TF>_<PFM_ID>.score - the sequences along with their assigned score <PEAK_CALLER>_<TF>_<PFM_ID>.score.thr - the thresholds on motif score and distance defining the enrichment zone, as well as the centrlity p-value <PEAK_CALLER>_<TF>_<PFM_ID>.ez.png - a visual representation of the ChIP-seq peaks and the thresholding result <PEAK_CALLER>_<TF>_<PFM_ID>.tfbs.bed - the TFBS predicted inside the enrichment zone in BED6 format

Authors

  • Marius Gheorghe - Development and design - MariusGh
  • Anthony Mathelier - Design and tutoring - amathelier

License

This project is licensed under the GNU General Public License

Acknowledgments

  • Hat tip to anyone who's bits and pieces of code were used

Citation

If you make use of the ChIP-eat pipeline, please cite A map of direct TF-DNA interactions in the human genome, doi:https://doi.org/10.1093/nar/gky1210