epiPALEOMIX

A Fast, Accurate, and Automatic pipeline for generating nucleosome and methylation maps from high-throughput sequencing data underlying ancient samples.

Wiki Content

Overview

epiPALEOMIX is a fast and open-source pipeline tailored to identify epigenetic signatures in archaeological material from high-throughput DNA sequencing data. It leverages on natural degradation processes that affect DNA after death and, thus, does not require prior treatment of ancient DNA extracts with gold-standard epigenetic methods, such as bisulfite or chip-seq. It can reveal genome-wide patterns of CpG methylation, and can generate nucleosome maps and phasogram analyses as shown in Pedersen et al. 2014 and Gokhman et al. 2014. Finally, epiPALEOMIX can accommodate any type of molecular tools used to prepare ancient DNA (aDNA), including USER-treatment of DNA extracts and amplification of DNA libraries with uracile-intolerant DNA polymerase, such as the Phusion DNA polymerase.

Input files

Required input files to epiPALEOMIX (BAM files, Reference Genomes, Bedfiles)

BAM alignment files (Binary form of SAM format files), bed coordinates for genomic regions of interest and a reference genome in fasta format are required as input to epiPALEOMIX. Tabulated mappability files can be optionally supplied to restrict the analyses to uniquely mappable regions of the genome. Mappability maps and BED files of Human (hg19), Horse (EquCab2.0) and Bos Taurus (bosTau6) used in Hanghøj et al (2016) are available here.

As epiPALEOMIX does not perform alignment and mapping of sequencing data, we recommend to use the user-friendly PALEOMIX pipeline in order to generate the BAM-format file of aligned sequencing data from FASTQ files. Reference genomes and bedfiles can be fetched from several sources (e.g.UCSC), depending on the organism analyzed and genome reference assembly required.

Installation

The pipeline is written in Python 2.7.10, and is compatible with 2.7.3+ but not python3. It builds on the node-graph structure created for the PALEOMIX pipeline, with makefiles in yaml-format (http://yaml.org/). It has been tested thoroughly on OS X (Apple), several linux based servers and a cluster.

Requirements

Python 2.7.3+, with pysam v0.8+

#!bash
    $ [sudo] easy_install pip  # if pip is not installed already
    $ pip install pysam --user

R v2.15+

Samtools is not required by epiPALEOMIX but recommended for general manipulation and indexing of SAM/BAM files and reference genomes.

Installation

Install all required dependencies listed above.

Create an installation folder (In this example ~/install) and clone epiPALEOMIX as shown below:

    $ mkdir -p ~/install
    $ cd ~/install
    $ git clone https://khanghoj@bitbucket.org/khanghoj/epiPALEOMIX.git

To avoid writing the full path to epiPALEOMIX, add a symbolic link to ~/install/epiPALEOMIX/run.py in your executable paths. For example, if ~/bin is in your executable path (echo $PATH to check executable paths).

    $ cd ~/bin
    $ ln -s ~/install/epiPALEOMIX/run.py epiPALEOMIX
    $ chmod +x epiPALEOMIX

You might need to restart the Bash session to enable the symbolic link.

NOTE: If no symbolic link is created, the entire path to the epiPALEOMIX executable (~/install/epiPALEOMIX/run.py) must be typed instead of just typing epiPALEOMIX. A symbolic link is not required for installing the pipeline but only recommended for simplicity.

Setting default optional arguments

It is recommended to write a config file --write-config-file prior to running epiPALEOMIX to set default parameters such as temporary folder, number of threads used by default, and warning levels. The configuration file will be written to ~/.pypeline/epiPALEOMIX.ini and can be further edited with any text file editor (i.e Nano, Pico, Vim, TextWrangler etc). For instance, for setting the maximum number of parallel threads to 20, type the following command:
```
$ epiPALEOMIX  --write-config-file --max-threads 20
```
In case a default argument needs to be modified, simply overrule it in the command line, as shown below, or change ~/.pypeline/epiPALEOMIX.ini with a text editor.
```
$ epiPALEOMIX run makefile.yaml --max-threads 4
```

If a config file is created, it will be parsed automatically every time epiPALEOMIX is executed.

How to run epiPALEOMIX

Below is a brief overview of required steps to run epiPALEOMIX after installation as shown above

$ epiPALEOMIX -h   # Prints extensive optional arguments
or
$ epiPALEOMIX help # Prints simple help

NOTE: epiPALEOMIX requires that bedfiles, reference genomes, mappability files, and BAM files share the same chromosome prefix.

And then provide an example of file names and their head –n 3 so that people realize what you mean here.

Preparing a makefile

A makefile, provided by the user in YAML format, should include input paths and analyses to be performed by epiPALEOMIX. The makefile can easily be filled using a text editor. See Makefile Documentation for an extensive description of the options, format and structure of epiPALEOMIX Yaml-format makefiles. These are largely reminiscent from those used in PALEOMIX. It is also recommended to go through the Tutorial to understand the structure of the pipeline, makefile, and outputs of epiPALEOMIX.

epiPALEOMIX requires at least one makefile in yaml-format in order to run the dryrun/run command. To generate a generic makefile, run the following command:

$ epiPALEOMIX makefile > makefile.yaml

Then fill in manually with a text editor input paths to the reference genome, bedfile(s), and BAM file(s). By default, all analyses are disabled. You can enable the analyses of interest by changing the enabled flag of the analyses concerned from False to True.

For generating a non-verbose makefile with all analysis parameters set to default, use $ epiPALEOMIX makefile simple > makefile.yaml. The default parameters can be found in Makefile Documentation.

Running epiPALEOMIX

Prior to starting the analyses using epiPALEOMIX, it is recommended to check if all executables and input/output files to the node graph are available using this command $ epiPALEOMIX dryrun makefile.yaml. Adding --list-output-files to this command, all output files will be written to standard out including all temporary files. Assuming you have created and filled the makefile, epiPALEOMIX can be executed using the following simple command:

$ epiPALEOMIX run makefile.yaml

Results will be located in a sub-directory named "OUT_makefilename_" in the current working directory unless a --destination path has been given.
Temporary files will be saved in a sub-directory named "TEMPORARYFILES_makefilename_" in the current working directory unless a --destination path has been given.

epiPALEOMIX Output

Two folders are generated when running the epiPALEOMIX pipeline:

The 'OUT_makefileName' folder contains a subfolder for each BAM-file analyzed, including final output files for each analysis conducted in the form of a "BAMName_AnalysisName_BedName.txt.gz" file.
The 'TEMPORARYFILES_makefileName' folder contains all temporary files generated by epiPALEOMIX. No Final results are located in this folder.

The epiPALEOMIX pipeline produces four types of flat tabulated output files in gzip format for each BED file provided.

NucleoMap output files contain the genomic coordinates for each predicted nucleosome, peak read-depth, and nucleosome calling score.
MethylMap output files contain genomic position, counts of deaminated reads and coverage.
Phasogram output files contain the distribution of distances between successive read starts.
WriteDepth output files contain genomic position, read depth, running nucleosome score.

epiPALEOMIX can also implement a read-depth correction procedure for variable %GC content. If the GCcorrect option is enabled in the makefile, the correction model is created, used and provided as an additional output. The suffix "GCcorr" will then be added to the file name of each analysis.

Tutorial

Documentation makefile