Wiki

Clone wiki

epiPALEOMIX / Tutorial

Tutorial

This tutorial will guide you through a basic utilization of epiPALEOMIX, aiming at generating methylation levels and nucleosome occupancy profiles in genomic regions surrounding CTCF binding sites. We will use a subset of the Illumina sequence data underlying the Paleo-Eskimo Saqqaq genome published by Rasmussen and colleagues (2010), and assume that the epiPALEOMIX pipeline has been properly installed, following the epiPALEOMIX installation wiki.

CTCF binding sites.

Fu et al (2008) have shown that occupied CTCF (CCCTC-binding factor) binding sites, which play an essential role in transcription regulation by binding the CTCF insulator, are flanked by arrays of well-positioned nucleosomes, separated by ~190-bp. Moreover, Kelly et al (2012) explored the methylation profile of such regions, revealing an out-of-phase profile between methylation levels and nucleosome occupancy upstream and downstream of the CTCF binding sites. These patterns have been described for the first time in ancient individuals by Pedersen et al. (2014), thereby validating the computational proxies used for predicting methylation levels and nucleosome occupancy. In this tutorial, we show how these findings can be identified in high-throughput sequencing data from an ancient human sample. The approach described here can be applied to any other genomic region by simply changing the genomic coordinates provided in the bed file.

No additional R-Packages are required to run epiPALEOMIX itself, however, two R-packages are required to plot the output in this tutorial.

Both packages can be installed by typing install.packages(c("ggplot2", "gridExtra")) in R.

Downloading Tutorial data

Downloading and unpacking the tutorial data (1.3 GB):

    $ wget https://www.dropbox.com/s/5kxsqhq1aktm1i9/tutorial_epipal.tar.gz?dl=0 -O tutorial_epipal.tar.gz
    $ tar xvzf tutorial_epipal.tar.gz
    $ cd tutorial_epipal/

The uncompressed directory contains:

  • A prefix folder, providing the Hg19 human reference genome (hs.build37.1.fa.gz) and its corresponding indeces (hs.build37.1.fa.{fai,gzi}) and mappability files (GENOME_51_50.40000-20000.mappability) of assembly Hg19;
  • A data folder, providing the BAM read alignment file (TUTORIAL_CTCF.bam) together with its index file TUTORIAL_CTCF.bam.bai;
  • A bedfile folder, providing the genomic coordinates of the first kilobase flanking OCCUPIED CTCF binding sites (CTCF.bed) Fu et al., 2008);
  • A yaml-format makefile (example.yaml), representing the main input file required by epiPALEOMIX and containing path to BAM and BED files as well as indicating which analyses should be conducted;
  • A plottingtools folder, providing additional R-scripts to visualize the outputs of epiPALEOMIX;
  • A executable bash script (createplots.sh), commanding all R-scripts provided in the plottingtools folder.

epiPALEOMIX makefiles, a walk-through

A complete example.yaml makefile is already provided for this tutorial. Here, we will explain its structure in different subsections. For an extensive documentation of makefiles, see Makefile Documentation.

The first line of example.yaml specifies the file type:

#!yaml

# -*- mode: Yaml; -*-

YAML is a human-readable format well-suited for configuration files. Subcategories are indented with one or more spaces relative to the parent category. Always use spaces, and never use tabs for indentation, see preview of the yaml-documentation for a quick overview of the YAML-format.

The second line is a time stamp automatically created by epiPALEOMIX while generating makefiles using the following commands $ epiPALEOMIX makefile or $ epiPALEOMIX makefile simple. Timestamps are unique and useful to keep track of when a particular analysis has been run.

#!yaml

# Timestamp: 2015-11-10T11:31:09.819521

Prefix section

The prefix section contains the reference genome path --FastaPath and the Mappability path --MappabilityPath. The former is mandatory, while the second is optional (and should be provided anytime MappabilityFilter and/or GCcorrect options are set to True).

#!yaml

Prefixes:
  --FastaPath: ./prefix/hs.build37.1.fa.gz                             # Human Reference genome build 37/hg19
  --MappabilityPath: ./prefix/GENOME_51_50.40000-20000.mappability  # mappability file to filter repetitive regions

Bedfile section

Here, we apply mappability filtering (MappabilityFilter: True) of the bedfile regions prior to running any analysis. As we remove regions of the bedfile outside of the mappable fraction of the genome, MappaOnly is added to the bedname automatically, i.e. CTCF -> CTCFMappaOnly.

#!yaml

BedFiles:  # AT LEAST ONE BEDFILE IS REQUIRED
    MappabilityFilter: True                       # Only Mappable regions are analyzed
    CTCF: ./bedfile/CTCF.bed                      # Output name of Bedfile (CTCF) and Path to Bedfile (./bedfile/CTCF.bed)

Bamfile section

In this tutorial, we apply default options to all analyses shown below.

#!yaml

BamInputs:                                        # Do not change
    EXAMPLEBAM:                                   # The output name of the bamfile (EXAMPLEBAM)
        BamInfo:                                  # Generic info about BAM file
            BamPath: ./data/TUTORIAL_CTCF.bam     # path to BAM-file
        GCcorrect:                                # Options for GCcorrection
            Enabled: True                         # GCcorrection is enabled
        NucleoMap:                                # Options for nucleosome calling
            Enabled: True                         # Nucleosome calling is enabled
            Apply_GC_Correction: True             # GCcorrection model applied to Nucleosome calling
        MethylMap:                                # Options for Methylation map
            Enabled: True                         # Methylation Map is enabled
        Phasogram:                                # Options for Phasogram
            Enabled: True                         # Phasogram is enabled
            Apply_GC_Correction: False            # GCcorrection model not applied to Phasogram
            --SubsetPileup: 3                     # Require minimum of three reads sharing same 5' position
        WriteDepth:                               # Options for WriteDepth 
            Enabled: True                         # WriteDepth is enabled
            Apply_GC_Correction: True             # GCcorrection model applied to WriteDepth

Analyzing methylation levels and nucleosome occypancy of CTCF binding sites ±1 KB using epiPALEOMIX.

To check that requirements and paths are available dryrun the makefile first:

$ epiPALEOMIX dryrun example.yaml
and
$ epiPALEOMIX dryrun example.yaml --list-executables

Next, we can run epiPALEOMIX to generate the nucleosome positioning, methylation marks, and Phasogram as specified in the makefile. It should take about 7-8 minutes on a MacBook Air (2012) using 4 threads.

$ epiPALEOMIX run example.yaml

After running epiPALEOMIX, six output files are generated in ./OUT_example/EXAMPLEBAM:

  1. EXAMPLEBAM_GC_Model.txt contains the GC-correct models used to correct for variable %GC content
  2. EXAMPLEBAM_GC_Model.pdf contains plots of the different GC-correction models used
  3. EXAMPLEBAM_Phasogram_CTCFMappaOnly.txt.gz contains two columns, the distance between read starts and number of occurrences:

    #Length Count
    1 67333
    2 69328
  4. EXAMPLEBAM_MethylMap_CTCFMappaOnly.txt.gz contains 5 columns, the genomic position, count of deaminated hits, total coverage, and provided bed coordinate:

    #chrom genomicpos deaminated total bedcoord
    10 100068549 0 5 10_100068499_100070499_+
    10 100068590 0 2 10_100068499_100070499_+
  5. EXAMPLEBAM_NucleoMapGCcorr_CTCFMappaOnly.txt.gz contains 6 columns, the genomic start and end position of the nucleosome dyad, GC-corrected read depth at the dyad, nucleosome score, and provided bed coordinate:

    #chrom start end depth score bedcoord
    10 100068658 100068659 60.4026394197 39.6249460575 10_100068499_100070499_+
    10 100068849 100068850 44.9184583098 26.9009008988 10_100068499_100070499_+
  6. EXAMPLEBAM_WriteDepthGCcorr_CTCFMappaOnly.txt.gz contains 5 columns, the genomic position, GC-corrected read depth, nucleosome running score, and provided bed coordinate:

    #chrom genomicpos depth score bedcoord
    10 100068499 27.545933293 -12.7661298164 10_100068499_100070499_+
    10 100068500 28.6795701848 -11.8480418039 10_100068499_100070499_+

Finally, we run createplots.sh to plot Methylation level, Nucleosome occupancy, and Phasogram of the CTCF binding site regions:

$ ./createplots.sh

Two PDF files will be created in the current working directory:

  • A Phasogram plot, providing short range and long range phasogram analyses showing remarkable ~10-bp and ~190-bp periodicities in the distance between read starts
  • A profile of methylation levels (blue) and nucleosome occupancy (Red) patterns, where boht show string ~190-bp periodicities and appear out-of-phase, as expected

If the run was successful the final PDF files should be identical to the two figures shown below.

A Phasogram plot

Phasogram plot

A profile of methylation levels (blue) and nucleosome occupancy (Red) patterns

NucleoMethyl plot

References

  • Rasmussen, M. et al. Ancient human genome sequence of an extinct Palaeo-Eskimo. Nature 463, 757–762 (2010).
  • Fu, Y., Sinha, M., Peterson, C. L. & Weng, Z. The insulator binding protein CTCF positions 20 nucleosomes around its binding sites across the human genome. PLoS Genet. 4, e1000138 (2008).
  • Kelly, T. K. et al. Genome-wide mapping of nucleosome positioning and DNA methylation within individual DNA molecules. Genome Res. 22, 2497–2506 (2012).
  • Pedersen, J. S. et al. Genome-wide nucleosome map and cytosine methylation levels of an ancient human genome. Genome Res. 24, 454–466 (2014).

Home


Documentation makefile

Updated