HTTPS SSH

ProfileSeq

ProfileSeq is a computational method for the quantitative assessment of biological profiles to provide an exact, nonparametric probability that specific regions of the test profile have higher or lower signal densities than a control set. The method is applicable to high-throughput sequencing data (ChIP-Seq, GRO-Seq, CLIP-Seq, etc.) and to genome-based datasets (motifs, etc.).

Figure1_bitbucket.jpg

This documentation describes how to generate profiles with ProfileSeq starting with a set of test and control regions, and a set of signals to be profiled in BED format from sequencing reads, motifs, etc.

Input

The first step is to create a ".ref" file for your test and control regions with the format:

<id>.<chromosome>.<position>.<strand>

For instance:

ref1.chr8.104952.+

We have provided sample test and control files: Test.ref and Control.ref These sample files are of paired intragenic regions with or without CTCF peaks present from ENCODE data for K562. Each line is a single reference position entry.

The other input, a file of coordiantes to be plotted on the profile, can be in closed BED format (i.e. the start and end positions are included). But since ProfileSeq is optimized to handle a single necleotide position per line for quantification, we recommend you use the special .pos format:

<chromosone>    <coordinate>    <strand>

For instance

chr7   1004584     +

Read lengths longer than a single nucleotide can be compressed to a single position (e.g. the read center) without losing any information. Tabs should delimit the columns in .pos files, as in BED files. We have provided a sample .pos file of Pol2 ChIP-Seq reads from the same cell line as Test.ref and Control.ref (K562 from ENCODE): Pol2.pos

The file consists of 5 million randomly selected lines from 2 replicates, which were pooled after removing duplicate reads from each rep separately. It is not necessary to remove duplicate reads, but it is important to known how many times each nucleotide coordinate can be repeated in the .pos file in order to attain reliable P-value results.

Pol2.pos is unstranded (reads from opposite strands are indistinguishable), so each replicate can contribute 2 lines to the .pos file (one for each strand), for a maximum of 4 in total for the pooled file.

Sample Inputs:

Test.ref
Control.ref
Pol2.pos

The next step is to count occurrences in the .bed or .pos file around the .ref regions:

The binary file count_occurences which has been included is compiled for Linux64. If it is compatible with your system, you may proceed to the description of count_occurences below. If not, you will need to unzip the file count_occurences_v01.zip and compile count_occurences.c. In Linux64 systems, this is done with the command

gcc -O4 -Wformat=0 count_occurences.c -o count_occurences -lm

The command should be similar in most systems, but some slight changes may be necessary. If you have questions or issues with getting count_occurences to run on your system, please email Nicolas Bellora: nbellora@gmail.com

count_occurences

Count the number of lines in the .bed/.pos file at bins around the .ref coordinates.

Usage:

count_occurencess from to bins -r reference_file -b BED_file [ -s strand ] > occurrences_file
  • from, to relative positions to the reference to compute the occurrences. negative = upstream and positive = downstream.
  • bins number of bins, or columns, of the output matrix. it MUST to be an integer factor of to-from.
  • -r reference's file with single nucleotide positions and strand. rows of the output matrix. format: unique_id.sequence_id.strand.position, e.g. my_id_1.chrX.+.19681201
  • -b BED file: chromosome, start, end, name, score and strand
  • -p .pos file, a file with just a single position: chromosome, position and strand
  • -s (+/-) only maps bed regions in the same strand as reference position (+) or just from the opposite strand (-). Default: it maps bed regions from both strands.
  • -h --help display this help and exit

Note: This program assumes that bed files are in closed format, i.e. column 3 is included in the interval.

Example: To count occurrences in the region from 1 kb upstream to 100 nt downstream of the positions in Test.ref and Control.ref, using a total of 110 bins (10 nt/bin), type in the command line

./count_occurences -1000 100 110 -r Test.ref -p Pol2.pos > Test.occ
./count_occurences -1000 100 110 -r Control.ref -p Pol2.pos > Control.occ

The first line of test.occ should look like this:

chr10:101605381:101605539:+_295_.chr10.+.101605756  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0

Now we are ready to GENERATE PROFILES!!!

ProfileSeq_basic

This script can be used for a quick and easy view of profiles. It will generate a standard multi-profile plot without P-values.

Usage:

R --vanilla --args name_1 occFile_1 lty_1 color_1 . . . name_1 occFile_N lty_N color_N outName from to bins RepNum RefLabel yLabel plotMin plotMax outDir resolution < ProfileSeq_basic.R

The options are the following

  • name_i Legend name for the profiles centered at the ith reference set.
  • occFile_i Name of the ith occ file
  • lty_i Line type of the ith profile. It is the lty paramater in the R function par(). E.g. 1 is for a solid line, and 2 is for a dashed line.
  • color_i Color for the line of the ith profile. It is the col paramater from the R function par().
  • outName The prefix used for output filenames.
  • from, to relative positions to the reference to compute the occurrences. negative = upstream and positive = downstream. Must be the same values as used for count_occurences.
  • bins number of bins, or columns, of the output matrix. it MUST to be an integer factor of to-from.
  • RepNum The maximum number of times reads with a specific set of coordinate (chr pos strand) can be repeated in the .bed/.pos file used to make the .occ file. Enter half the value just stated if occurences of the .bed/.pos file were only counted if they were on the same strand as positions in the .ref file. It MUST be an integer or half-integer value. It is important that this number is set correctly in order to get accurate P-values. For example, if each coordinate can occur at most 1 time in the .bed/.pos file, and strand was considered in making the .occ file, RepNum=.5, wherease if strand was not considered (e.g. ChIP-Seq data), RepNum=1.
  • RefLabel The label for the reference regions, used for the x-axis label. The x-axis will be "nt from <RefLabel>".
  • yLabel The full label to print for the y-axis.
  • plotMin The minimum value for the y-axis. To let R set it automatically, enter "NA".
  • plotMax The maximum value for the y-axis. To let R set it automatically, enter "NA".
  • outDir The name of the directory to put the output file. It must end with a "/".
  • resolution Enter "HI" to produce a high-resolution image which should be suitable for publication (300 ppi). Any other value prompts R to produce plots at the default resolution (for smaller output filesize).

Note: i runs from 1 to N, where N can be any integer greater or equal to 1. 5 tick marks are plotted and labelled on the x-axis at evenly spaced intervals starting at "from". Making to-from divisible by 4 will result in the nicest x-axis labelling, e.g. to-from=1000; 1000/4=250. A vertical line is plotted at the reference point, 0 nt on the x-axis.

Output:

  • outName.png .png file of the profiles.

Example:

R --vanilla --args "CTCF" Test.occ 1 black "No CTCF" Control.occ 2 black Pol2Plot_basic -1000 100 110 2 "peak center" "% reads mapped" NA NA $(pwd)/ low < ProfileSeq_basic.R

ProfileSeq_ss

Generates profiles with P-values, comparing test vs. control occureences at each bin. Exactly 2 profiles are plotted and compared. Ideal for profiles around splice sites (DNA or pre-mRNA).

Usage:

R --vanilla --args name_1 occFile_1 lty_1 color_1 name_1 occFile_2 lty_2 color_2 [mapFile_1 mapFile_2] outName dataLab from to bins RepNum RefLabel yLabel plotMin plotMax outDir legend maxPval Pvals resolution < ProfileSeq_ss.R
  • name_i Legend name for the profiles centered at the ith reference set.
  • occFile_i Name of the ith occ file
  • lty_i Line type of the ith profile. It is the lty paramater in the R function par(). E.g. 1 is for a solid line, and 2 is for a dashed line.
  • color_i Color for the line of the ith profile. It is the col paramater from the R function par().
  • mapFile_i Optional. .occ files containing counts of mappable reads at the test reference regions (i=1) and control reference regions (i=2), to increase the profile's accuracy. If normalization by input signal is desired rather than by mappable reads, you must use ProfileSeq_peaks_InputNorm.
  • outName The prefix used for output filenames.
  • dataLab Label for the data contained in the .bed/.pos file.
  • from, to relative positions to the reference to compute the occurrences. negative = upstream and positive = downstream. Can be -1000, 100 or -300, 100 for 3' ss and -100, 300 or -100, 1000 for 5' ss. If other values are desired, you should use ProfileSeq_basic or ProfileSeq_peaks.
  • bins number of bins, or columns, of the output matrix. it MUST to be an integer factor of to-from.
  • RepNum The maximum number of times reads with a specific set of coordinate (chr pos strand) can be repeated in the .bed/.pos file used to make the .occ file. Enter half the value just stated if occurences of the .bed/.pos file were only counted if they were on the same strand as positions in the .ref file. It MUST be an integer or half-integer value. It is important that this number is set correctly in order to get accurate P-values. For example, if each coordinate can occur at most 1 time in the .bed/.pos file, and strand was considered in making the .occ file, RepNum=.5, wherease if strand was not considered (e.g. ChIP-Seq data), RepNum=1.
  • RefLabel The label for the reference regions, used for the x-axis label. The x-axis will be nt from <RefLabel>.
  • yLabel The full label to print for the y-axis.
  • plotMin The minimum value for the y-axis. To let R set it automatically, enter "NA".
  • plotMax The maximum value for the y-axis. To let R set it automatically, enter "NA".
  • outDir The name of the directory to put the output file. It must end with a "/".
  • legend Either TRUE or FALSE, to indicate whether or not to display the legend on the plot. TRUE displays the legend. The y-axis label will also be suppressed if legend=FALSE, since it is assumed that in it will be paired with another profile with the legend and y-axis, e.g. the 3' ss profile with legend and y-axis, and the 5' ss profile without.
  • maxPval Can be used to indicate for which P-value cutoff FDR < .05 (this must be deterined beforehand). Only bins whose P-val satisfies FDR<.05 will be displayed. Enter 1 to display all bins with P < .01 (no FDR estimated).
  • Pvals Either TRUE or FALSE. TRUE outputs a .png file with a grayscale for the P-value cutoffs which are printed at the bottom of the profile.
  • resolution Enter "HI" to produce a high-resolution image which should be suitable for publication (300 ppi). Any other value prompts R to produce plots at the default resolution (for smaller output filesize).

Notes: Exactly 2 or 4 .occ files must be input. The 3rd and 4th .occ files are for occurences of mappable positions and are optional.

Output:

  • outName.png .png file of the profiles.
  • Pvalues_outName Text file with the P-value for each bin in the profile. P-values will be appended to an existing file of P-values, which is useful for estimating the FDR.
  • Table_SignificantRegions_outName Lists the regions where there is a significant difference between the test and control profiles.
  • grayscalePvals.png Image showing the P-value cutoffs corresponding to the shaded gray rectangles in the plot. Only output if Pvals=True.

Example:

R --vanilla --args "CTCF" Test.occ 1 black "No CTCF" Control.occ 2 black Pol2Plot_ss Pol2 -1000 100 110 2 "3' ss" "% reads mapped" NA NA $(pwd)/ TRUE 1 TRUE HI < ProfileSeq_ss.R

Note that in the example above, we use .occ files that were from counts centered at peaks instead of splice sites, but we are running this as if they were centered at the 3' splice site. This was just to be able to illustrate its usage without the need for additional .ref files. Test.ref and Control.ref will typically be centered at either the 3' or 5' splice sites for ProfileSeq_ss, although other types of regions may be applicable.

ProfileSeq_peaks

Generates profiles with P-values, as in ProfileSeq_ss. However, additional P-values are calculated and displayed comparing the central to flanking regions of each individual profile. Exactly 2 profiles are plotted and compared. Ideal to test for accumulation of a signal at peaks or other regions of interest.

Usage:

R --vanilla --args name_1 occFile_1 lty_1 color_1 name_1 occFile_2 lty_2 color_2 [mapFile_1 mapFile_2] outName dataLab from to bins RepNum RefLabel yLabel plotMin plotMax outDir legend maxPval Pvals resolution Pvals resolution < ProfileSeq_peaks.R
  • name_i Legend name for the profiles centered at the ith reference set.
  • occFile_i Name of the ith occ file
  • lty_i Line type of the ith profile. It is the lty paramater in the R function par(). E.g. 1 is for a solid line, and 2 is for a dashed line.
  • color_i Color for the line of the ith profile. It is the col paramater from the R function par().
  • mapFile_i Optional. .occ files containing counts of mappable reads at the test reference regions (i=1) and control reference regions (i=2), to increase the profile's accuracy. If normalization by input signal is desired rather than by mappable reads, you must use ProfileSeq_peaks_InputNorm.
  • outName The prefix used for output filenames.
  • dataLab Label for the data contained in the .bed/.pos file.
  • from, to relative positions to the reference to compute the occurrences. negative = upstream and positive = downstream.
  • bins number of bins, or columns, of the output matrix. it MUST to be an integer factor of to-from. MUST be divisible by 8, e.g. 200/8=25.
  • RepNum The maximum number of times reads with a specific set of coordinate (chr pos strand) can be repeated in the .bed/.pos file used to make the .occ file. Enter half the value just stated if occurences of the .bed/.pos file were only counted if they were on the same strand as positions in the .ref file. It MUST be an integer or half-integer value. It is important that this number is set correctly in order to get accurate P-values. For example, if each coordinate can occur at most 1 time in the .bed/.pos file, and strand was considered in making the .occ file, RepNum=.5, wherease if strand was not considered (e.g. ChIP-Seq data), RepNum=1.
  • RefLabel The label for the reference regions, used for the x-axis label. The x-axis will be nt from <RefLabel>.
  • yLabel The full label to print for the y-axis.
  • plotMin The minimum value for the y-axis. To let R set it automatically, enter "NA".
  • plotMax The maximum value for the y-axis. To let R set it automatically, enter "NA".
  • outDir The name of the directory to put the output file. It must end with a "/".
  • Pvals Either TRUE or FALSE. TRUE outputs a .png file with a grayscale for the P-value cutoffs which are printed at the bottom of the profile.
  • resolution Enter "HI" to produce a high-resolution image which should be suitable for publication (300 ppi). Any other value prompts R to produce plots at the default resolution (for smaller output filesize).

Notes: Exactly 2 or 4 .occ files must be input. The 3rd and 4th .occ files are for occurences of mappable positions and are optional. The single profile comparisons compare the central region, which consists of 1/4 of all bins, centered at the middle of the plot, to the two immediately flanking regions, which together consist of 1/4 of all bins as well. So it is important that a whole number of integer bins fit in these regions. The easiest way to do this is to set from=-to, and make bins divisible by 8.

Output:

  • outName.png .png file of the profiles.
  • Pvalues_outName Text file with the P-value for each bin in the profile. P-values will be appended to an existing file of P-values, which is useful for estimating the FDR.
  • Table_SignificantRegions_outName Lists the regions where there is a significant difference between the test and control profiles.
  • grayscalePvals.png Image showing the P-value cutoffs corresponding to the shaded gray rectangles in the plot. Only output if Pvals=True.

Example:

#Make symmetric .occ files
./count_occurences -1000 1000 200 -r Test.ref -p Pol2.pos > Test2.occ
./count_occurences -1000 1000 200 -r Control.ref -p Pol2.pos > Control2.occ
#Plot profile
R --vanilla --args  "CTCF" Test2.occ 1 black "No CTCF" Control2.occ 2 black Pol2Plot_peaks Pol2 -1000 1000 200 2 "peak center" "% reads mapped" NA NA $(pwd)/ FALSE low < ProfileSeq_peaks.R

ProfileSeq_peaks_InputNorm

The same as ProfileSeq_peaks, except an input signal must be input, which is used for profile normalization. In ProfileSeq_peaks (and _ss and _basic), either mappable coordinates or all genomic coordinates are used as the pool of possible places where a read can occur. Mappable reads (or the full set of coorinates) is guaranteed to be at least as big as the number of reads in the sample. This is not necessarily the case for input reads, howerver. Since the input sample is IP'ed and PCR amplified to attain the final sample, it could be that there are more sample reads than input reads at some coordinates. So the calculation of P-values must be done in a slightly different way, hence the need for a separate code for input normalization.

See ProfileSeq_peaks for a description of parameters and usage. RepNum here is the ratio of the number of sampe reps divided by the number of input reps. RepNum=1 when the same number of replicates are pooled for sample and input files.

Example: As we have not provided a file of input reads, we will pretend that Pol2.pos is both the final ChIP-Seq sample and inputt.

R --vanilla --args  "CTCF" Test2.occ 1 black "No CTCF" Control2.occ 2 black Test2.occ Control2.occ Pol2Plot_peaks_InputNorm Pol2 -1000 1000 200 1 "peak center" "% reads/input read" NA NA $(pwd)/ FALSE low < ProfileSeq_peaks_InputNorm.R