Wiki

Clone wiki

CAFE / vignette

CAFE Package HowTo

Introduction

So, you have downloaded the CAFE package, and are wondering "What on Earth can I do with this?". In answering this question, we can first of all refer to the package name: I’m sure you’d all have thought about a nice warm cup of coffee, but in reality CAFE stands for Chromosomal Aberrations Finder in Expression data. Now, that might not tell you much, so here is a slightly better - albeit longer - summary of what it does: CAFE analyzes microarray expression data, and tries to find out whether your samples have any gross chromosomal gains or losses. On top of that, it provides some plotting functions - using the ggplot2 package - to give you a nice visual tool to determine where those aberrations are located. How it exactly does this is something you can find later in this document. CAFE does not invent any new algorithms for the detection of chromosomal aberrations. Instead, it takes the approach of Ben-David et al [1], and molds it into an easy-to-use R package.

Prerequisites

As all software packages, CAFE has its requirements. You should preferably have the latest version of R, but at the very least version 2.10. Other dependencies can be found in the DESCRIPTION file. Some packages imported by CAFE require you to have libcurl installed in your system, and by extension the RCurl and Rtracklayer packages. If these are not yet installed in your system, this could take a while to install.

Preparing your file system

CAFE analyzes microarray expression data. That means that without anything to analyze, it won’t do anything at all. So, how do we fix this obvious problem? As a starter, CAFE only analyzes .CEL files. This means that CAFE will only work with data from Affymetrix microarrays. Illumina or other platforms unfortunately will not work. To prepare your filesystem correctly, follow the following steps.

Put all CEL files in a folder and then start R in that folder. Alternatively, you can set the working directory to this folder by setwd("/some/folder/path/"). And we’re done. It’s that simple. See figure 1 for a screenshot of how this looks in a file manager.

See figure 1 for a screenshot of how this is layed out

The file system layout

Analysis

Now we’re ready to start analyzing our dataset(s).

CAFE provides the ProcessCels() function. This function takes four arguments: threshold.over, threshold.under remove_method and local_file. The threshold will determine which probes are going to be considered as overexpressed. The default setting is threshold.over=1.5, meaning that probes with a relative expression over 1.5 times median will be considered overexpressed. Likewise threshold.under determines which probesets will be considered underexpressed. As for the remove_method argument, this determines in what way ProcessCels will remove duplicate probes. See section 3.1 for an overview of this option. The ProcessCels() function will basically suck up all your CEL files in your working directory - but don’t worry, they’ll still be there in your file system - and perform some number magic on them; it will normalize, calculate probes that are overexpressed, log2 transform and remove duplicates. It returns you a list object of your entire dataset, which the rest of CAFE requires to function.

So lets try that. First of all, we of course need to load the package:

library(CAFE)
## Loading required package: biovizBase
## Loading required package: GenomicRanges
## Loading required package: BiocGenerics
## Loading required package: parallel
## 
## Attaching package: 'BiocGenerics'
## 
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB
## 
## The following object is masked from 'package:stats':
## 
##     xtabs
## 
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, as.vector, cbind,
##     colnames, duplicated, eval, evalq, Filter, Find, get,
##     intersect, is.unsorted, lapply, Map, mapply, match, mget,
##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
##     rbind, Reduce, rep.int, rownames, sapply, setdiff, sort,
##     table, tapply, union, unique, unlist
## 
## Loading required package: IRanges
## Loading required package: XVector
## Loading required package: ggbio
## Loading required package: ggplot2
## Need specific help about ggbio? try mailing 
##  the maintainer or visit http://tengfei.github.com/ggbio/
## 
## Attaching package: 'ggbio'
## 
## The following objects are masked from 'package:ggplot2':
## 
##     geom_bar, geom_rect, geom_segment, ggsave, stat_bin,
##     stat_identity, xlim
## 
## 
## Attaching package: 'CAFE'
## 
## The following objects are masked _by_ '.GlobalEnv':
## 
##     discontSmooth, slidSmooth

Then, we should set our working directory to the dataset folder if we’re not already there.

setwd("/some/path")

Now we’re there we can process the CEL files.

datalist <- ProcessCels()

A moment of reproducibillity

The stuff we’ve done above here is unfortunately not really reproducible. It requires you to have .CEL files in a folder called /some/path in your home directory, and that’s of course not very pretty. So to keep the rest of this document reproducible, the CAFE package comes together with a data object. This is the list object returned by ProcessCels() when processing both GSE10809 and GSE6561.

data("CAFE_data")
# will put object named CAFE_data in your global environment

Statistics

So now we come to the core of CAFE: finding which chromosomes are significantly over- or underexpressed. CAFE uses thresholding to determine which probes are overexpressed or underexpressed. We use the probes we deemed overexpressed by our threshold to create a contingency table which can then be used in an Exact Fisher or Chi-Square test. Basically we assume that everything over our defined threshold.over is really overexpressed, and everything uder our defined threshold.under is underexpressed.

To calculate p-values for chromosomes, we will use the chromosomeStats() function. We can then also select which test (fisher or chi square) we want to use. When performing multiple tests (i.e. if we are testing multiple chromosomes), we need to correct our p-values for type I errors. We can therefore set the bonferroni argument to true when testing multiple chromosomes. The enrichment argument controls which side (over- or underexpressed) we are testing for.

# we first have to decide which samples we want to use.
names(CAFE_data[[2]])  #to see which samples we got
##  [1] "GSM151738.CEL" "GSM151739.CEL" "GSM151740.CEL" "GSM151741.CEL"
##  [5] "GSM272914.CEL" "GSM272915.CEL" "GSM272916.CEL" "GSM272917.CEL"
##  [9] "GSM272918.CEL" "GSM272919.CEL" "GSM272920.CEL" "GSM272921.CEL"
sam <- c(1, 3)  #so we use sample numbers 1, and 3 to compare against the rest
chromosomeStats(CAFE_data, chromNum = 17, samples = sam, test = "fisher", bonferroni = FALSE, 
    enrichment = "greater")  # we are only testing 1 chromosome
##     Chr17 
## 3.989e-48

This uses an Exact Fisher test. Technically speaking, this is better than a chi square test, but can be slower for very large sample sizes. We can also do a chi square test, which is slightly faster.

chromosomeStats(CAFE_data, chromNum = 17, samples = sam, test = "chisqr", bonferroni = FALSE, 
    enrichment = "greater")
##     Chr17 
## 1.188e-05

As you will have seen, the output of this is different from test="fisher". The reason for this is that the Fisher test gives an exact p-value, whereas a chi square test is just an approximation.

But if we now want to test multiple chromosomes, we have to correct our p-values

chromosomeStats(CAFE_data, chromNum = "ALL", samples = sam, test = "fisher", 
    bonferroni = TRUE, enrichment = "greater")
##      Chr1      Chr2      Chr3      Chr4      Chr5      Chr6      Chr7 
## 1.000e+00 1.000e+00 1.000e+00 1.021e-01 3.637e-01 1.000e+00 1.000e+00 
##      Chr8      Chr9     Chr10     Chr11     Chr12     Chr13     Chr14 
## 1.000e+00 1.000e+00 1.000e+00 1.000e+00 3.809e-42 3.868e-01 1.000e+00 
##     Chr15     Chr16     Chr17     Chr18     Chr19     Chr20     Chr21 
## 1.000e+00 1.000e+00 8.777e-47 1.000e+00 1.000e+00 1.000e+00 1.000e+00 
##     Chr22 
## 1.000e+00

The results which we got might be all nice and well - there seems be something amiss with chromosome 17 - but preferably we would like to delve a bit deeper. We would like to know which chromosome bands are duplicated or lost. To do this we can use the same syntax as for chromosomes, except that we are using a different function:

bandStats(CAFE_data, chromNum = 17, samples = sam, test = "fisher", bonferroni = TRUE, 
    enrichment = "greater")  #multiple bands per chromosome, so need bonferroni!
##     17cen-q12       17p11.1       17p11.2   17p11.2-p12         17p12 
##     1.000e+00     1.000e+00     8.009e-17     1.000e+00     1.000e+00 
##       17p12.3   17p12-p11.2         17p13       17p13.1   17p13.1-p12 
##     1.000e+00     1.000e+00     1.000e+00     1.000e+00     1.000e+00 
##       17p13.2       17p13.3     17p13-p12    17pter-p11           17q 
##     1.000e+00     1.000e+00     1.000e+00     1.000e+00     1.000e+00 
##         17q11       17q11.1 17q11.1-q11.2       17q11.2   17q11.2-q12 
##     1.000e+00     1.000e+00     1.000e+00     1.000e+00     1.000e+00 
##     17q11-q12     17q11-q21   17q11-q21.3    17q11-qter         17q12 
##     1.000e+00     1.000e+00     1.000e+00     1.000e+00     1.000e+00 
##     17q12-q21   17q12-q21.1         17q21       17q21.1 17q21.1-q21.3 
##     1.000e+00     1.000e+00     1.000e+00     1.000e+00     1.000e+00 
##       17q21.2       17q21.3      17q21.31      17q21.32      17q21.33 
##     1.000e+00     1.000e+00     1.000e+00     1.000e+00     1.000e+00 
##   17q21.3-q22     17q21-q22     17q21-q23     17q21-q24    17q21-qter 
##     1.000e+00     1.000e+00     1.000e+00     1.000e+00     1.000e+00 
##         17q22       17q22.2     17q22-q23   17q22-q23.2         17q23 
##     1.000e+00     1.000e+00     1.000e+00     1.000e+00     1.000e+00 
##       17q23.1       17q23.2 17q23.2-q25.3       17q23.3     17q23-q24 
##     1.000e+00     1.000e+00     1.000e+00     1.000e+00     1.000e+00 
##         17q24       17q24.1       17q24.2       17q24.3     17q24-q25 
##     1.000e+00     1.000e+00     1.000e+00     1.000e+00     1.000e+00 
##         17q25       17q25.1       17q25.2 17q25.2-q25.3       17q25.3 
##     1.000e+00     1.000e+00     1.000e+00     1.000e+00     1.000e+00 
##        17qter 
##     1.000e+00

So this way we see that there is most likely a duplication around band 17p11.2

Plotting

So now we know that chromosome 17 for these two samples is aberrant, but we would like to plot that. A picture says more than a thousand words - as the saying goes. CAFE provides four functions for plotting samples, rawPlot(), slidPlot(), discontPlot() and facetPlot().

Raw plots

The rawPlot() function plots each individual probe along the chromosome with its ’raw’, unaltered, log2 relative expression value. This can give a very rough overview of what is happening. As a visual tool, an ideogram of the chromosome can be plotted over the plot. See figure 2 for an example.

rawPlot(CAFE_data, samples = c(1, 3, 10), chromNum = 17)
## Writing plot to Chr17Mul

plot of chunk unnamed-chunk-9

Sliding plots

The plots given by rawPlot() are often not very informative since the within-sample variation in expression levels is quite high. The slidPlot() function solves this problem by applying a sliding average to the entire sample. As such, patterns become visible that would otherwise have went unnoticed. The function has two extra arguments as rawPlot(): if combine=TRUE a raw plot will be plotted in the background. The size of the sliding window can be determined by argument k. See figure 3 for an example.

slidPlot(CAFE_data, samples = c(1, 3, 10), chromNum = 17, k = 100)
## Writing plot to Chr17slidMul

plot of chunk unnamed-chunk-10

Discontinuous plots

In reality, there can only be an integer number of copy numbers for a given region (be it an entire chromosome or a band). One cannot 2.5 times duplicate a region; no, it is 0, 1, 2, 3 times etc. Also, there should be a defined boundary where the duplication or loss begins and ends. Yet, functions like slidPlot() will give smooth transitions and variable regions, which is a relatively poor reflection of what is actually happening. As such, we need a discontinuous smoother rather than a sliding average smoother. A discontinuous smoother is a smoother which produces distinct "jumps" rather than smooth transitions. The discontPlot() function implements such a discontinuous smoother - called a Potts filter. The smoothness (i.e. amount of jumps) can be determined by setting parameter gamma. A higher gamma will result in a smoother graph, with less jumps. See figure 4 for an example.

discontPlot(CAFE_data, samples = c(1, 3, 10), chromNum = 17, gamma = 100)
## Writing plot to Chr17discMul

plot of chunk unnamed-chunk-11

Facet plots

With the facetPlot() function you can plot all chromosomes stitched together horizontally in one single splot. It includes options to use either a sliding average smoother. See figure 4 for an example..

facetPlot(CAFE_data, samples = c(1, 3, 10), slid = TRUE, combine = TRUE, k = 100)
## Writing plot to FacetSlidMul

plot of chunk unnamed-chunk-12

In a nutshell

As follows from the above, CAFE analysis basically boils down to just three steps

  1. data <- ProcessCels()

  2. xxxStats(data, ...)

  3. xxxPlot(data, ...)

For instance:

data <- ProcessCels()
chromosomeStats(data, samples = c(1, 3), chromNum = "ALL")
discontPlot(data, samples = c(1, 3), chromNum = "ALL", gamma = 100)

Behind the scenes

Normalization

Affymetrix CEL files are read in and normalized by the justRMA() function in the affy package. Absent probes are then found (by the mas5calls function), and are subsequently removed if absent in more than 20% of samples. After normalization is complete, relative expression values (to median) are calculated. The remove_method argument controls which duplicate probes are removed from the dataset. When remove_method=0, no duplicate probes will be removed. When remove_method=1, duplicate probes linking to the same gene will be removed such that each gene only links to one single probe, according to the following priorities:

  1. Probes with ..._at are preferably retained

  2. Probes with ...a_at

  3. Probes with ...n_at

  4. Probes with ...s_at

  5. And finally probes with ...x_at

When this still results in multiple probes per gene, the probe with the earliest chromosomal location is retained. When remove_method=2, duplicate probes are only removed when they link to the exact same location, using the same priority list as specified above.

Category testing

When using chromosomeStats() and bandStats() the dataset is eventually split into four categories:

  1. Whole dataset & On Chromosome (or band)

  2. Whole dataset & Overexpressed & On Chromosome (or band)

  3. Sample(s) & On Chromosome (or band)

  4. Sample(s) & Overexpressed & On Chromosome (or band)

The number of probes for each category are counted, and saved in a so-called contingency table. Then we test the likelihood that Sample(s) occurs within Whole via either the Exact Fisher test or the Chi Square test. The Exact Fisher test supplies exact p-values, whereas the Chi Square test supplies merely an approximation. However, the Exact Fisher test can be slower than a Chi Square test, and the Chi Square test increases in accuracy with increasing sample size.

"Overexpressed" is defined as those probes which were over the set threshold - the threshold.over argument in ProcessCels(). Likewise, "underexpressed" is defined as those probes which were under the set threshold - argument threshold.under. Please note that these thresholds are a defined as a fraction of median expression value for each probe.

Smoothers

Sliding average

With the moving average we attempt to the smooth the raw data so that patterns emerge. We have a set of points $(x_i,y_i), i=1 \dots  n$, which is ordered such that $x_1 \leq \dots \leq x_n$. Since we plotting chromosomal location versus log2 relative expression, $x_i$ refers to chromosomal locations and $y_i$ refers to log2 relative expressions. Since chromosomal locations are integers, $x_i \in \mathbb{N}:x_i \geq0$, and since log2 relative expressions can be any real number $y_i \in \mathbb{R}$. We have a sliding average windows, denoted $k$, which determines the smoothness. We are reconstructiong $y_i$ (the log2 relative expressions) to come to a smoother $\hat{y}_i$. For $y_i$ at least $k$-elements from the boundary, the smoother is calculated as in equation 1. For the remaining $\hat{y}_i$, where $i \in \mathbb{N}: 1 \leq i \leq k$ and $i \in \mathbb{N}: (n-k+1) \leq i \leq n$, we first define two variables $a$ and $b$, where $a = max(1,(i-k))$ and $b = min(n,(i+k))$. The remaining $\hat{y}_i$ are then calculated using equation 2.

Equation 1: $$\hat{y}_i =  \frac{1}{k} \cdot \sum\limits_{i=j}^{i+k} y_j , i \in \mathbb{N}: (k+1) \leq i \leq (n-k)$$

Equation 2: $$\hat{y}_i = \frac{\sum\limits_{i=a}^b y_i}{(a-b+1)}$$

Discontinuous smoother

The discontinuous smoother used in discontPlot(), is essentially a minimization problem. We again have a set of points $(x_i,y_i), i=1 \dots  n$. This set is ordered in such manner that $x_1 \leq \dots \leq x_n$. For our particular problem $x_i$ refers to chromosomal coordinates, such that $x_i \in \mathbb{N}:x_i \geq 0$. Conversely, $y_i$ refers to log2 relative expressions, such that $y_i \in \mathbb{R}$. We want to reconstruct $y_i$ in such a way that the reconstruction - called $\hat{y_i}$ - closely resembles $y_i$ but has as little jumps as possible. A jump is defined as $\hat{y_i} \neq \hat{y}_{i+1}$. The reconstruction $\hat{y}_i$ can be constructed by minimizing the following function:

$$H = \sum\limits_{i=1}^n ({y_i} - {\hat{y}_i})^2 + \gamma \cdot |(\hat{y}_i \neq \hat{y}_{i+1})|$$

In the above formula, || denotes cardinality. The first term in the formula denotes a least squares goodness-of-fit term. The second term determines how strict the formula is. A higher $\gamma$ will result in a flatter result, with fewer jumps, and this $\gamma$ can be any number over 0. The entire formula is called a Potts filter. The mathematics behind this formula is described in detail by Winkler et al [2].

The implementation in discontPlot() uses the algorithm descibed by Friedrich et al [3].

Non-Affymetrix data

Even though CAFE is designed to work only with Affymetrix .CEl files, it is possible to analyse non-affymetrix data via a workaround. For CAFE to work with non-affymetrix data, it is necessary you provide the correct data structure which we use to analyse our data. The major structure the software works with is a list of lists structure. The list contains three lists - named $whole, $over and $under - which both contain data frames for each and every sample using the following format:

  1. $ID, some identifier (probe IDs in affymetrix) (character vector)

  2. $Sym, gene symbol (character vector)

  3. $Value, log2 transformed expression value (numerical vector)

  4. $LogRel, log2 transformed relative expressions (as a function of mean) (numerical vector)

  5. $Loc, chromosomal locations in bp (integer vector)

  6. $Chr, chromosome number (character vector)

  7. $Band, cytoband (character vector)

For instance

str(CAFE_data$whole[[1]])
## 'data.frame':    10442 obs. of  8 variables:
##  $ ID    : chr  "1053_at" "121_at" "1316_at" "1487_at" ...
##  $ Sym   : chr  "RFC2" "PAX8" "THRA" "ESRRA" ...
##  $ Value : num  6.48 6.72 4.17 6.39 7.41 ...
##  $ LogRel: num  -2.077 -0.336 -0.435 -0.814 -1.672 ...
##  $ Loc   : int  73646002 113974938 38219184 64073043 43562239 26722694 90039000 196439228 27941782 34175215 ...
##  $ Chr   : chr  "7" "2" "17" "11" ...
##  $ Band  : chr  "7q11.23" "2q13" "17q11.2" "11q13" ...
##  $ Arm   : chr  "7q" "2q" "17q" "11q" ...
##  - attr(*, "na.action")=Class 'omit'  Named int [1:95] 44 53 54 55 56 57 187 194 206 220 ...
##   .. ..- attr(*, "names")= chr [1:95] "44" "53" "54" "55" ...

The $whole list should contain data for all probes, the $over list should only contain those probes which are deemed overexpressed (not required if one uses thresholding=FALSE). The order of probes and locations in $whole should be identical across all samples. Each list element (i.e. the dataframes) is named according to its sample name. For instance

print(names(CAFE_data$whole))
##  [1] "GSM151738.CEL" "GSM151739.CEL" "GSM151740.CEL" "GSM151741.CEL"
##  [5] "GSM272914.CEL" "GSM272915.CEL" "GSM272916.CEL" "GSM272917.CEL"
##  [9] "GSM272918.CEL" "GSM272919.CEL" "GSM272920.CEL" "GSM272921.CEL"

References

[1]: Uri Ben-David, Yoav Mayshar, and Nissim Benvenisty. Virtual karyotyping of pluripotent stem cells on the basis of their global gene expression profiles. Nature protocols, 8(5):989–997, 2013. http://dx.doi.org/10.1038/nprot.2013.051

[2]: G Winkler, Volkmar Liebscher, and V Aurich. Smoothers for Discontinuous Signals. Journal of Nonparametric Statistics, 14:203–222, 2002. http://dx.doi.org/10.1080/10485250211388

[3]: F Friedrich, a Kempe, V Liebscher, and G Winkler. Complexity Penalized M- Estimation. Journal of Computational and Graphical Statistics, 17(1):201– 224, March 2008. http://dx.doi.org/10.1198/106186008X285591

Updated