cheungsubsetceldata / inst / doc / Cheung.Rnw


\author{Sean Davis}
\title{Working with Affymetrix Expression Array Data}



Our tasks in this tutorial are to:

  \item{Load affymetrix data from raw .cel files}
  \item{Use exploratory data analysis to examine the raw data}
  \item{Process the array data to produce normalized data that can be used for further analysis (differential expression)}
  \item{Examine the effects of different normalization approaches}

We are going to be loading the affymetrix .CEL files from an old paper described by the abstract:

  Natural variation in gene expression is extensive in humans and other organisms, and variation in the baseline expression level of many genes has a heritable component. To localize the genetic determinants of these quantitative traits (expression phenotypes) in humans, we used microarrays to measure gene expression levels and performed genome-wide linkage analysis for expression levels of 3,554 genes in 14 large families. For approximately 1,000 expression phenotypes, there was significant evidence of linkage to specific chromosomal regions. Both cis- and trans-acting loci regulate variation in the expression levels of genes, although most act in trans. Many gene expression phenotypes are influenced by several genetic determinants. Furthermore, we found hotspots of transcriptional regulation where significant evidence of linkage for several expression phenotypes (up to 31) coincides, and expression levels of many genes that share the same regulatory region are significantly correlated. The combination of microarray techniques for phenotyping and linkage analysis for quantitative traits allows the genetic mapping of determinants that contribute to variation in human gene expression. \cite{morley}

\subsection{Installation of the CheungSubsetCelData Package}

This tutorial package can be installed into R using the following command (paste into R)


\section{Getting Started}

We are going to be using one of several packages for dealing with affymetrix data, the \texttt{affy} package.  This package is applicable to 3'-biased arrays we will be using here.  Other packages dealing with affymetrix arrays can be found at this url: \url{}.

\subsection{Install the affy package}

Affymetrix data are stored on the disk in a single file per sample in a format called .CEL.  This format can be either binary or text.  Thankfully, there is a Bioconductor package, the \texttt{affy} package \cite{gautier}, that knows all about .CEL files and how to load them.  We will be installing the \texttt{affy} package as a first step.


Remember that if you have not installed Bioconductor base packages first, the above command may fail.  If that happens, head back to the Bioconductor website and start there.


To get an overview of the affy package, use:


\subsection{Finding the .CEL files}

The .CEL files for this tutorial are stored in the ``CheungSubsetCelData'' package; packages like this one that contain mainly data and not functionality are called ``data packages'' and are a nice way to simplify management for R.

Once an R package is installed, R can find files in the package using the \texttt{system.file()} function.  Use the R help system to read a bit about \texttt{system.file}, but the directory that stores the .CEL files is here:

celfilepath = system.file('extdata',package='CheungSubsetCelData')

The \texttt{celfilepath} is the directory on the disk where the .CEL files are located.  Use the \texttt{list.celfiles} function to list the .CEL files in that directory.


Read the help page for the \texttt{ReadAffy} function and apply it to the .CEL files in the \texttt{celfilepath} location.

abatch = ReadAffy(filenames=list.celfiles(celfilepath),

In order to interpret the array features, the affy package needs to gain access to the ``content design file'' or CDF.  Bioconductor has data packages available for many array types that encapsulate the design information.  As soon as the affy package needs some information from the data package, it will automatically download it.  


The \texttt{annotation} accessor function is used to determine the array type after loading affy data into R.  In this case, the array type is the ``hgfocus'' array.

\section{Exploratory Data Analysis}

The first step after loading microarray data is to do some data exploration.  

\textbf{Exercise 1}:  Read the help for ``AffyBatch'' to find an accessor to give you the ``pm'' (perfect match) intensities.  Make a text summary of the resulting matrix.



\textbf{Exercise 2}:  Create a matrix of the \texttt{pm} intensities.  Then, make a histogram of the intensities for the first array.  Is this plot useful to you?  How could you improve it?  (Hint: you might want to transform the data before plotting).

pmmat = pm(abatch)

\textbf{Exercise 3}:  Instead of a histogram, make a density plot.  You may want to apply the same trick you applied above.

pmmat = pm(abatch)

pmmat = pm(abatch)

\textbf{Exercise 4}:  


\bibitem{morley} Morley M, Molony CM, Weber TM, Devlin JL, Ewens KG, Spielman RS, Cheung VG.
Genetic analysis of genome-wide variation in human gene expression. Nature. 2004 
Aug 12;430(7001):743-7. PubMed PMID: 15269782.

\bibitem{gautier}   Gautier, L., Cope, L., Bolstad, B. M., and Irizarry, R. A. 2004.
  affy---analysis of Affymetrix GeneChip data at the probe level.
  Bioinformatics 20, 3 (Feb. 2004), 307-315.