Source

statistics / statistics_in_biology.tex

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
\documentclass{article}

\usepackage{natbib}
\usepackage{amsmath}


\begin{document}

\title{Statistics in Biology}

\author{Casey Dunn \\
\small Department of Ecology and Evolutionary Biology\\[-0.8ex]
\small Brown University, Providence, RI USA\\
\small casey\_dunn@brown.edu}


\maketitle

%\maketitleinst


% Teaching statistics for biologists can be approached in a variety of ways. One approach is to build intuition based on example applications, without necessarily starting with first principles. Another approach is to start with first principles, and from these develop a formal understanding of the methods that can lead to intuition of how they behave with actual data sets.
% Many biologists now routinely interpret and apply extremely sophisticated statistical analyses that are far outside the scope of most biostatistics courses and textbooks. Without understanding how these sophisticated tools relate to the basic statistics they already know, it can be extremely difficult to build intuition, relate different findings to each other, or have any sense of the problems that may be encountered. 
% This document serves as a quick-reference that pulls together basic information on the statistics that biologists will now regularly encounter in the literature and in the software tools relevant to their analyses, from basic first-year applied stats that have been used in biology since their inception to more complicated approaches that have only recently become widespread. No attempt is made to be comprehensive or argue from first principles, this document is at its core a field guide. But by collecting diverse information in one place it can serve as a course roadmap for making the types of connections you will need between these tools, first principles, and other methods not covered here.

\section{Descriptive statistics}

Parameters describe the actual population we want to know about. In most cases, we don't know the real parameter values, and instead make estimates based on observations of samples drawn from the full population. These estimates are an approximation of the true population parameters, and much of statistics is concerned with figuring out how good parameter estimates are. This allows us to test hypotheses about the parameters, as well as compare estimates to each other.

\subsection{Descriptive statistics for a single variable}

Many times we just want to summarize the distribution of a single variable, which we will here call $X$.

The arithmetic mean $\mu$, a parameter, is an average of all values in a population:

\begin{equation}
\mu= \frac{\displaystyle\sum\limits_{i=1}^n X_i}{n}
\end{equation}

Where the total population has size $n$ and the value of each item $i$ in the population is $X_i$. It often serves as an expected value of a sample drawn from the population.

The sample arithmetic mean $\bar{X}$, an estimate, is the average of all values in a sample drawn from the population:

\begin{equation}
\bar{X}= \frac{\displaystyle\sum\limits_{i=1}^n X_i}{n}
\end{equation}

Where the sample size is $n$ and the value of each item $i$ in the sample is $X_i$. As the sample size approaches the population size, $\bar{X}$ approaches $\mu$.

The variance $\sigma^2$, a parameter, gives an indication of how far values in the population deviate from the mean:

\begin{equation}
\sigma^2= \frac{\displaystyle\sum\limits_{i=1}^n (X_i-\mu)^2}{n}
\end{equation}


The estimate of variance $s^2$ is calculated from a sample of the population as:

\begin{equation}
s^2= \frac{\displaystyle\sum\limits_{i=1}^n (X_i-\bar{X})^2}{n-1}
\end{equation}

The $n-1$ in the denominator of the estimate accounts for the sample being only a fraction of the population, which leads to an underestimation of the deviances from the mean. This underestimation decreases as the sample size grows. See Appendix 2 of \cite{Grafen:2002vr} for an excellent explanation of why $n-1$ results in an unbiased estimation of variance.


The standard deviation, which is int he same units as the original measure, is simple the square root of the variance:

\begin{equation}
\sigma= \sqrt{\sigma^2}
\end{equation}


\begin{equation}
s= \sqrt{s^2}
\end{equation}


\subsection{Descriptive statistics for two variables}

In the above examples, we were summarizing a single variable across a population or a sample drawn from the population. We often want to measure multiple variables, and describe potential associations between them.

If the variables $X$ and $Y$ are both measured for all individuals in a population, their covariance is:


\begin{equation}
cov(X, Y)= \frac{\displaystyle\sum\limits_{i=1}^n (X_i-\mu_X)(Y_i-\mu_Y)}{n}
\end{equation}


If the variables $X$ and $Y$ are both measured for individuals in a sample, their covariance is:


\begin{equation}
cov(X, Y)= \frac{\displaystyle\sum\limits_{i=1}^n (X_i-\bar{X})(Y_i-\bar{Y})}{n-1}
\end{equation}

The covariance gives an indication of the degree to which the values of two variables are associated. Variance is a special case of covariance--it is the covariance of a variable with itself. Covariance can be positive (an increase in the value of $X$ is associated with an increase in the value of $Y$), 0 (there is no association between $X$ and $Y$), or negative (an increase in the value of $X$ is associated with a decrease in the value of $Y$). There is no limit to their magnitude, they can range from $-\infty$ to $\infty$.

Because the different variables may have different magnitudes and units, a covariance alone can be difficult to interpret. One way around this is to normalize the covariance by the standard deviation of each variable. This results in a unit-less correlation coefficient. For a population, the correlation coefficient $\rho$ is:

\begin{equation}
\rho = \frac{cov(X, Y)}{\sigma_X\sigma_Y}
\end{equation}


Note that population size $n$ also cancels out in the correlation co-efficient, since the covariance has an $n$ in the denominator and the standard deviations each have $\sqrt{n}$ in their denominator.

For a sample, the estimate $r$ of the correlation coefficient is:


\begin{equation}
r = \frac{cov(X, Y)}{s_X s_Y}=\frac{\displaystyle\sum\limits_{i=1}^n (X_i-\bar{X})(Y_i-\bar{Y})}{\sqrt{\displaystyle\sum\limits_{i=1}^n (X_i-\bar{X})^2} \sqrt{\displaystyle\sum\limits_{i=1}^n (Y_i-\bar{Y})^2}}=\frac{1}{n-1}\displaystyle\sum\limits_{i=1}^n (\frac{X_i-\bar{X}}{s_X})(\frac{Y_i-\bar{Y}}{s_Y})
\end{equation}

The correlation coefficient has a few nice properties that make it easier than the covariance to interpret when you don't know much else about the sample. For one, it varies from -1 to 1, where 1 is perfect positive association between the variables and -1 is perfect negative association between the variables. Independent variables will be uncorrelated. Dependent variables, however, may also be uncorrelated, since correlation only measures linear dependence.

Another quantity that is closely related to the covariance and correlation coefficient is the least-squares linear regression, $\beta$. Whereas covariance and correlation describe association between variables that are not treated differently from each other in any way, linear regression describes the expected change in one variable relative to the change in the other. The expected change in $Y$ given $X$ is described by $\beta_{Y,X}$, while the expected change in $X$ given $Y$ is $\beta_{X,Y}$.

If $Y$ values are plotted against their corresponding $X$ values, $\beta_{Y,X}$ is the slope of the line drawn through the points such that the squared $Y$ distances from the points to the line are minimized. This is often how the linear regression is calculated. But it can also be related to the covariance and variance in the following way:

\begin{equation}
\beta_{Y,X} = \frac{cov(X, Y)}{\sigma_X^2}
\end{equation}


\begin{equation}
\beta_{X,Y} = \frac{cov(X, Y)}{\sigma_Y^2}
\end{equation}

Note that the linear regressions are obtained in a very similar way as the correlation coefficient. The denominator of the linear regression contains the variance of the explanatory variable that is being used to predict the dependent variable, while the denominator of the correlation contains the product of the square roots of the variance (i.e., the standard deviation) of each variable. From this relationship, it can also be seen that:

\begin{equation}
cov(X, Y)= \beta_{Y,X}{\sigma_X^2}=\beta_{X,Y}{\sigma_Y^2}
\end{equation}

$\beta$ isn't going to be very helpful for predicting the value of the dependent variable in terms of the explanatory variable if there is no linear relationship between them. This gets to another important relationship between correlation and linear regression. The square of the correlation coefficient, $r^2$, provides an indication of how linear the relationship is. If $r^2=0$, then there is no linear relationship and  $\beta$ won't explain one variable in terms of the other. If $r^2=1$, then the relationship is perfectly linear and all variation in each variable can be explained by the variation in the other variable. Equivalently, $r^2$ provides an indication of how close the data points are to the regression line, with higher values indicating that they are closer.

There is often considerable confusion when it comes to applying correlations versus linear regressions \citep{Twomey:2008en}. in general, remember that correlations coefficients provide an indication of the association between the variables, while linear regression determines the linear relationship (if there is one) between the variables. Like correlation, covariance provides an indication of the association between variables, but is also impacted by the units and magnitude of the variables. If, for example, two variables are perfectly correlated ($r=1$), but the magnitude of $X$ varies much more than the magnitude of $Y$, then $X$ will have a greater impact on the covariance.


\subsection{Multivariate descriptive statistics}

Covariances, correlations, and linear regressions can all be generalized to more than two variables. Let's begin with an examination of the covariance matrix. Take the column vector $\textbf{X}$ of random scalar variables $X_1 ... X_n$. Each $X_i$ has its own mean and variance. The covariance matrix $\Sigma$ has entries $i, j$ that describe the covariance between $X_i$ and $X_j$:

\begin{equation}
\Sigma_{i, j} = cov(X_i, X_j)
\end{equation}

The diagonal of the covariance matrix contains the variance of each element of $\textbf{X}$, as the variance of a variable is its covariance with itself.

Since $cov(X_i, X_j)=cov(X_j, X_i)$, the covariance matrix $\Sigma$ is symmetric.

The elements $i, j$ of the correlation matrix can be readily derived from the covariance matrix as $\Sigma_{i, j}/(\sigma_i \sigma_j)$. The elements of the diagonal of the correlation matrix are all 1, since they are the variance of element $i$ divided by the variance of element $i$. This makes sense since $X_i$ is perfectly correlated with itself as long as its variance is nonzero. % The R function cov2cor() converts a covariance matrix to a correlation matrix

The covariance matrix can also be used to compute the multiple regression of a particular variable $X_k$ against all other variables. This is done by simply dividing the entire covariance matrix by the variance $\sigma^2_k$. The elements of column $k$ and row $k$ are the partial regression of each variable against $X_k$. Diagonal element $k$ will be 1, since it is $\sigma^2_k$ divided by $\sigma^2_k$, This makes sense since linear regression of $X_i$ against itself has slope 1. % Need to look into multiple regression a bit more, are the other rows and columns the residuals after eliminating the effect of variable k? See pages 59-61 of Grafen:2002vr

One of the most popular summaries of multivariate data is the Principal Component Analysis (PCA), which is a particular manipulation of the covariance matrix. 



\section{Probability distributions}

A probability density function indicates the relative likelihood of a variable taking on a particular value. The probability of the variable taking on a value in a particular range is given by the integral of the the probability density function over that range. The total area under the probability density function is 1, i.e. it encompasses all potential outcomes.

All of the probability density functions presented here, with the exception of the uniform distribution and the $t$-distribution, are exponential family distributions. This class of distributions are central to many aspects of statistics, and you will see them in many different settings.

These distributions are used in a variety of distinct ways, and it is always important to have a clear idea of how they are being applied. Distinct types of applications include:

\begin{itemize}
\item A description of the distribution of actual data (such as the use of a poisson distribution to model the number of jellyfish collected in multiple net trawls). 

\item Based on first principles, we expect particular statistics to fit the distribution (such as the use of the normal distribution to explain the distribution of means of repeated samples from a population).

\item They have a convenient shape (such as the use of a particular distribution as a Bayesian prior, even if there is no biological reason that the data should have that particular relationship between independent and dependent variables).

\end{itemize}




\subsection{Normal distribution}

This distribution is extremely important both because it fits the observed values of many biological data, and also because it approximates the expected mean values of samples drawn from \emph{any} distribution (see the central limit theorem below).


\begin{equation}
f(X; \mu, \sigma)= \frac{1}{\sqrt{2 \pi \sigma^2}}e^{\frac{-(X-\mu)^2}{2 \sigma^2}}
\end{equation}


To facilitate work with the normal distribution, values are often Z-standardized before analysis:

\begin{equation}
Z_i = \frac{X_i - \mu}{\sigma}
\end{equation}


This centers the distribution so that the mean is 0 and scales it so that the variance is 1. This standardized normal distribution is sometimes called the $z$ distribution.


\subsection{Exponential distribution}

Imagine an event that happens at rate $\lambda$, and let's define the sojourn time is the interval between adjacent events. The sojourn times will be distributed according to an exponential distribution:

\begin{equation}
f(t; \lambda) = \lambda e^{- \lambda t} 
\end{equation}


\subsection{Gamma distribution}

Imagine that we don't want to know the density of sojourn times, but instead the density of times until the event has occurred exactly $k$ times. This is given by the Gamma distribution:

\begin{equation}
f(t; \lambda, k) = \frac{\lambda ^ k}{\Gamma(k)} t^{k - 1} e^{- \lambda t} 
\end{equation}

Where $\Gamma(k)$ is the Gamma function, not to be confused with the Gamma distribution. For integers, $\Gamma(k) = (k - 1)!$


\subsection{Poisson distribution}

Now imagine that you are going to make observations for a particular time $T$, and you want to know how many events $k$ to expect in that interval. This is given by the Poisson distribution:


\begin{equation}
f(k; \lambda) = \frac{\lambda ^ k e^{-\lambda}}{k!} 
\end{equation}

The Poisson distribution has the very useful property that $\mu=\sigma^2=\lambda$. If a sample has a variance greater than the mean ($s^2>\bar{X}$), then the values are more clumped than would be expected under a Poisson distribution. If a sample has a variance less than the mean ($s^2<\bar{X}$), then the values are more evenly dispersed than expected from a Poisson distribution.

\subsection{Binomial distribution}

When a measurement can only have two outcomes (e.g. success or failure), the binomial distribution describes the probability of $k$ successes in $n$ independent trials, given the probability $p$ of success for any one trial:

\begin{equation}
f(k; p) = \binom{n}{k} p^k (1-p)^{n-k}
\end{equation}

\subsection{Negative binomial distribution}

The negative binomial gives the distribution of the number $k$ of successes that occur before a specified number of $r$ failures, given the probability $p$ of a success in any one trial:

\begin{equation}
f(k; r, p) = \binom{k + r - 1}{k} p^k (1-p)^r
\end{equation}



The mean and variance are:

\begin{equation}
\mu = \frac{pr}{1-p}
\end{equation}

\begin{equation}
\sigma^2 = \frac{pr}{(1-p)^2}
\end{equation}

%This allows us, in turn, to reparameterize the negative binomial as follows:
%
%\begin{equation}
%\mu = \lambda = p\frac{r}{1-p} \Rightarrow (1-p)\lambda = \lambda - \lambda p = rp \Rightarrow \lambda = (r+\lambda)p \Rightarrow p = \frac{\lambda}{r + \lambda}
%\end{equation}
%
%If we plug this back into the original definition, we get:
%
%\begin{equation}
%f(k; r, \lambda) = \binom{k + r - 1}{k} \left(\frac{\lambda}{r + \lambda}\right)^k \left(1-\frac{\lambda}{r + \lambda}\right)^r
%\end{equation}



% Using a different set of parameter names than Cook to be consistent with other distributions here
%	Here	Cook	Desc
%	p	1-p		Probability of success
%	r	x		failures
%	k	r		successes


The negative binomial is an appropriate alternative to the Poisson distribution when the sample is overdispersed, i.e. $s^2 > \bar{X}$. In fact, the variance is greater than the mean for a negative binomial for all non-zero values of $p$. This can be seen by rewriting the variance in terms of the mean:

\begin{equation}
\sigma^2 = \frac{pr}{(1-p)^2} = \frac{\mu}{1-p}
\end{equation}

The negative binomial distribution converges on the Poisson distribution as $r \to \infty$, and the mean is held constant (which requires that $p \to 0$). In this case, the smaller the value of $r$, the greater the variance.

There are alternative parameterizations to the negative binomial. Rather than specify $r$ and $p$, as in the above parameterization, the mean $\mu$ and a dispersion parameter $\phi$ can be specified. The dispersion parameter $\phi=1/r$. 

\begin{equation}
f(k; \mu, \phi) = \frac{\Gamma(k+\phi^{-1})}{\Gamma(\phi^{-1})\Gamma(k+1)}\left(\frac{1}{1+\mu \phi}\right)^{\phi^{-1}} \left(  \frac{\mu}{\phi^{-1}+\mu}\right)^k
\end{equation}

In this case, the variation $\sigma^2$ is given by:

\begin{equation}
\sigma^2 = \mu + \phi \mu^2
\end{equation}

The negative binomial is used, among other things, to model gene counts in RNA-seq data \citep{Robinson:2009cw}: 


\begin{equation}
Y_{gi} \sim NB(M_{i}p_{gj}, \phi_g)
\end{equation}

where $Y_{gi}$ is the number of counts for gene $g$ in sample $i$, $M_i$ is the total number of counts for sample $i$, $p_{gj}$ is the fraction of counts for gene $g$ in treatment $j$ (to which $i$ belongs), and $\phi_g$ is the dispersion for gene $g$. $\mu_{gi}=M_i p_{gj}$. The dispersion can be interpreted as a coefficient of variation of biological variation for gene $g$ across samples. Since $r$ and $\phi$ are inversely related, as $\phi_g$ decreases $r$ approaches infinity and the distribution approximates a Poisson distribution. When there is no biological variation in gene count between samples, $\phi_g$ is 0 and all technical variation is accommodated by the Poisson \citep{Robinson:2009cw}. In practice, $\phi_g$ is not calculated independently for each gene, a common $\phi$ is calculated across all genes or information is shared across genes when calculating each $\phi_g$.


\subsection{$\chi^2$ distribution}

The $\chi^2$ distribution is a probability density function for the sum of the squares of $k$ variables, each of which is independently drawn from a normal distribution with $\mu=0$ and $\sigma^2=1$.

\begin{equation}
f(X; k) = \frac{1}{2 ^ \frac{k}{2} \Gamma(\frac{k}{2})} X ^ {\frac{k}{2}-1} e ^ {-\frac{X}{2}}
\end{equation}


\subsection{$F$ distribution}

The $F$ distribution is just a ratio of two $\chi^2$ distributions:

\begin{equation}
f(X; m, n) = \frac{\chi^2_m}{\chi^2_n}
\end{equation}


\subsection{Uniform distribution}

This distribution is a flat line over a specified range of $x$ values, $x_{min}$ and $x_{max}$. The value of $y$ is selected such that the area under the line in the specified range is 1, i.e. $y=1/(x_{max}-x_{min})$.


\subsection{$t$-distribution}

The $t$-distribution is, among other things, used to compare sample means when the variance is unknown. If the variance were known, then in many cases a normal distribution would suffice. As $k$ (the degrees of freedom) approaches infinity, the $t$-distribution approaches a normal distribution with $\mu=0$ and $\sigma^2=1$ (i.e., a $z$ distribution). The tails of the $t$-distribution have more area than do the tails of the normal distribution, accommodating the uncertainty due to the lack of an estimate for variance.

\begin{equation}
f(X; k) = \frac{\Gamma(\frac{k+1}{2})}{\sqrt{k\pi}\Gamma(\frac{k}{2})} \left(1 + \frac{X^2}{k}\right)^{\frac{k+1}{2}}
\end{equation}

Although the $t$-distribution is not part of the exponential family of distributions, it can be rewritten in terms of two of the most basic exponential family distributions, the standard normal distribution and the $\chi^2$ distribution, as follows \citep{Grafen:2002vr}:

\begin{equation}
f(X; k) = \frac{N(0,1)}{\sqrt{\chi_{k-1}^2/(k-1)}}
\end{equation}

Where $N(0,1)$ is the standard normal distribution. Since the mean of the $\chi^2_k$ distribution is $k$, the denominator goes to 1 as $k$ goes to infinity.

Now that you have seen the relationship of the $t$-distribution to the normal distribution, it is also interesting to note its similarity to the $F$ distribution. The square of the standard normal distribution is the $\chi^2_1$ distribution. If we share the $t$ distribution, we therefore get a ratio of two $\chi^2$ distributions, each divided by their degrees of freedom. Sound familiar? This is an $F$ distribution with $m=1$. So, the square of a $t$-distribution is a particular $F$ distribution. The implications of this relationship will be seen when we explore General Linear Models below.


\section{Laws and Theorems}

\subsection{The Central Limit Theorem}

While we often think of the normal distribution being useful because it fits so many observed data, one of its most powerful applications results from the Central Limit Theorem (CLT). The CLT, in its most usual form, states that ``As sample size increases, the means of samples drawn from a population of any distribution will approach the normal distribution'' \cite{Sokal:1995uz}. The key here is that this happens \emph{even when the underlying distribution of the data is not normal}. The variance of the resulting normal distribution will decrease as the sample size increases, due to increased precision. This has a variety of implications:

\begin{itemize}

\item In large enough samples, the mean of the sample means will approximate the population mean, $\mu$.
\item The standard error of the sample means, i.e. the standard deviation of the normal distribution of sample means, will be $\sigma_{\bar{X}} = \sigma/\sqrt{n}$ where $\sigma$ is the population standard deviation and $n$ is the sample size. Usually we don't know $\sigma$, though, so we estimated it from the data as $SE_{\bar{X}} = s/\sqrt{n}$ where $s$ is the standard deviation of the actual sample. This is often called the standard error of the mean. Note that \emph{the standard error $SE$ is a completely different quantity than the observed standard deviation $s$ of the sample}. $s$ gives an indication of how much variation is observed within a sample. $SE$ gives an indication of how much variation is expected across samples.
\item A normal distribution with mean $np$ and variance $np(1-p)$ can be used to approximate a binomial distribution when the number of trails $n$ is very large.
\end{itemize}


\subsection{Law of large numbers}

As the sample size increases, the average of the sample will approach the average of the population (i.e., the expected value). 

\section{Hypothesis testing}

In hypothesis testing, the data are evaluated against a null hypothesis $H_0$ to see how unusual they are. If there is only a small chance of the data being realized under the null hypothesis, then the null hypothesis is rejected. The null hypothesis is paired with an alternative hypothesis, $H_A$, that is less specific than the null hypothesis and encapsulates all alternatives to the null hypothesis (sometimes there may be a set of alternative hypotheses that describe the alternatives, rather than just one). Usually there are a variety of assumptions that are made under the null hypothesis. Violation of any of these assumptions could lead to rejection of the null, even though the investigator is often interested only in whether or not one of the assumptions are violated.

To test a null hypothesis, the distribution of data that would be expected under the null is generated, and we ask how frequently the observed data would be expected under the null model. Typically a P-value, the probability of observing the actual data under the null hypothesis, is reported. If the P-value is small, i.e. less than a particular $\alpha$, the null hypothesis is rejected. The comparison between the sample data and the null model is made with a test statistic that summarizes some aspect of the sample data. The test statistic could, for example, be the mean of the sample or the proportion of the observations in the sample that fall into a particular category.

\subsection{Relationship to confidence intervals}

In many cases the test of the null hypothesis is equivalent to examining the confidence interval of the null distribution \citep{Whitlock:2008vp}. If the test statistic derived from the observation falls out of the confidence interval that contains a specified fraction of the expected values under the null hypothesis (e.g. 95\%), then the null is rejected since that outcome would be expected by chance less than 5\% of the time.


\subsection{General Linear Models}
If you've taken an introductory class to statistics, you probably learned a bunch of different tests for examining the significance of the relationship of one variable to another, including the $t$-test, ANOVA, MANOVA, $F$ tests for regression, and others. Well, it turns out that they aren't so different - they are all special cases of General Linear Models (GLMs). Statisticians have known this for some time, many statistical software tools make no distinction between them (see the lm() function in R, for example), yet scientists still learn them as independent methods.

For a clear and informative introduction to GLMs, and a discussion of how they relate to tests you may already be familiar with, see the first three chapters of \cite{Grafen:2002vr}.


%\subsection{The binomial test}
%The binomial test examines whether or not a particular proportion explains 

% \section{Degrees of freedom}

% \subsection{Multivariate statistics when the number of variables exceeds the number of treatments}

% @article{Bathke:2008bj,
%author = {Bathke, Arne C and Harrar, Solomon W and Madden, Laurence V},
%title = {{How to compare small multivariate samples using nonparametric tests}},
%journal = {Computational Statistics {\&} Data Analysis},
%year = {2008},
%volume = {52},
%number = {11},
%pages = {4951--4965},
%month = jul
%}

% Breaks covariance down into H (covariances and variances (mean squares) due to treatment) and G (the variances and covariances (mean squares) due to error, i.e. the residuals)
% a - the number of treatments or levels (all of which are relevant to one factor)


% Treatment effects may only impact joint distributions of variables, not their marginal distributions. If univariate methods are used, these joint effects would be missed.

% http://cran.r-project.org/web/packages/npmv/npmv.pdf

% Mentions that it is a problem when H and G become singular, for example when the number of response variables exceed the sample size. In that case, only the ANOVA is a valid test
% ANOVA tests can be planned or unplanned. If associations identified after collecting the data are being examined an unplanned test must be used, if the planned test was used the Type I error would become inflated. See \citep{Whitlock:2008vp} p. 404-405.



% http://biomet.oxfordjournals.org/content/92/4/951.abstract
% A test for independence in high dimensions...



\section{Multiple independent tests}

If multiple independent tests are made, the risk of Type I error (falsely rejecting a true null hypothesis) rapidly grows. 

The Bonferroni correction is a common mechanism for addressing this issue. A new more stringent $\alpha$ is simply calculated as $\alpha^*=\alpha/n$ where $n$ is the number of independent tests. While this reduces the rate of Type I error, it does so at the cost of reducing the power of each test, elevating the Type II error.

Alternatively, the False Discovery Rate (FDR) may be used.  

% \section{Likelihood methods}

% \section{Bayesian statistics}

% \section{Generalized linear models}

% EdgeR uses generalized linear models, see the edgeR user guide for formulation


\bibliographystyle{amnat}

\bibliography{statistics}


\end{document}