Commits

Anonymous committed 3bb2f6b

added chapter5

Comments (0)

Files changed (24)

StatSimulationBased.txt

 		* compute critical t-value (in R: qt(0.95, DF), where DF = n1+n2-2;n1,n2 are sizes of 2groups)
 		* reject H0 {{./equation028.png?type=equation}}
 
-=== 5 Analysis of Variance (ANOVA) ===
+===== 5 Analysis of Variance (ANOVA) =====
+	* source: [[./ex5anova.py]]
+	* aka ANALYSIS OF VARIANCE
+	* Gauss noticed that the observational error had a particular distribution: there were more observations close to the truth than not, and errors overshot and undershot with equal probability. The errors in fact have a normal distribution. Thus if we average the observations, the errors tend to cancel themselves out.
+	* Any sample mean can be thought of as 'containing' the true population mean plus an error term:
+		* {{./equation029.png?type=equation}}
+
+=== 5.2 Statistical models ===
 	* source: 
+	* Characterizing a sample mean as an error about a population mean is perhaps the simplest possible example of building a STATISTICAL MODEL:
+		* {{./equation030.png?type=equation}}
+		* allows us to compare statistical models and decide which one better characterizes the data. (powerful idea)
+	* If an effect α_j is present, the variation between groups increases because of the systematic differences between groups: //the between-group variation is due to error variation plus variation due to α_j. So the null hypothesis becomes://
+		* {{./equation031.png?type=equation}}   
+	* As the sample size goes up, the sample means will be tighter, and the variance will go down, but it will always be positive and skewed right, and thus the mean of this sampling distribution will always overestimate the true parameter. 
+
+=== 5.2.3 Analyzing the variance ===
+	* source: [[./ex5-3analyzevar.py]]
+	* (//recall that i ranges across participants within a group, and j ranges across groups//):
+	* {{./equation032.png?type=equation}}
+	* That is, the difference beween anyt value and the grand mean is equal to the sum of (I) the difference between that value and its group mean and (II) the difference between its group mean and the grand mean.
+	* SUM OF SQUARES (SS-Total) is the sum of the SS-between and SS-within:
+		{{./equation033.png?type=equation}}
+	* To get to the variances within and between each group, we simply need to divide each SS by the appropriate degrees of freedom.
+	* The DF-total and DF-between are analogous to the case for the simple variance {{./equation034.png?type=equation}}
+	//The number of scores minnus the number of parameters estimated gives you the degrees of freedom for each variance://
+	* {{./equation035.png?type=equation}}
+	* Another term for a variance is the MEAN SQUARE (MS), which is term used in ANOVA.
+		* {{./equation036.png?type=equation}}
+
+=== 5.3.3 Hypothesis testing ===
+	* source: [[./ex5-3msbetween.py]]
+	* The null hypothesis amounts to saying that there is no effect of αj: that any between group variance we see is completely attributable to within group variance:
+		{{./equation037.png?type=equation}}
+	* The key idea of ANOVA: //when the groups' means are in fact identical, the variance of these two distribution is very close to population variance.//
+	* F-STATISTIC - precisely analogous to a t-statistics; and the accompanying sampling distribution - the F-distribution - can be used precisely like a t-curve to comput the p-value for a result.
+	* F-distribution is defined as F(DFa, DFb), where DFa is the degrees of freedom of the MS-between (numerator) and DFb is the degrees of freedom of the MS-within (denominator).
+
+=== 5.3.4 MS-within, three non-identical populations ===
+	* source: [[./ex5-3mswithinnonequal.py]]
+	* The null hypothesis is now in fact false.
+	* **MS-within** is computing the spread about the mean in each sample: the location of the mean in that sample is irrelevant. As long as the population variances remain identical, MS-within will always estimate this variance in and unbiased manner.
+	* If the null hypothesis is in fact false (if the population means differ), then it's highly likely that MS-between is greater than MS-within, and that the F-ratio is significantly greater than 1.
+	* When population means actually differ, for a given sample it is possible that MS-between is lower and that MS-within is higher than the population's variances.
+	* A common rule of thumb is that //the results of ANOVA will be approximately correct if the largest standard deviation is less than twice the smallest standard deviation.//
+
+=== 6. Bivariate statistics and linear models ===
+
+

StatSimulationBased/equation029.png

Added
New image

StatSimulationBased/equation029.tex

+\bar{x}_1 = \mu + \epsilon_1

StatSimulationBased/equation030.png

Added
New image

StatSimulationBased/equation030.tex

+\bar{x}_j = \mu + \epsilon_j

StatSimulationBased/equation031.png

Added
New image

StatSimulationBased/equation031.tex

+H_0: ~ x_ij ~=~ \mu + \epsilon_{ij} \\
+H_a: ~ x_ij ~=~ \mu + \alpha_j + \epsilon_{ij}

StatSimulationBased/equation032.png

Added
New image

StatSimulationBased/equation032.tex

+x_{ij} = x_{ij} \\
+x_{ij} - \bar{x} = x_{ij} - \bar{x} \\
+= x_{ij} + (-\bar{x_j} + \bar{x_j}) - \bar{x} \\
+= (x_{ij}- \bar{x}_j) + (\bar{x}_j - \bar{x}) \\
+= (\bar{x}_j - \bar{x}) + (x_{ij} - \bar{x}_j)

StatSimulationBased/equation033.png

Added
New image

StatSimulationBased/equation033.tex

+\sum_{j=1}^{I} \sum_{i=1}^{n_j} (x_{ij} - \bar{x})^2
+= \sum_{j=1}^{I} \sum_{i=1}^{n_j} (\bar{x}_j - \bar{x})^2
++ \sum_{j=1}^{I} \sum_{i=1}^{n_j} (x_ij - \bar{x}_j)^2 \\
+SS_{total} = SS_{between} + SS_{within}

StatSimulationBased/equation034.png

Added
New image

StatSimulationBased/equation034.tex

+(n-1, I - 1)

StatSimulationBased/equation035.png

Added
New image

StatSimulationBased/equation035.tex

+DF-within ~=~ DF-total - DF-between \\
+= (n-1) - (I-1) \\
+= n - I

StatSimulationBased/equation036.png

Added
New image

StatSimulationBased/equation036.tex

+MS_{total} = \frac{\sum_{j=1}^{I} \sum_{i=1}^{n_j} (x_{ij} - \bar{x})^2}{N-1} \\
+MS_{between} = \frac{\sum_{j=1}^{I} \sum_{i=1}^{n_j} (\bar{x}_j - \bar{x})^2}{I-1} \\
+MS_{within} = \frac{\sum_{j=1}^{I} \sum_{i=1}^{n_j} (x_{ij} - \bar{x}_j)^2}{N-I}

StatSimulationBased/equation037.png

Added
New image

StatSimulationBased/equation037.tex

+H_0: x_{ij} = \mu + \epsilon_{ij}\\
+H_a: x_{ij} = \mu + \alpha_j + \epsilon_{ij}\\
+in~other~words:\\
+H_0: MS_{between} = MS_{within}\\
+H_0: \frac{MS_{between}}{MS_{within}} =  1

StatSimulationBased/ex5-2variance.py

+# -*- coding: utf-8 -*-
+"""
+Created on Fri Sep  9 21:09:56 2011
+
+@author: -
+"""
+
+import scipy
+from scipy import stats
+import matplotlib.pyplot as plt
+
+trials = 1000
+mu = 60
+sigma = 4
+sample_size = 11
+
+
+norm_dist = stats.norm(loc = mu, scale = sigma)
+variances = scipy.zeros(trials)
+for i in xrange(0, trials):
+    sample1 = norm_dist.rvs(size = sample_size)
+    sample2 = norm_dist.rvs(size = sample_size)
+    sample3 = norm_dist.rvs(size = sample_size)
+    variances[i] = scipy.var([sample1, sample2, sample3])
+
+print "mean of variances", scipy.mean(variances)
+plt.hist(variances)
+plt.show()

StatSimulationBased/ex5-3analyzevar.py

+# -*- coding: utf-8 -*-
+"""
+Analyze variances
+Created on Fri Sep  9 21:30:30 2011
+@author: 
+"""
+
+import scipy
+from scipy import stats
+import matplotlib.pyplot as plt
+
+rvh = scipy.array([[9, 10, 1], [1, 2, 5], [2, 6, 0]])
+
+'''
+first we calculate difference between each value and 
+grand mean, which we call total
+'''
+print "mean of rvh: ", scipy.mean(rvh)
+total = scipy.zeros((3, 3))
+for col in xrange(0, 3):
+    for row in xrange(0, 3):
+        total[row, col] = rvh[row, col] - scipy.mean(rvh)
+print "matrix of total: \n", total
+
+'''
+Second, we compute the 'within' group differences
+i.e the difference between each value and its own
+group mean
+'''
+within =  scipy.zeros((3, 3))
+for col in xrange(0, 3):
+    for row in xrange(0, 3):
+        within[row, col] = rvh[row, col] - scipy.mean(rvh[:, col])
+print "matrix of within:\n", within
+
+'''
+Finally, we compute the "between" group differences, 
+i.e difference between each group mean and the grand mean 
+'''
+between = scipy.zeros((3, 3))
+for col in xrange(0,3):
+    for row in xrange(0, 3):
+        between[row, col] = scipy.mean( rvh[:, col]) - scipy.mean(rvh)
+
+print "matrix of between:\n", between
+
+print "total vs within+between:"
+print total
+print within + between
+
+ms_total = scipy.sum(total**2)/ (total.size -1)
+ms_between = scipy.sum(between**2)/ (total.shape[0]-1)
+ms_within = scipy.sum(within**2) / (total.size - total.shape[0])
+print "ms_total: %0.5f, ms_between: %.5f, ms_within: %.5f " % (ms_total, ms_between, ms_within)

StatSimulationBased/ex5-3msbetween.py

+# -*- coding: utf-8 -*-
+"""
+Created on Sat Sep 10 19:42:10 2011
+
+@author: -
+"""
+import scipy
+from scipy import stats
+import matplotlib.pyplot as plt
+
+mu = 60
+sigma = 4
+trials = 1000
+sample_size = 11
+groups = 3
+
+ms_within = scipy.zeros(trials)
+ms_between = scipy.zeros(trials)
+fs = scipy.zeros(trials)
+
+norm_gen = stats.norm(loc = mu, scale = sigma)
+for n in xrange(0, trials):
+    sample1 = norm_gen.rvs(size = sample_size)
+    sample2 = norm_gen.rvs(size = sample_size)
+    sample3 = norm_gen.rvs(size = sample_size)
+    
+    m = scipy.vstack((sample1, sample2, sample3)).transpose()
+    within = scipy.zeros((sample_size, 3))
+    between = scipy.zeros((sample_size, 3))
+
+    for col in xrange(0, 3):
+        for row in xrange(0, sample_size):
+            within[row, col] = m[row, col] - scipy.mean(m[:, col])
+            between[row, col] = scipy.mean(m[:, col]) - scipy.mean(m)
+    
+    ms_within[n] = scipy.sum(within**2)/(m.size - groups)
+    ms_between[n] = scipy.sum(between**2)/(groups - 1)
+    fs[n] = ms_between[n]/ms_within[n]
+    
+print "mean of f-statistic:", scipy.mean(fs)
+
+#finally plotting
+plt.figure(1)
+plt.subplot(311)
+plt.title("histogram of ms_within")
+plt.hist(ms_within)
+plt.subplot(312)
+plt.title("histogram of ms_between")
+plt.hist(ms_between)
+plt.subplot(313)
+plt.title("histogram of fn")
+plt.hist(fs)
+plt.show()    

StatSimulationBased/ex5-3mswithinnonequal.py

+# -*- coding: utf-8 -*-
+"""
+MS-within for non-identical populations
+Created on Sat Sep 10 19:42:10 2011
+
+@author: -
+"""
+import scipy
+from scipy import stats
+import matplotlib.pyplot as plt
+
+mu1 = 58
+mu2 = 60
+mu3 = 62
+sigma = 4
+trials = 1000
+sample_size = 11
+groups = 3
+
+ms_within = scipy.zeros(trials)
+ms_between = scipy.zeros(trials)
+fs = scipy.zeros(trials)
+
+norm_gen1 = stats.norm(loc = mu1, scale = sigma)
+norm_gen2 = stats.norm(loc = mu2, scale = sigma)
+norm_gen3 = stats.norm(loc = mu3, scale = sigma)
+
+for n in xrange(0, trials):
+    sample1 = norm_gen1.rvs(size = sample_size)
+    sample2 = norm_gen2.rvs(size = sample_size)
+    sample3 = norm_gen3.rvs(size = sample_size)
+    
+    m = scipy.vstack((sample1, sample2, sample3)).transpose()
+    within = scipy.zeros((sample_size, 3))
+    between = scipy.zeros((sample_size, 3))
+
+    for col in xrange(0, 3):
+        for row in xrange(0, sample_size):
+            within[row, col] = m[row, col] - scipy.mean(m[:, col])
+            between[row, col] = scipy.mean(m[:, col]) - scipy.mean(m)
+    
+    ms_within[n] = scipy.sum(within**2)/(m.size - groups)
+    ms_between[n] = scipy.sum(between**2)/(groups - 1)
+    fs[n] = ms_between[n]/ms_within[n]
+    
+print "mean of f-statistic:", scipy.mean(fs)
+
+#finally plotting
+plt.figure(1)
+plt.subplot(311)
+plt.title("histogram of ms_within")
+plt.hist(ms_within)
+plt.subplot(312)
+plt.title("histogram of ms_between")
+plt.hist(ms_between)
+plt.subplot(313)
+plt.title("histogram of fn")
+plt.hist(fs)
+plt.show()    

StatSimulationBased/ex5anova.py

+# -*- coding: utf-8 -*-
+"""
+Created on Fri Sep  9 20:40:36 2011
+
+@author: -
+"""
+
+import scipy
+from scipy import stats
+import matplotlib.pyplot as plt
+
+trials = 1000
+mu = 60 #avg of population
+sigma = 1 #stdev of population
+sample_size = 10
+
+errors = scipy.zeros(trials)
+sampler = stats.norm(loc = mu, scale = sigma)
+for i in xrange(0, trials):
+    sample = sampler.rvs(size = sample_size)
+    errors[i] = scipy.mean(sample) -mu
+
+plt.title("Distribution of errors")
+plt.hist(errors)
+plt.show()