# Commits

committed 0bfc332

finished chapter.4 power analysis

• Participants
• Parent commits 3b61714
• Branches default

# File StatSimulationBased.txt

 	* A: In the pre-computing days, people used to look up a table that told you, for n-1 degrees of freedom, how many SE's you need to go around the sample mean to get a 95% CI.

 === Some observations on Confidence Intervals ===
-
 	* source: [[./ex3-8.py]]
 	* One importantpoint to notice is that the range defined by the confidence interal wil vary with each sample even if the sample size is kept constant. The reason is that the sample mean will vary each time, and the standard deviation will vary too.
 	* The sample mean and standard deviation are likely to be close to the population mean and standard deviation,but they are ultimately just estimates of the true parameters.
 	* This means that as evidence for rejection of H0 we will count extreme values on both sides of µ . For this reason, the above test is calles a **2sided significance test.**

 === 3.16 Comparing 2 samples ===
-	* source:
+	* source: [[./ex3-16twosample.py]]
+	* can state our 0 hypothesis as follows H0: μ1 = μ2 aka null hypothesis is that the difference between the two means is zero.
+	* conclusions from experiment:
+		* differences of means of 2sample follows normal distribution and is **centered around the true difference betweeen the two populations. **
+		* precise relationship between difference:
+			{{./equation023.png?type=equation}}
+		* 2sample **z-score** - replace σ with sigma of difference:
+		{{./equation024.png?type=equation}}

+		* 2sample **t-statistic**:
+		{{./equation025.png?type=equation}}
+		* translating this t-statistics to p-value is problematic, we dont know degrees of freedom needed for the correct t-distribution are not obvious. The t-distribution assumes that only s replaces a single σ : but we have two of these. If σ1 = σ2, we could just take a weighted average of the 2sample SDs s1 and s2.
+		* In this case the correct t-distribution has (n1-1 + n2-1)   degrees of freedom [proof: Rice, 1992, 422]
+
+=== 4.1 Hypothesis testing revisited ===
+	* source: [[./ex4-2poweranalysis.py]]
+	* **TYPE I ERROR:** the null hypothesis is **true**, but our sample leads us to **reject** it
+	* **TYPE II ERROR:** the null hypothesis is **false,** but our sample shows it as **true.**
+Some conventions:
+	* Let R = 'Reject the null hypothesis H0'
+	* Let -R = 'Fail to reject the null hypothesis H0.'
+	* The decision R and -R is based on the sample
+NB! When we do an experiment we dont know whether the null hypothesis is true or not.
+	1. Attempt to minimize error → how to measure it? Let P(R|H0) = "//Probability of rejecting the null hypothesis conditional on the assumption that the null hypothesis is in fact true//."
+	{{./MatrixTypeErrors.png}}
+	* Thus,if we want to decrease the chance of a Type II error, we need to increase the power of the statistical test.
+	* The best situation is when we have relatively high power (low type2 error) and low type1 error. By convention, we keep α at 0.05, and we should not to change it → lowering reduce power.
+	* If we have a relatively narrow CI, and a nonsignifant result (p > 0.05), we have then relatively high power and a relatively low probability of making a Type2 error aka accepting the null hypothesis as true, when it is in fact false.
+	* observed power provides no new information after p-value is known: if the p-value is high, we already know the observed power is low, there is nothing gained by computing it.
+
+=== 4.3.1 Equivalence testing examples ===
+	* source:[[./ex4-3example.py]]
+	* TOST method's algorithm:
+		* define an equivalence threshold Θ
+		* compute 2 one-way t-tests:
+		{{./equation026.png?type=equation}}
+		{{./equation027.png?type=equation}}
+		* compute critical t-value (in R: qt(0.95, DF), where DF = n1+n2-2;n1,n2 are sizes of 2groups)
+		* reject H0 {{./equation028.png?type=equation}}
+
+=== 5 Analysis of Variance (ANOVA) ===
+	* source:

# File StatSimulationBased/equation023.tex

+\sigma_{\bar{x_1} - \bar{x_2}} =
+	\sqrt{\frac{\sigma_{1}^2}{n_1} + \frac{\sigma_{2}^2}{n_2}}

# File StatSimulationBased/equation024.tex

+z = \frac{\bar{x} - \mu_0}{\sigma/\sqrt{n}} =
+\frac{sample~mean-pop.mean}{sd~of~sampling~distribution}
+=> \frac{\bar{x} - \mu_0}{\sigma_{\bar{x_1} - \bar{x_2}}}

# File StatSimulationBased/equation025.tex

+t = \frac{(\bar{x_1}- \bar{x_2}) - (\mu_1 - \mu_2)}{
+\sigma_{\bar{x_1}- \bar{x_2}}
+} =
+\frac{(\bar{x_1}- \bar{x_2}) - (\mu_1 - \mu_2)}{
+\sqrt{ \frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}
+}

# File StatSimulationBased/equation026.tex

+t_d \le \Theta_L = \frac{d-\Theta}{ s_{pooled} / \sqrt{1/n_1 + 1/n_2}}

# File StatSimulationBased/equation027.tex

+t_d \ge \Theta_U = \frac{d + \Theta}{ s_{pooled} / \sqrt{1/n_1 + 1/n_2}}

# File StatSimulationBased/equation028.tex

+if~ t_d \le -t_{crit} ~is~true, t_d \ge t_{crit} ~is~true

# File StatSimulationBased/ex3-16twosample.py

+# -*- coding: utf-8 -*-
+"""
+Created on Fri Aug 26 18:59:22 2011
+3.16 Compare 2samples and control 0-Hypothesis
+
+@author: -
+"""
+import scipy
+from scipy import stats
+import matplotlib.pyplot as plt
+
+#first population's data
+mu1 = 34
+sigma1 = 43 #25
+sample_size1 = 10
+#second population's data
+mu2 = 7
+sigma2 = 25 #43
+sample_size2 = 20
+
+trials = 1000
+
+
+d = scipy.zeros(trials)
+sample1_gen = stats.norm(loc = mu1, scale = sigma1)
+sample2_gen = stats.norm(loc = mu2, scale = sigma2)
+
+for i in scipy.r_[0:trials]:
+    sample1 = sample1_gen.rvs(size = sample_size1)
+    sample2 = sample2_gen.rvs(size = sample_size2)
+    d[i] = scipy.mean(sample1) - scipy.mean(sample2)
+
+print "new_sigma: ", scipy.sqrt(
+    (sigma1**2/sample_size1) + (sigma2**2)/sample_size2)
+
+plt.hist(d)
+plt.title("Distribution of differences of 2samples")
+plt.show()

# File StatSimulationBased/ex4-2poweranalysis.py

+# -*- coding: utf-8 -*-
+"""
+Created on Fri Aug 26 21:10:22 2011
+
+Chapter 4 - Power analysis
+@author: -
+"""
+import scipy
+import scipy.stats as stats
+import matplotlib.pyplot as plt
+from matplotlib.patches import Polygon
+mu1 = 60
+sigma1 = 1
+mu2 = 62
+sigma2 = 1
+
+#main func
+xmin = -6
+xmax = 6
+step = 0.01
+alpha = 0.05
+
+x = scipy.arange(xmin, xmax, step)
+qnts = scipy.array([alpha/2, 1 - (alpha/2)], dtype = scipy.float32)
+
+
+plt.figure()
+subplot = plt.subplot(211)
+
+reverse = lambda x: x[::-1]
+
+def plot_type1_error(subplot, x, qnts,
+                     mean = 0, sd = 1,
+                     shading = 0.3,
+                     main = "", show_legend = False):
+    #sample carateristics
+    xmin = x.min()
+    xmax = x.max()
+    mean = mean
+    sd= sd
+    #mean = scipy.mean(x)
+    #sd = scipy.std(x)
+    norm_gen = stats.norm(loc = mean, scale = sd)
+
+    ax = subplot
+    ax.plot(x, norm_gen.pdf(x))
+
+    left_limit = norm_gen.ppf(qnts[0])
+    right_limit = norm_gen.ppf(qnts[1])
+
+    #draw left side
+    x1 = scipy.arange(xmin, left_limit, qnts[0]/5)
+    y1 = norm_gen.pdf(x1)
+    verts = zip(x1, y1)
+    poly = Polygon([(xmin,0)]+verts+[(left_limit,0)])
+    ax.add_patch(poly)
+
+    #draw right side
+    x1 = scipy.arange(right_limit, xmax, qnts[0]/5)
+    y1 = norm_gen.pdf(x1)
+    verts = zip(x1, y1)
+    poly1 = Polygon([(right_limit,0)]+verts+[(xmax,0)])
+    ax.add_patch(poly1)
+    #TODO: add legend/text
+
+def plot_type2_error(subplot,x, qnts,
+                     mean1, mean2 = 0,
+                     sd = 1,
+                     shade1 =0.7, shade2 = 0.9):
+    """plots type1 and type2 error
+    mean1 - defines plot with type2 error
+    mean2 - defines plot with type1 error
+    """
+    ax = subplot
+    shade1 = '%s' % str(shade1)
+    shade2 = '%s' % str(shade2)
+    xmin = x.min()
+    xmax = x.max()
+
+    step = qnts[0]/5
+
+    norm_gen = stats.norm(loc = mean1, scale = sd)
+    norm2_gen = stats.norm(loc = mean2, scale = sd)
+
+    #plot both distributions
+    ax.plot(x, norm_gen.pdf(x))
+    ax.plot(x, norm2_gen.pdf(x))
+
+    #add type1 errors to mean.null graphs
+    #left-side
+    right_limit = norm2_gen.ppf(qnts[0])
+    x1 = scipy.arange(xmin, right_limit, step)
+    y1 = norm2_gen.pdf(x1)
+    verts = [(xmin, 0)] + zip(x1, y1) + [(right_limit, 0)]
+    poly2 = Polygon(verts, facecolor = shade1, edgecolor = 'r')
+    ax.add_patch(poly2)
+
+    #right-side
+    left_limit = norm2_gen.ppf(qnts[1])
+    x1 = scipy.arange(left_limit, xmax, step)
+    y1 = norm2_gen.pdf(x1)
+    verts = [(left_limit, 0)] + zip(x1,y1) + [(xmax, 0)]
+    poly3 = Polygon(verts, facecolor = shade1, edgecolor = 'r')
+    ax.add_patch(poly3)
+
+    #add type error2 to mean.true graph
+    llimit = norm2_gen.ppf(qnts[0])
+    x1 = scipy.arange(llimit, xmax, step)
+    y1 = norm_gen.pdf(x1)
+    verts = [(llimit,0)]+zip(x1, y1)+[(xmax,0)]
+    poly1 = Polygon(verts, facecolor = shade2, edgecolor = 'b')
+    ax.add_patch(poly1)
+
+#plot_type1_error(subplot, x, qnts)
+plot_type2_error(subplot, x, qnts = [0.01,0.99], mean1 = -2, mean2 =0)
+plt.show()
+

# File StatSimulationBased/ex4-3example.py

+# -*- coding: utf-8 -*-
+"""
+Created on Sat Aug 27 14:04:33 2011
+
+4.3 - PowerAnalysis real example - does extra expenses pay off?
+@author: -
+"""
+import scipy
+import scipy.stats as stats
+
+n1 = 64
+n2 = 70
+pop_size = 134
+
+mean1 = 1.5679
+mean2 = 1.6764
+pop_mean = 1.6246
+
+sd1 = 0.4285
+sd2 = 0.4748
+pop_sd = 0.4533 #pooled
+
+theta = 0.2 * mean1 #equivalence threshold
+def TOST(mean1, mean2, n1, n2, sigma):
+    """ calcs equivalence by using TOST Method"""
+    d = (mean2 - mean1)
+    t1 = (d-theta)/(sigma * scipy.sqrt(1.0/n1 + 1.0/n2))
+    t2 = (d+theta)/(sigma * scipy.sqrt(1.0/n1 + 1.0/n2))
+
+    t_crit = stats.t.ppf(0.95, (n1+n2-2))
+
+    #test criterias
+    print "TOST: %f <= %f <= %f" % (t1, t_crit, t2)
+    if (t1 < -t_crit) and (t2 > t_crit):
+        print "result: equivalent"
+    else:
+        print "result: failed to proof equivalence"
+
+SE = lambda sigma,n1,n2: (sigma*1.0)/(scipy.sqrt(1.0/n1+1.0/n2))
+def CIMethod(mean1, mean2, n1, n2, sigma):
+    """Proofs equivalence by using Confidence Intervals """
+    d = (mean2 -mean1)
+    t_crit = stats.t.ppf(0.95, (n1+n2-2))
+    ci_lower = d - t_crit * SE(sigma, n1, n2)
+    ci_upper = d + t_crit * SE(sigma, n1, n2)
+
+    print "CIMethod: %f <=-/+ %f <= %f" % (ci_lower, theta, ci_upper)
+    if (ci_lower < -theta) and (theta < ci_upper):
+        print "result: equivalent"
+    else:
+        print "result: failed to proof equivalence"
+
+
+TOST(mean1, mean2, n1, n2, pop_sd)
+CIMethod(mean1, mean2, n1, n2, pop_sd)

# File StatSimulationBased/ex_RtoPy.py

+# -*- coding: utf-8 -*-
+"""
+Created on Fri Aug 26 22:22:16 2011
+
+@author: -
+"""
+
+import scipy
+import scipy.stats as stat
+
+t     = scipy.transpose     # transpose of a matrix.
+mean  = stat.mean
+var   = stat.var
+sd    = stat.std
+
+dnorm = stat.norm.pdf
+pnorm = stat.norm.cdf
+qnorm = stat.norm.ppf
+rnorm = stat.norm.rvs
+
+dt = stat.t.pdf
+pt = stat.t.cdf
+qt = stat.t.ppf
+rt = stat.t.rvs
+
+df = stat.f.pdf
+pf = stat.f.cdf
+qf = stat.f.ppf
+rf = stat.f.rvs
+
+dchisq = stat.chi2.pdf
+pchisq = stat.chi2.cdf
+qchisq = stat.chi2.ppf
+rchisq = stat.chi2.rvs
+
+# added mar 29, 2009 8:40
+dunif = stat.uniform.pdf
+punif = stat.uniform.cdf
+qunif = stat.uniform.ppf
+runif = stat.uniform.rvs
+
+
+def Test():
+  X = scipy.arange(1,10,1)
+  print X
+  print mean(X)
+  print sd(X)
+  print var(X)
+
+if __name__ == "__main__":
+  Test()