# Commits

committed 3b61714

New Examples to paragraph 3.16

• Participants
• Parent commits a7845e8

# File StatSimulationBased.txt

 	* While we do not know {{./equation012.png?type=equation}} , we can estimate it using SE_x and perform inference using a distribution that is almost normal, but reflects the increase in uncertainty arising from this estimation: **the t-distribution**.

 === 3.11 Significance Tests ===
+	* source: [[./ex3-11nullhypo.py]]
+	* recall the discussion of 95% confidence intervals:
+		* The sample gives us a mean {{./equation013.png?type=equation}} .
+		* We compute {{./equation014.png?type=equation}} (an estimate of {{./equation015.png?type=equation}}) using s (an estimate of δ ) and sample size n
+		* The we calculate the range {{./equation016.png?type=equation}} - thats the 95% CI
+	* NULL HYPOTHESIS: Suppose we have a hypothesis that the populatio mean has a certain value. If we have a hypothesis about the population mean, then we also know what the corresponding sampling distribution would look like - we know the probability of any possible sample given that hypothesis. We then take an actual sample, measure the distance of our sample mean from hypothesized population mean, and use the facts of the sampling distribution to determine the probability of obtaining such a sampe //assuming the hypothesis is true. //Intuitively,if the probability of our sample (given the hypothesis) is high, this provides evidence the hypothesis is true.
+	* Results of A SIGNIFICANCE TEST yields a probability that indicates exactly how well or poorly the data and the hypothesis agree.
+
+=== 3.12 The Null Hypothesis ===
+	* we are interested in evidence against the null hypothesis, since this is evidence for some real statistically significant result. This is what a formal significance test does: it determines if the result provides sufficient evidence against the null hypothesis for us to reject it.
+	* In order to achieve a high degree of skepticism about the interpretation of the data, we require the evidence against the null hypothesisis to be very great.
+
+=== 3.13 Z-scores ===
+	* z is called the STANDARDIZED VALUE  or the Z-score and also referred to as a TEST-STATISTIC
+	* {{./equation017.png?type=equation}}
+	* z-scores are quick and accepted way of expressing 'how far away' from hypothetical value an observation falls, and for determining if that observation is beyond some accepted threshols.
+
+=== 3.14 P-values ===
+	* source: [[./ex3-14pvalues.py]]
+	* How much of the total probability lies beyond the observed value, out into the tail of the distribution.
+		* Discrete →  sum of probabilities
+		* Continous → area under the curve
+	* The p-value of statistical test is the probability, computed assuming H0 is true,that the test statistic would take a value as extreme or more extreme than that actually observed.
+	* CONDITIONAL PROBABILITY: it is probability of observing a particular sample mean (orsomething more extreme) conditional on the assumption that the null hypothesis is true, we can write this as P(Data | H0)
+	* The p-value does not measure the probability of the null hypothesis given the data P(H0 | Data)
+	* if P(Data | H0) is low, we say the LIKELIHOOD of the hypothesis is low.
+	* To determine p-value: Simply integrate the area under the normal curve, going out from our observed value.
+	* rule of thumb: About 95% of the probability is within 2SD of the mean. The remainder is split into two, one at each end of the distribution, each representing a probability of about 0.025.
+
+=== 3.15 Hypothesis testing: a more realistic scenario ===
 	* source:
+	* Just as in the case of computing real world confidence intervals:
+		* instead of δ we use the unbiased estimator s;
+		* instead of {{./equation018.png?type=equation}} we use the unbiased estimator {{./equation019.png?type=equation}};
+		* instead of the normal distribution we use the t-distribution
+	* //recall definition of statistics: a number that describes some aspect of the sample.//
+	* t-statistics - replace δ in z-score with estimate s:
+		{{./equation020.png?type=equation}}
+	* Note a rather subtle point: we can have sample with the same mean value, but different t-scores, since the SD s of the samples may differ.  T-score would be even be identical to the z-score, but the probability associated with the score will differ slightly, since we use the t-distribution, not the normal distribution
+	* If our null hypothesis H0 was that observed mean {{./equation021.png?type=equation}} is equal to hypothesized mean µ0. Then rejecting the null hypthesis amount to accepting the alternative hypothesis,i.e that theobserved value is less than the mean ot the observed value is greate than the mean: {{./equation022.png?type=equation}} .
+	* This means that as evidence for rejection of H0 we will count extreme values on both sides of µ . For this reason, the above test is calles a **2sided significance test.**
+
+=== 3.16 Comparing 2 samples ===
+	* source:
+

# File StatSimulationBased/equation013.tex

+\bar{x}

# File StatSimulationBased/equation014.tex

+SE_{\bar{x}}

# File StatSimulationBased/equation015.tex

+\delta_{\bar{x}}

# File StatSimulationBased/equation016.tex

+\bar{x} \pm 2 * SE_{\bar{x}}

# File StatSimulationBased/equation017.tex

+z = \frac{\bar{x} - \mu_0}{\sigma_{\bar{x}}} =
+\frac{\bar{x} - \mu_0}{\sigma / \sqrt{n} }

# File StatSimulationBased/equation018.tex

+\delta_{\bar{x}}

# File StatSimulationBased/equation019.tex

+SE_{\bar{x}}

# File StatSimulationBased/equation020.tex

+t = \frac{\bar{x}- \mu_0}{ SE_{\bar{x}}}
+= \frac{\bar{x} - \mu_0}{s/\sqrt{n}}

# File StatSimulationBased/equation021.tex

+\bar{x}

# File StatSimulationBased/equation022.tex

+H_a: \bar{x} < \mu_0 ~ or ~ \mu_0 < \bar{x}

# File StatSimulationBased/ex3-11nullhypo.py

+# -*- coding: utf-8 -*-
+"""
+Created on Sun Aug 21 13:18:35 2011
+
+Null hypothesis
+@author: -
+"""
+import scipy
+import scipy.stats
+import matplotlib.pyplot as plt
+
+pop_mean = 70   # mean of population
+pop_std = 4     #stdeviation of population
+
+sample_size = 11
+sample_mean = 60
+sample_std = 1.2
+
+SD_distribution = pop_std / scipy.sqrt(sample_size)
+
+range1 = scipy.arange(55, 85, 0.01)
+
+sample = scipy.stats.norm.rvs(size = sample_size,
+                              loc = sample_mean,
+                              scale =  pop_std)
+real_sample_mean = scipy.mean(sample)
+
+plt.plot(range1,
+        scipy.stats.norm.pdf(range1, loc = pop_mean,
+                             scale = SD_distribution)
+         )
+plt.axvline(x = real_sample_mean)
+plt.title("The null hypothesis")
+
+plt.show()
+
+#example for 3.13 Z-scores
+z_score = (sample_mean - pop_mean)/(pop_std/scipy.sqrt(sample_size))
+
+print "Z-score: ", z_score

# File StatSimulationBased/ex3-14pvalues.py

+# -*- coding: utf-8 -*-
+"""
+Created on Sun Aug 21 14:32:39 2011
+
+Calculating p-values
+@author: -
+"""
+import scipy
+import scipy.stats
+import matplotlib.pyplot as plt
+
+
+mu0 = 70 #mean of population
+sigma = 4.0 # stdev of population
+sample_size = 11
+trials  = 1000
+pop_zscore = -8.291562
+
+sample_means = scipy.zeros(trials)
+zs = scipy.zeros(trials) #zeroscores
+
+norm_gen = scipy.stats.norm(loc = mu0, scale = sigma)
+for i in scipy.r_[0:trials]:
+    sample = norm_gen.rvs(size = sample_size)
+    sample_means[i] = scipy.mean(sample)
+    zs[i] = (sample_means[i] - mu0) / (sigma/scipy.sqrt(sample_size))
+
+sd_dist = sigma/scipy.sqrt(sample_size)
+
+plt.figure(1)
+plt.subplot(321, title = "Histogram of sample means")
+plt.hist(sample_means)
+
+plt.subplot(322, title = "Density of sample means")
+#taken from:
+#http://stackoverflow.com/questions/4150171/how-to-create-a-density-plot-in-matplotlib
+density = scipy.stats.gaussian_kde(sample_means)
+density.covariance_factor = lambda : .25
+density._compute_covariance()
+plt.plot(scipy.sort(sample_means),
+         density(scipy.sort(sample_means)))
+
+
+plt.subplot(323, title =  "Histogram of z-score")
+plt.hist(zs)
+
+plt.subplot(324, title = "Density of z-scores")
+#taken from:
+#http://stackoverflow.com/questions/4150171/how-to-create-a-density-plot-in-matplotlib
+density = scipy.stats.gaussian_kde(zs)
+density.covariance_factor = lambda : .25
+density._compute_covariance()
+plt.plot(scipy.sort(zs),
+         density(scipy.sort(zs)))
+
+
+plt.subplot(325, title = "Density of population")
+range2 = scipy.arange(mu0 - (sigma * sd_dist),
+                      mu0 + (sigma * sd_dist), 0.1)
+plt.plot(range2,
+    scipy.stats.norm.pdf(range2, loc = mu0,
+            scale = sigma/scipy.sqrt(sample_size)))
+
+plt.subplot(326, title = "Density of normal distribution")
+range3 = scipy.arange(-sigma, sigma, 0.1)
+plt.plot(range3,
+         scipy.stats.norm.pdf(range3, loc = 0, sigma = 1))
+
+plt.show()
+
+#calculate p-values
+import scipy.integrate
+
+func = lambda x: scipy.stats.norm.pdf(x, loc = 0, scale = 1)
+print "zscore p-value: ", scipy.integrate.quad(func, -scipy.Inf, pop_zscore)
+
+func = lambda x: scipy.stats.norm.pdf(x,loc = mu0, scale =sigma/scipy.sqrt(sample_size))
+print "P-value for outlier", scipy.integrate.quad(func, -scipy.Inf, 60) #is 60 correct?
+
+#now p-value for hypothetical sample with mean = 67.58
+sd = (67.58 - mu0)/(sigma/scipy.sqrt(sample_size))
+func = lambda x: scipy.stats.norm.pdf(x, loc = 0, scale = 1)
+val = scipy.integrate.quad(func, -scipy.Inf, sd)
+print "P-value for hypothetical samples sd:", val

# File StatSimulationBased/ex3-15realhypo.py

+# -*- coding: utf-8 -*-
+"""
+Created on Sun Aug 21 17:37:52 2011
+
+Hypothesis testing: A More realistic scenario
+@author: -
+"""
+
+import scipy
+import scipy.stats
+import matplotlib.pyplot as plt
+
+print "Lets simulate the sampling distribution of t-statistics and\
+ compare it to distribution of z-statistic"
+
+trials = 10000
+sample_size = 4
+mu0 = 70 #mean of population
+sigma = 4 #stdev of population
+
+ts = scipy.zeros(trials)
+zs = scipy.zeros(trials)
+
+norm_gen = scipy.stats.norm(loc = mu0, scale = sigma)
+for i in scipy.r_[0:trials]:
+    sample = norm_gen.rvs(size = sample_size)
+    zs[i] = (scipy.mean(sample) - mu0)/(sigma/ scipy.sqrt(sample_size))
+    ts[i] = (scipy.mean(sample) - mu0)/(scipy.std(sample)/scipy.sqrt(sample_size))
+
+#taken from:
+#http://stackoverflow.com/questions/4150171/how-to-create-a-density-plot-in-matplotlib
+density = scipy.stats.gaussian_kde(zs)
+density.covariance_factor = lambda : .25
+density._compute_covariance()
+
+plt.figure(1)
+plt.subplot(221, title = "Sampling distribution of z")
+plt.plot(scipy.sort(zs), density(scipy.sort(zs)))
+
+plt.subplot(222, title = "Sampling distribution of t")
+xvals = scipy.linspace(-sigma, sigma, len(ts))
+plt.plot(scipy.sort(ts), density(scipy.sort(ts)))
+
+plt.subplot(223, title = "Limiting case: normal distribution")
+rng = scipy.arange(-sigma, sigma, 0.1)
+vals = scipy.stats.norm.pdf(rng, loc = 0, scale = 1)
+plt.plot(vals)
+
+plt.subplot(224, title = "Limiting case: t-distribution")
+vals = scipy.stats.t.pdf(rng, 3)
+plt.plot(vals)
+
+plt.show()
+
+print "example t-score:", (67.58 - 70)/(4/scipy.sqrt(11))
+
+#example how to use rpy, because scipy dont have such a good t-test
+import rpy2.robjects as robjects
+robjects.r('''sample.11<-rnorm(11,mean = 60, sd = 4)''')
+robjects.r('''testtext <- t.test(sample.11, alternative = "two.sided",
+       mu=70, conf.level =0.95)''')
+print robjects.globalenv['testtext']

 I tried to keep code examples so authentic as possible with author's ones and most of cases it was possible.

 To view short conspectus, you have to use Zim with installed Latex plugin.
+
+Some requirements:
+
+1. scipy, scipy.stats
+2. matplotlib
+3. rpy2
+
+You can freely add your contributions!