statsimulations / StatSimulationBased.txt

Content-Type: text/x-zim-wiki
Wiki-Format: zim 0.4
Creation-Date: 2011-08-07T21:42:01+03:00

====== StatSimulationBased ======
Created Sunday 07 August 2011

=== About ===
	The Foundations of Statistics: A Simulation-based Approach,
	 Shravan Vasishth, Michael Broe: Books
	{{./Screenshot - 08202011 - 03:31:26 PM.png}}

=== Content ===
	**1. Getting Started.**
	**2. Randomness and Probability.**
	**3. The Sampling Distribution of the Sample Mean.**
	4. Power.
	5. Analysis of Variance (NOVA).
	6. Bivariate Statistics and Linear Models.
	7. An Introduction to Linear Mixed Models.

=== Some simple command in Numpy ===
#creating array
>>> import numpy as np
>>> scores = np.int_([99, 97, 72, 56, 88, 80, 74, 95, 66, 57, 89])

#minimum and maximum values of given array
>>> scores.max()
1: 99
>>> scores.min()
2: 56

>>> scores.mean()
3: 79.36363636363636

Variance - it tells you how far away the individual scores are from the mean score on average, and it's defined as follows:

>>> scores.var()
4: 219.68595041322314

* **Standard Deviation**
	* **Why do we divide by n-1 and not n?**
		1. The sum of deviations from the mean is always zero, so if we know n-1 of deviations, the last deviation is predictable.
	* The unrelated numbers that give us the mean and are also called the degrees of freedom
>>> scores.std()
5: 14.821806583990467

* **Median** - midpoint of a sorted (increasing order) list of distribution
>>> np.median(scores)
6: 80.0

* **Quartiles/Percentiles**
	* quartiles Q1 and Q2 are measures of spread around the median - they are the median of the observations below (Q1) and above (Q3) the 'grand' median
	* **Interquartile range (IQR)**: Q3-Q1
	* 5number summary: **MIN, Q1, MEDIAN, Q3, MAX**
>>> np.percentile(scores, 25)
7: 69.0
>>> np.percentile(scores, [25, 75])
8: [69.0, 92.0]

=== Graphical summaries ===

* **Boxplot - **essentially shows the 5-number summary. The box in the middle has a line going through it, that's the median. The lower and upper ends of the box are Q1 and Q3 respectively, and the 2 'whiskers' at either end of the box extend to the minimumand maximum value.
>>> plt.boxplot(scores)
* **Histogram **- shows the number of scores that occur within particular ranges.
>>> plt.hist(scores)

=== Randomness and probability ===

Many random phenomena have the following property:
	while they are unpredicable in specific individual cases, they follow predictable laws in the aggregate.
* **The sum and product rules**
	* **Probability mass**// - total 'weight' of event over all the logically possible  outcomes.//
	1. **Sum Rule: **The probability of mutually exclusive events occurring is the sum of probabilities of each of the events.

	2. **Product Rule: **When 2 or more events are independent, the probability of both of them occuring is the product of their individual probabilities.

#generating 10 random binomial value:
#one stone example:
>>> stats.binom.rvs(1,0.5, size = 10)
>>> np.sum(stats.binom.rvs(1,0.5, size = 10)) * 1.0 / 10
18: 0.40000000000000002
>>> np.sum(stats.binom.rvs(1,0.5, size = 1000)) * 1.0 / 1000
19: 0.48199999999999998
#40 stones 
>>> stats.binom.rvs(40,0.5, size = 10)
21: array([12, 15, 29, 20, 18, 21, 19, 20, 17, 19])
#plotting 1000 experiments
>>> results = stats.binom.rvs(40, 0.5, size = 1000)
>>> plt.hist(results, bins = 40)

**The Binomial Distribution**
The binomial theorem allows us to compute the probability of k Right-stone hits (success) when we make n observations (trials), when the probability of a Right-stone hit (success) is p:
The binomial theorem can be applied whenever there are only 2 possible primitive outcomes, the fixed, n trials are mutually independent, and the probability p of a 'success' is the same for each trial.

#The number of ways we can arrange 3 R's in 4 positions - aka finding **Binomial Coefficient**
>>> scipy.misc.comb(4,3)
0: array(4.000000000000001)
>>> scipy.misc.comb(4,[1,4])
1: array([ 4.,  1.])
>>> outcomes = scipy.misc.comb(40, [x for x in xrange(0,40)])
>>> plt.plot(outcomes)

=== 2.3 Practical example Balls in a box ===
//Suppose we have 12,000 balls in a big box, and we know that 9000 (3/4) are Red, the others White. We say we have a population of 12,000. Suppose we take a RANDOM SAMPLE of 100 balls from these 12,000. We'd expect to draw about 75 white balls. What's the probability of getting exactly 75? source: //[[./|Balls in box]]

A number that describes some aspect of a sample is called **a statistics**. The particular statistics we are computing here is the **sample count,** and if we plot the results we will be able to get an idea of the **sampling distribution** of this statistic. 

**As the sample size goes up, the probability of the most likely sample count goes down. The spread, or standard deviation, decreases as we increase sample size.**
Demo in file: [[./]]
* In the binomial distribution - most of the probability is clustered around the mean
* Most important conceptual steps in statistical inference:
	//If the sample count is within 6 of the mean 95% of the the time, then 95% of the time the mean is within 6 of the sample count.//
* The accuracy of the confidence interval increases with a sample size.
* **Statistic** describes some aspect of a sample, a **Parameter** describes some aspect of a population.
* The spread, or **standard deviation**, decreases as we increase sample size.
* Mean minimizes variance.

=== The binomial versus the Normal Distribution ===
normal distribution:

One important difference between the normal and binomial distributions is that the former refers to continuous dependent variable, whereas the latter refers to discrete binomial variable.

==== Chapter 3. The sampling distribution of the sample mean. ====
* Standard deviation of the distribution of means gets smaller as we increase sample size.
* As the sample size is increased, the mean of the sample means comes closer and closer to the population mean mu_x.
* There is lawful relationship between the standard deviation sigma of the population and the standard deviation of the distribution of means:
* **Central limit theorem:** Provided the sample size is large enough, the sampling distribution of the sample mean will be close to normal irrespective of what the population's distribution looks like.
* The sampling distributions of various statistics (the sampling distribution of the sample mean, or sample proportion, or sample count) are nearly normal. The noral distribution implies that a sample statistics that is close to the mean has a higher probability than one that one that is far away.
* The mean of the sampling distribution of the sample mean is the same as the population mean.
* It follows from the above two facts that the mean of sample is more likely to be close to the population mean than not.

=== s is an unbiased estimator of ===
	* source: [[./]]
	* we'll see that any one sample's standard deviation s is more likely to be close to the population standard deviation δ .
	* If we use s as an estimator of δ, we're more likely than not get close to the right value: we say s is an unbiased estimator of δ . This is true even if the population is not normally distributed.
	* Notice that the Standard Error will vary from sample to sample, since the estimate s of the population parameter δ will vary from sample to sample. And of course, as the sample size increases the estimate s becomes more accurate, as does the SE, suggesting that the uncertainty introduced by this extra layer of estimation will be more of an issue for smaller sample sizes.
	* If we were to derive some value v for the SE, and simplyplug this in to the normal distribution for the sample statistics, this would be equivalent to claiming that v really was the population parameter δ .  What we require is a distribution whose shape has greater uncertainty built into it than the the normal distribution.

* === The t-distribution ===

	* source: [[./]]
	* In the limit, if the sample were the size of the entire population, the t-distribution would be the normal distribution, so the t-curve becomesmore normal as sample size increases. 
	* This distribution is formally defined by the degrees of freedom  and has more of the total probability located in the tails of the distribution. It follows that the probability of a sample mean being close to the true mean is slightly lower when measured by this distribution, reflecting ou greater uncertainty,
	* standart error: {{./equation005.png?type=equation}}

=== The one-sample t-test ===

	* source:  [[./]]
	* q: How many SE's do we need to go to the left and right of the sample mean, within the appropriate t-distribution, to be 95% sure that the population mean lies in that range?
	* A: In the pre-computing days, people used to look up a table that told you, for n-1 degrees of freedom, how many SE's you need to go around the sample mean to get a 95% CI.

=== Some observations on Confidence Intervals ===
	* source: [[./]]
	* One importantpoint to notice is that the range defined by the confidence interal wil vary with each sample even if the sample size is kept constant. The reason is that the sample mean will vary each time, and the standard deviation will vary too.
	* The sample mean and standard deviation are likely to be close to the population mean and standard deviation,but they are ultimately just estimates of the true parameters.
	* '95%' confidence interval means:  It's a statement about the probabability that the hypothetical confidence intervals (that would be computed from the hypothetical repeateed samples) will contain the population mean.
	* When we compute a 95% confidence interval for particular sample, we have only one interval. Strictly speaking, that particular interval does not mean that the probability that the population mean lies within that interval is 0.95. For that statement to be true, it would have to be the case that the population mean is a random variable.
	* The population mean is a single point value that cannot have a multitude of possible values and is therefore not a random variable. If we relax this assuption, that the population mean is apoint value, and assume instead that the populationmean is in reality a range of possible values, then we could say that any on 95% confidence interval represens the range within which the population mean with probability 0.95.

=== Sample SD, degrees of freedom, unbiased estimators ===

	* source: [[./]]
	* **Sample standard deviation** s is just the root of the variance: the average distance of the numbers in the list from the mean of the numbers. {{./equation006.png?type=equation}}

=== Summary of the sampling process ===
	* summary of the notation used
	**the sample statistic	#	an unbiased estimate of**
	sample mean 	{{./equation007.png?type=equation}}		#	population mean µ 
	sample SD s			#	population SD σ 
	standard error SE_x	#	sampling distribution {{./equation008.png?type=equation}}
	* **statistical inference** involves a single sample value but assumes knowledge of the sampling distribution which provides probabilities for all possible sample values.
	* The **statistics** (e.g mean) in a random sample is more likely to be closer to the **population parameter** (the population mean) than not. This follows from the normal distribution of the sample means.
	* In the limit, the **mean of the sampling distribution** is equal to the population parameter.
	* The further away a **sample statistic** is from the mean of the sampling distribution, the lower the probability that such a sample will occur.
	* The standard deviation of the sampling distribution {{./equation009.png?type=equation}} is partially determined by the inherent variability δ in the population, and partially determined by the sample size. It tells us how steeply the probability falls off from the center. 
		* If {{./equation010.png?type=equation}} is small, then the fall-off in probability is steep: //random samples are more likely to be very close to the mean, samples are better indicators of the population parameters, and inference is more certain//.
		* If {{./equation011.png?type=equation}} is large, then the fall-off in probability from the center is gradual: //random samples far from the true mean are more likely, samples are not such good indicators of the population parameters, and inference is less certain.//
	* While we do not know {{./equation012.png?type=equation}} , we can estimate it using SE_x and perform inference using a distribution that is almost normal, but reflects the increase in uncertainty arising from this estimation: **the t-distribution**.

=== 3.11 Significance Tests ===
	* source: [[./]]
	* recall the discussion of 95% confidence intervals:
		* The sample gives us a mean {{./equation013.png?type=equation}} . 
		* We compute {{./equation014.png?type=equation}} (an estimate of {{./equation015.png?type=equation}}) using s (an estimate of δ ) and sample size n
		* The we calculate the range {{./equation016.png?type=equation}} - thats the 95% CI
	* NULL HYPOTHESIS: Suppose we have a hypothesis that the populatio mean has a certain value. If we have a hypothesis about the population mean, then we also know what the corresponding sampling distribution would look like - we know the probability of any possible sample given that hypothesis. We then take an actual sample, measure the distance of our sample mean from hypothesized population mean, and use the facts of the sampling distribution to determine the probability of obtaining such a sampe //assuming the hypothesis is true. //Intuitively,if the probability of our sample (given the hypothesis) is high, this provides evidence the hypothesis is true.
	* Results of A SIGNIFICANCE TEST yields a probability that indicates exactly how well or poorly the data and the hypothesis agree.

=== 3.12 The Null Hypothesis ===
	* we are interested in evidence against the null hypothesis, since this is evidence for some real statistically significant result. This is what a formal significance test does: it determines if the result provides sufficient evidence against the null hypothesis for us to reject it.
	* In order to achieve a high degree of skepticism about the interpretation of the data, we require the evidence against the null hypothesisis to be very great.

=== 3.13 Z-scores ===
	* z is called the STANDARDIZED VALUE  or the Z-score and also referred to as a TEST-STATISTIC
	* {{./equation017.png?type=equation}}
	* z-scores are quick and accepted way of expressing 'how far away' from hypothetical value an observation falls, and for determining if that observation is beyond some accepted threshols.

=== 3.14 P-values ===
	* source: [[./]]
	* How much of the total probability lies beyond the observed value, out into the tail of the distribution.
		* Discrete →  sum of probabilities
		* Continous → area under the curve
	* The p-value of statistical test is the probability, computed assuming H0 is true,that the test statistic would take a value as extreme or more extreme than that actually observed.
	* CONDITIONAL PROBABILITY: it is probability of observing a particular sample mean (orsomething more extreme) conditional on the assumption that the null hypothesis is true, we can write this as P(Data | H0)
	* The p-value does not measure the probability of the null hypothesis given the data P(H0 | Data)
	* if P(Data | H0) is low, we say the LIKELIHOOD of the hypothesis is low.
	* To determine p-value: Simply integrate the area under the normal curve, going out from our observed value.
	* rule of thumb: About 95% of the probability is within 2SD of the mean. The remainder is split into two, one at each end of the distribution, each representing a probability of about 0.025.

=== 3.15 Hypothesis testing: a more realistic scenario ===
	* source:
	* Just as in the case of computing real world confidence intervals:
		* instead of δ we use the unbiased estimator s; 
		* instead of {{./equation018.png?type=equation}} we use the unbiased estimator {{./equation019.png?type=equation}};
		* instead of the normal distribution we use the t-distribution
	* //recall definition of statistics: a number that describes some aspect of the sample.//
	* t-statistics - replace δ in z-score with estimate s:
	* Note a rather subtle point: we can have sample with the same mean value, but different t-scores, since the SD s of the samples may differ.  T-score would be even be identical to the z-score, but the probability associated with the score will differ slightly, since we use the t-distribution, not the normal distribution
	* If our null hypothesis H0 was that observed mean {{./equation021.png?type=equation}} is equal to hypothesized mean µ0. Then rejecting the null hypthesis amount to accepting the alternative hypothesis,i.e that theobserved value is less than the mean ot the observed value is greate than the mean: {{./equation022.png?type=equation}} .
	* This means that as evidence for rejection of H0 we will count extreme values on both sides of µ . For this reason, the above test is calles a **2sided significance test.**

=== 3.16 Comparing 2 samples ===
	* source: [[./]]
	* can state our 0 hypothesis as follows H0: μ1 = μ2 aka null hypothesis is that the difference between the two means is zero.
	* conclusions from experiment:
		* differences of means of 2sample follows normal distribution and is **centered around the true difference betweeen the two populations. **
		* precise relationship between difference:
		* 2sample **z-score** - replace σ with sigma of difference:

		* 2sample **t-statistic**:
		* translating this t-statistics to p-value is problematic, we dont know degrees of freedom needed for the correct t-distribution are not obvious. The t-distribution assumes that only s replaces a single σ : but we have two of these. If σ1 = σ2, we could just take a weighted average of the 2sample SDs s1 and s2. 
		* In this case the correct t-distribution has (n1-1 + n2-1)   degrees of freedom [proof: Rice, 1992, 422]

=== 4.1 Hypothesis testing revisited ===
	* source: [[./]]
	* **TYPE I ERROR:** the null hypothesis is **true**, but our sample leads us to **reject** it
	* **TYPE II ERROR:** the null hypothesis is **false,** but our sample shows it as **true.**
Some conventions:
	* Let R = 'Reject the null hypothesis H0'
	* Let -R = 'Fail to reject the null hypothesis H0.'
	* The decision R and -R is based on the sample
NB! When we do an experiment we dont know whether the null hypothesis is true or not.
	1. Attempt to minimize error → how to measure it? Let P(R|H0) = "//Probability of rejecting the null hypothesis conditional on the assumption that the null hypothesis is in fact true//."
	* Thus,if we want to decrease the chance of a Type II error, we need to increase the power of the statistical test.
	* The best situation is when we have relatively high power (low type2 error) and low type1 error. By convention, we keep α at 0.05, and we should not to change it → lowering reduce power.
	* If we have a relatively narrow CI, and a nonsignifant result (p > 0.05), we have then relatively high power and a relatively low probability of making a Type2 error aka accepting the null hypothesis as true, when it is in fact false.
	* observed power provides no new information after p-value is known: if the p-value is high, we already know the observed power is low, there is nothing gained by computing it.

=== 4.3.1 Equivalence testing examples ===
	* source:[[./]]
	* TOST method's algorithm:
		* define an equivalence threshold Θ
		* compute 2 one-way t-tests:
		* compute critical t-value (in R: qt(0.95, DF), where DF = n1+n2-2;n1,n2 are sizes of 2groups)
		* reject H0 {{./equation028.png?type=equation}}

===== 5 Analysis of Variance (ANOVA) =====
	* source: [[./]]
	* Gauss noticed that the observational error had a particular distribution: there were more observations close to the truth than not, and errors overshot and undershot with equal probability. The errors in fact have a normal distribution. Thus if we average the observations, the errors tend to cancel themselves out.
	* Any sample mean can be thought of as 'containing' the true population mean plus an error term:
		* {{./equation029.png?type=equation}}

=== 5.2 Statistical models ===
	* source: 
	* Characterizing a sample mean as an error about a population mean is perhaps the simplest possible example of building a STATISTICAL MODEL:
		* {{./equation030.png?type=equation}}
		* allows us to compare statistical models and decide which one better characterizes the data. (powerful idea)
	* If an effect α_j is present, the variation between groups increases because of the systematic differences between groups: //the between-group variation is due to error variation plus variation due to α_j. So the null hypothesis becomes://
		* {{./equation031.png?type=equation}}   
	* As the sample size goes up, the sample means will be tighter, and the variance will go down, but it will always be positive and skewed right, and thus the mean of this sampling distribution will always overestimate the true parameter. 

=== 5.2.3 Analyzing the variance ===
	* source: [[./]]
	* (//recall that i ranges across participants within a group, and j ranges across groups//):
	* {{./equation032.png?type=equation}}
	* That is, the difference beween anyt value and the grand mean is equal to the sum of (I) the difference between that value and its group mean and (II) the difference between its group mean and the grand mean.
	* SUM OF SQUARES (SS-Total) is the sum of the SS-between and SS-within:
	* To get to the variances within and between each group, we simply need to divide each SS by the appropriate degrees of freedom.
	* The DF-total and DF-between are analogous to the case for the simple variance {{./equation034.png?type=equation}}
	//The number of scores minnus the number of parameters estimated gives you the degrees of freedom for each variance://
	* {{./equation035.png?type=equation}}
	* Another term for a variance is the MEAN SQUARE (MS), which is term used in ANOVA.
		* {{./equation036.png?type=equation}}

=== 5.3.3 Hypothesis testing ===
	* source: [[./]]
	* The null hypothesis amounts to saying that there is no effect of αj: that any between group variance we see is completely attributable to within group variance:
	* The key idea of ANOVA: //when the groups' means are in fact identical, the variance of these two distribution is very close to population variance.//
	* F-STATISTIC - precisely analogous to a t-statistics; and the accompanying sampling distribution - the F-distribution - can be used precisely like a t-curve to comput the p-value for a result.
	* F-distribution is defined as F(DFa, DFb), where DFa is the degrees of freedom of the MS-between (numerator) and DFb is the degrees of freedom of the MS-within (denominator).

=== 5.3.4 MS-within, three non-identical populations ===
	* source: [[./]]
	* The null hypothesis is now in fact false.
	* **MS-within** is computing the spread about the mean in each sample: the location of the mean in that sample is irrelevant. As long as the population variances remain identical, MS-within will always estimate this variance in and unbiased manner.
	* If the null hypothesis is in fact false (if the population means differ), then it's highly likely that MS-between is greater than MS-within, and that the F-ratio is significantly greater than 1.
	* When population means actually differ, for a given sample it is possible that MS-between is lower and that MS-within is higher than the population's variances.
	* A common rule of thumb is that //the results of ANOVA will be approximately correct if the largest standard deviation is less than twice the smallest standard deviation.//

=== 6. Bivariate statistics and linear models ===