Overview
Atlassian Sourcetree is a free Git and Mercurial client for Windows.
Atlassian Sourcetree is a free Git and Mercurial client for Mac.
title: Subtleties in controlling confounds in inferential statistics subtitle: some surprising, some obvious in retrospect author: Phillip M. Alday institute: University of South Australia date: 29 April 2016, Uni Adelaide
When all you have is a hammer ...
Animate and inanimate words chosen as stimulus materials did not differ in word frequency ($p$ > 0.05).
Controls and aphasics did not differ in age ($p$ > 0.05).
Control and ecological validity in conflict
 Experimental control not possible in the brain & behavioural sciences in the same way as in the physical sciences:
 truly artificial stimuli problematic (novel as types & tokens, etc.)
 natural stimuli strongly confounded, with unclear causality and primacy
 Still, we try to match our stimulus, participant, etc. groups
(Sassenhagen & Alday, under review at B&L)
What's the problem?
Animate and inanimate words chosen as stimulus materials did not differ in word frequency ($p$ > 0.05).
Controls and aphasics did not differ in age ($p$ > 0.05).
Where do I start?
. . .
Philosophy
 You can't accept the null in NHST, only fail to reject it.
. . .
Statistics
 You're violating testing assumptions because by design you did not randomly sample.
 You've performing inferences on a population you don't care about.
. . .
Pragmatics
 You're failing to perform the inference you actually care about.
Philosophy: Accepting the null.
. . .
 Simply put, NHST doesn't have the notion of 'accepting' hypotheses, especially not the null.
 You only reject a hypothesis as having a likelihood (probability conditional on your data model) that is too low to be taken seriously.
Statistics: Getting useless information.
Random sampling
 You just aren't doing it by any stretch of the imagination.
 You are actively trying to distort measures of both central location and spread.
Populations vs. samples
 Inferential statistics, including statistical testing, draw conclusions from the data present about the data absent.
 The absent data are things we don't care about:
 The set of all animate vs. all inanimate nouns
 The set of all possible patients vs. all possible controls
 Alternatively, we have a completely sampled population and there are no absent data.
 So just use descriptive statistics and make sure they match!
Pragmatics: Testing what you care about.
 Even if we could
 accept the null and
 pretend that we're sampling randomly
 from a population we care about

we're still answering a boring question:
do these two populations differ systematically in the given feature?

when we actually care about:
is the variance observed in my manipulation (better or at least partially) explained by the differences in the given feature?
What to do, what to do
 Stop inferential tests for confound control.
 Try to match groups as closely as possible using purely descriptive statistics (reduce confounds and collinearity).
 If you can (and this is a should could!), explicitly model these confounds as a covariate
 Painful with ANOVA / ANCOVA / other 1970s statistics
 Not a problem with modern (explicit) regression techniques like mixedeffects models
 Which you really should be using anyway for many BBS designs [cf. @clark1973a; @judd.westfall.etal:2012pp; @westfallkennyjudd2014a]
. . .
 And thus you correctly use statistics to answer questions you care about.
I scream, you scream ...
Fresh off the presses
\
(DOI: 10.1371/journal.pone.0152719)
Arrows show causal true relationships
(All model diagrams from John Myles White)
Highlighting shows conditioning (in modelling)
But what if I didn't measure the actual causal variable?
. . .
Conditional probabilities and modelling (is so hard for frequentists)
 Conditioning effectively "blocks" a given path
 Nonblocked paths allow for "spurious" correlations and false positives
. . .
Going against the arrow and not hitting a bright stop sign in time leads to confusion and trouble.
Simulated for typical data
(DOI: 10.1371/journal.pone.0152719.g002)
But all of this follows directly from the GLM
Standard GLM applications have a "vertical" error/variance term.
$$ Y = \beta_0 + \beta_1 X_1 + \varepsilon $$ $$ \varepsilon \sim N(0,\sigma) $$
. . .
In other words, we assume:
. . .
 (Measurement) error/variance only occurs in the dependent variable.
 We manipulate the independent variables directly and without error.
But doesn't subjective temperature influence summertime fun?
Subjective handwaving is easy ...
... objectivity is hard
Modelling the full structure helps
The end is near ...
So what do we do in practice?
 Mind your covariates and latent variables!
. . .
 @westfall.yarkoni:2016p recommend structural equation modelling (SEM).
 Check out the online app: http://jakewestfall.org/ivy/.
 "Traditional" and "modern" regression can still work nicely when we (can) accommodate the correct structure and conditioning in our model.
 PCA, ICA, residualisation, etc. may not bring as much as you hope (unless you're just reducing dimensionality / collinearity) if/because they don't add any structure to the model.
<! causal structure of wordlength and frequency?>
As always ...
You can find my stuff online: palday.bitbucket.org
The end is here.
If you have no questions about this stuff ....
We can discuss
 The ASA's statement on $p$values
 Issues with optional stopping, both frequentist and Bayesian
 Brexit
 Catalonia's quest for independence
 The rise of the right in Germany and the rest of Europe
We will not discuss
 Donald Trump
 Ted Cruz
I really don't understand it either.