title: Subtleties in controlling confounds in inferential statistics subtitle: some surprising, some obvious in retrospect author: Phillip M. Alday institute: University of South Australia date: 29 April 2016, Uni Adelaide

When all you have is a hammer ...

Animate and inanimate words chosen as stimulus materials did not differ in word frequency ($p$ > 0.05).

Controls and aphasics did not differ in age ($p$ > 0.05).

Control and ecological validity in conflict

  • Experimental control not possible in the brain & behavioural sciences in the same way as in the physical sciences:
    • truly artificial stimuli problematic (novel as types & tokens, etc.)
    • natural stimuli strongly confounded, with unclear causality and primacy
  • Still, we try to match our stimulus, participant, etc. groups

(Sassenhagen & Alday, under review at B&L)

What's the problem?

Animate and inanimate words chosen as stimulus materials did not differ in word frequency ($p$ > 0.05).

Controls and aphasics did not differ in age ($p$ > 0.05).

Where do I start?

. . .


  1. You can't accept the null in NHST, only fail to reject it.

. . .


  1. You're violating testing assumptions because by design you did not randomly sample.
  2. You've performing inferences on a population you don't care about.

. . .


  1. You're failing to perform the inference you actually care about.

Philosophy: Accepting the null.

. . .

  • Simply put, NHST doesn't have the notion of 'accepting' hypotheses, especially not the null.
  • You only reject a hypothesis as having a likelihood (probability conditional on your data model) that is too low to be taken seriously.

Statistics: Getting useless information.

Random sampling

  • You just aren't doing it by any stretch of the imagination.
  • You are actively trying to distort measures of both central location and spread.

Populations vs. samples

  • Inferential statistics, including statistical testing, draw conclusions from the data present about the data absent.
  • The absent data are things we don't care about:
    • The set of all animate vs. all inanimate nouns
    • The set of all possible patients vs. all possible controls
  • Alternatively, we have a completely sampled population and there are no absent data.
  • So just use descriptive statistics and make sure they match!

Pragmatics: Testing what you care about.

  • Even if we could
    • accept the null and
    • pretend that we're sampling randomly
    • from a population we care about
  • we're still answering a boring question:

    do these two populations differ systematically in the given feature?

  • when we actually care about:

    is the variance observed in my manipulation (better or at least partially) explained by the differences in the given feature?

What to do, what to do

  • Stop inferential tests for confound control.
  • Try to match groups as closely as possible using purely descriptive statistics (reduce confounds and collinearity).
  • If you can (and this is a should could!), explicitly model these confounds as a covariate
    • Painful with ANOVA / ANCOVA / other 1970s statistics
    • Not a problem with modern (explicit) regression techniques like mixed-effects models
    • Which you really should be using anyway for many BBS designs [cf. @clark1973a; @judd.westfall.etal:2012pp; @westfallkennyjudd2014a]

. . .

  • And thus you correctly use statistics to answer questions you care about.

I scream, you scream ...

Fresh off the presses


(DOI: 10.1371/journal.pone.0152719)

Arrows show causal true relationships

(All model diagrams from John Myles White)

Highlighting shows conditioning (in modelling)

But what if I didn't measure the actual causal variable?

. . .

Conditional probabilities and modelling (is so hard for frequentists)

  • Conditioning effectively "blocks" a given path
  • Non-blocked paths allow for "spurious" correlations and false positives

. . .

Going against the arrow and not hitting a bright stop sign in time leads to confusion and trouble.

Simulated for typical data

(DOI: 10.1371/journal.pone.0152719.g002)

But all of this follows directly from the GLM

Standard GLM applications have a "vertical" error/variance term.

$$ Y = \beta_0 + \beta_1 X_1 + \varepsilon $$ $$ \varepsilon \sim N(0,\sigma) $$

. . .

In other words, we assume:

. . .

  1. (Measurement) error/variance only occurs in the dependent variable.
  2. We manipulate the independent variables directly and without error.

But doesn't subjective temperature influence summertime fun?

Subjective handwaving is easy ...

... objectivity is hard

Modelling the full structure helps

The end is near ...

So what do we do in practice?

  • Mind your covariates and latent variables!

. . .

  • @westfall.yarkoni:2016p recommend structural equation modelling (SEM).
  • Check out the online app:
  • "Traditional" and "modern" regression can still work nicely when we (can) accommodate the correct structure and conditioning in our model.
  • PCA, ICA, residualisation, etc. may not bring as much as you hope (unless you're just reducing dimensionality / collinearity) if/because they don't add any structure to the model.

<!-- causal structure of word-length and frequency?-->

As always ...

You can find my stuff online:

The end is here.

If you have no questions about this stuff ....

We can discuss

We will not discuss

  • Donald Trump
  • Ted Cruz

I really don't understand it either.

References {.fragile .plain}