1.1 Statistical golems

The author provides a nice analogy between following models and a golem.

“Scientists also make golems. Our golems rarely have physical form, but they too are often made of clay, living in silicon as computer code. These golems are scientific models. But these golems have real effects on the world, through the predictions they make and the intuitions they challenge or inspire.” (McElreath, 2020, p. 1)

Golems were robots of extraordinary power and with good intentions (born from truth). However, when used clumsily gave rise to innocent deaths. Similarly, models too when used clumsily can give incorrect results (which might cause deaths).

An example set of golems used often in life sciences and biomedical research:

“Advanced courses in statistics do emphasize engineering principles, but most scientists never get that far. Teaching statistics this way is somewhat like teaching engineering backwards, starting with bridge building and ending with basic physics. So students and many scientists tend to use charts like Figure 1.1 without much thought to their underlying structure, without much awareness of the models that each procedure embodies, and without any framework to help them make the inevitable compromises required by real research. It’s not their fault.” (McElreath, 2020, p. 3)

“What researchers need is some unified theory of golem engineering, a set of principles for designing, building, and refining special-purpose statistical procedures. Every major branch of statistical philosophy possesses such a unified theory. But the theory is never taught in introductory—and often not even in advanced—courses. So there are benefits in rethinking statistical inference as a set of strategies, instead of a set of pre-made tools.” (McElreath, 2020, p. 4)

1.2 Statistical rethinking

The author makes a distinction between hypothesis, process models and statistical models.

Hypothesis: A general statement about how a system works (generally a mechanistic statement).

Process models: Mathematical framing of the hypothesis. Has a causal structure to it. Different inputs give rise to different outputs.

Statistical models: The output statistics of the data (or the process models). The final measurement, generally a distribution of data points.

The rationale is:
Hypothesis Process models Statistical models

The problem is that this is not a one-to-one map. Rather, a hypothesis can correspond to multiple process models and a statistical model can correspond to multiple process models. As a result, a statistical model can correspond to multiple hypotheses.

As a consequence, rejecting a null hypothesis is not clean - could be rejecting parts of multiple hypotheses. And failure to rejecting the null implies that multiple hypothesis could be true.

The first key argument is that using statistical model alone is not enough, and we need process models too.

The second key argument is that falsification is not straightforward as there is observation error. An example the author gives is the scientific fable of color of swans. The null hypothesis is that all swans are white. Till Europeans reached Australia, this was not falsified. In Australia, the presence of black swans falsify the hypothesis.

However, there is always false positives (mistaken confirmations) and false negatives (mistaken disconfirmations), which makes it hard to falsify a hypothesis just based on one observation. This manifests itself through measurement error - for instance, in this case the is an implicit assumption that there is only two colors - black and white, and everybody observes it in the same way. In reality, not all white swans are pure white and not all black swans are pure black, and can be mistaken for white or black. The authors give other examples - the sightings of an extinct woodpecker or the measurement that neutrinos are faster-than-light, as examples where the key dilemma is figure out if the falsification is real or spurious.

“Popper was aware of this limitation inherent in measurement, and it may be one reason that Popper himself saw science as being broader than falsification. But the probabilistic nature of evidence rarely appears when practicing scientists discuss the philosophy and practice of falsification. My reading of the history of science is that these sorts of measurement problems are the norm, not the exception.” (McElreath, 2020, p. 9)

1.3 Tools for golem engineering

The author makes a key distinction between “Bayesian data analysis” and the “frequentist” approach.

  • In the frequentist approach, the “measurements” are though to arise from a sampling distribution with different parameters (t-distribution for the t-test, binomial/multinomial distribution for fisher’s exact test, normal distribution for ANOVA and a bunch of parametric tests etc.). The idea is that if the measurements were resampled many times, it would give rise to the sampling distribution. And statistical tests determine if the observations arise from two or more different sampling distributions. One of the more surprising notions of this (to me) is that all the uncertainty lies in the measurements and there is no probability distributions for the parameters themselves. That is the distributions are fixed and everything we see is due to measurement error.
  • Bayesian data analysis is more general, with no fixed sampling distributions.1 The probability distributions could describe measurement error or the parameters for distributions (or probably something even more general).[^2]

A point about overfitting is made, and how predictions on future data is one good way of judging the model accuracy. In contrast, just looking at the fit estimates is actually a bad way of judging model - models that work really well (overfit the data) does not predict future data correctly. Information criteria estimates, used frequently in sciences now, is a rapidly evolving field. They recommends caution while using them as their power is often exaggerated.

Multilevel models are introduced.2 Multilevel models (hierarchical models) are models where parameters themselves are described as models with parameters of their own (the recursion can be long). These are powerful but we hit the same overfitting risk. Partial pooling is one way of dealing with overfitting, something introduced later in the book. The advantages for partial pooling is very important (especially for life sciences where the data is rife with repeat sampling, imbalanced sampling, variation and rife with averaging to circumvent these problems). This is something I want to learn and am looking forward to it.

“1) To adjust estimates for repeat sampling. When more than one observation arises from the same individual, location, or time, then traditional, single-level models may mislead us. (2) To adjust estimates for imbalance in sampling. When some individuals, locations, or times are sampled more than others, we may also be misled by single-level models. (3) To study variation. If our research questions include variation among individuals or other groups within the data, then multilevel models are a big help, because they model variation explicitly. (4) To avoid averaging. Pre-averaging data to construct variables can be dangerous. Averaging removes variation, manufacturing false confidence. Multilevel models preserve the uncertainty in the original, pre-averaged values, while still using the average to make predictions.” (McElreath, 2020, p. 15)”

Statistical models can be used to figure out association between variables, but it is harder to infer causality. The author talks about how hard it is to infer causality as strong relationships between cause and effect implies that often the cause can be predicted from effect (like if the branches move, there is wind, even though wind is the cause).

“A statistical model is an amazing association engine. It makes it possible to detect associations between causes and their effects. But a statistical model is never sufficient for inferring cause, because the statistical model makes no distinction between the wind causing the branches to sway and the branches causing the wind to blow. Facts outside the data are needed to decide which explanation is correct.” (McElreath, 2020, p. 16)

The second paradox of overfitting is brought here, which is that causally incorrect models often predict the data better than casually correct ones. It took me a while to wrap my head around this one - in a sense this is the same as predicting the data based on all cause and effect relationships instead of just one (the causally correct one). The one which uses more relationships would perform better at predictions even if weights the wrong causally relationship (like looking at the branches to predict wind).

“When I introduced them above, I described overfitting as the primary paradox in prediction. Now we turn to a secondary paradox in prediction: Models that are causally incorrect can make better predictions than those that are causally correct.” (McElreath, 2020, p. 16)

The author mentions that formal method for causal inference will be introduced through graphical causal models, like the directed acylic graph (DAG). These are heuristic models but they allow us to deduce causality.

Course video

Footnotes

  1. A consequence of this is that Bayesian methods are more computationally expensive. I guess it is the same as training a ML model for face detection vs using Adaboost for face tracking. The latter has built in assumptions which makes it faster to implement whereas the former needs a lot of training. So the bigger the data, easier it is to use frequentist assumptions (especially given the large sample set assumption of frequentist statistics).

  2. Paired t-test is apparently a multilevel model in disguise. Surprising! I figured it was just comparing location one the mean of one distribution with respect to the other but looks like my interpretation is wrong. Or maybe my interpretation of multilevel model based on the introduction is wrong. Or both my interpretations are wrong 😅 Anyway, will find out in a few chapters.