The chapter starts with the textbook application of Bayes’ rule to figure out the probability of true positive for a test, or in this the person is a vampire 😆, given its error rate, prior probability etc. The author shows that while the example uses Bayes’ rule there is actually no need to do it and there is nothing “Bayesian” about it. Just converting it into counts makes it clear and there is no need to remember the Bayes’ rule to do the calculation.

I felt the point of the example is the prior distribution - given that it is very unlikely to find a vampire, the probability of getting a test that is positive does not actually tell much. All it means is that we did a lot of tests and one of them turned out to be positive, because the test is error prone and error builds up with number of tests (same idea as multiple comparisons).

I really liked the “Rethinking: Why statistics can’t save bad science” example. Here, the prior distribution that a scientific hypothesis is really true is quite small. If that is the case, assuming that a statistical test is right 95% of the time (same as the vampire test), we are in the same conundrum. If the prior probability is low, even if we get significant statistical tests we get, the probability that the hypothesis is true is still quite small. The way to do solve this is reduce the prior probability - which requires thinking about the hypothesis deeply.

“Suppose the probability of a positive finding, when an hypothesis is true, is Pr(sig|true) = 0.95. That’s the power of the test. Suppose that the probability of a positive finding, when an hypothesis is false, is Pr(sig|false) = 0.05. That’s the false-positive rate, like the 5% of conventional significance testing. Finally, we have to state the base rate at which hypotheses are true. Suppose for example that 1 in every 100 hypotheses turns out to be true. Then Pr(true) = 0.01. No one knows this value, but the history of science suggests it’s small. Plug in the appropriate values, and the answer is approximately Pr(true|pos) = 0.16. So a positive finding corresponds to a 16% chance that the hypothesis is true. This is the same low base-rate phenomenon that applies in medical (and vampire) testing. You can shrink the false-positive rate to 1% and get this posterior probability up to 0.5, only as good as a coin flip. The most important thing to do is to improve the base rate, Pr(true), and that requires thinking, not testing.” (McElreath, 2020, p. 51)

Summarizing a posterior distribution

Once the posterior distribution is generated, we are ready right? The problem is that the posterior distribution is different based on the approach we use. Grid approximate provides samples which is easy to interpret. Quadratic approximate just gives us a mean and variance of the gaussian as the posterior. Any summarization we do will be harder, especially when there are multiple parameters that is being fit.

So one way to solve this is to simply sample from the posterior. No matter the method (grid-approximation, quadratic approximation, MCMC), once we sample from the posterior distribution, we can use the same method on the posterior samples to summarize parameter values described by the posterior probability distribution.

The author talks about how to summarize a parameter value once the posterior distribution is generated.

  • Intervals of defined boundaries: simply summarizing the probability that the parameter lies in an interval. Example - the probability that the parameter lies between 0.5 to 0.75 is 61%. Useful when we want to see the probability the parameter lies between a parameter range of interest.

  • Intervals of defined mass: summarize the posterior probability density instead - basically saying that 0.1 to 0.9 of the posterior is contained between parameters x1 and x2. This is what is called “Percentile Intervals (PI)“. The problem though is that 80% of the probability can be contained in arbitrary intervals - it could be from 10-90 % or from 0-80% or from 20-100%. What is more important is the narrowest interval that can contain the 80% of the probability. So we can look at the “Highest Posterior Density Interval (HPDI)”, which might be better for some cases where the posterior is non-gaussian. For cases where it is gaussian, the HDPI similar to PI. The disadvantage of HDPI is that suffers from greater simulation variance, so it requires many more samples making it computationally intensive. Still, I think HDPI is a nicer way of finding a parameter range containing x% of the posterior probability than the standard PI.

    “So in terms of describing the shape of the posterior distribution—which is really all these intervals are asked to do—the percentile interval can be misleading.” (McElreath, 2020, p. 56)

    “Overall, if the choice of interval type makes a big difference, then you shouldn’t be using intervals to summarize the posterior. Remember, the entire posterior distribution is the Bayesian “estimate.” It summarizes the relative plausibilities of each possible value of the parameter. Intervals of the distribution are just helpful for summarizing it. If choice of interval leads to different inferences, then you’d be better off just plotting the entire posterior distribution.” (McElreath, 2020, p. 58)

“It is common to hear that a 95% “confidence” interval means that there is a probability 0.95 that the true parameter value lies within the interval. In strict non-Bayesian statistical inference, such a statement is never correct, because strict non-Bayesian inference forbids using probability to measure uncertainty about parameters. Instead, one should say that if we repeated the study and analysis a very large number of times, then 95% of the computed intervals would contain the true parameter value.” (McElreath, 2020, p. 58)

  • Point estimates: using mean, median or mode. For a gaussian they are all the same. For a non-gaussian, the mode is the maximum a posterior (MAP) or the parameter value with the highest sample counts. The mean is simply the parameter value at the average count and the median is the parameter value at the middle. Overall, it is not a good idea to pick point estimates - they are only useful in cases where need to use a loss function.

    “The Bayesian parameter estimate is precisely the entire posterior distribution, which is not a single number, but instead a function that maps each unique parameter value onto a plausibility value. So really the most important thing to note is that you don’t have to choose a point estimate. It’s hardly ever necessary and often harmful. It discards information.” (McElreath, 2020, p. 58)
    “In order to decide upon a point estimate, a single-value summary of the posterior distribution, we need to pick a loss function. Different loss functions nominate different point estimates. The two most common examples are the absolute loss as above, which leads to the median as the point estimate, and the quadratic loss (d − p)2, which leads to the posterior mean (mean(samples)) as the point estimate.” (McElreath, 2020, p. 60)

Sampling to simulate prediction

The author talks about ways to ensure the model design is appropriate by:

  • generating dummy data (based on our assumptions).
  • model checking using posterior predictive distribution.

The key idea behind posterior predictive distributions is that two uncertainties go into a model:

  • observational uncertainty - the samples are obtained from a distribution which has some variance.
  • parameter uncertainty - the parameter is unseen and inferred from the data. The posterior distribution describes the parameter uncertainty.

So to predict the observations, one has to model both the parameter uncertainty and the observational uncertainty. One way to do this is to generate a distribution that uses the posterior distribution as weights to combine sampling distributions for all possible parameter values. The resultant distribution is called the posterior predictive distribution and is generally much broader than the posterior distribution as it incorporates observational uncertainty as well.

Cautionary note about sampling distributions:

“Sampling distributions are the foundation of common non-Bayesian statistical traditions. In those approaches, inference about parameters is made through the sampling distribution. In this book, inference about parameters is never done directly through a sampling distribution. The posterior distribution is not sampled, but deduced logically. Then samples can be drawn from the posterior, as earlier in this chapter, to aid in inference. In neither case is “sampling” a physical act. In both cases, it’s just a mathematical device and produces only small world (Chapter 2) numbers.” (McElreath, 2020, p. 63)

Posterior distribution based on data using grid approximation (10 points, R 2.3 in book):

p_grid <- seq( from=0 , to=1 , length.out=1e3 )
prob_p <- rep( 1 , 1e3 )
prob_data <- dbinom( 6 , size=9 , prob=p_grid )
posterior <- prob_data * prob_p
posterior <- posterior / sum(posterior)
 
samples <- sample( p_grid , prob=posterior , size=1e5 , replace=TRUE )
dens(samples, xlab="probability of water")
mtext(paste("N =", 1e5) )

Using just one point - the maximum (p=0.7) in the distribution would yield:

w <- rbinom( 1e4 , size=9 , prob=0.7 )
library(rethinking)
simplehist(w, xlab="count of W" , ylab="frequency" )

It does actually predict that we should get 6 or 7 counts of water as the highest frequency of occurrence. But the above predictive posterior only accounts for observational uncertainty and not the posterior uncertainty. Adding that in gives the following:

w <- rbinom( 1e4 , size=9 , prob=samples )
library(rethinking)
simplehist(w, xlab="count of W" , ylab="frequency" )

“If instead you were to use only a single parameter value to compute implied predictions, say the most probable value at the peak of posterior distribution, you’d produce an overconfident distribution of predictions, narrower than the posterior predictive distribution in Figure 3.6 and more like the sampling distribution shown for p = 0.6 in the middle row. The usual effect of this overconfidence will be to lead you to believe that the model is more consistent with the data than it really is the predictions will cluster around the observations more tightly. This illusion arises from tossing away uncertainty about the parameters” (McElreath, 2020, p. 66)

I find this idea quite useful as I am often nervous when I show distributions of averages. Basically, that is the same as throwing away one dimension uncertainty. But with the predictive posterior distributions, both uncertainties are combined. I guess the same method can be applied when I am comparing between multiple experimental trials each with its own noise distribution (observational uncertainty). Probably something that will be described in the later chapters.

(Bonus round) Measurement error

The video lecture for Chapter 2 & 3 (this one) has a bonus section that describes how to incorporate measurement errors into models, which I found very useful. It also talks about the use of DAGs to represent models (not present in the 2nd edition of the book). Here is a DAG for the globe tossing example, with the measurement error.1

flowchart LR
	p((p)) --> W
	N --> W 
	W --> W*
	x((x)) --> W*
	

The idea is to incorporate the error as a probability of observing a sample and see how it changes the posterior.2

”When there is measurement error, better to model it than ignore it.”

Summary

“Working with samples transforms a problem of integral calculus into a problem of data summary. These samples can be used to produce intervals, point estimates, posterior predictive checks, as well as other kinds of simulations.” (McElreath, 2020, p. 68)

“Posterior predictive checks combine uncertainty about parameters, as described by the posterior distribution, with uncertainty about outcomes, as described by the assumed likelihood function. These checks are useful for verifying that your software worked correctly. They are also useful for prospecting for ways in which your models are inadequate.” (McElreath, 2020, p. 68)

Footnotes

  1. In my DAGs, I use circles for parameters (unobserved variables) and squares for variables.

  2. This would require the posterior to be dependent on p and x. Not sure how to do this yet - need to update this once I figure it out.