The author makes a really nice analogy about the scope of a model using Christopher Columbus’s assumptions for his westward voyage to India. His assumptions were that the earth was spherical with a 30000 km circumference at the equator, with no land mass between Europe and Asia in the west of Europe. So traveling west would allow him to hit Asia.

This was his model of the world (small world). It was consistent with the data he had. There was no data about another land mass between Europe and Asia in his model. When the model was deployed in the large world (he went on a voyage), the model predictions were wrong (which was lucky for him). This distinction between the model (small world) which is built on assumptions and known data, vs deploying it in reality (large world) and getting correct predictions is the main challenge of modeling (statistical or otherwise).

”Colombo’s small and large worlds provide a contrast between model and reality. All statistical modeling has these two frames: the small world of the model itself and the large world we hope to deploy the model in.38 Navigating between these two worlds remains a central challenge of statistical modeling.” (McElreath, 2020, p. 19)

2.1 Garden of forking data

The author does a brilliant illustration of describing how to determine which conjectures are likely by “forking” all possible events and counting the number of events that give rise to the observations. I really like the visuals and the description - it is very beginner friendly and helps develop a strong intuition for Bayesian interference.

”In order to make good inference about what actually happened, it helps to consider everything that could have happened. A Bayesian analysis is a garden of forking data, in which alternative sequences of events are cultivated. As we learn about what did happen, some of these alternative sequences are pruned. In the end, what remains is only what is logically consistent with our knowledge.” (McElreath, 2020, p. 21)

2.2 Building a model

Once the intuition is provided, the author builds a simple model to show how bayesian statistical models would look like. The model building is broken down into three parts:

  1. Data story: Description of how the data was sampled, as well as the aspects of underlying reality. The variables, parameters and the sampling process are highlighted with this step. A cautionary note is made about the fact that different data stories can correspond to the same model, so it is important to discard the story eventually, but it is a good starting point to ensure that all the specific questions are answered before starting the modeling.

    “The data story has value, even if you quickly abandon it and never use it to build a model or simulate new observations. Indeed, it is important to eventually discard the story, because many different stories correspond to the same model. As a result, showing that a model does a good job does not in turn uniquely support our data story. Still, the story has value because in trying to outline the story, often one realizes that additional questions must be answered.” (McElreath, 2020, p. 29)

  2. Bayesian updating: Updating of the plausibility curve with addition of every new data point. The final curve does not depend on the order of the observations (but just the number of different observations). It is also possible to figure out previous plausibility curve by dividing out an observation allowing to reverse the updating process. The idea is that the updating is an iterated learning process that depends on utilizing previous data and incoming observations.

    “The Bayesian model learns in a way that is demonstrably optimal, provided that it accurately describes the real, large world. This is to say that your Bayesian machine guarantees perfect inference within the small world. No other way of using the available information, beginning with the same state of information, could do better.” (McElreath, 2020, p. 31)

  3. Evaluate: Checking if the model makes sense. Model certainty is not sufficient. It is important to critique the model’s results. This is hard and will be the described in the next chapter. This quote describes the importance of being critical as there are multiple ways of getting useful inferences even when the model is wrong.

    “Failure to conclude that a model is false must be a failure of our imagination, not a success of the model. Moreover, models do not need to be exactly true in order to produce highly precise and useful inferences. All manner of small world assumptions about error distributions and the like can be violated in the large world, but a model may still produce a perfectly useful estimate. This is because models are essentially information processing machines, and there are some surprising aspects of information that cannot be easily captured by framing the problem in terms of the truth of assumptions.” (McElreath, 2020, p. 32)

I found it the use of “plausibilities” and “probabilities” confusing. They both seem to be the same, so I don’t see why a distinction was made. Maybe it becomes apparent later.

2.3 Components of the model

Variables and Parameters: Observed values that need to be inferred (rates, proportions etc) are variables. Unobserved variables are called parameters. This is good nomenclature - I have often used them interchangeably 😅

Likelihood - The relative number of ways that a value p can produce the data is usually called a likelihood. It is derived by enumerating all the possible data sequences that could have happened and then eliminating those sequences inconsistent with the data.

Prior plausibility/probability - Initial plausibility assignment for each of the parameter values. In the case of the globe tossing example, the parameter (p) is the proportion of water covering earth. And the prior plausibility is the range of plausible values and the weights for them. The prior could be a uniform distribution (weights same from 0 to 1), or a step distribution or any complex function. Our choice of the prior matters as it influences which data to focus on and which to ignore. An argument is made to view it as an assumption and to test multiple priors.

“Beyond all of the above, there’s no law mandating we use only one prior. If you don’t have a strong argument for any particular prior, then try different ones. Because the prior is an assumption, it should be interrogated like other assumptions: by altering it and checking how sensitive inference is to the assumption. No one is required to swear an oath to the assumptions of a model, and no set of assumptions deserves our obedience.” (McElreath, 2020, p. 35)

Posterior probability - The updated probability of the parameter values based on the likelihood (data) and the prior. The posterior probability becomes the prior probability in the next data update.

2.4 Making the model go

The update of the parameter probabilities occurs via Bayes’ rule. The author simplifies the Bayes’ theorem as:

I really like this as it highlights the key aspect of Bayes’ theorem - updating the prior based on the data (likelihood). The denominator (average likelihood) is simply the average probability of the data for all possible parameter values.

For each data update the posterior is obtained. If the update is done sequentially, at the next iteration, the prior becomes the posterior from the previous step. This way, the posterior gets updated iteratively based on incoming data.

There are multiple ways of updating based on Bayes’ theorem. This chapter talks about three:

  1. Grid approximation
  2. Quadratic approximation
  3. Markov Chain Monte Carlo (MCMC)

The author gives a cautionary note about numerical techniques as each of them have assumptions based into it:

“In even moderately complex problems, however, the details of fitting the model to data force us to recognize that our numerical technique influences our inferences. This is because different mistakes and compromises arise under different techniques. The same model fit to the same data using different techniques may produce different answers. When something goes wrong, every piece of the machine may be suspect. And so our golems carry with them their updating engines, as much slaves to their engineering as they are to the priors and likelihoods we program into them.” (McElreath, 2020, p. 39)

Grid approximation

The most simplistic and is a good starting point for simple problems. The idea is to simply break our prior into a number of points and apply Bayes’ rule on each of the points.

The biggest problem is that this method breaks a continuous parameter (like proportion of water on earth, which goes from 0 to 1) into a fixed number of points. As the number of parameters increase, the number of points to be computed will scale by (points)^parameters, which makes it incredible inefficient.

Quadratic approximation

Quadratic approximation solves the scaling problem (points)^parameters by replacing them with a continuous function - a gaussian. Because a gaussian can be described using two numbers (mean, variance), this essentially compresses the points to 2. It works really well for a lot of problems as the parameters generally are normally distributed anyway, so it gives almost exact solutions. Even if it doesn’t, I am guessing that my adding more gaussians, and making the output a sum of gaussians, one could probably approximate any arbitrary parameter distribution.

The author does not go into the details of the method yet - hopefully it will be explained in the next chapters.

Markov Chain Monte Carlo (MCMC)

This is the main technique I want to learn! This chapter does not go into the details though as it is a complicated technique.

“The conceptual challenge with MCMC lies in its highly non-obvious strategy. Instead of attempting to compute or approximate the posterior distribution directly, MCMC techniques merely draw samples from the posterior. You end up with a collection of parameter values, and the frequencies of these values correspond to the posterior plausibilities. You can then build a picture of the posterior from the histogram of these samples.” (McElreath, 2020, p. 45)

From what I understand, the point is to generate parameter values that explain the data samples well. Do a lot of times and the posterior distribution merges from the density/histogram of the parameter values. It is not very clear how this is the done from the overthinking sample code. I guess I have to wait till Chapter 9.