Geocentric model vs Heliocentric model:

  • Used the same ideas - epicycles to describe the motion of planets on the sky. The difference is one kept earth at the center and the other kept sun at the center.
  • The reason Copernicus preferred Heliocentric model was because of parsimony - he needed fewer epicycles to describe planets motion than the geocentric model.
  • For any given sample, there a lot of causal models that will fit. We will use parsimony to find the simplest best fit model.

Problems with prediction:

  • What function describes these points? - (fitting, compression)
  • What function explains these points? (causal inference between X, Y)
  • What would happen if we changed a point mass? (intervention)
  • What is the next observation from the same process? (prediction)

Cross-validation

  • lppd - log pointwise predictive density
  • cross-validation
    • leave-one-out cross-validation - typically average residuals (sum of squares) is used as a measure of how well the line predicts new data.
    • works well to show that in simple complex models, the fits better but cross-validation is bad (because it fits badly when a point is left on).
    • called overfitting
    • idea is to have a function that can fit the data well as well as predict the data well.
  • regularization
    • does not allow everything to change based on one feature.
    • designing features so that it produces good cross-validation.
    • ridge regression adds a sum-of-squares penalty on the slopes, so they cannot be arbitrarily big.
    • for bayesian, regularization can be implemented by making priors too tight (not allowing them to change too much).
    • this works very well, and plotting the prediction error for in and out of sample allows us to see how well regularization helps with fits. But be careful as the priors can be too tight which would make it hard to fit to new data (Bayesian Inference - Importance of weak priors). Good regularizing priors in important (no need to be perfect though, don’t have too flat prior but not too narrow).
  • penalty prediction for cross-validation
    • cross-validation requires fitting N models for N points (assuming leave-one-out approach)
    • how to obtain penalty terms without N fits (important for especially complex models with a lot of data points)
      • PSIS
      • WAIC
    • neither addresses causal inference.
    • this is just fits and predictions of the data points (not causality of linked components in the model)
    • causal interference is prediction in the presence of intervention.
  • Model mis-selection
    • DO NOT use predictive criteria (WAIC, PSIS, CV) to chose a causal estimate!
    • easy to get wrong models that fit the data better. often these reasons are clear when you look at the data deeply, but often this is not done.
  • Outliers & Robust regression
    • Outliers - points that are more influential than others (more surprise in these points).
    • Dropping outliers is bad - ignores the problem. It is the model that is wrong, not the data.
    • Quantify the influence of each point. Can be done using cross-validation. Add or drop the points and see how much the posterior is changing.
      • PSIS k statistic
      • WAIC penalty term
    • Use a better model - mixture model (robust regression)
      • mixing gaussians with different standard deviations (with same mean) will give rise to a student-t distributions, which has a thicker tail (outliers are more probable). this works for cases where there are multiple gaussians underlying a process, all with the same mean, but the variance is different for some reason (unobserved heterogeneity).
      • so instead of using gaussian distribution for parameters, we can use student distributions, which can handle the surprise from the outliers.
      • student-t distribution requires an extra parameter for how much variability is in the tails (degree of freedom). usually to estimate this we need a lot of data (outliers), which is generally not available. so the principled approach is to try different values and report all of them.
      • student-t regression is a good default, especially for cases where there are outliers (and unobserved heterogeneity)
    • tails of the priors is an important thing to consider. gaussians are skeptical (thin tail) but other distributions have heavy-tail distributions, so think about how the prior/posterior would look like (and the distribution that models the prior/posterior).
      • War, investment, etc are thick tail distributions