Read research problems:

  • Many unknowns
  • Nested relationships
  • Bounded outcomes
  • Difficult calculations

Metropolis algorithm:

  • The idea is how do you visit a bunch of islands in proportion to the population in the island.
  • Simple rule of thumb algorithm
    1. Toss a coin to decide which island to move next (up or down). This is called the proposal.
    2. Find the ratio of population of the next island by the current city. This the probability of accepting the proposal.
    3. Move to new island with probability of the population ratio. This is the monte carlo step.
    4. Repeat step 1
  • Oldest Markov chain Monte Carlo
    • Markov chain, because it is a sequence/chain of draws, each of which depends on the current step (no history dependence).
    • Monte Carlo because of the random simulation.
  • Usual use case: draw samples from the posterior distribution
    • islands: parameter values (dimensions)
    • population size: posterior probability
    • visit each parameter value in proportion to its posterior probability
  • Oldest algorithm, but very inefficient so not used often.

MCMC:

  • Rosenbluth (Metropolis) algorithm

    • step size - dictates the next proposal.
      • large step sizes causes proposals to be rejected often because it skips multiple islands. but the posterior is sampled faster.
      • small step sizes take a long time to sample the posterior, but is systematic and proposals are accepted often.
  • Hamiltonian Monte Carlo algorithm

    • log(gaussian distribution) is a parabola. so can do convex optimization and follow gradients
    • the idea is to essentially roll a marble down this parabola with random velocity and stop it after a particular time to find the location in the n-dimensional parabola.
    • Hamiltonian - because it uses physics.
    • We can use this on n-dimensional surfaces, so can work for a lot of parameters.
    • has a lot of different parameters that need to be tuned
      • random velocity
      • step size
      • trajectory length
    • needs gradients
      • can do it using standard calculus, but that is hard.
      • can do finite differences to estimate gradients - however it won’t work for Monte Carlo because it is not precise enough and gives rise to numerical problems.
      • autodiff (autograd for ANNs) - automatic differentiation.
        • symbolic derivates of your model code
        • used in machine learning (backpropagation)
        • stan math library does this.
        • turing, julia also uses same kind of algorithms.
  • Implementing a Hamiltonian MC algorithm

    • ulam in the rethinking package. rest is all the same. make priors, have a set of equations etc.
    • need a lot off diagnostics as the machinery we are using is complex.
      • trace plots - visualizations of the Markov chain.
        • it should rapidly move around in the high probability space (looks like a fuzzy caterpillar).
        • warmup (grey region) - testing out step length, and other parameters to ensure the sampling is good.
        • samples (white region) - the actual sample
        • chains overlaid - generally 4 or more chains are used to ensure the posterior converges. they are also generally overlaid to see if the samples are from similar locations
        • bad chains wander around and do not sample the same space. with multiple chains, each of them sample a different space.
        • with a well configured chain, you only need a couple of 100 samples to get a good estimate of the posterior. so in the 100-1000, the whole posterior should be sampled.
      • trace rank plots
        • rank plots - basically rank the chains and see how the chains explore the space.
        • same as above - every chain should consistently sample the whole space. shouldn’t be stuck in some ranks, should have ranks that span the whole space.
      • R-hat
        • chain convergence estimate
        • chain converge
          • start and end of each chain explores same region
          • independent chains explore same region
        • R-hat is the ratio of variances:
          • Total variance within a chain
          • Average variance between chains
          • R-hat is the ratio of total variance to the average variance
          • Ratio should converge to 1
          • If the ratio is greater than 1 (maybe 1.1 or 1.2), then there is a problem
      • n_eff (also called Effective Sample Size (ESS) or bulk_ESS)
        • Number of effective samples (that are truly independent)
        • How long will the Markov chain will be, if it were uncorrelated (as if we used a random number generator).
        • When samples are autocorrelated, there are fewer effective samples
        • ACF function - there is some autocorrelation in MCMC
        • Sometimes the algorithm is so efficient it does better than random (goes and samples unsampled region). In that case effective sample might be higher than the number of samples we took.
  • Divergent transition: a kind of rejected proposals

    • with Metropolis algorithm, this happens often, but the Hamiltonian MC is very efficient, so this happens rarely.
    • so if there are many DTs, there is poor exploration and possible bias.
  • The folk theorem of statistical computing:

    When you have computational problems, often there’s a problem with your model