class: center, middle, title-slide .title[ # Markov Chain Monte Carlo ] .author[ ### Luke Tierney ] .institute[ ### University of Iowa ] .date[ ### 2023-05-19 ] --- layout: true <link rel="stylesheet" href="stat7400.css" type="text/css" /> <style type="text/css"> .remark-code { font-size: 75%; } </style> <!-- $ -- get emacs sane again --> ## Simulation with Dependent Observations --- Suppose we want to compute `\begin{equation*} \theta = E[h(X)] = \int h(x) f(x) dx \end{equation*}` -- Crude Monte Carlo: generate `\(X_1,\dots,X_N\)` that are * independent; * identically distributed from `\(f\)`. and compute `\(\widehat{\theta} = \frac{1}{N} \sum h(X_i)\)`. -- Then * `\(\widehat{\theta} \to \theta\)` by the law of large numbers; * `\(\widehat{\theta} \sim \text{AN}(\theta, \sigma^2/N)\)` by the central limit theorem; * `\(\sigma^2\)` can be estimated using the sample standard deviation. $$ \renewcommand{\Real}{\mathbb{R}} \renewcommand{\Var}{\text{Var}} \renewcommand{\Cov}{\text{Cov}} $$ --- Sometimes generating independently from `\(f\)` is not possible or too costly. -- Importance sampling: generate independently from `\(g\)` and reweight. -- An alternative: generate _dependent_ samples in a way that * preserves the law of large numbers; * has a central limit theorem if possible. -- Variance estimation will be more complicated. --- layout: true ## A Simple Example --- Suppose `\(X,Y\)` are bivariate normal with mean zero, variance 1, and correlation `\(\rho\)`, `\begin{equation*} (X,Y) \sim \text{BVN}(0,1,0,1,\rho). \end{equation*}` -- Then `\begin{align*} Y|X=x &\sim \text{N}(\rho x, 1-\rho^2)\\ X|Y=y &\sim \text{N}(\rho y, 1-\rho^2). \end{align*}` -- Suppose we start with some initial values `\(X_0, Y_0\)` and generate `\begin{align*} X_1 &\sim \text{N}(\rho Y_0, 1-\rho^2)\\ Y_1 &\sim \text{N}(\rho X_1, 1-\rho^2) \end{align*}` (the value of `\(X_0\)` is not used), -- and continue for `\(i = 1, \dots, N-1\)` `\begin{align*} X_{i+1} &\sim \text{N}(\rho Y_i, 1-\rho^2)\\ Y_{i+1} &\sim \text{N}(\rho X_{i+1}, 1-\rho^2)\\ \end{align*}` --- .pull-left[ For `\(\rho = 0.75\)` and `\(Y_0 = 0\)`: ```r r <- 0.75 x <- numeric(10000) y <- numeric(10000) x[1] <- rnorm(1, 0, sqrt(1 - r^2)) y[1] <- rnorm(1, r * x[1], sqrt(1 - r^2)) for (i in 1:(length(x) - 1)) { x[i + 1] <- rnorm(1, r * y[i], sqrt(1 - r^2)) y[i + 1] <- rnorm(1, r * x[i + 1], sqrt(1 - r^2)) } ``` {{content}} ] -- The joint distribution of `\(X\)` and `\(Y\)` and the marginal distribution of `\(X\)` look right; the marginal distribution of `\(Y\)` does as well: -- .pull-right[ <img src="mcmc_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" /> {{content}} ] -- The correlation is close to the target value: ```r cor(x, y) ## [1] 0.743805 ``` --- .pull-left[ Initial sample paths show the dependence: <img src="mcmc_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> ] -- .pull-right[ The sample auto-correlation function for `\(X\)` also reflects the dependence: <img src="mcmc_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> ] --- The sequence of pairs `\((X_i,Y_i)\)` form a _continuous state space Markov chain_. -- Suppose `\((X_0,Y_0) \sim \text{BVN}(0,1,0,1,\rho)\)`. -- Then * `\((X_1,Y_0) \sim \text{BVN}(0,1,0,1,\rho)\)`; * `\((X_1,Y_1) \sim \text{BVN}(0,1,0,1,\rho)\)`; * `\((X_2,Y_1) \sim \text{BVN}(0,1,0,1,\rho)\)`; * `\(\dots\)` -- So `\(\text{BVN}(0,1,0,1,\rho)\)` is a _stationary distribution_ or _invariant distribution_ of the Markov chain. -- `\(\text{BVN}(0,1,0,1,\rho)\)` is also the _equilibrium distribution_ of the chain: -- * For any starting distribution the joint distribution of `\((X_n,Y_n)\)` converges to the `\(\text{BVN}(0,1,0,1,\rho)\)` distribution. --- For this example, `\(X_1, X_2, \dots\)` is an AR(1) process `\begin{equation*} X_i = \rho^2 X_{i-1} + \epsilon_i \end{equation*}` with the `\(\epsilon_i\)` independent and `\(N(0, 1-\rho^4)\)`. -- Standard time series results show that the equilibrium distribution of this chain is `\(\text{N}(0,1)\)`. -- If `\begin{equation*} \overline{X}_N = \frac{1}{N}\sum_{i=1}^N X_i \end{equation*}` -- then $$ \Var({\overline{X}_N}) \approx \frac{1}{N} \frac{1-\rho^4}{(1-\rho^2)^2} = \frac{1}{N}\frac{1+\rho^2}{1-\rho^2}. $$ --- layout: true ## Markov Chain Monte Carlo --- The objective is to compute `\begin{equation*} \theta = E[h(X)] = \int h(x) f(x) dx. \end{equation*}` -- Basic idea: -- * Construct a Markov chain with invariant distribution `\(f\)`. -- * Make sure the chain has `\(f\)` as its equilibrium distribution. -- * Pick a starting value `\(X_0\)`. -- * Generate `\(X_1, \dots, X_N\)` and compute `$$\widehat{\theta} = \frac{1}{N} \sum h(X_i).$$` -- * Possibly repeat independently several times, maybe with different starting values. --- Some issues: -- * How to construct a Markov chain with a particular invariant distribution. -- * How to estimate the variance of `\(\widehat{\theta}\)`. -- * What value of `\(N\)` to use. -- * Should an initial portion be discarded; if so, how much? --- layout: true ## Some MCMC Examples --- Markov chain Monte Carlo (MCMC) is used for a wide range of problems and applications: -- * generating spatial processes; -- * sampling from equilibrium distributions in physical chemistry; -- * computing likelihoods in missing data problems; -- * computing posterior distributions in Bayesian inference; -- * optimization, e.g. simulated annealing; -- * `\(\dots\)` --- ### Strauss Process The Strauss process is a model for random point patterns with some regularity. -- A set of `\(n\)` points is distributed on a region `\(D\)` with finite area or volume. -- The process has two parameters, `\(c \in [0,1]\)` and `\(r > 0\)`. -- The joint density of the points is `\begin{equation*} f(x_1,\dots,x_n) \propto c^{N(x, r)} \end{equation*}` where `\(N(x, r)\)` is the number of pairs within `\(r\)` of each other. -- For `\(c = 0\)` the density is zero if any two points are withing `\(r\)` of each other. -- Simulating independent draws from `\(f\)` is possible in principle but very inefficient. --- A Markov chain algorithm: -- * Start with `\(n\)` points `\(X_1, \dots, X_n\)` with `\(f(X_1,\dots,X_n) > 0\)`. -- * Choose an index `\(i\)` at random, remove point `\(X_i\)`, and replace it with a draw from `$$f(y|x_1,\dots,x_{i-1},x_{i+1},\dots,x_n) \propto c^{N(y, x_{-i}, r)}$$` with `\(N(y, x_{-i}, r)\)` the number of remaining `\(n-1\)` points within `\(r\)` of `\(y\)`. -- * This can be sampled reasonably efficiently by rejection sampling. -- * Repeat for a reasonable number of steps. -- `Strauss()` in package `spatial` implements this algorithm. -- [**Caution:** it used to be easy to hang R because the C code would go into an infinite loop for some parameters. More recent versions may have modified C code that checks for interrupts.] --- The algorithm is due to Ripley (1979); Ripley's algorithm applies to a general multivariate density. -- A sample realization: <!-- ## nolint start: semicolon --> <img src="mcmc_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> <!-- ## nolint end --> --- ### Markov Random Fields A Markov random field is a spatial process in which -- * each index `\(i\)` has a set of neighbor indices `\(\mathcal{N}_i\)`; -- * `\(X_i\)` and `\(\{X_k: k \neq i, k \not \in \mathcal{N}_i\}\)` are conditionally independent given `\(\{X_j: j \in \mathcal{N}_i\}\)`. -- In a binary random field the values of `\(X_i\)` are binary. -- Random fields are used as models in -- * spatial statistics; -- * image processing; -- * `\(\dots\)` --- A simple model for an `\(n \times n\)` image with `\(c\)` colors: `\begin{equation*} f(x) \propto \exp\{\beta\times (\text{number of adjacent pixel pairs with the same color})\}. \end{equation*}` -- This is called a Potts model. The special case for `\(c = 2\)` is an Ising model. -- For a pixel `\(i\)`, the conditional PMF of the pixel color `\(X_i\)`, given the other pixel colors, is `\begin{equation*} f(x_i|\text{rest}) \propto \exp\{\beta \times w_i(x_i) \} \end{equation*}` where `\(w_i(x)\)` is the number of neighbors of pixel `\(i\)` with color `\(x\)`. -- <img src="mcmc_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" /> --- ### Simple Image Reconstruction Suppose a binary image is contaminated with noise. -- The noise process is assumed to act independently on the pixels. -- Each pixel is left unchanged with probability `\(p\)` and flipped to the opposite color with probability `\(1-p\)`. -- The likelihood for the observed image `\(Y_i\)` is `\begin{equation*} f(y|x) \propto p^m (1-p)^{n^2-m} \end{equation*}` with `\(m\)` the number of pixels with `\(y_i = x_i\)`. --- A Markov random field is used as the prior distribution of the true image. -- The posterior distribution of the true image is `\begin{equation*} f(x|y) \propto p^m (1-p)^{n^2-m} e^{\beta w} \end{equation*}` with `\(w\)` the number of adjacent pixel pairs in the true image `\(x\)` with the same color. -- The posterior distribution is also a Markov random field, and `\begin{equation*} f(x_i|y, x_j: j \neq i) \propto p^{m_i}(1-p)^{1-m_i} e^{\beta w_i} \end{equation*}` with `\(m_i = 1\)` if `\(x_i = y_i\)` and `\(m_i = 0\)` otherwise, and `\(w_i\)` the number of neighbors with color `\(x_i\)`. -- Images drawn from the posterior distribution can be averaged to form a posterior mean. -- The posterior mean can be rounded to a binary image. -- Many other variations are possible. --- .pull-left[ A simple example: [Another example using `\(\beta = 0.5\)` and `\(\beta = 0.35\)` and `\(N=100\)` will be shown in class.] ] .pull-right[ <!-- ## nolint start: semicolon --> <img src="mcmc_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" /> <!-- ## nolint end --> ] --- This is a simple version of the ideas in > Geman, S. and Geman D, (1984) Stochastic relaxation, Gibbs > distributions and the Bayesian restoration of images, _IEEE > Transactions on Pattern Analysis and Machine Intelligence_ > **6**, 721--741. -- The posterior distributions are related to Gibbs distributions in physics. -- Geman and Geman call the Markov chain algorithm _Gibbs sampling_. <!-- Example of recent research: Barbu, A. and Zhu, S. (2005) Generalizing Swendsen-Wang to Sampling Arbitrary Posterior Probabilities, _IEEE Transactions on Pattern Analysis and Machine Intelligence_ **27**, 1239--1253. --> --- .pull-left.width-40[ Code for the image sampler: ] .pull-right.width-60[ ```r simImg <- function(m, img, N, beta, p) { colors <- 1:2 inboff <- c(-1, -1, -1, 0, 0, 1, 1, 1) jnboff <- c(-1, 0, 1, -1, 1, -1, 0, 1) for (k in 1:N) { for (i in seq_len(nrow(m))) { for (j in seq_len(ncol(m))) { w <- double(length(colors)) inb <- i + inboff jnb <- j + jnboff omit <- inb == 0 | inb == nrow(m) + 1 | jnb == 0 | jnb == ncol(m) + 1 inb <- inb[!omit] jnb <- jnb[!omit] for (ii in seq_along(inb)) { kk <- m[inb[ii], jnb[ii]] w[kk] <- w[kk] + 1 } if (is.null(img)) lik <- 1 else lik <- ifelse(img[i, j] == colors, p, 1 - p) prob <- lik * exp(beta * w) m[i, j] <- sample(colors, 1, TRUE, prob = prob) } } } m } ``` ] --- ### Profiling to Improve Performance `Rprof()` can be used to turn on profiling. -- During profiling a _stack trace_ is written to a file, `Rprof.out` by default, 50 times per second. -- `summaryRprof()` produces a summary of the results. <!-- ## nolint start: semicolon --> ```r Rprof(); system.time(simImg(m, img, 100, .35, .7)); Rprof(NULL) ## user system elapsed ## 6.970 0.000 6.976 s <- summaryRprof() head(s$by.self) ## self.time self.pct total.time total.pct ## "simImg" 1.90 27.14 6.98 99.71 ## "ifelse" 1.34 19.14 2.32 33.14 ## "sample.int" 0.92 13.14 1.86 26.57 ## "which" 0.58 8.29 0.90 12.86 ## "stopifnot" 0.56 8.00 0.94 13.43 ## "sample" 0.44 6.29 2.30 32.86 ``` <!-- ## nolint end --> --- Replacing `ifelse` in ```r else lik <- ifelse(img[i, j] == colors, p, 1 - p) ``` with ```r else if (img[i, j] == 1) lik <- c(p, 1 - p) else lik <- c(1 - p, p) ``` speeds things up somewhat: <!-- ## nolint start: semicolon --> ```r Rprof();system.time(simImg1(m,img,100,.35,.7));Rprof(NULL) ## user system elapsed ## 4.991 0.000 5.001 s1 <- summaryRprof() head(s1$by.self) ## self.time self.pct total.time total.pct ## "simImg1" 2.38 47.41 5.00 99.60 ## "sample.int" 1.22 24.30 1.96 39.04 ## "stopifnot" 0.42 8.37 0.70 13.94 ## "sample" 0.22 4.38 2.22 44.22 ## "...elt" 0.16 3.19 0.18 3.59 ## "double" 0.12 2.39 0.12 2.39 ``` <!-- ## nolint end --> <!-- % $ -- make Emacs happy --> --- .pull-left.width-40[ Starting profiling with ```r Rprof(line.profiling = TRUE) ``` will enable source profiling if our function is defined in a file and sourced. Using the package `proftools` with ```r library(proftools) pd <- readProfileData() annotateSource(pd) ``` produces: ] .pull-right.width-60.tiny-code[ ``` : simImg1 <- : function (m, img, N, beta, p) : { : colors <- 1:2 : inboff <- c(-1, -1, -1, 0, 0, 1, 1, 1) : jnboff <- c(-1, 0, 1, -1, 1, -1, 0, 1) : for (k in 1:N) { : for (i in seq_len(nrow(m))) { : for (j in seq_len(ncol(m))) { 5.06% : w <- double(length(colors)) 0.42% : inb <- i + inboff : jnb <- j + jnboff 20.68% : omit <- inb == 0 | inb == nrow(m) + 1 | jnb == : 0 | jnb == ncol(m) + 1 2.11% : inb <- inb[!omit] 4.64% : jnb <- jnb[!omit] 2.11% : for (ii in seq_along(inb)) { 3.38% : kk <- m[inb[ii], jnb[ii]] 3.80% : w[kk] <- w[kk] + 1 : } 3.38% : if (is.null(img)) : lik <- 1 : else if (img[i, j] == 1) : lik <- c(p, 1 - p) : else lik <- c(1 - p, p) 3.38% : prob <- lik * exp(beta * w) 51.05% : m[i, j] <- sample(colors, 1, TRUE, prob = prob) : } : } : } : m : } ``` ] --- .pull-left.width-40[ This suggests one more useful change: Move the loop-invariant computations `nrow(m)+1` and `ncol(m)+1` out of the loop. ] .pull-right.width-55.tiny-code[ ```r simImg2 <- function(m, img, N, beta, p) { colors <- 1:2 inboff <- c(-1, -1, -1, 0, 0, 1, 1, 1) jnboff <- c(-1, 0, 1, -1, 1, -1, 0, 1) nrp1 <- nrow(m) + 1 ncp1 <- ncol(m) + 1 for (k in 1:N) { for (i in seq_len(nrow(m))) { for (j in seq_len(ncol(m))) { w <- double(length(colors)) inb <- i + inboff jnb <- j + jnboff omit <- inb == 0 | inb == nrp1 | jnb == 0 | jnb == ncp1 inb <- inb[!omit] jnb <- jnb[!omit] for (ii in seq_along(inb)) { kk <- m[inb[ii], jnb[ii]] w[kk] <- w[kk] + 1 } if (is.null(img)) lik <- 1 else if (img[i, j] == 1) lik <- c(p, 1 - p) else lik <- c(1 - p, p) prob <- lik * exp(beta * w) m[i, j] <- sample(colors, 1, TRUE, prob = prob) } } } m } ``` ] --- This helps as well: ```r system.time(simImg2(m, img, 100, .35, .7)) ## user system elapsed ## 4.398 0.000 4.400 ``` --- .pull-left[ ### Exploiting Conditional Independence {{content}} ] -- We are trying to sample a joint distribution of a collection of random variables `\begin{equation*} X_i, i \in \mathcal{C} \end{equation*}` {{content}} -- Sometimes it is possible to divide the index set `\(\mathcal{C}\)` into `\(k\)` groups `\begin{equation*} \mathcal{C}_1, \mathcal{C}_2, \dots, \mathcal{C}_k \end{equation*}` such that for each `\(j\)` the indices `\(X_i, i \in \mathcal{C}_j\)` are conditionally independent given the other values `\(\{X_i, i \not \in \mathcal{C}_j\}\)`. {{content}} -- For a 4-neighbor lattice we can use two groups: -- .pull-right[ <img src="mcmc_files/figure-html/unnamed-chunk-24-1.png" style="display: block; margin: auto;" /> Here `\(\mathcal{C}_1 = \text{red}\)` and `\(\mathcal{C}_2 = \text{white}\)`. {{content}} ] -- For an 8-neighbor lattice four groups are needed. --- .pull-left[ Each group can be updated all together, either {{content}} ] -- * using vectorized arithmetic, or {{content}} -- * using parallel computation. {{content}} -- A vectorized implementation based on this approach: -- .pull-right.tiny-code[ ```r nn <- function(m, c) { nr <- nrow(m) nc <- ncol(m) nn <- matrix(0, nr, nc) nn[1:(nr) - 1, ] <- nn[1:(nr) - 1, ] + (m[2:nr, ] == c) nn[2:nr, ] <- nn[2:nr, ] + (m[1:(nr - 1), ] == c) nn[, 1:(nc) - 1] <- nn[, 1:(nc) - 1] + (m[, 2:nc] == c) nn[, 2:nc] <- nn[, 2:nc] + (m[, 1:(nc - 1)] == c) nn } simGroup <- function(m, l2, l1, beta, which) { pp2 <- l2 * exp(beta * nn(m, 2)) pp1 <- l1 * exp(beta * nn(m, 1)) pp <- pp2 / (pp2 + pp1) ifelse(runif(sum(which)) < pp[which], 2, 1) } simImgV <- function(m, img, N, beta, p) { white <- outer(seq_len(nrow(m)), seq_len(ncol(m)), FUN = `+`) %% 2 == 1 black <- ! white if (is.null(img)) { l2 <- 1 l1 <- 1 } else { l2 <- ifelse(img == 2, p, 1 - p) l1 <- ifelse(img == 1, p, 1 - p) } for (i in 1:N) { m[white] <- simGroup(m, l2, l1, beta, white) m[black] <- simGroup(m, l2, l1, beta, black) } m } ``` ] --- The results: ```r system.time(simImgV(m, img, 100, .35, .7)) ## user system elapsed ## 0.133 0.012 0.145 ``` --- layout: true ## More MCMC Examples --- ### Monte Carlo Maximum Likelihood Suppose we have an exponential family likelihood `\begin{equation*} h(x|\theta) = c(\theta) e^{\theta x - \nu(x)} = c(\theta) \tilde{h}(x|\theta) \end{equation*}` -- In many problems `\(c(\theta)\)` is not available in closed form: -- * Strauss process with `\(\theta = \log c\)`. -- * MRF image model with `\(\theta = \beta\)`. --- Geyer and Thompson (1992) write `\begin{equation*} \log\frac{h(x|\theta)}{h(x|\eta)} = \log\frac{\tilde{h}(x|\theta)}{\tilde{h}(x|\eta)} - \log E_\eta\left[\frac{\tilde{h}(X|\theta)}{\tilde{h}(X|\eta)}\right]. \end{equation*}` -- This follows from the fact that `\begin{equation*} \frac{c(\theta)}{c(\eta)} = \frac{1}{c(\eta) \int\tilde{h}(x|\theta) dx} = \left(\int\frac{\tilde{h}(x|\theta)}{\tilde{h}(x|\eta)} c(\eta)\tilde{h}(x|\eta)dx\right)^{-1} = \left(E_\eta\left[\frac{\tilde{h}(X|\theta)}{\tilde{h}(X|\eta)} \right]\right)^{-1}. \end{equation*}` -- Using a sample `\(X_1, \dots, X_N\)` from `\(h(x|\eta)\)` this can be approximated by `\begin{equation*} \log\frac{h(x|\theta)}{h(x|\eta)} \approx \log\frac{\tilde{h}(x|\theta)}{\tilde{h}(x|\eta)} - \log \frac{1}{N} \sum_{i=1}^N \frac{\tilde{h}(X_i|\theta)}{\tilde{h}(X_i|\eta)} \end{equation*}` -- The sample `\(X_1, \dots, X_N\)` from `\(h(x|\eta)\)` usually needs to be generated using MCMC methods. --- ### Data Augmentation Suppose we have a problem where data `\(Y, Z\)` have joint density `\(f(y,z|\theta)\)` but we only observe `\(z\)`. -- Suppose we have a prior density `\(f(\theta)\)`. -- The joint density of `\(Y,Z,\theta\)` is then `\begin{equation*} f(y,z,\theta) = f(y, z|\theta) f(\theta) \end{equation*}` -- and the joint posterior density of `\(\theta, y\)` given `\(z\)` is `\begin{equation*} f(\theta, y|z) = \frac{f(y, z|\theta) f(\theta)}{f(z)} \propto f(y, z|\theta) f(\theta). \end{equation*}` --- Suppose it is easy to sample from the conditional distribution of -- * the missing data `\(y\)`, given `\(\theta\)` and the observed data `\(z\)`; -- * the parameter `\(\theta\)` given the complete data `\(y, z\)`. -- Then we can start with `\(\theta^{(0)}\)`, and for each `\(i=1, 2, \dots\)` -- * draw `\(y^{(i)}\)` from `\(f(y|\theta^{(i-1)}, z)\)`; -- * draw `\(\theta^{(i)}\)` from `\(f(\theta|y^{(i)},z)\)`. -- This is the _data augmentation algorithm_ of Tanner and Wong (1987). -- The result is a Markov chain with stationary distribution `\(f(\theta,y|z)\)`. -- If we discard the `\(y\)` values then we have a (dependent) sample from the marginal posterior density `\(f(\theta|z)\)`. -- In this alternating setting, the marginal sequence `\(\theta^{(i)}\)` is a realization of a Markov chain with invariant distribution `\(f(\theta|z)\)`. --- ### Probit Model for Pesticide Effectiveness Batches of 20 tobacco budworms were subjected to different doses of a pesticide and the number killed was recorded. <table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;"> <tbody> <tr> <td style="text-align:left;"> Dose </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 16 </td> <td style="text-align:right;"> 32 </td> </tr> <tr> <td style="text-align:left;"> Died </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 9 </td> <td style="text-align:right;"> 13 </td> <td style="text-align:right;"> 18 </td> <td style="text-align:right;"> 20 </td> </tr> </tbody> </table> -- A probit model assumes that binary responses `\(Z_i\)` depend on covariate values `\(x_i\)` though the relationship `\begin{equation*} Z_i \sim \text{Bernoulli}(\Phi(\alpha+\beta (x_i - \overline{x}))) \end{equation*}` -- A direct likelihood or Bayesian analysis is possible. --- An alternative is to assume that there are latent variables `\(Y_i\)` with `\begin{align*} Y_i &\sim \text{N}(\alpha + \beta (x_i - \overline{x}), 1)\\ Z_i &= \begin{cases} 1 & \text{if}\,\, Y_i \ge 0\\ 0 & \text{if}\,\, Y_i < 0 \end{cases} \end{align*}` -- For this example assume a flat, improper, prior density. -- The full data posterior distribution is `\begin{equation*} f(\alpha,\beta|y) \propto \exp\left\{- \frac{n}{2} (\alpha - \overline{y})^2 - \frac{\sum(x_i-\overline{x})^2}{2}(\beta - \widehat{\beta})^2\right\} \end{equation*}` with `\begin{equation*} \widehat{\beta} = \frac{\sum(x_i-\overline{x})y_i}{\sum(x_i-\overline{x})^2} \end{equation*}` --- So `\(\alpha\)`, `\(\beta\)` are independent given `\(y\)` and `\(x\)`, and `\begin{align*} \alpha|y,z &\sim \text{N}(\overline{y}, 1/n)\\ \beta|y,z &\sim \text{N}(\widehat{\beta}, 1/\sum(x_i-\overline{x})^2) \end{align*}` -- Given `\(z\)`, `\(\alpha\)`, and `\(\beta\)` the `\(Y_i\)` are conditionally independent, and `\begin{equation*} Y_i|z, \alpha, \beta \sim \begin{cases} \text{N$(\alpha+\beta (x_i-\overline{x})^2,1)$ conditioned to be positive} & \text{if}\,\, z_i = 1\\ \text{N$(\alpha+\beta (x_i-\overline{x})^2,1)$ conditioned to be negative} & \text{if}\,\, z_i = 0 \end{cases} \end{equation*}` -- The inverse CDF's are `\begin{equation*} F^-(u|z_i,\mu_i) = \begin{cases} \mu_i + \Phi^{-1}(\Phi(-\mu_i) + u(1-\Phi(-\mu_i))) & \text{if}\,\, z_i = 1\\ \mu_i + \Phi^{-1}(u\,\Phi(-\mu_i)) & \text{if}\,\, z_i = 0 \end{cases} \end{equation*}` with `\(\mu_i = \alpha + \beta(x_i-\overline{x})\)`. --- .pull-left[ The data: ```r dose <- c(1, 2, 4, 8, 16, 32) died <- c(1, 4, 9, 13, 18, 20) ``` {{content}} ] -- A plot of the proportion killed against dose is curved, but a plot against the logarithm is straight in the middle. <img src="mcmc_files/figure-html/unnamed-chunk-30-1.png" style="display: block; margin: auto;" /> -- .pull-right[ So use `\(x = \log_2(\text{dose})\)`. ```r x <- log2(dose) ``` {{content}} ] -- We need to generate data for individual cases: ```r xx <- rep(x - mean(x), each = 20) z <- unlist(lapply(died, function(x) c(rep(1, x), rep(0, 20 - x)))) ``` --- .pull-left[ We need functions to generate from the conditional distributions: ```r genAlpha <- function(y) rnorm(1, mean(y), 1 / sqrt(length(y))) genBeta <- function(y, xx, sxx2) rnorm(1, sum(xx * y) / sxx2, 1 / sqrt(sxx2)) genY <- function(z, mu) { p <- pnorm(-mu) u <- runif(length(z)) mu + qnorm(ifelse(z == 1, p + u * (1 - p), u * p)) } ``` ] -- .pull-right[ A function to produce a sample of parameter values by data augmentation is then defined by ```r da <- function(z, alpha, beta, xx, N) { val <- matrix(0, nrow = N, ncol = 2) colnames(val) <- c("alpha", "beta") sxx2 <- sum(xx^2) for (i in 1 : N) { y <- genY(z, alpha + beta * xx) alpha <- genAlpha(y) beta <- genBeta(y, xx, sxx2) val[i, 1] <- alpha val[i, 2] <- beta } val } ``` <!-- **** keeping the ybar and beta-hat values would be useful for variance reduction. --> ] --- .pull-left[ Initial values are ```r alpha0 <- qnorm(mean(z)) beta0 <- 0 ``` {{content}} ] -- A run of 10000: ```r v <- da(z, alpha0, beta0, xx, 10000) apply(v, 2, mean) ## alpha beta ## 0.2015720 0.7540646 apply(v, 2, sd) ## alpha beta ## 0.1480918 0.1140951 ``` -- .pull-right[ <img src="mcmc_files/figure-html/unnamed-chunk-37-1.png" style="display: block; margin: auto;" /> ] --- .pull-left.width-30[ Some diagnostics: ] .pull-right.width-70[ <img src="mcmc_files/figure-html/unnamed-chunk-38-1.png" style="display: block; margin: auto;" /> ] --- Using a simple AR(1) model, `\begin{align*} \text{SD}(\overline{\alpha}) &\approx \frac{\text{SD}(\alpha|z)}{\sqrt{N}} \sqrt{\frac{1+\rho_\alpha}{1-\rho_\alpha}} = \frac{0.1480918}{100}\sqrt{\frac{1+0.65}{1-0.65}} = 0.0032154\\ \text{SD}(\overline{\beta}) &\approx \frac{\text{SD}(\beta|z)}{\sqrt{N}} \sqrt{\frac{1+\rho_\beta}{1-\rho_\beta}} = \frac{0.1140951}{100}\sqrt{\frac{1+0.8}{1-0.8}} = 0.0034229 \end{align*}` -- Suppose we could sample independently from the posterior distribution. -- The number of independent draws needed to achieve the standard error `\(\text{SD}(\overline{\alpha})\)` is the value of `\(N_E\)` that solves the equation `\begin{equation*} \text{SD}(\overline{\alpha}) = \frac{\text{SD}(\alpha|z)}{\sqrt{N_E}}. \end{equation*}` -- This is the _effective sample size_ `\(\text{ESS}(\overline{\alpha})\)`. --- The approximate effective sample sizes are `\begin{align*} \text{ESS}(\overline{\alpha}) &\approx N \left(\frac{1-\rho_\alpha}{1+\rho_\alpha}\right) = 10000 \left(\frac{1-0.65}{1+0.65}\right) = 2121.212\\ \text{ESS}(\overline{\beta}) &\approx N \left(\frac{1-\rho_\beta}{1+\rho_\beta}\right) = 10000 \left(\frac{1-0.8}{1+0.8}\right) = 1111.111\\ \end{align*}` -- Approximate efficiencies: `\begin{align*} \text{Efficiency}(\overline{\alpha}) = \frac{\text{ESS}(\overline{\alpha})}{N} &\approx \left(\frac{1-\rho_\alpha}{1+\rho_\alpha}\right) = \left(\frac{1-0.65}{1+0.65}\right) = 0.2121\\ \text{Efficiency}(\overline{\beta}) = \frac{\text{ESS}(\overline{\beta})}{N} &\approx \left(\frac{1-\rho_\beta}{1+\rho_\beta}\right) = \left(\frac{1-0.8}{1+0.8}\right) = 0.1111\\ \end{align*}` --- layout: true ## Practical Bayesian Inference --- In Bayesian inference we have -- * a prior density or PMF `\(f(\theta)\)`; -- * a data density or PMF, or likelihood, `\(f(x|\theta)\)`. -- We compute the posterior density or PMF as `\begin{equation*} f(\theta|x) = \frac{f(x|\theta)f(\theta)}{f(x)} \propto f(x|\theta)f(\theta). \end{equation*}` -- At this point, in principle, we are done. --- In practice, if `\(\theta = (\theta_1,\dots,\theta_p)\)` then we want to compute things like -- * the posterior means `\(E[\theta_i|x]\)` and variances `\(\Var(\theta_i|x)\)`; -- * the marginal posterior densities `\(f(\theta_i|x)\)`; -- * posterior probabilities, such as `\(P(\theta_1 > \theta_2|x)\)`. -- These are all integration problems. -- For a few limited likelihood/prior combinations we can compute these integrals analytically. -- For most reasonable likelihood/prior combinations analytic computation is impossible. --- ### Numerical Integration For `\(p = 1\)` -- * we can plot the posterior density; -- * we can compute means and probabilities by numerical integration. -- General one dimensional numerical integration methods like -- * the trapezoidal rule; -- * Simpson's rule; -- * adaptive methods (as in `integrate()` in R). -- often use `\(N \approx 100\)` function evaluations. --- If `\(p=2\)` we can -- * plot the joint posterior density; -- * compute marginal posterior densities by one dimensional numerical integrations and plot them; -- * compute means and probabilities by iterated one dimensional numerical integration. -- In general, iterated numerical integration requires `\(N^p\)` function evaluations. -- If a one dimensional `\(f\)` looks like `\begin{equation*} f(x) \approx (\text{ low degree polynomial}) \times (\text{normal density}) \end{equation*}` then _Gaussian quadrature_ (Monahan, p. 268--271; Givens and Hoeting, Section 5.3) may work with `\(N = 3\)` or `\(N = 4\)`. -- * This approach is used in Naylor and Smith (1984). -- * Getting the location and scale of the Gaussian right is critical. -- * Even `\(3^p\)` gets very large very fast. --- ### Large Sample Approximations If the sample size `\(n\)` is large, let `\begin{align*} \widehat{\theta}& = \text{mode of joint posterior density}\,\, f(\theta|x)\\ H &= - \nabla^2_\theta \; \log f(\widehat{\theta}|x)\\ &= \text{Hessian (matrix of second partial derivatives) of}\,\, -\log f \,\,\text{at}\,\, \widehat{\theta}. \end{align*}` -- Then under often reasonable conditions `\begin{equation*} f(\theta|x) \approx \text{MVN}_p(\widehat{\theta},H^{-1}). \end{equation*}` -- The relative error in the density approximation is generally of order `\(O(n^{-1/2})\)` near the mode. --- More accurate second order approximations based on Laplace's method are also sometimes available. -- To approximate the marginal posterior density of `\(\theta_1\)`, compute `\begin{align*} \widehat{\theta}_2(\theta_1) &= \text{argmax}_{\theta_2} f(\theta_1,\theta_2|x)\\ H(\theta_1) &= - \nabla^2_{\theta_2} \; \log f(\theta_1,\widehat{\theta}_2(\theta_1)|x). \end{align*}` -- Then `\begin{equation*} \hat{f}(\theta_1|x) \propto \sqrt{\det H(\theta_1)} f(\theta_1,\widehat{\theta}_2(\theta_1)|x) \end{equation*}` approximates `\(f(\theta_1|x)\)` with a relative error near the mode of order `\(O(n^{-3/2})\)`. -- The component `\(f(\theta_1,\widehat{\theta}_2(\theta_1)|x)\)` is analogous to the _profile likelihood_. -- The term `\(\sqrt{\det H(\theta_1)}\)` adjusts for differences in spread in the parameter being maximized out. --- ### Monte Carlo Methods Early Monte Carlo approaches used importance sampling. -- * Usually some form of multivariate `\(t\)` is used to get heavy tails and bounded weights. -- * Guaranteeing bounded weights in high dimensions is very hard. -- * Even bounded weights may have too much variation to behave well. -- Tanner and Wong (1987) introduced the idea of Markov chain sampling for data augmentation. -- Gelfand and Smith (1989) showed that many joint posterior distributions have simple _full conditional distributions_ `\begin{equation*} f(\theta_i|x, \theta_1,\dots,\theta_{i-1},\theta_{i+1},\dots,\theta_p). \end{equation*}` -- These can be sampled using a Markov chain, called a Gibbs sampler. --- The _systematic scan_ Gibbs sampler starts with some initial values `\(\theta_1^{(0)},\dots,\theta_p^{(0)}\)` and then for each `\(k\)` generates `\begin{align*} \theta_1^{(k+1)} &\sim f(\theta_1|x, \theta_2=\theta_2^{(k)},\dots,\theta_p = \theta_p^{(k)})\\ &\;\;\vdots\\ \theta_i^{(k+1)} &\sim f(\theta_i|x, \theta_1=\theta_1^{(k+1)},\dots, \theta_{i-1}=\theta_{i-1}^{(k+1)}, \theta_{i+1}=\theta_{i+1}^{(k)},\dots, \theta_p = \theta_p^{(k)})\\ &\;\;\vdots\\ \theta_p^{(k+1)} &\sim f(\theta_p|x, \theta_1=\theta_1^{(k+1)},\dots, \theta_{p-1} = \theta_{p-1}^{(k+1)})\\ \end{align*}` -- The _random scan_ Gibbs sampler picks an index `\(i=1,\dots,p\)` at random and updates that component from its full conditional distribution. -- The random scan Gibbs sampler is the approach proposed by Ripley (1979). -- Many other variations are possible. -- All generate a Markov chain `\(\theta^{(0)}, \theta^{(1)}, \theta^{(2)},\dots,\)` with invariant distribution `\(f(\theta|x)\)`. --- ### Example: Pump Failure Data Numbers of failures and times of observation for 10 pumps in a nuclear power plant: |Pump | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |---------|-----:|-----:|-----:|------:|-----:|-----:|----:|----:|-----:|----:| |Failures | 5 | 1 | 5 | 14 | 3 | 19 | 1 | 1 | 4 | 22 | |Time |94.32 |15.72 |62.88 |125.76 | 5.24 |31.44 |1.05 |1.05 |2.10 |10.48 | -- Times are in 1000's of hours. -- Suppose the failures follow Poisson processes with rates `\(\lambda_i\)` for pump `\(i\)`; so the number of failures on pump `\(i\)` is `\(X_i \sim \text{Poisson}(\lambda_i t_i)\)`. -- Suppose the rates `\(\lambda_i\)` are drawn from a `\(\text{Gamma}(\alpha,1/\beta)\)` distribution. -- Assume `\(\alpha = 1.8\)`. <!-- Method of moments, sort of: mean(r)^2*(9/10*sd(r)^2-mean(r)*mean(1/time)) actually gives around 1.67 I think Gelfand & Smith method I think gives 1.52 Gaver & O'M I think use 1.27 More primitive method of moments gives 0.96 --> -- Assume `\(\beta \sim \text{Gamma}(\gamma, 1/\delta)\)` with `\(\gamma = 0.01\)` and `\(\delta = 1\)`. --- The joint posterior distribution of `\(\lambda_1,\dots,\lambda_{10},\beta\)` is .smaller[ `\begin{align*} f(\lambda_1,\dots,\lambda_{10},\beta|t_1,\dots,t_{10},x_1,\dots,x_{10}) &\propto \left(\prod_{i=1}^{10}(\lambda_i t_i)^{x_i} e^{-\lambda_i t_i} \lambda_i^{\alpha-1}\beta^\alpha e^{-\beta\lambda_i}\right) \beta^{\gamma-1}e^{-\delta\beta}\\ &\propto \left(\prod_{i=1}^{10}\lambda_i^{x_i+\alpha-1} e^{-\lambda_i(t_i + \beta)}\right) \beta^{10\alpha + \gamma-1}e^{-\delta\beta}\\ &\propto \left(\prod_{i=1}^{10}\lambda_i^{x_i+\alpha-1} e^{-\lambda_i t_i}\right) \beta^{10\alpha + \gamma-1}e^{-(\delta + \sum_{i=1}^{10}\lambda_i)\beta} \end{align*}` ] -- Full conditionals for `\(\lambda_1, \dots, \lambda_{10}\)`: Conditional on the data and `\(\beta\)`, the `\(\lambda_i\)` are independent `\begin{equation*} \lambda_i | \beta, t_i, x_i \sim \text{Gamma}(x_i+\alpha,(t_i+\beta)^{-1}). \end{equation*}` -- Full conditional for `\(\beta\)`: conditional on the data and the `\(\lambda_i\)`, `\begin{equation*} \beta | \lambda_i, t_i, x_i \sim \text{Gamma}(10\alpha+\gamma,(\delta+\sum \lambda_i)^{-1}). \end{equation*}` --- It is also possible to integrate out the `\(\lambda_i\)` analytically to get `\begin{equation*} f(\beta|t_i,x_i) \propto \left(\prod_{i=1}^{10}(t_i+\beta)^{x_i+\alpha} \right)^{-1} \beta^{10\alpha + \gamma-1}e^{-\delta\beta} \end{equation*}` -- This can be simulated by rejection or RU sampling; the `\(\lambda_i\)` can then be sampled conditionally given `\(\beta\)`. --- Suppose `\(\alpha\)` is also unknown and given an exponential prior distribution with mean one. -- The joint posterior distribution is then `\begin{equation*} f(\lambda_1,\dots,\lambda_{10},\beta, \alpha|t_i, x_i) \propto \left(\prod_{i=1}^{10}\lambda_i^{x_i+\alpha-1} e^{-\lambda_i(t_i + \beta)}\right) \beta^{10\alpha + \gamma-1}e^{-\delta\beta} \frac{e^{-\alpha}}{\Gamma(\alpha)^{10}} \end{equation*}` -- The full conditional density for `\(\alpha\)` is `\begin{equation*} f(\alpha|\beta, \lambda_i, t_i, x_i) \propto \frac{\left(\beta^{10}e^{-1}\prod_{i=1}^{10}\lambda_i\right)^\alpha} {\Gamma(\alpha)^{10}} \end{equation*}` -- This is not a standard density. -- This density is log-concave and can be sampled by adaptive rejection sampling. -- Another option is to use the Metropolis-Hastings algorithm. --- layout: true ## Metropolis-Hasting Algorithm --- Introduced in > N, Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and > E. Teller (1953), "Equations for state space calculations by fast > computing machines," _Journal of Chemical Physics_. -- Extended in > Hastings (1970), "Monte Carlo sampling methods using Markov chains > and their applications," _Biometrika_. --- .smaller[ Suppose we want to sample from a density `\(f\)`. {{content}} ] -- We need a family of _proposal distributions_ `\(Q(x,dy)\)` with densities `\(q(x,y)\)`. {{content}} -- Suppose a Markov chain with stationary distribution `\(f\)` is currently at `\(X^{(i)} = x\)`. {{content}} -- Then {{content}} -- * Generate a _proposal_ `\(Y\)` for the new location by drawing from the density `\(q(x,y)\)`. {{content}} -- * Accept the proposal with probability `$$\alpha(x,y) = \min\left\{\frac{f(y)q(y,x)}{f(x)q(x,y)}, 1\right\}$$` and set `\(X^{(i+1)} = Y\)`. {{content}} -- * Otherwise, reject the proposal and remain at `\(x\)`, i.e. set `\(X^{(i+1)} = x\)`. {{content}} -- The resulting transition densities satisfy the _detailed balance equations_ for `\(x \neq y\)` and initial distribution `\(f\)`: `\begin{equation*} f(x) q(x,y) \alpha(x,y) = f(y) q(y,x) \alpha(y,x) \end{equation*}` {{content}} -- The chain is therefore reversible and has invariant distribution `\(f\)`. --- ### Symmetric Proposals .pull-left[ Suppose `\(q(x,y) = q(y,x)\)`. {{content}} ] -- Then `\begin{equation*} \alpha(x,y) = \min\left\{\frac{f(y)}{f(x)},1\right\} \end{equation*}` {{content}} -- So: * If `\(f(y) \ge f(x)\)` then the proposal is accepted. {{content}} -- * If `\(f(y) < f(x)\)` then the proposal is accepted with probability `$$\alpha(x,y) = \frac{f(y)}{f(x)} < 1$$` -- .pull-right[ Symmetric proposals are often used in the _simulated annealing_ optimization method. {{content}} ] -- _Symmetric random walk proposals_ with `\(q(x, y) = g(y-x)\)` where `\(g\)` is a symmetric density are often used. {{content}} -- Metropolis et al. (1953) considered only the symmetric proposal case. {{content}} -- Hastings (1970) introduced the more general approach allowing for non-symmetric proposals. --- ### Independence Proposals .pull-left[ Suppose `\(q(x,y) = g(y)\)`, independent of `\(x\)`. {{content}} ] -- Then `\begin{equation*} \alpha(x,y) = \min\left\{\frac{f(y)g(x)}{f(x)g(y)}, 1\right\} = \min\left\{\frac{w(y)}{w(x)}, 1\right\} \end{equation*}` with `\(w(x) = f(x)/g(x)\)`. -- .pull-right[ This is related to importance sampling: {{content}} ] -- * If a proposal `\(y\)` satisfies `\(w(y) \ge w(x)\)` then the proposal is accepted. {{content}} -- * If `\(w(y) < w(x)\)` then the proposal may be rejected. {{content}} -- * If the weight `\(w(x)\)` at the current location is very large, meaning `\(g(x)\)` is very small compared to `\(f(x)\)`, then the chain will remain at `\(x\)` for many steps to compensate. --- ### Metropolized Rejection Sampling Suppose `\(h(x)\)` is a possible envelope for `\(f\)` with `\(\int h(x) dx < \infty\)`. -- Suppose we sample -- * `\(Y\)` from `\(h\)` * `\(U\)` uniformly on `\([0,h(Y)]\)` until `\(U < f(Y)\)`. -- Then the resulting `\(Y\)` has density `\(g(y) \propto \min(h(x),f(x))\)`. --- Using `\(g\)` as a proposal distribution, the Metropolis acceptance probability is .small[ `\begin{align*} \alpha(x,y) &=\min\left\{\frac{f(y)\min(h(x),f(x))}{f(x)\min(h(y),f(y))}, 1\right\}\\ &= \min\left\{\frac{\min(h(x)/f(x), 1)}{\min(h(y)/f(y),1)},1\right\}\\ &= \begin{cases} 1 & \text{if}\,\, f(x) \le h(x)\\ h(x)/f(x) & \text{if}\,\, f(x) > h(x) \,\,\text{and}\,\, f(y) \le h(y)\\ \min\left\{\frac{f(y)h(x)}{f(x)h(y)},1\right\} &\text{otherwise} \end{cases}\\ &= \min\left\{\frac{h(x)}{f(x)}, 1\right\} \min\left\{\max\left\{\frac{f(y)}{h(y)},1\right\}, \max\left\{\frac{f(x)}{h(x)},1\right\}\right\}\\ & \ge \min\left\{\frac{h(x)}{f(x)}, 1\right\} \end{align*}` ] -- If `\(h\)` is in fact an envelope for `\(f\)`, then `\(\alpha(x,y) \equiv 1\)` and the algorithm produces independent draws from `\(f\)`. -- If `\(h\)` is not an envelope, then the algorithm occasionally rejects proposals when the chain is at points `\(x\)` where the envelope fails to hold. -- The dependence can be very mild if the failure is mild; it can be very strong if the failure is significant. --- ### Metropolis-Within-Gibbs Suppose `\(f(x) = f(x_1, \dots, x_p)\)`. -- The Metropolis-Hastings algorithm can be used on the entire vector `\(x\)`. -- The Metropolis-Hastings algorithm can also be applied to one component of a vector using the full conditional density `\begin{equation*} f(x_i|x_1,\dots,x_{i-1},x_{i+1},\dots,x_p) \end{equation*}` as the target density. -- This approach is sometimes called _Metropolis-within-Gibbs_. -- This is a misnomer: this is what Metropolis et al. (1953) did to sample from the equilibrium distribution of `\(n\)` gas molecules. --- ### Example: Pump Failure Data .pull-left[ Suppose `\(\alpha\)` has an exponential prior distribution with mean 1. {{content}} ] -- The full conditional density for `\(\alpha\)` is `\begin{equation*} f(\alpha|\beta, \lambda_i, t_i, x_i) \propto \frac{\left(\beta^{10}e^{-1}\prod_{i=1}^{10}\lambda_i\right)^\alpha} {\Gamma(\alpha)^{10}} \end{equation*}` -- .pull-right[ To use a random walk proposal it is useful to make the support of the distribution be the whole real line. {{content}} ] -- The full conditional density of `\(\log \alpha\)` is `\begin{equation*} f(\log \alpha|\beta, \lambda_i, t_i, x_i) \propto \frac{\left(\beta^{10}e^{-1}\prod_{i=1}^{10}\lambda_i\right)^\alpha\alpha} {\Gamma(\alpha)^{10}} \end{equation*}` {{content}} -- Using a normal random walk proposal requires choosing a standard deviation; 0.7 seems to work reasonably well. {{content}} -- We can use a single Metropolis step or several. --- .pull-left[ ### Gibbs Sampler in R for Pump Data {{content}} ] -- The data: ```r fail <- c(5, 1, 5, 14, 3, 19, 1, 1, 4, 22) time <- c(94.32, 15.72, 62.88, 125.76, 5.24, 31.44, 1.05, 1.05, 2.10, 10.48) ``` {{content}} -- Hyper-parameters for the prior distribution: ```r alpha <- 1.8 gamma <- 0.01 delta <- 1 ``` {{content}} -- Generator for the full conditional for `\(\lambda_i, i = 1, \dots, 10\)`: ```r genLambda <- function(alpha, beta) rgamma(10, fail + alpha, rate = time + beta) ``` -- .pull-right[ Generator for the full conditional for `\(\beta\)`: ```r genBeta <- function(alpha, lambda) rgamma(1, 10 * alpha + gamma, rate = delta + sum(lambda)) ``` {{content}} ] -- Metropolis-Hastings sampler for the full conditional for `\(\alpha\)` (for `\(K = 0\)` the value of `\(\alpha\)` is unchanged): ```r genAlpha <- function(alpha, beta, lambda, K, d) { b <- (10 * log(beta) + sum(log(lambda)) - 1) for (j in seq_len(K)) { newAlpha <- exp(rnorm(1, log(alpha), d)) logDensRatio <- ((newAlpha - alpha) * b + log(newAlpha) - log(alpha) + 10 * (lgamma(alpha) - lgamma(newAlpha))) if (is.finite(logDensRatio) && log(runif(1)) < logDensRatio) alpha <- newAlpha } alpha } ``` --- .pull-left[ Driver for running the sampler: ```r pump <- function(alpha, beta, N, d = 1, K = 1) { v <- matrix(0, ncol = 12, nrow = N) for (i in 1:N) { lambda <- genLambda(alpha, beta) beta <- genBeta(alpha, lambda) alpha <- genAlpha(alpha, beta, lambda, K, d) v[i, 1:10] <- lambda v[i, 11] <- beta v[i, 12] <- alpha } v } ``` {{content}} ] -- Run the sampler: ```r v0 <- pump(alpha, beta = 1, 10000, K = 0) v1 <- pump(alpha, beta = 1, 10000, K = 1) v10 <- pump(alpha, beta = 1, 10000, K = 10) ``` -- .pull-right[ Marginal densities for `\(\beta\)` and joint marginal densities for `\(\alpha\)` and `\(\beta\)`: <img src="mcmc_files/figure-html/unnamed-chunk-47-1.png" style="display: block; margin: auto;" /> ] --- .pull-left.width-40[ Auto-correlation functions for `\(\beta\)` ] .pull-right.width-60[ <img src="mcmc_files/figure-html/unnamed-chunk-48-1.png" style="display: block; margin: auto;" /> ] --- As an aside, it is possible in this case to integrate out the `\(\lambda_i\)` to obtain the joint marginal posterior density of `\(\alpha\)` and `\(\beta\)` as `\begin{equation*} f(\alpha, \beta | t_i, x_i) \propto \frac{\prod_{i=1}^{10} \Gamma(x_i + \alpha)} {\Gamma(\alpha)^{10}\prod_{i=1}^{10} (t_i + \beta)^{x_i + \alpha}} \beta^{10 \alpha + \gamma - 1} e^{-\delta\beta}e^{-\alpha} \end{equation*}` -- Plots of this joint distribution match the results of the simulation: <img src="mcmc_files/figure-html/unnamed-chunk-49-1.png" style="display: block; margin: auto;" /> --- layout: true ## Markov Chain Theory: Discrete State Space --- A sequence of random variables `\(X_0, X_1, X_2, \dots\)` with values in a finite or countable set `\(E\)` is a Markov chain if for all `\(n\)` `\begin{equation*} P(X_{n+1} = j|X_n=i, X_{n-1}, X_{n-2},\dots, X_0) = P(X_{n+1}=j|X_n=i) \end{equation*}` -- i.e. given the present, the future and the past are independent. -- A Markov chain is _time homogeneous_ if `\begin{equation*} P(i,j) = P(X_{n+1} = j|X_n=i) \end{equation*}` does not depend on `\(n\)`. -- `\(P(i,j)\)` is the _transition matrix_ of the Markov chain. -- A transition matrix satisfies `\begin{equation*} \sum_{j \in E} P(i,j) = 1 \end{equation*}` for all `\(i \in E\)`. --- .pull-left[ The distribution of `\(X_0\)` is called the _initial distribution_ of a Markov chain. {{content}} ] -- The `\(n\)` step transition matrix is `\begin{equation*} P^n(i,j) = P(X_n=j|X_0=i) \end{equation*}` {{content}} -- The `\(n+m\)` step transition matrix satisfies `\begin{equation*} P^{n+m} = P^n P^m \end{equation*}` {{content}} -- That is, `\begin{equation*} P^{n+m}(i,j) = \sum_{k \in E} P^n(i,k)P^m(k,j) \end{equation*}` for all `\(i, j \in E\)`. -- .pull-right[ A distribution `\(\pi\)` is an _invariant distribution_ or a _stationary distribution_ for a Markov transition matrix `\(P\)` if `\begin{equation*} \pi(j) = \sum_{i \in E} \pi(i) P(i,j) \end{equation*}` for all `\(i\)`. {{content}} ] -- These equations are sometimes called the _flow balance equations_. --- ### Reversibility .pull-left[ A transition matrix is _reversible_ with respect to a distribution `\(\pi\)` if `\begin{equation*} \pi(i) P(i,j) = \pi(j) P(j,i) \end{equation*}` for every pair `\(i,j\)`. {{content}} ] -- These equations are called the _detailed balance_ equations. {{content}} -- If `\(P\)` is reversible with respect to `\(\pi\)`, then `\(\pi\)` is a stationary distribution for `\(P\)`: `\begin{align*} \sum_{i \in E} \pi(i) P(i,j) &= \sum_{i \in E} \pi(j) P(j,i)\\ &= \pi(j) \sum_{i \in E} P(j,i) = \pi(j) \end{align*}` -- .pull-right[ If `\(P\)` is reversible with respect to `\(\pi\)` and `\(X_0\)` has initial distribution `\(\pi\)`, then the vectors `\begin{equation*} (X_0, X_1, X_2, \dots, X_{n-2}, X_{n-1}, X_n) \end{equation*}` and `\begin{equation*} (X_n, X_{n-1}, X_{n-2}, \dots, X_2, X_1, X_0) \end{equation*}` have the same joint distributions. {{content}} ] -- Reversible transition matrices have many nice properties, including real eigenvalues. --- ### Convergence A Markov chain is irreducible if for each pair of states `\(i,j\)` there is an integer `\(n\)` such that `\begin{equation*} P^n(i,j) > 0 \end{equation*}` -- i.e. if each state can be reached with positive probability from any other state. -- The time to reach state `\(i\)` is `\begin{equation*} \tau_i = \min\{n \ge 1: X_n = i\} \end{equation*}` with `\(\tau_i = \infty\)` if `\(X_n \neq i\)` for all `\(n\)`. --- An irreducible Markov chain falls into one of three categories: -- * _Transient:_ `\(P(\tau_i = \infty|X_0=j) > 0\)` for all `\(i,j \in E\)`. -- * _Null recurrent:_ `\(P(\tau_i = \infty|X_0=j) = 0\)` and `\(E[\tau_i|X_0=j] = \infty\)` for all `\(i,j \in E\)`. -- * _Positive recurrent:_ `\(P(\tau_i = \infty|X_0=j) = 0\)` and `\(E[\tau_i|X_0=j] < \infty\)` for all `\(i,j \in E\)`. -- If an irreducible Markov chain is transient or null recurrent then it does not have a proper invariant distribution. --- If an irreducible Markov chain is positive recurrent, then -- * it has a unique invariant distribution `\(\pi\)`; -- * for any `\(i \in E\)` `$$\frac{1}{n} \sum_{k=1}^n P^k(i,j) \to \pi(j);$$` -- * if `\(h\)` is a real-valued function on `\(E\)` such that `\(\sum_{i \in E} |h(i)| \pi(i) < \infty\)` then almost surely `$$\frac{1}{n}\sum_{k=1}^n h(X_k) \to \pi h = \sum_{i \in E} h(i)\pi(i);$$` -- * if the chain is also aperiodic, then `$$P^n(i, j) = \to \pi(j)$$` for all `\(i, j \in E\)`. In this case `\(\pi\)` is an _equilibrium distribution_ of the chain. --- ### Different Points of View -- .pull-left[ In applied probability problems we usually {{content}} ] -- * verify that a chain is irreducible; {{content}} -- * verify that a chain is positive recurrent; {{content}} -- * conclude that a unique stationary distribution exists; {{content}} -- * compute the unique stationary distribution `\(\pi\)`; {{content}} -- * verify that the chain is aperiodic; {{content}} -- * use `\(\pi\)` to approximate the distribution of `\(X_n\)`. -- .pull-right[ In Markov chain Monte Carlo {{content}} ] -- * we know by construction that a proper stationary distribution `\(\pi\)` exists; {{content}} -- * we verify that the chain is irreducible; {{content}} -- * we conclude that the chain must be positive recurrent (since it cannot be transient or null recurrent) and therefore `\(\pi\)` is the unique stationary distribution; {{content}} -- * we approximate expectations under `\(\pi\)` by sample path averages; {{content}} -- * aperiodicity is usually not important. --- layout: true ## Markov Chain Theory: General State Spaces --- .pull-left[ Let `\(E\)` be an arbitrary set and `\(\mathcal{E}\)` a countably generated sigma-algebra on `\(E\)`. {{content}} ] -- A sequence of `\((E,\mathcal{E})\)`-valued random variables is a time homogeneous Markov chain with transition kernel `\(P(x,dy)\)` if `\begin{equation*} P(X_{n+1} \in A|X_n, X_{n-1},\dots,X_0) = P(X_n, A) \end{equation*}` for all `\(A \in \mathcal{E}\)`. {{content}} -- A Markov transition kernel is a function `\(P(\cdot,\cdot)\)` such that {{content}} -- * `\(P(x,\cdot)\)` is a probability on `\((E,\mathcal{E})\)` for each `\(x \in E\)`. {{content}} -- * `\(P(\cdot,A)\)` is a `\(\mathcal{E}\)`-measurable function for each `\(A \in \mathcal{E}\)`. -- .pull-right[ The `\(n\)`-step transition kernel of a Markov chain is `\begin{equation*} P^n(x,A) = P(X_n \in A| X_0 = x) \end{equation*}` {{content}} ] -- and satisfies `\begin{equation*} P^{n+m}(x,A) = \int P^n(x,dy)P^m(y,A) \end{equation*}` for all `\(x \in E\)` and all `\(A \in \mathcal{E}\)`. {{content}} -- A distribution `\(\pi\)` is invariant for a Markov transition kernel `\(P\)` if `\begin{equation*} \pi(A) = \int \pi(dy) P(y, A) \end{equation*}` for all `\(A \in \mathcal{E}\)`. --- ### Reversibility A transition kernel `\(P\)` is reversible with respect to a distribution `\(\pi\)` if `\begin{equation*} \pi(dx)P(x,dy) = \pi(dy)P(y,dx) \end{equation*}` -- i.e. these two bivariate distributions must be identical. -- If `\(P\)` is reversible with respect to `\(\pi\)` then `\(P\)` is invariant with respect to `\(\pi\)`: .smaller[ `\begin{align*} \int_{x \in E} \pi(dx) P(x,A) &= \int_{x \in E}\int_{y \in A} \pi(dx) P(x,dy)\\ &= \int_{x \in E}\int_{y \in A} \pi(dy) P(y,dx)\\ &= \int_{y \in A} \pi(dy) \int_{x \in E}P(y,dx)\\ &= \int_{y \in A} \pi(dy)\\ &= \pi(A) \end{align*}` ] --- ### Convergence A Markov chain with transition kernel `\(P\)` is irreducible with respect to a sigma-finite measure `\(\nu\)` if for every `\(x \in E\)` and every `\(A \in \mathcal{E}\)` with `\(\nu(A) > 0\)` there exists an integer `\(n\)` such that `\(P^n(x, A) > 0\)` -- A Markov chain is irreducible if it is irreducible with respect to `\(\nu\)` for some sigma-finite `\(\nu\)`. -- The standard definition of irreducibility for discrete state spaces corresponds to irreducibility with respect to counting measure for general state spaces. -- An irreducible Markov chain is either transient, null recurrent, or positive recurrent. -- An irreducible Markov chain is positive recurrent if and only if it has a proper stationary distribution `\(\pi\)`. -- Essentially all Markov chains used in MCMC that are recurrent are also _Harris recurrent_. --- .smaller[ If an irreducible Markov chain is positive recurrent, then: {{content}} ] -- * It has a unique stationary distribution `\(\pi\)`. {{content}} -- * If the chain is Harris recurrent, then `$$\sup_{A \in \mathcal{E}} \left|\frac{1}{n}\sum_{k=1}^nP^k(x,A)-\pi(A)\right| \to 0$$` for all `\(x\)`. {{content}} -- * If the chain is Harris recurrent, `\(h\)` is real-valued, `\(\mathcal{E}\)`-measurable, and `\(\pi|h| = \int |h(x)|\pi(dx) < \infty\)`, then for any initial distribution almost surely `$$\frac{1}{n} \sum_{k=1}^n h(X_k) \to \pi h = \int h(x) \pi(dx)$$`. {{content}} -- * If the chain is Harris recurrent and aperiodic, then `\(\sup_{A \in \mathcal{E}}|P^n(x,A)-\pi(A)| \to 0\)` for all `\(x\)`. --- ### Rates of Convergence A Markov chain is _geometrically ergodic_ if there exists a nonnegative function `\(M(x)\)` with `\(\pi M < \infty\)` and a constant `\(\lambda < 1\)` such that `\begin{equation*} \sup_{A \in \mathcal{E}} |P^n(x,A)-\pi(A)| \le M(x)\lambda^n \end{equation*}` for all `\(x \in E\)` and all integers `\(n \ge 1\)`. -- A Markov chain is _uniformly ergodic_ if there exists a finite constant `\(M\)` and a constant `\(\lambda < 1\)` such that `\begin{equation*} \sup_{A \in \mathcal{E}} |P^n(x,A)-\pi(A)| \le M\lambda^n \end{equation*}` for all `\(x \in E\)` and all integers `\(n \ge 1\)`. -- Many MCMC samplers are geometrically ergodic; relatively few are uniformly ergodic. -- Restricting parameters to a compact set can often make a chain uniformly ergodic. --- ### Central Limit Theorems Suppose `\(\int h(x)^2 \pi(dx) < \infty\)` and let `\begin{equation*} \overline{h}_n = \frac{1}{n}\sum_{k=1}^nh(X_k) \end{equation*}` -- Let `\(\tau_n = n \Var_\pi(\overline{h}_n)\)` and let `\begin{equation*} \tau = \lim_{n \to \infty} \tau_n = \Var_\pi(h(X_0)) + 2 \sum_{k=1}^\infty \Cov_\pi(h(X_k),h(X_0)) \end{equation*}` if the limit exists. --- Suppose the Markov chain is uniformly ergodic. Then the limit `\(\tau\)` exists, is finite, and `\(\sqrt{n}(\overline{h}-\pi h)\)` converges in distribution to a `\(\text{N}(0,\tau)\)` random variable. -- Suppose the Markov chain is Harris recurrent and geometrically ergodic, and that `\(\int |h(x)|^{2+\epsilon}\pi(dx) < \infty\)` for some `\(\epsilon > 0\)`. Then the limit `\(\tau\)` exists, is finite, and `\(\sqrt{n}(\overline{h}-\pi h)\)` converges in distribution to a `\(\text{N}(0,\tau)\)` random variable. -- If the Markov chain is reversible, Harris recurrent, and geometrically ergodic then `\(\pi(h^2) < \infty\)` is sufficient for a CLT. -- Suppose the Markov chain is reversible. Then the limit `\(\tau\)` exists but may be infinite. If the limit is finite, then `\(\sqrt{n}(\overline{h}-\pi h)\)` converges in distribution to a `\(\text{N}(0,\tau)\)` random variable. --- The asymptotic variance `\(\tau\)` can be written as `\begin{equation*} \tau = \Var_\pi(h(X_0))\left[1+2 \sum_{k=1}^\infty \rho_k(h)\right] \end{equation*}` with `\begin{equation*} \rho_k(h) = \text{Corr}_\pi(h(X_k),h(X_0)) \end{equation*}` -- To use a central limit theorem we need to -- * be confident that it is valid; -- * be able to estimate `\(\tau\)`. --- ### Summary of Markov Chain Theory Suppose `\(X_1, X_2, \dots\)` is a Markov chain on a state space `\(E\)` with invariant distribution `\(\pi\)` and `\(h\)` is a function such that `\(\int h(x)^2 \pi(dx) < \infty\)`. -- .pull-left[ _Law of large numbers:_ If the chain is irreducible, i.e. can get from any initial point to, or close to, any other point in `\(E\)`, then `\(\pi\)` is an equilibrium distribution and almost surely `\begin{align*} \overline{h}_n &= \frac{1}{n}\sum_{i=1}^n h(X_i)\\ & \to \pi h = \int h(x) \pi(dx). \end{align*}` ] -- .pull-right[ _Central limit theorem:_ Under reasonable conditions, `\(\sqrt{n}(\overline{h}_n - \pi h)\)` converges in distribution to a `\(\text{N}(0,\tau)\)` random variable with `$$\tau = \Var_\pi(h(X_0))\left[1+2 \sum_{k=1}^\infty \rho_k(h)\right]$$` with `\(\rho_k(h) = \text{Corr}_\pi(h(X_k),h(X_0))\)` and `\(X_0 \sim \pi\)`. {{content}} ] -- To use the central limit theorem we need to be able to estimate `\(\tau\)`. --- layout: true ## Output Analysis --- .pull-left[ Simulation output analysis deals mainly with {{content}} ] -- * estimation using simulation output; {{content}} -- * estimating variances of simulation estimators; {{content}} -- * assessing convergence, initialization bias, initial transients; {{content}} -- based on data produced by simulations. {{content}} -- General characteristics of such data: {{content}} -- * Simulation run lengths are often very long. {{content}} -- * There is usually dependence within runs. {{content}} -- * Usually runs settle down to some form of stationarity -- .pull-right[ Software: {{content}} ] -- * CODA (Best, Cowles, and Vines) * Developed for S-PLUS, mainly for BUGS output. * R port by Martyn Plummer; available on our Linux systems. <!-- * Documentation: http://www.mrc-bsu.cam.ac.uk/bugs/documentation/contents.shtml --> {{content}} -- * BOA (B. Smith) * Major revision of CODA. * Provides more methods. * Available as an R package from CRAN and on our Linux systems. --- ### Simulation Estimators .pull-left[ Suppose a simulation produces values `\(X_1, \dots, X_N\)` and `\begin{equation*} \overline{X}_N = \frac{1}{N}\sum_{i=1}^N X_i \to \theta \end{equation*}` {{content}} ] -- Usually we will estimate `\(\theta\)` by `\(\overline{X}\)`. {{content}} -- If the process `\(X_1, X_2, \dots,\)` is stationary then usually `\begin{equation*} \theta = E[X_i]. \end{equation*}` {{content}} -- Otherwise we usually have `\(E[\overline{X}_N] \to \theta\)` and often also `\(E[X_i] \to \theta\)`. -- .pull-right[ In some cases we may be able to find a function `\(g\)` such that `\begin{equation*} \frac{1}{N} \sum_{i=1}^N g(X_i) \to \theta. \end{equation*}` {{content}} ] -- Rao-Blackwellization is one approach that may produce such a function `\(g\)`. {{content}} -- Often a Rao-Blackwellized estimator will have reduced variance, but this in _not_ guaranteed with depended `\(X_i\)`, {{content}} -- For the rest of this discussion, assume that `\(X_i\)` incorporates any such `\(g\)`. --- ### Variance Estimation We often have, or hope to have, `\begin{equation*} \overline{X}_N \sim \text{AN}(\theta, \tau_N/N) \end{equation*}` -- where `\begin{equation*} \tau_N = N \Var(\overline{X}_N) = \frac{1}{N} \sum_{i=1}^{N}\sum_{j=1}^{N} \text{Cov}(X_i,X_j) \end{equation*}` -- If the process `\(X_1, \dots, X_N\)` is stationary, which is usually the case in the limit, then `\begin{align*} \sigma^2 &= \text{Var}(X_i)\\ \rho_k & = \text{Corr}(X_i,X_{i+k}) \end{align*}` do not depend on `\(i\)`. The value `\(\rho_k\)` is the _lag `\(k\)` autocorrelation_ of `\(X_1, X_2, \dots\)`. --- .pull-left[ For a stationary process `\begin{equation*} \tau_N = \sigma^2 \left(1 + 2\sum_{k=1}^{N-1}\left(1-\frac{k}{N}\right)\rho_k\right) \end{equation*}` {{content}} ] -- Typically, `\begin{equation*} \tau_N \to \tau = \sigma^2\left(1 + 2\sum_{k=1}^\infty\rho_k\right) = \sigma^2 \sum_{k=-\infty}^\infty \rho_k \end{equation*}` -- .pull-right[ Several methods are available for estimating `\(\tau\)`, including {{content}} ] -- * modeling the time series and using the estimated autocorrelations; {{content}} -- * estimating the spectral density at zero; {{content}} -- * batching; {{content}} -- * combinations; {{content}} -- * regenerative simulation; {{content}} -- * replication. --- ### Time Series Models .pull-left[ We can fit an `\(\text{ARMA}(p,q)\)` model of the form `\begin{equation*} (X_i-\theta) = \sum_{j=1}^p\alpha_j(X_{i-j}-\theta) + \sum_{j=1}^q\beta_j\epsilon_{i-j} + \epsilon_i \end{equation*}` with the `\(\epsilon_i\)` independent `\(\text{N}(0,\sigma^2_{\epsilon})\)`. {{content}} ] -- Then `\begin{equation*} \tau = \sigma_\epsilon^2 \frac{\left(1+\sum_{j=1}^q\beta_j\right)^2} {\left(1-\sum_{j=1}^p\alpha_j\right)^2} \end{equation*}` {{content}} -- `\(\tau\)` can be estimated by plugging in estimates of `\(\alpha_j\)`, `\(\beta_j\)`, and `\(\sigma_\epsilon^2\)`. -- .pull-right[ For the `\(\text{AR}(1)\)` model `\(\sigma^2 = \sigma_\epsilon^2/(1-\alpha_1^2)\)` and `\(\rho_1 = \alpha_1\)`; so `\begin{align*} \tau &= \sigma_\epsilon^2 \frac{1}{(1 - \alpha_1)^2} = \sigma^2\frac{1-\alpha_1^2}{(1-\alpha_1)^2}\\ &= \sigma^2 \frac{1+\alpha_1}{1-\alpha_1} = \sigma^2 \frac{1+\rho_1}{1-\rho_1} \end{align*}` {{content}} ] -- An estimate is `\begin{equation*} \widehat{\tau} = S^2 \frac{1+r_1}{1-r_1} \end{equation*}` with `\(S^2\)` the sample variance and `\(r_1\)` the lag one sample autocorrelation. --- .pull-left[ ### Spectral Density at the Origin The autocorrelation function satisfies `\begin{equation*} \sigma^2 \rho_k = 2 \int_0^{\pi}\cos(k\omega) f(\omega) d\omega \end{equation*}` {{content}} ] -- where `\begin{equation*} f(\omega) = \frac{\sigma^2}{2\pi}\sum_{k=-\infty}^\infty \rho_k\cos(k\omega) \end{equation*}` is the _spectral density_. {{content}} -- The spectral density is sometimes defined as a function on `\([0,1/2)\)`. -- .pull-right[ The spectral density at zero is related to `\(\tau\)` as `\begin{equation*} \tau = 2 \pi f(0). \end{equation*}` {{content}} ] -- Spectral densities are usually estimated by smoothing the _periodogram_, the Fourier transform of the sample autocovariance function. {{content}} -- Smoothing flattens peaks, and there is typically a peak at zero. {{content}} -- A number of methods are available for dealing with this. {{content}} -- The CODA function `spectrum0` computes `\(\tau\)` by some of these methods. {{content}} -- It is usually a good idea to make `spectrum0` use batching by specifying a value for `max.length`. --- .pull-left[ ### Batching and Replication If we replicate sampler runs independently `\(R\)` times, then we have `\(R\)` independent sample averages and can use their sample standard deviation in computing a standard error for the overall average. {{content}} ] -- Because of concerns about convergence we usually run only relatively few long chains; this does not provide enough degrees of freedom by itself. {{content}} -- Batching is a form of within chain replication: -- .pull-right[ * Suppose `\(N = K M\)` and for `\(i = 1, \dots, K\)` let `$$\overline{X}_{M, i} = \frac{1}{M}\sum_{j=(i-1)M+1}^{iM} X_i.$$` Then `\(\overline{X}_{M,1}, \dots, \overline{X}_{M,K}\)` are means of successive batches of size `\(M\)`. {{content}} ] -- * The overall mean is the mean of the batch means, `$$\overline{X}_N = \frac{1}{N}\sum_{i=1}^N X_i = \frac{1}{K}\sum_{i=1}^K\overline{X}_{M,i}$$` {{content}} -- * If the batches are large enough, then the batch means will be approximately independent and normal, so `\(t\)` confidence intervals can be used. --- An estimate of `\(\tau\)` based on assuming independent batch means is `\begin{equation*} \widehat{\tau} = \frac{M}{K-1}\sum_{i=1}^{K}(\overline{X}_{M,i}-\overline{X})^2 \end{equation*}` -- An alternative: -- * Choose a batch size so that an AR(1) model fits. -- * Estimate `\(\tau\)` assuming the batch means follow an AR(1) model. -- Batching and replication can be combined. --- ### Effective Sample Size and Sampler Efficiency .pull-left[ If the sequence `\(X_1, \dots, X_N\)` were a random sample from a distribution `\(\pi\)`, then we would have $$ \Var(\overline{X}_N) = \frac{\sigma^2}{N}. $$ {{content}} ] -- With dependent sampling the variance is `\begin{equation*} \Var(\overline{X}_N) \approx \frac{\tau}{N} = \frac{\sigma^2}{N} \sum_{k=-\infty}^\infty \rho_k \end{equation*}` -- .pull-right[ So a sample of size `\(N\)` from a sampler with dependence is equivalent to a sample of `\begin{equation*} N_E = N \frac{\sigma^2}{\tau} = N \left(\sum_{k=-\infty}^\infty \rho_k\right)^{-1} \end{equation*}` independent observations. {{content}} ] -- `\(N_E\)` is sometimes called the _effective sample size_. --- .pull-left[ By analogy to estimation theory the value `\begin{equation*} \frac{N_E}{N} = \left(\sum_{k=-\infty}^\infty \rho_k\right)^{-1} \end{equation*}` is sometimes called the _asymptotic relative efficiency_, or just the efficiency, of the sampler. {{content}} ] -- Thinking about the equivalent number of independent observations is often useful. {{content}} -- Efficiencies need to be treated with caution: {{content}} -- If sampler A is half as efficient but ten times as fast as sampler B, then sampler A is clearly better. -- .pull-right[ In the physics literature the quantity `\begin{equation*} \mathcal{T}_\text{int} = \sum_{k=-\infty}^\infty \rho_k \end{equation*}` is called the _integrated autocorrelation time_. ] --- .pull-left[ ### Example: Pump Data Generate a run of 20000 with ```r v <- pump(1.8, 1, 20000) colnames(v) <- c(paste("lambda", 1:10, sep = ""), "beta", "alpha") ``` Using CODA we can get a summary: ] .pull-right.tiny-code[ <!-- Use this to truncate output instead of dropping quantiles: https://community.rstudio.com/t/showing-only-the-first-few-lines-of-the-results-of-a-code-chunk/6963 --> ```r library(coda) summary(mcmc(v), quantiles = NULL) ## ## Iterations = 1:20000 ## Thinning interval = 1 ## Number of chains = 1 ## Sample size per chain = 20000 ## ## 1. Empirical mean and standard deviation for each variable, ## plus standard error of the mean: ## ## Mean SD Naive SE Time-series SE ## lambda1 0.05976 0.02522 0.0001784 0.0001819 ## lambda2 0.10128 0.07900 0.0005586 0.0006320 ## lambda3 0.08912 0.03730 0.0002637 0.0002637 ## lambda4 0.11640 0.03045 0.0002153 0.0002153 ## lambda5 0.60165 0.31592 0.0022339 0.0022773 ## lambda6 0.60921 0.13782 0.0009745 0.0009745 ## lambda7 0.90969 0.73397 0.0051900 0.0053375 ## lambda8 0.90864 0.74207 0.0052473 0.0056124 ## lambda9 1.59442 0.77784 0.0055001 0.0062135 ## lambda10 2.00107 0.42659 0.0030164 0.0032289 ## beta 0.90430 0.53192 0.0037612 0.0110664 ## alpha 0.69284 0.27124 0.0019179 0.0068604 ## ## 2. Quantiles for each variable: ## ## numeric(0) ``` ] --- .pull-left[ The function `bmse` computes a standard error for the sample path mean using batching and, optionally, a time series adjustment based on an AR(1) model: ```r bmse <- function(x, M = 1, ts = FALSE) { bm <- apply(matrix(x, nrow = M), 2, mean) se <- sd(bm) / sqrt(length(bm)) if (ts) { r <- acf(bm, plot = FALSE, lag = 1)$acf[2] se <- se * sqrt((1 + r) / (1 - r)) } se } ``` <!-- %% $ to make emacs sane --> ] -- .pull-right.tiny-code[ Results for `\(\beta\)`: ```r bmse(v[, 11]) ## [1] 0.003761225 bmse(v[, 11], ts = TRUE) ## [1] 0.007140101 sqrt(spectrum0(v[, 11], max.length = NULL)$spec / nrow(v)) ## [1] 0.005942574 sqrt(spectrum0(v[, 11])$spec / nrow(v)) # max.length = 200 ## [1] 0.00947038 bmse(v[, 11], M = 100) # 200 batches of size 100 ## [1] 0.01035152 bmse(v[, 11], M = 100, ts = TRUE) ## [1] 0.00986565 ``` {{content}} ] -- Results for `\(\alpha\)`: ```r bmse(v[, 12]) ## [1] 0.001917943 bmse(v[, 12], ts = TRUE) ## [1] 0.006860205 sqrt(spectrum0(v[, 12], max.length = NULL)$spec / nrow(v)) ## [1] 0.003479021 sqrt(spectrum0(v[, 12])$spec / nrow(v)) ## [1] 0.006688784 bmse(v[, 12], M = 100) ## [1] 0.006951214 bmse(v[, 12], M = 100, ts = TRUE) ## [1] 0.006922896 ``` --- ### Convergence and Mixing There are many, many "convergence diagnostics" available in the literature. -- More are still being developed. -- CODA and BOA provide a rich set of choices along with references. -- Monahan also has some references. -- The simulation literature is less preoccupied with this; a good example is C. Alexopoulos and D. Goldsman (2004) "To Batch or Not To Batch," _ACM Trans. on Modeling and Simulation_ **14**, 76--114. --- Convergence is not the issue, _mixing_ is: -- * Suppose you could, possibly at great cost, obtain one draw from the target distribution and use it to start a chain. -- * The chain would then be stationary. -- * On the other hand, the conditional chain, given the value of the draw, is not stationary. -- * Mixing conditions deal with how rapidly `$$\sup_{A,B} |P(X_n \in A, X_0 \in B)-P(X_n \in A)P(X_0 \in B)| \to 0$$` --- .pull-left[ If we knew the values of `\(\sigma^2 = \Var_\pi(X_i)\)` and of `\(\tau = \lim N \Var(\overline{X}_N)\)` then we would know how well the chain mixes and how to choose `\(N\)`. {{content}} ] -- For independent sampling, `\(N = 10000\)` is typically sufficient (computational uncertainty about the posterior mean is 1% of the uncertainty in the posterior distribution). {{content}} -- Unfortunately we do not know `\(\sigma^2\)` or `\(\tau\)`. {{content}} -- We can try to estimate them from one or more sample paths. {{content}} -- We cannot be certain, by looking at sample paths alone, that the estimates are in the right ballpark. -- .pull-right[ An example: $$ f(x,y) = \frac{1}{2}\varphi(x)\varphi(y) + \frac{1}{2}\varphi(x-\mu)\varphi(y-\mu) $$ {{content}} ] -- This can be sampled by a simple Gibbs sampler. {{content}} -- Each full conditional is a univariate mixture of two normals. --- .pull-left[ For `\(\mu\)` = 3: ] .pull-right[ <img src="mcmc_files/figure-html/unnamed-chunk-55-1.png" style="display: block; margin: auto;" /> ] --- .pull-left[ For `\(\mu\)` = 6: ] .pull-right[ <img src="mcmc_files/figure-html/unnamed-chunk-56-1.png" style="display: block; margin: auto;" /> ] --- .pull-left[ For `\(\mu\)` = 8: ] .pull-right[ <img src="mcmc_files/figure-html/unnamed-chunk-57-1.png" style="display: block; margin: auto;" /> ] --- .pull-left[ For `\(\mu\)` = 9: ] .pull-right[ <img src="mcmc_files/figure-html/unnamed-chunk-58-1.png" style="display: block; margin: auto;" /> ] --- ### Outline of Diagnostic Approaches .pull-left[ Plot the data. {{content}} ] -- Single chain approaches: {{content}} -- * ANOVA on batches; {{content}} -- * Comparing beginning batch to end batch; {{content}} -- * Reverse chain and look for "out of control"; {{content}} -- * Detect how much to discard as "burn in"; {{content}} -- * Start far from center to see how quickly effect dissipates. -- .pull-right[ Multiple chain approaches: {{content}} ] -- * Look for consistency within and between chains; {{content}} -- * Use "over-dispersed" starting points; {{content}} -- * Can be run in parallel. {{content}} -- Dropping the "burn in" is a _bias/variance trade-off_. --- ### Convergence and Mixing Again Mixing and convergence are two sides of the same issue: `\begin{equation*} E[h(X_0)g(X_n)] = E[h(X_0) E[g(X_n)|X_0]] \end{equation*}` -- So mixing behavior such as `\begin{equation*} E[h(X_0)g(X_n)] \to E[h(X_0)]E[g(X_0)] \end{equation*}` is essentially equivalent to `\begin{equation*} E[g(X_n)|X_0] \to E[g(X_0)] \end{equation*}` --- layout: true ## Combining MCMC Samplers --- Some samplers may mix well but be very costly. -- Other samplers may be good at one kind of jump and not so good at others -- Suppose `\(P_1\)` and `\(P_2\)` are two transition kernels with common invariant distribution `\(\pi\)`. -- Then * `\(P_1 P_2\)` is a transition kernel with invariant distribution `\(\pi\)`. -- * `\(\alpha P_1 + (1-\alpha)P_2\)` is a transition kernel with invariant distribution `\(\pi\)` for any `\(\alpha \in [0,1]\)`. -- For a _mixture kernel_ `\(P = \alpha P_1 + (1-\alpha)P_2\)` with `\(0 < \alpha < 1\)` -- * if either `\(P_1\)` or `\(P_2\)` is irreducible, then `\(P\)` is irreducible -- * if either `\(P_1\)` or `\(P_2\)` is uniformly ergodic then `\(P\)` is uniformly ergodic. --- Metropolis-Hasting kernels such as ones that * make an independence proposal from an approximation to the posterior distribution -- * propose a value reflected around an axis or through a point -- * propose a rotated version of the current state -- can often be useful. -- Combinations can be used to improve theoretical properties, such as make a chain reversible. --- layout: true ## Improving Mixing and Convergence --- Some possible strategies: -- * transforming parameters; -- * blocking; -- * auxiliary variables; -- * heating, alternating, and reweighting. -- Many basic ideas from optimization also apply: -- * make sure the problem is reasonably scaled; -- * make sure the problem is well conditioned; -- * eliminate range constraints where possible. --- ### Transformations .pull-left[ For random walk MH samplers eliminating range constraints is useful. {{content}} ] -- For variable at a time samplers, transforming to make variables nearly uncorrelated helps mixing. {{content}} -- For a hierarchical model `\begin{align*} \mu_i|\mu &\sim \text{independent N$(\mu, \sigma^2)$}\\ \mu &\sim \text{N}(0, 1) \end{align*}` with `\(i = 1, \dots, K\)`, we have `\begin{align*} \text{Corr}(\mu_i,\mu_j) &= 1/(1+\sigma^2)\\ \text{Corr}(\mu_i,\mu) &= 1/\sqrt{1+\sigma^2} \end{align*}` -- .pull-right[ These will be close to one if `\(\sigma^2\)` is small. {{content}} ] -- But for `\begin{equation*} \alpha_i = \mu_i - \mu \end{equation*}` the parameters `\(\alpha_1,\dots, \alpha_K, \mu\)` are independent. {{content}} -- Linear transformations that cause many variables to be updated at once often make the cost of a single update much higher. --- ### Blocking Sometimes several parameters can be updated together as a _block_. -- Exact sampling of a block is usually only possible if the joint distribution is related to the multivariate normal distribution. -- * Exact block sampling usually improves mixing. -- * The cost of sampling a block often increases with the square or cube of the block size. -- Block sampling with the Metropolis-Hastings algorithm is also possible. -- * The rejection probability usually increases with block size. -- * The cost of proposals often increases with the square or cube of the block size -- At times choosing overlapping blocks may be useful -- In some cases some level of blocking may be essential to ensure irreducibility. --- ### Auxiliary Variables Suppose we want to sample from `\(\pi(x)\)`. -- We can expand this to a joint distribution `\begin{equation*} \pi(x,y) = \pi(x) \pi(y|x) \end{equation*}` on `\(x, y\)`. -- This may be useful if we can choose `\(\pi(y|x)\)` so that -- * the joint distribution `\(\pi(x,y)\)` is simple in some way; -- * the conditional distribution `\(\pi(x|y)\)` is simple in some way. -- Data augmentation is one example of this. --- If `\(\pi(x) = h(x)g(x)\)` then it may be useful to take `\begin{equation*} Y|X=x \sim \text{U}[0, g(x)] \end{equation*}` -- Then `\begin{equation*} \pi(x,y) = h(x) 1_{[0,g(x)]}(y) \end{equation*}` -- In particular, if `\(h(x)\)` is constant, then `\(\pi(x,y)\)` is uniform on `\begin{equation*} \{(x,y): 0 \le y \le g(x)\} \end{equation*}` -- This is the idea used in rejection sampling. --- The conditional distribution of `\(X|Y=y\)` is uniform on `\(\{x: \pi(x) \ge y\}\)`. -- Alternately sampling `\(X|Y\)` and `\(Y|X\)` from these uniform conditionals is called _slice sampling_. -- Other methods of sampling from this uniform distribution are possible: -- * Random direction (hit-and-run) sampling -- * Metropolis-Hastings sampling -- Ratio of uniforms sampling can also be viewed as an auxiliary variable method. --- ### Example: Tobacco Budworms We previously used a latent variable approach to a probit regression model `\begin{align*} Z_i &= \begin{cases} 1 & \text{if}\,\, Y_i \ge 0\\ 0 & \text{if}\,\, Y_i < 0 \end{cases}\\ Y_i | \alpha, \beta, x &\sim \text{N}(\alpha + \beta (x_i - \overline{x}),1)\\ \alpha, \beta &\sim \text{flat non-informative prior distribution} \end{align*}` It is sometimes useful to introduce an additional non-identified parameter to improve mixing of the sampler. --- .pull-left[ One possibility is to add a variance parameter: `\begin{align*} Z_i &= \begin{cases} 1 & \text{if}\,\, Y_i \ge 0\\ 0 & \text{if}\,\, Y_i < 0 \end{cases}\\ Y_i | \widetilde{\alpha}, \widetilde{\beta}, x &\sim \text{N}(\widetilde{\alpha} + \widetilde{\beta} (x_i - \overline{x}), \sigma^2)\\ \widetilde{\alpha}, \widetilde{\beta} &\sim \text{flat non-informative prior distribution}\\ \sigma^2 &\sim \text{TruncatedInverseGamma}(\nu_0, a_0, T)\\ \alpha &= \widetilde{\alpha} / \sigma\\ \beta &= \widetilde{\beta} / \sigma \end{align*}` ] .pull-right[ Several schemes are possible: {{content}} ] -- * Generate `\(Y\)`, `\(\sigma\)`, `\(\widetilde{\alpha}\)`, and `\(\widetilde{\beta}\)` from their full conditional distributions. {{content}} -- * Generate `\(\sigma\)` from its conditional distribution given `\(Y\)`, i.e. integrating out `\(\widetilde{\alpha}\)` and `\(\widetilde{\beta}\)`, and sample the others from their full conditional distributions. {{content}} -- * Generate `\(Y\)` from it's conditional distribution given the identifiable `\(\alpha\)` and `\(\beta\)` by generating a `\(\sigma^*\)` value from the prior distribution; generate the others from their full conditional distributions. {{content}} -- Sample code is available [here](../examples/worms.Rmd) --- ### Swendsen-Wang Algorithm For the Potts model with `\(C\)` colors `\begin{equation*} \pi(x) \propto \exp\{\beta \sum_{(i,j) \in \mathcal{N}} 1_{x_i=x_j}\} = \prod _{(i,j) \in \mathcal{N}}\exp\{\beta 1_{x_i=x_j}\} \end{equation*}` with `\(\beta > 0\)`. `\(\mathcal{N}\)` is the set of neighboring pairs. -- Single-site updating is easy but may mix slowly if `\(\beta\)` is large. --- Taking `\(Y_{ij}|X=x\)` for `\((i,j) \in \mathcal{N}\)` to be independent and uniform on `\([0,\exp\{\beta 1_{x_i=x_j}\}]\)` makes the joint density `\begin{equation*} \pi(x,y) \propto \prod _{(i,j) \in \mathcal{N}} 1_{[0,\exp\{\beta 1_{x_i=x_j}\}]}(y_{ij}) \end{equation*}` -- * The conditional distribution of `\(X\)` given `\(Y=y\)` is uniform on the possible configurations. -- * If `\(y_{ij} > 1\)` then `\(x_i = x_j\)`; otherwise there are no further constraints. -- * The nodes can be divided into patches that are constrained to be the same color. -- The colors of the patches are independent and uniform on the available colors. -- The algorithm that alternates generating `\(Y|X\)` and `\(X|Y\)` was introduced by Swendsen and Wang (1987). --- <img src="mcmc_files/figure-html/unnamed-chunk-59-1.png" style="display: block; margin: auto;" /> -- For models without an external field this approach mixes much more rapidly than single-site updating. -- Nott and Green (2004) propose a Bayesian variable selection algorithm based on the Swendsen-Wang approach. --- ### Hamiltonian Monte Carlo Hamiltonian Monte Carlo (HMC) is an auxiliary variable method for producing a Markov chain with invariant distribution `\(f(\theta)\)`. -- HMC is also known ad Hybrid Monte Carlo. -- HMC requires that `\(f(\theta)\)` be differentiable and that the gradient of `\(\log f(\theta)\)` be computable. --- A motivation for the method: -- * View `\(\theta\)` as the position of an object on a surface, with potential energy `\(-\log f(\theta)\)`. -- * Add a random momentum vector `\(r\)`, with kinetic energy `\(\frac{1}{2} r \cdot r\)`. -- * Compute where the object will be after time `\(T\)` by solving the differential equation of Hamiltonian dynamics. -- * The numerical solution uses a discrete approximation with `\(L\)` steps of size `\(\epsilon\)`, with `\(T = \epsilon L\)`. -- The random momentum values are sampled as independent standard normals. --- The algorithm produces a Markov chain with invariant density $$ h(\theta, r) \propto f(\theta) \exp\left\(-\frac{1}{2} r \cdot r\right). $$ -- If the differential equation is solved exactly then the `\(\theta,r\)` pair moves along contours of the energy surface `\(-\log h(\theta, r)\)`. -- With discretization this is not exactly true, and Metropolis Hasting step is used to correct for discretization errors. -- With a good choice of `\(T = \epsilon L\)` the algorithm can take very large steps and mix much better than a random walk Metropolis algorithm or simple Gibbs sampler. -- `\(\epsilon\)` has to be chosen to be large enough to move a reasonable distance but small enough to keep the acceptance probability from becoming too small. --- The basic algorithm: <!-- | Given `\(\theta^0, \epsilon, L, \mathcal{L}, M\)` | **for** `\(m = 1\)` to `\(M\)` | Sample `\(r \sim \text{N}(0, I)\)` | Set $\tilde{\theta}, \tilde{r} \gets \text{Leapfrog}(\theta^{m-1}, r, \epsilon, L)$ | With probability $\alpha = \min\left\{1, \frac{\exp\{\mathcal{L}(\tilde{\theta}) - \frac{1}{2}\tilde{r} \cdot \tilde{r}\}} {\exp\{\mathcal{L}(\theta^{m-1}) - \frac{1}{2} r^0 \cdot r^0\}}\right\}$ | set `\(\theta^m \gets \tilde{\theta}\)` | Otherwise, set `\(\theta^m \gets \theta^{m-1}\)` | **end for** | **function** Leapfrog($\theta, r, \epsilon, L$) | **for** {$i = 1$ to `\(L\)`} | Set `\(r \gets (\epsilon / 2) \nabla_\theta \mathcal{L}(\theta)\)` `\(\qquad\qquad\qquad\triangleleft\)` half step for `\(r\)` | Set `\(\theta \gets \theta + \epsilon r\)` `\(\qquad\qquad\qquad\qquad\triangleleft\)` full step for `\(\theta\)` | Set `\(r \gets (\epsilon / 2) \nabla_\theta \mathcal{L}(\theta)\)` `\(\qquad\qquad\qquad\triangleleft\)` another half step for `\(r\)` | **end for** | **return** `\(\theta, -r\)` | **end function** --> <!-- `\begin{algorithm} \begin{algorithmic} \State Given `\(\theta^0, \epsilon, L, \mathcal{L}, M\)` \For{$m = 1$ to `\(M\)`} \State Sample `\(r \sim \text{N}(0, I)\)` \State Set $\tilde{\theta}, \tilde{r} \gets \text{Leapfrog}(\theta^{m-1}, r, \epsilon, L)$ \State With probability $\alpha = \min\left\{1, \frac{\exp\{\mathcal{L}(\tilde{\theta}) - \frac{1}{2}\tilde{r} \cdot \tilde{r}\}} {\exp\{\mathcal{L}(\theta^{m-1}) - \frac{1}{2} r^0 \cdot r^0\}}\right\}$ set `\(\theta^m \gets \tilde{\theta}\)` \State Otherwise, set `\(\theta^m \gets \theta^{m-1}\)` \EndFor \State \Function{\textnormal{Leapfrog}}{$\theta, r, \epsilon, L$} \For{$i = 1$ to `\(L\)`} \State Set `\(r \gets (\epsilon / 2) \nabla_\theta \mathcal{L}(\theta)\)` \Comment half step for `\(r\)` \State Set `\(\theta \gets \theta + \epsilon r\)` \Comment full step for `\(\theta\)` \State Set `\(r \gets (\epsilon / 2) \nabla_\theta \mathcal{L}(\theta)\)` \Comment another half step for `\(r\)` \EndFor \State \Return `\(\theta, -r\)` \EndFunction \end{algorithmic} \end{algorithm}` --> <img src="leap.png" width="60%" style="display: block; margin: auto;" /> --- The Leapfrog step produces a deterministic proposal `\((\tilde{\theta},\tilde{r}) = \text{Leapfrog}(\theta, r)\)`. -- It is reversible: `\((\theta,r) = \text{Leapfrog}(\tilde{\theta}, \tilde{r})\)` -- It is also satisfies `\(|\det \nabla_{\theta,r}\text{Leapfrog}(\theta, r)| = 1\)`. -- Without this property a Jacobian correction would be needed in the acceptance probability. -- Scaling of the distribution of `\(\theta\)` will affect the sampler's performance; it is useful to scale so the variation in the `\(r_i\)` is comparable to the variation in the `\(\theta_i\)`. -- Since `\(L\)` gradients are needed for each step the algorithm can be very expensive. -- Pilot runs are usually used to tune `\(\epsilon\)` and `\(L\)`. -- It is also possible to choose values of `\(\epsilon\)` and `\(L\)` random, independently of `\(\theta\)` and `\(r\)`, before each Leapfrog step -- The No-U-Turn Sampler (NUTS) provides an approach to automatically tuning `\(\epsilon\)` and `\(L\)`. -- NUTS is the basis of the [Stan](https://mc-stan.org/) framework for automate posterior sampling. --- References: > Duane S, Kennedy AD, Pendleton BJ, Roweth D (1987). "Hybrid > Monte Carlo." Physics Letters, B(195), 216-222. > Neal R (2011). "MCMC for Using Hamiltonian Dynamics." In S Brooks, A > Gelman, G Jones, M Xiao-Li (eds.), Handbook of Markov Chain Monte > Carlo, p. 113-162. Chapman \& Hall, Boca Raton, FL. > Hoffman M, Gelman A (2012). "The No-U-Turn Sampler: Adaptively > Setting Path Lengths in Hamiltonian Monte Carlo." Journal of > Machine Learning Research, 1-30. > [Stan project home page](https://mc-stan.org/). A simple R implementation is [available on line](../examples/hmc.R) --- ### Pseudo-Marginal Metropolis-Hastings MCMC The Metropolis-Hastings method using a proposal density `\(q(x, x')\)` for sampling from a target proportional to `\(f\)` uses the acceptance ratio `\begin{equation*} A(x, x') = \frac{f(x') q(x', x)}{f(x)q(x, x')}. \end{equation*}` -- Sometimes the target `\(f\)` is expensive or impossible to compute, but a non-negative unbiased estimate is available. -- Suppose, after generating a proposal `\(x'\)`, such an estimate `\(y'\)` of `\(f(x')\)` is produced and used in the acceptance ratio `\begin{equation*} \hat{A}(x, x') = \frac{y' q(x', x)}{y q(x, x')}. \end{equation*}` -- The previous estimate `\(y\)` for `\(f(x)\)` has to be retained and used. -- This produces a joint chain in in `\(x, y\)`. --- The marginal invariant distribution of the `\(x\)` component has density proportional to `\(f(x)\)`. -- To see this, denote the density of the estimate `\(y\)` given `\(x\)` as `\(h(y|x)\)` and write `\begin{equation*} \hat{A}(x, x') = \frac{y' h(y'|x')}{y h(y|x)} \frac{q(x', x)h(y | x)}{q(x, x')h(y' | x')}. \end{equation*}` -- This is the acceptance ratio for a Metropolis-Hastings chain with target density `\(y h(y|x)\)`. -- Since `\(y\)` is unbiased, the marginal density of `\(x\)` is `\begin{equation*} \int y h(y|x) dy = f(x). \end{equation*}` -- This is known as the _pseudo-marginal_ method introduced by Andrieu and Roberts (2009) extending earlier work of Beaumont (2003). -- A number of extensions and generalizations are also available. --- #### Doubly-Intractable Posterior Distributions For some problems a likelihood for data `\(y\)` is of the form `\begin{equation*} p(y|\theta) = \frac{g(y, \theta)}{Z(\theta)} \end{equation*}` where `\(g(y,\theta)\)` is available but `\(Z(\theta)\)` is expensive or impossible to evaluate. -- The posterior distribution is then `\begin{equation*} p(\theta|y) \propto \frac{g(y, \theta) p(\theta)}{Z(\theta)}. \end{equation*}` -- But is again not computable because of the likelihood normalizing constant `\(Z(\theta)\)`. --- .pull-left[ For a fixed value `\(\hat{\theta}\)` of `\(\theta\)` it is useful to write the posterior density as `\begin{equation*} p(\theta|y) \propto g(y, \theta) p(\theta)\frac{Z(\hat{\theta})}{Z(\theta)}, \end{equation*}` {{content}} ] -- Suppose it is possible for a given `\(\theta\)` to simulate a draw `\(y^*\)` from `\(p(y|\theta)\)`. {{content}} -- Then `\begin{align*} E\left[\frac{g(y^*, \hat{\theta})}{g(y^*, \theta)}\right] &= \int \frac{g(y^*, \hat{\theta})}{g(y^*, \theta)} p(y^*|\theta) dy^*\\ &= \frac{1}{Z(\theta)} \int g(y^*, \hat{\theta}) dy^* = \frac{Z(\hat{\theta})}{Z(\theta)}. \end{align*}` -- .pull-right[ So an unbiased importance-sampling estimate of `\(p(\theta|y)\)` is `\begin{equation*} \hat{p}(\theta|y) = g(y, \theta) p(\theta)\frac{g(y^*, \hat{\theta})}{g(y^*, \theta)} \end{equation*}` {{content}} ] -- Generating multiple `\(y^*\)` samples is also possible. {{content}} -- Reducing the variance of the estimate generally reduces rejection rates and improves mixing of the sampler. --- ### Heating and Reweighting Let `\(f(x)\)` have finite integral and let `\(f_T(x) = f(x)^{1/T}\)`. -- If `\(f_T\)` has finite integral, then we can run a Markov Chain with invariant distribution `\(f_T\)`. -- Increasing `\(T\)` flattens the target density and may lead to a faster mixing chain---this is called _heating_. -- Decreasing `\(T\)` leads to a more peaked `\(f_T\)` concentrated more around the global maximum of `\(f\)`. -- Careful choice of a _cooling schedule_ `\(T_n \to 0\)` can produce an inhomogeneous chain that converges to the global maximum. This is called _simulated annealing_. -- Using a fixed `\(T > 1\)` can produce a faster mixing chain than `\(T=1\)`. -- More generally, using a similar but more dispersed, or more easily sampled, density `\(g\)` may produce a faster mixing chain. --- If `\(X_1, X_2, \dots\)` is a Markov chain with invariant density `\(g\)`, then, under reasonable conditions, `\begin{equation*} \frac{\sum W_i h(X_i)}{\sum W_i} \to \frac{\int h(x) f(x) dx}{\int f(x) dx} \end{equation*}` where `\(W_i = f(X_i)/g(X_i)\)`. -- This approach can also be used for sensitivity analysis: -- * Sample from the primary distribution of interest `\(g\)`. -- * Examine how results change for various perturbations `\(f\)` using the original sample from `\(g\)` and reweighting to `\(f\)`. -- * Reusing the sample is a form of common variate use. -- Instead of keeping weights one can resample with probabilities proportional to the weights. --- ### Switching and Parallel Chains Suppose `\(f_1,\dots,f_k\)` are unnormalized densities, `\(a_1,\dots,a_k\)` are positive numbers, and `\(p_{ij}\)` are transition probabilities on `\(\{1,\dots,k\}\)`. -- A sampler on `\((X,I)\)` can be run as: -- * when `\(I = i\)`, run a sampler with invariant distribution `\(f_i\)` for `\(K\)` steps. -- * Then choose an index `\(J \in \{1,\dots,k\}\)` with probabilities `\(p_{i1},\dots,p_{ik}\)`. -- * With probability `$$\min\left\{\frac{p_{Ji} a_J f_J(X)}{p_{iJ} a_i f_i(X)}, 1\right\}$$` accept the proposal and set `\(I = J\)`; otherwise keep `\(I = i\)` -- The resulting chain has an invariant distribution with `\(f(x|i) \propto f_i\)`. --- Usually one distribution, say `\(f_1\)`, is the primary target distribution and the others are successively "hotter" alternatives. -- The hottest distribution may allow independent sampling. -- This approach is called _simulated tempering_. -- Care is needed in choosing `\(a_i\)` and `\(p_{ij}\)` to ensure the chain does not get stuck -- A variant runs `\(k\)` chains in parallel and periodically proposes a permutation of states, which is accepted with an appropriate probability. This is called _parallel tempering_. -- Parallel tempering does not require constants `\(a_i\)`; the joint distribution of the chains has density proportional to `\(f_1(x_1) \cdots f_k(x_k)\)`. --- Some references: > Geyer, C. (1991) "Markov chain Monte Carlo maximum likelihood," > _Computing Science and Statsitics: The 23sr Symposium on the > Interface_, Interface Foundation, 156--153. > Geyer, C. and Thompson, E (1995) "Annealing Markov chain Monte Carlo > with applications to ancestral inference," _JASA_, 909--920. > Marinari, E. and Parisi, G. (1992) "Simulated tempering: a new Monte > Carlo scheme," _Europhysics letters_, 451--458. Sample code is available [here](../examples/mcmc2.Rmd) ([slides](mcmcEx2.html)). --- layout: true ## Regeneration and MCMC --- A process `\(X_1, X_2, \dots\)` is _regenerative_ if there exists a sequence of random variables `\(T_1 \le T_2 \le T_3 \dots\)` such that the `\(T_i\)` form a (possibly delayed) renewal process. -- The _tour lengths_ and _tours_ `\begin{equation*} (T_{i+1}-T_i, X_{T_i+1}, X_{T_i+2}, \dots, X_{T_{i+1}}) \end{equation*}` are independent and identically distributed. -- Suppose `\(X_n\)` is regenerative with stationary distribution `\(\pi\)`. Let `\(T_0 = 0\)`, `\begin{align*} N_i &= T_i - T_{i - 1}\\ Y_i &= \sum_{j=T_{i-1}+1}^{T_i} h(X_j) \end{align*}` -- If `\(E[|Y_i|] < \infty\)` and `\(E[N_i] < \infty\)` then `\begin{equation*} \widehat{\theta}_n = \frac{\sum_{i=1}^n Y_i}{\sum_{i=1}^n N_i} = \frac{\overline{Y}}{\overline{N}} \to \theta = E_\pi[h(X)] \end{equation*}` --- If `\(E[Y_i^2] < \infty\)` and `\(E[N_i^2] < \infty\)` then `\begin{equation*} \sqrt{n}(\widehat{\theta}_n-\theta) \to \text{N}(0,\tau) \end{equation*}` and `\(\tau\)` can be estimated by the variance estimation formula for a ratio estimator: `\begin{equation*} \widehat{\tau} = \frac{\frac{1}{n}\sum(Y_i-\widehat{\theta}_nN_i)^2}{\overline{N}^2} \end{equation*}` -- For a regenerative process we can simulate tours independently and in any order. -- An irreducible discrete state space Markov chain is regenerative with the `\(T_i\)` corresponding to the hitting times of any chosen state. -- Irreducible general state space chains are also regenerative. -- Finding regeneration times can be hard and may involve using auxiliary variables. --- If we have an approximate envelope `\(h\)` available, then -- * we and can use Metropolized rejection sampling for a target distribution `\(f\)`; -- * every time we get `\(f(X_i) \le h(X_i)\)` then the next proposal will be accepted; -- * so every step with `\(f(X_i) \le h(X_i)\)` is a regeneration time. -- Periodically using a Metropolized rejection step is the simplest way to introduce regeneration in MCMC. -- How well it works depends on the quality of the envelope and the other sampler it is used in conjunction with. -- Other methods are available for identifying regeneration points. -- Regenerative analysis does _not_ make a sampler better: poorly mixing samplers have tour length distributions with long tails. -- Sample code is available [here](../examples/mcmc2.Rmd) ([slides](mcmcEx2.html#regeneration)). --- layout: true ## Transdimensional MCMC --- A number of problems have parameter spaces that are unions of spaces of different dimensions: -- * model selection problems; -- * finite mixture models with unknown number of components; -- * model-based clustering; -- * partitioned regression models; -- * spline models with unknown number of knots. -- For each of these the parameter space can be viewed as taking the form `\begin{equation*} \Theta = \bigcup_{k \in \mathcal{K}} \left(\Theta_k \times \{k\}\right) \end{equation*}` --- A Bayesian formulation usually involves specifying -- * a prior on `\(k\)`; -- * a conditional prior, given `\(k\)`, on the parameters in `\(\Theta_k\)`. -- An MCMC approach needs a way of moving between models. -- Several approaches are available: -- * integrating out `\(\theta_k\)` (sometimes viable); -- * _reversible jump_ sampler; -- * _birth and death_ sampler; -- * other special purpose samplers. -- A useful review paper by Sisson appeared in JASA, September 2005. --- ### Reversible Jump MCMC .pull-left[ In Bayesian model selection problems we have {{content}} ] -- * a set of `\(M\)` models with parameter spaces `\(\Theta_1,\dots,\Theta_M\)`; {{content}} -- * a set of likelihoods `\(f_i(x|\theta_i)\)` with `\(\theta_i \in \Theta_i\)`; {{content}} -- * conditional prior distributions given the model `\(\pi(\theta_i|i)\)`; {{content}} -- * prior probabilities `\(\pi(i)\)` on the models. -- .pull-right[ The posterior probabilities of the models are proportional to `\begin{equation*} \pi(i) \int_{\Theta_i} f_i(x|\theta_i)\pi(\theta_i|i) d\theta_i \end{equation*}` {{content}} ] -- The odds of model `\(i\)` to model `\(j\)` can be written as `\begin{equation*} \frac{\int_{\Theta_i} f_i(x|\theta_i)\pi(\theta_i|i) d\theta_i} {\int_{\Theta_j} f_j(x|\theta_j)\pi(\theta_j|j) d\theta_j} \frac{\pi(i)}{\pi(j)} = B_{ij}(x)\frac{\pi(i)}{\pi(j)} \end{equation*}` {{content}} -- `\(B_{ij}(x)\)` is called the _Bayes factor_ for model `\(i\)` against model `\(j\)`. --- One computational option is to run separate samplers for each model and estimate the normalizing constants. -- Another option is to run a single sampler that moves both within and between models. -- To move between models we need a proposal distribution `\(Q_{ij}(u,dv)\)` for proposing a model `\(j\)` and a value `\(v\)` in model `\(j\)` when currently at `\(u\)` in model `\(i\)`. -- The proposal is accepted with probability `\begin{equation*} \alpha_{ij}(u,v) = \min\left\{ \frac{\pi_j(dv|x) Q_{ji}(v,du)}{\pi_i(du|x) Q_{ij}(u,dv)}, 1 \right\} = \min\left\{r_{ij}(u, v), 1\right\} \end{equation*}` where `\(\pi_k(d\psi|x) = \pi(k) f_k(x|\psi)\pi(\psi|k) d\psi\)`. -- With care the proposal for going from a larger model to a smaller one can be chosen to be deterministic. -- This is the _reversible jump_ sampler of Green (1995). --- ### A Simple Example: Normal Means Suppose `\(X_1, X_2\)` are independent. -- * Model 1: `\(X_1, X_2 \sim \text{N}(\mu, 1)\)`, `\(\mu \sim N(0,b^2)\)`. -- * Model 2: `\(X_1 \sim \text{N}(\mu_1,1)\)`, `\(X_2 \sim \text{N}(\mu_2,1)\)`, `\(\mu_i \sim \text{N}(0, b^2)\)` and independent. -- The two models are assumed equally likely a priori. -- The jump proposals: -- * To move from 1 to 2: Generate `\(\mu_1\sim \text{N}(\mu,1)\)` and set `\(\mu_2=2\mu-\mu_1\)`. -- * To move from 2 to 1 set `\(\mu = (\mu_1+\mu_2)/2\)`. --- Let `\(\varphi(z)\)` be the standard normal density. -- The `\(\pi_k(d\psi|x)\)` are `\begin{align*} \pi_1(d\mu|x) &= \frac{1}{2} \varphi(x_1 - \mu) \varphi(x_2 - \mu) b^{-1} \varphi(\mu / b) d\mu\\ \pi_2(d\mu_1, d\mu_2|x) &= \frac{1}{2} \varphi(x_1 - \mu_1) \varphi(x_2 - \mu_2) b^{-1} \varphi(\mu_1 / b) b^{-1} \varphi(\mu_2 / b) d\mu_1 d\mu_2 \end{align*}` -- The `\(Q_{ij}\)` are `\begin{align*} Q_{12}(\mu, d\mu_1, d\mu_2) &= \varphi(\mu_1 - \mu) d\mu_1 \delta_{2\mu - \mu_1}(d\mu_2)\\ Q_{21}(\mu_1, \mu_2, d\mu) &= \delta_{(\mu_1 + \mu_2) / 2}(d\mu) \end{align*}` --- The acceptance ratio `\(r_{12}\)` for a move from model 1 to model 2 is `\begin{align*} r_{12}(\mu,\mu_1,\mu_2) &= \frac{\pi_2(d\mu_1, d\mu_2|x) Q_{21}(\mu_1, \mu_2, d\mu)} {\pi_1(d\mu|x) Q_{12}(\mu, d\mu_1, d|m_2)} \\ &= \frac{\frac{1}{2}\varphi(x_1-\mu_1)\varphi(x_2-\mu_2) b^{-1}\varphi(\mu_1/b) b^{-1}\varphi(\mu_2/b) d\mu_1 d\mu_2 \, \delta_{(\mu_1 + \mu_2) / 2}(d\mu)} {\frac{1}{2}\varphi(x_1-\mu)\varphi(x_2-\mu)b^{-1}\varphi(\mu/b) \varphi(\mu_1-\mu)d\mu d\mu_1 \, \delta_{2\mu - \mu_1}(d\mu_2)}\\ &= \frac{\varphi(x_1-\mu_1)\varphi(\mu_1/b)\varphi(x_2-\mu_2) \varphi(\mu_2/b)} {b\varphi(x_1-\mu)\varphi(x_2-\mu)\varphi(\mu/b)\varphi(\mu_1-\mu)} \frac{d\mu_2}{d\mu}\\ &= \frac{2}{b} \frac{\varphi(x_1-\mu_1)\varphi(\mu_1/b)\varphi(x_2-\mu_2) \varphi(\mu_2/b)} {\varphi(x_1-\mu)\varphi(x_2-\mu)\varphi(\mu/b)\varphi(\mu_1-\mu)} \end{align*}` -- The acceptance probabilities are `\begin{align*} \alpha_{12}(\mu,\mu_1,\mu_2) &= \min(r_{12}(\mu,\mu_1,\mu_2), 1)\\ \alpha_{21}(\mu_1, \mu_2, \mu) &= \min(1/r_{12}(\mu,\mu_1,\mu_2), 1) \end{align*}` --- R code to implement a within-model Gibbs step followed by a jump proposal: .tiny-code[ ```r rj <- function(m, N, x1 = 1, x2 = -1, b = 1) { lr12 <- function(m, m1, m2) log(2 / b) - 0.5 * ((x1 - m1)^2 + (m1 / b)^2 + (x2 - m2)^2 + (m2 / b)^2) + 0.5 * ((x1 - m)^2 + (x2 - m)^2 + (m / b)^2 + (m1 - m)^2) xbar <- (x1 + x2) / 2 v <- matrix(nrow = N, ncol = 3) I <- 1 m <- m1 <- m2 <- 0 for (i in 1:N) { if (I == 1) { m <- rnorm(1, xbar * b^2 / (1 / 2 + b^2), b / sqrt(1 + 2 * b^2)) m1 <- rnorm(1, m) m2 <- 2 * m - m1 if (log(runif(1)) < lr12(m, m1, m2)) I <- 2 } else { m1 <- rnorm(1, x1 * b^2 / (1 + b^2), b / sqrt(1 + b^2)) m2 <- rnorm(1, x2 * b^2 / (1 + b^2), b / sqrt(1 + b^2)) m <- (m1 + m2) / 2 if (log(runif(1)) < -lr12(m, m1, m2)) I <- 1 } if (I == 1) v[i, ] <- c(1, m, m) else v[i, ] <- c(2, m1, m2) } v } ``` ] --- .pull-left[ A sample run: ```r v <- rj(0, 10000, x1 = 2, x2 = -2, b = 1) ``` <img src="mcmc_files/figure-html/unnamed-chunk-63-1.png" style="display: block; margin: auto;" /> ] -- .pull-right[ Autocorrelation function for the `\(\mu_1\)` series: <img src="mcmc_files/figure-html/unnamed-chunk-64-1.png" style="display: block; margin: auto;" /> ] --- .pull-left[ Estimating the posterior probability of model 1 for several values of `\(b\)`: {{content}} ] -- ```r v <- rj(0, 10000, x1 = 2, x2 = -2, b = 1) mean(ifelse(v[, 1] == 1, 1, 0)) ## [1] 0.1312 ``` {{content}} -- ```r v <- rj(0, 10000, x1 = 2, x2 = -2, b = 2) mean(ifelse(v[, 1] == 1, 1, 0)) ## [1] 0.0583 ``` {{content}} -- ```r v <- rj(0, 10000, x1 = 2, x2 = -2, b = 20) mean(ifelse(v[, 1] == 1, 1, 0)) ## [1] 0.213 ``` {{content}} -- ```r v <- rj(0, 10000, x1 = 2, x2 = -2, b = 100) mean(ifelse(v[, 1] == 1, 1, 0)) ## [1] 0.573 ``` {{content}} -- ```r v <- rj(0, 10000, x1 = 2, x2 = -2, b = 200) mean(ifelse(v[, 1] == 1, 1, 0)) ## [1] 0.7301 ``` -- .pull-right[ Care is needed when using vague priors, especially when models of different dimensions are considered. {{content}} ] -- The code is available [here](../examples/rj.R). <!-- % Integrate[Exp[-(x1-m)^2/2-(m/b)^2/2]/(2 Pi b),{m,-Infinity,Infinity}] % e1 = Simplify[%,Re[b^(-2)]>0] % Integrate[Exp[-(x2-m)^2/2-(m/b)^2/2]/(2 Pi b),{m,-Infinity,Infinity}] % e2 = Simplify[%,Re[b^(-2)]>0] % Integrate[Exp[-(x1-m)^2/2-(x2-m)^2/2-(m/b)^2/2]/((2 Pi)^(3/2) b),{m,-Infinity,Infinity}] % e3 = Simplify[%,Re[b^(-2)]>0] % Simplify[e3 / (e1 e2)] % 4 2 2 2 4 % 2 (-(b (x1 - x2) ) + 2 b x1 x2)/(2 + 6 b + 4 b ) % (1 + b ) E % Out[83]= ----------------------------------------------------------- % -2 % Sqrt[2 + b ] b % As b -> Infinity Model 1 is favored; other priors behave differently %% **** Got here 2010-04-05 --> --- ### Alternate Approach: Mixed Distributions We can view this as a single model with means `\(\mu_1,\mu_2\)` and a prior distribution that says: -- * with probability 1/2 the means are equal and the common value has a `\(\text{N}(0,b^2)\)` distribution; -- * with probability 1/2 the means are unequal and drawn independently from a `\(\text{N}(0,b^2)\)` distribution. --- The full conditional distribution of `\(\mu_2|X_1,X_2,\mu_1\)` is a mixed discrete-continuous distribution such that `\begin{align*} P(\mu_2 = \mu_1|x_1,x_2,\mu_1) &= \frac{\frac{1}{2}\varphi(x_2-\mu_1)}{\frac{1}{2}\varphi(x_2-\mu_1)+ \frac{1}{2}\frac{1}{\sqrt{1+b^2}}\varphi(x_2/\sqrt{1+b^2})}\\ &= \frac{\sqrt{1+b^2}\varphi(x_2-\mu_1)}{\sqrt{1+b^2}\varphi(x_2-\mu_1)+ \varphi(x_2/\sqrt{1+b^2})} \end{align*}` -- and `\begin{equation*} \mu_2|x_1,x_2,\mu_1,\mu_2\neq\mu_1 \sim \text{N}(x_2 b^2/(1+b^2),b^2/(1+b^2)) \end{equation*}` -- The full conditional distribution of `\(\mu_1|Y_1,Y_2,\mu_2\)` is analogous. --- The Gibbs sampler can therefore be used directly -- Metropolis-Hastings methods can also be used if care is taken in defining densities. -- With more parameters a similar approach can be used to sample pairs of parameters where the distribution can consist of -- * a discrete component; -- * a one dimensional component; -- * a two dimensional component. --- Transformations can again help: if we use `\begin{align*} \theta_1 &= \mu_1+\mu_2\\ \theta_2 &= \mu_1-\mu_2 \end{align*}` then -- * `\(\theta_1\)`, `\(\theta_2\)` are independent under both prior and posterior distributions; -- * `\(\theta_1\)` has a continuous posterior distribution; -- * `\(\theta_2\)` has a mixed posterior distribution with `\(P(\theta_2 = 0|X) > 0\)`. --- R code to implement this approach: .pull-left[ ```r peq <- function(x, m, b) { d1 <- dnorm(x, m) d2 <- dnorm(x, 0, sqrt(1 + b^2)) d1 / (d1 + d2) } genNeq <- function(x, m, b) { v <- b^2 / (1 + b^2) rnorm(1, x * v, sqrt(v)) } genMu <- function(x, m, b) { if (runif(1) < peq(x, m, b)) m else genNeq(x, m, b) } ``` ] .pull-right[ ```r mx <- function(m, N, x1 = 2, x2 = -2, b = 1) { v <- matrix(nrow = N, ncol = 2) m1 <- m2 <- m for (i in 1:N) { m1 <- genMu(x1, m2, b) m2 <- genMu(x2, m1, b) v[i, ] <- c(m1, m2) } v } ``` ] --- .pull-left[ A sample run: ```r v <- mx(0, 10000, x1 = 2, x2 = -2, b = 1) mean(v[, 1] == v[, 2]) ## [1] 0.1405 ``` <img src="mcmc_files/figure-html/unnamed-chunk-73-1.png" style="display: block; margin: auto;" /> ] -- .pull-right[ Autocorrelation function for the `\(\mu_1\)` series: <img src="mcmc_files/figure-html/unnamed-chunk-74-1.png" style="display: block; margin: auto;" /> {{content}} ] -- The code is available [here](../examples/mx.R). --- .pull-left[ What happens with a very large value of `\(b\)`? ```r v <- mx(0, 10000, x1 = 2, x2 = -2, b = 10000) mean(v[, 1] == v[, 2]) ## [1] 0.9942 ``` ] -- .pull-right[ Autocorrelation function for the `\(\mu_1\)` series: <img src="mcmc_files/figure-html/unnamed-chunk-76-1.png" style="display: block; margin: auto;" /> ] --- .pull-left[ <img src="mcmc_files/figure-html/unnamed-chunk-77-1.png" style="display: block; margin: auto;" /> ] -- .pull-right[ There are very few unique values with `\(\mu_1 = \mu_2\)`: ```r length(unique(v[v[, 1] == v[, 2], 1])) ## [1] 78 ``` {{content}} ] -- The sampler can only move to a new common mean value through a pair with `\(\mu_1 \neq \mu_2\)`. --- .pull-left[ To fix this, add a move within the small model: {{content}} ] -- ```r mx2 <- function(m, N, x1 = 2, x2 = -2, b = 1) { * xbar <- (x1 + x2) / 2 v <- matrix(nrow = N, ncol = 2) m1 <- m2 <- m for (i in 1:N) { m1 <- genMu(x1, m2, b) m2 <- genMu(x2, m1, b) * if (m1 == m2) { * m <- rnorm(1, * xbar * b^2 / (1 / 2 + b^2), * b / sqrt(1 + 2 * b^2)) * m1 <- m * m2 <- m } v[i, ] <- c(m1, m2) } v } ``` -- .pull-right[ ```r v <- mx2(0, 10000, x1 = 2, x2 = -2, b = 10000) mean(v[, 1] == v[, 2]) ## [1] 0.9929 ``` {{content}} ] -- Autocorrelation function for the `\(\mu_1\)` series: <img src="mcmc_files/figure-html/unnamed-chunk-81-1.png" style="display: block; margin: auto;" /> --- ### Birth and Death MCMC A number of models have parameters that sets of points: -- * support set for finite mixture models; -- * knot set for spline models. -- A distribution on a set of points is a _point process_. -- Point process models can be sampled by a continuous time Markov process called a _spatial birth and death process_. --- A basic birth and death process is a continuous time non-negative integer valued Markov process `\(X(t)\)`. -- `\(X(t)\)` represents the size of the population at time `\(t\)`. -- In a population of size `\(n\)`, births occur at rate `\(\beta_n\)` and deaths at rate `\(\delta_n\)`. -- If the population size is `\(n\)` then it remains at that level for an exponentially distributed _sojourn time_ with rate `\(\beta_n + \delta_n\)`. -- At the end of a sojour time, with probability `\(\beta_n / (\beta_n + \delta_n)\)` a birth occurs and the population size increases to `\(n + 1\)`. -- Othewise, a death occurs and the population size decreases to `\(n - 1\)`. -- A simple model with individuals giving birth independently at rate `\(\beta\)` and dying at rate `\(\delta\)` would have `\(\beta_n = n \beta\)` and `\(\delta_n = n \delta\)`. --- For a spatial birth and death process the state `\(X(t)\)` at time `\(t\)` is a set of points. -- A set of points `\(y = \{y_1,\dots,y_n\}\)` changes by -- * _births_ that add a point: `\(y \to y \cup \{\xi\}\)` -- * _deaths_ that remove a point: `\(y \to y \backslash y_i\)` -- For a current state `\(y\)` births at location `\(\xi\)` occur at a rate `\begin{equation*} b(y, \xi) = \beta(y) \tilde{b}(y,\xi). \end{equation*}` -- The total birth rate is `\(\beta(y) = \int b(y, \xi) d\xi\)` -- The points in a set `\(y = \{y_1,\dots,y_n\}\)` die independently with rates `\(d(y\backslash\{y_i\}, y_i)\)`. -- The total death rate is `\(\delta(y) = \sum_{i=1}^n d(y\backslash\{y_i\}, y_i)\)` --- Suppose we wish to simulate a point process with density `\(h(y)\)` with respect to an inhomogeneous Poisson process with rate `\(\lambda(x)\)`. -- A spatial birth and death process will have this point process as invariant distribution if it satisfies the detailed balance equations `\begin{equation*} h(y) b(y, \xi) = h(y \cup \{\xi\}) \lambda(\xi) d(y, \xi) \end{equation*}` -- Usual approach: -- * pick a reasonable birth rate function; -- * solve for the required death rate. --- The algorithm: starting with a set of points `\(y\)`: -- 1. Wait for an amount of time exponentially distributed with rate `\(\beta(y)+\delta(y)\)`. 2. At that time, a birth occurs with probability `\(\beta(y)/(\beta(y)+\delta(y))\)`. 3. If a birth occurs, generate the location of the new point `\(\xi\)` from `\(\tilde{b}(y, \xi)\)`. 4. If a death occurs, choose the point to die with probabilities proportional to `\(d(y\backslash\{y_i\},y_i)\)`. -- The idea is due to Ripley (1977) and Preston (1977). -- Stephens (2001) introduced a variation for Bayesian inference for finite mixture models. --- Continuous time data for pure jump processes can be represented as the sequence of states and their waiting times. -- Sample path averages are time-weighted averages. -- An alternative is to sample at a discrete grid of time points. -- Some notes: -- * There are no rejections. Instead, some points die very quickly. -- * It may be useful to add move steps that pick one point to possibly move based on, say, a Metropolis-Hastings proposal. --- ### Example: Normal Mixture Models A normal mixture model assumes that `\(X_1,\dots,X_n\)` are independent draws from the density `\begin{equation*} f(x|K, \mu, \sigma, p) = \sum_{i=1}^K p_i \frac{1}{\sigma_i}\varphi\left(\frac{x - \mu_i}{\sigma_i}\right) \end{equation*}` with `\(\varphi\)` the standard normal density and `\(p_1 + \dots + p_K = 1\)`. -- A possible prior distribution for `\(K, p, \mu, \sigma\)` can be specified as `\begin{align*} K &\sim \text{Poisson}(\lambda), \text{conditioned on}\,\, 1 \le K \le K_{\max}\\ p | K &\sim \text{Dirichlet}(\alpha,\dots,\alpha)\\ \sigma_i^2|K,p &\overset{\text{ind}}{\sim} \text{IG}(a, b)\\ \mu_i|K,p,\sigma &\overset{\text{ind}}{\sim} \text{N}(m, \sigma_{\mu}^2) \end{align*}` -- A more elaborate formulation might put priors on some of the hyperparameters. --- If we add an auxilliary variable `\(v\)`, independent of `\(K,p,\sigma,\mu\)`, with `\begin{equation*} v | K, p, \sigma, \mu \sim \text{Gamma}\left(K \alpha, 1\right) \end{equation*}` and set `\(w_i = v p_i\)`, then `\begin{equation*} w_i | K, \sigma, \mu \overset{\text{ind}}{\sim} \text{Gamma}(\alpha, 1) \end{equation*}` -- The prior distribution of `\(K, w, \sigma^2, \mu\)` is an inhomogeneous Poisson process on `\(\Real^3\)` with rate function `\begin{align*} \lambda(K, \mu, \sigma, w) = \lambda &\times \text{Gamma densities for}\,\, w_i\\ &\times \text{Inverse Gamma densities for}\,\, \sigma_i^2 \\ &\times \text{Normal densities for}\,\, \mu_i \end{align*}` and `\(p_i = w_i / \sum_{j=1}^K w_j\)`, conditioned on `\(1 \le K \le K_{\max}\)`. --- The posterior distribution has a density `\begin{equation*} h(K, \mu, \sigma, w) \propto 1_{[1, K_{\max}]}(K)\prod_{j=1}^n f(x_i|K, \mu, \sigma, p) \end{equation*}` with respect to the Poisson process. -- Code is available [here](../examples/bdnmix.R) ([slides](bdnmix.html)). --- layout: true ## Approximate Bayesian Computation (ABC) --- All approaches to posterior sampling so far have required computing the likelihood function `\(f(x|\theta)\)`. -- For some problems this is not possible, but it is possible to simulate from `\(f(\cdot|\theta)\)`. -- A simple approach: 1. Draw `\(\theta^*\)` from the prior distribution. 2. Run the model to simulate `\(x^*\)` from `\(f(x|\theta^*)\)`. 3. If `\(x^*\)` is close to the observed `\(x\)` then keep `\(\theta^*\)`; otherwise, go back to step (1). -- An MCMC variant of this is also used and can lead to higher acceptance rates. -- Closeness might be measured as `\(d(x, x^*) \le \epsilon\)` for some distance `\(d\)` and tolerance `\(\epsilon\)`. -- * It the tolerance is small enough the distribution of the accepted `\(\theta^*\)` values should be close to the posterior distribution `\(f(\theta|x)\)`. -- * If the tolerance is too small the acceptance probability will be too low. -- * This problem increases very quickly with the dimension of `\(x\)`. --- If a low dimensional sufficient statistic is available then the distance can be based on the sufficient statistic. -- Generally, sufficient statistics are not available in problems where ABC is needed. -- If a modest number of statistics can be chosen that are nearly sufficient, then the conditional distribution given these statistics may not be too far from the full posterior distribution. -- Much recent literature has explored ways of selecting a suitable set of conditioning statistics. -- Another direction of work explores the use of sequential Monte Carlo, and adaptive sequential Monte Carlo methods, in the ABC context ([Sisson, Fan, and Tanaka, 2007](https://www.pnas.org/doi/full/10.1073/pnas.0607208104)). -- The [Wikipedia entry](https://en.wikipedia.org/wiki/Approximate_Bayesian_computation) provides a good introduction and references. --- ### A Simple Example <!-- Similar to https://theoreticalecology.wordpress.com/2014/06/02/explaining-the-abc-rejection-algorithm-in-r/ --> Suppose `\(X_1, \dots, X_n\)` are independent `\(\text{N}(\mu, \sigma^2)\)`. -- `\(\mu\)` and `\(\sigma\)` are independent, with `\(\mu \sim \text{U}[-10, 10]\)` and `\(\sigma \sim \text{U}[0, 10]\)`. -- Generating draws from the prior distribution: ```r genPrior <- function(k) data.frame(mu = runif(k, -10, 10), sigma = runif(k, 0, 10)) ``` -- Generating sufficient statistics for a set of parameters: ```r genSS <- function(theta, n) { mu <- theta$mu sigma <- theta$sigma k <- length(mu) data.frame(xbar = rnorm(k, mu, sigma / sqrt(n)), s = sigma * sqrt(rchisq(k, n - 1) / (n - 1))) } ``` --- A sample from a standard normal distribution and the observed sufficient statistics: ```r x <- rnorm(100) xbar <- mean(x) s <- sd(x) ``` -- Generate parameters from the prior distribution and then sufficient statistics for the parameters: ```r theta <- genPrior(1e7) ss <- genSS(theta, length(x)) ``` -- Find the simulated samples with sufficient statistics close to the ones for the observed data: ```r idx <- abs(ss$xbar - xbar) + abs(ss$s - s) < 0.05 sum(idx) ## [1] 242 ``` --- The ABC estimate of the posterior density of `\(\mu\)` and the normal approximation: ```r plot(density(theta[idx, ]$mu)) z <- seq(-1.5, 1.5, length.out = 101) lines(z, dnorm(z, xbar, s / sqrt(length(x))), col = "red") ``` <img src="mcmc_files/figure-html/unnamed-chunk-87-1.png" style="display: block; margin: auto;" /> --- layout: false ## Other MCMC and Related Approaches There are many other approaches and ideas. * Particle filters. * Umbrella sampling. * Dynamic reweighting. * Adaptive MCMC (Christophe Andrieu, "Annotated Bibliography: Adaptive Monte Carlo Methods," _The ISBA Bulletin_ 15(1), March 2008; <http://www.bayesian.org/bulletin/0803.pdf>. * Special issue on adaptive Monte Carlo, _Statistics and Computing_, December 2008. * Sequential Importance Sampling * ... -- Several book length treatments are available: * Gamerman and Lopes (2006) * Robert and Casella (2004) * Chen, Shao, and Ibrahim (2000) * Liu (2001) * Brooks, Gelman, Jones, and Meng (2011) among a number of others. --- layout: true ## Perfect Sampling and Coupling From The Past --- Suppose `\(\pi\)` is a distribution on `\(E = \{1,\dots,M\}\)` and `\(P\)` is an irreducible, aperiodic transition matrix with invariant distribution `\(\pi\)`. -- Let `\(\phi(u,i)\)` be the inverse CDF of `\(P(i, \cdot)\)`, so if `\(U\sim\text{U}[0,1]\)` then `\(\phi(U,i)\)` has distribution `\(P(i, \cdot)\)`. -- Suppose `\(U_1,U_2,\dots\)` are independent `\(\text{U}[0,1]\)` and suppose `\begin{equation*} X_{i+1} = \phi(U_i, X_i) \end{equation*}` for `\(i = -1, -2, \dots\)`. -- For this chain started in the infinite past `\(X_0 \sim \pi\)`. --- Can we figure out what `\(X_0\)` is without going infinitely far into the past? -- <img src="img/CFTP.svg" width="50%" style="display: block; margin: auto;" /> --- For `\(T < 0\)` and `\(k \in E\)` define `\begin{align*} X_{T}^{(T,k)} &= k\\ X_{i+1}^{(T,k)} &= \phi(U_i, X_{i}^{(T,k)}) \end{align*}` for `\(i = T, T+1, \dots, -2, -1\)`. -- * If `\(X_0^{(T,k)}\)` is the same state for all initial states `\(k\)`, say `\(X_0^{(T,k)} = j \in E\)`, then `\(X_0 = j\)`. The chains are said to have coupled. -- * With probability one there exists a finite `\(T < 0\)` such that all chains starting at T will have coupled by time zero. -- The _coupling from the past (CFTP)_ algorithm: -- * Start with an initial `\(T\)` and determine whether all chains have coupled by time zero. If so, return the common value at time zero. -- * If not, double `\(T\)` and repeat. --- The CFTP algorithm was introduced by Propp and Wilson (1996). -- If `\(\phi(u, i) \le \phi(u, j)\)` for every `\(u\)` and every `\(i \le j\)` then it is sufficient to consider the minimal and maximal chains `\(X_i^{(T,1)}\)` and `\(X_i^{(T,M)}\)` since `\begin{equation*} X_i^{(T,1)} \le X_i^{(T,k)} \le X_i^{(T,M)} \end{equation*}` for all `\(k \in E = \{1,\dots,M\}\)`. If the minimal and maximal chains have coupled then all chains have coupled. -- This idea can be extended to partially ordered state spaces with a minimal and maximal value. -- Extensions to some continuous state space problems have been developed. -- CFTP samplers for a number of interesting distributions in physics applications have been found. -- Progress in statistics is still limited to somewhat artificial examples. -- One issue is bias: Truncating the backward search for `\(T\)` will change the distribution of `\(X_{0}\)`. Variations are available to address this. --- ### Example: Image reconstruction with Ising Prior .pull-left[ States are partially ordered by pixel with $$ \text{"black"} > \text{"white"}. $$ {{content}} ] -- All "white" is minimal, all "black" is maximal. {{content}} -- A CFTP version of the vectorized Ising model sampler: -- .pull-right.tiny-code[ Compute number of neighbors with color `c`: ```r nn <- function(m, c) { nr <- nrow(m) nc <- ncol(m) nn <- matrix(0, nr, nc) nn[1:(nr) - 1, ] <- nn[1:(nr) - 1, ] + (m[2:nr, ] == c) nn[2:nr, ] <- nn[2:nr, ] + (m[1:(nr - 1), ] == c) nn[, 1:(nc) - 1] <- nn[, 1:(nc) - 1] + (m[, 2:nc] == c) nn[, 2:nc] <- nn[, 2:nc] + (m[, 1:(nc - 1)] == c) nn } ``` {{content}} ] -- Generate new colors for a set of pixels: ```r simGroupU <- function(m, l2, l1, beta, which, u) { pp2 <- l2 * exp(beta * nn(m, 2)) pp1 <- l1 * exp(beta * nn(m, 1)) pp <- pp2 / (pp2 + pp1) ifelse(u[which] < pp[which], 2, 1) } ``` --- .pull-left.tiny-code[ Run one step in the Markov chain: ```r simImgU <- function(m, img, beta, p, u) { white <- outer(seq_len(nrow(m)), seq_len(ncol(m)), FUN = `+`) %% 2 == 1 black <- ! white if (is.null(img)) { l2 <- 1 l1 <- 1 } else { l2 <- ifelse(img == 2, p, 1 - p) l1 <- ifelse(img == 1, p, 1 - p) } m[white] <- simGroupU(m, l2, l1, beta, white, u) m[black] <- simGroupU(m, l2, l1, beta, black, u) m } ``` ] -- .pull-right.tiny-code[ Couple from the past: ```r isingCFTP <- function(img, N, d, beta, p) { u <- array(runif(d * d * N), c(d, d, N)) repeat { m1 <- matrix(1, d, d) m2 <- matrix(2, d, d) for (i in seq_len(dim(u)[3])) { m1 <- simImgU(m1, img, beta, p, u[, , i]) m2 <- simImgU(m2, img, beta, p, u[, , i]) } if (identical(m1, m2)) return(m1) u <- array(c(array(runif(d * d * N), c(d, d, N)), u), c(d, d, 2 * N)) N <- 2 * N } } ``` ] --- .pull-left[ It takes about five seconds for 10 images: ```r img4 <- array(0, c(64, 64, 10)) system.time( for (i in 1:10) img4[, , i] <- isingCFTP(img, 10, 64, 0.9, 0.7)) ## user system elapsed ## 4.958 0.032 4.998 ``` ] -- .pull-right[ The results for the average image seem reasonable: <img src="mcmc_files/figure-html/unnamed-chunk-97-1.png" style="display: block; margin: auto;" /> ] --- Performance deteriorates as -- * dimension increases -- * `\(\beta\)` increases -- For `\(\beta = 1.2\)` it took about 45 seconds to generate 10 `\(64 \times 64\)` images <!-- Local Variables: mode: poly-markdown+R mode: flyspell End: -->
//adapted from Emi Tanaka's gist at //https://gist.github.com/emitanaka/eaa258bb8471c041797ff377704c8505