class: center, middle, title-slide .title[ # Optimization ] .author[ ### Luke Tierney ] .institute[ ### University of Iowa ] .date[ ### 2023-05-19 ] --- layout: true <link rel="stylesheet" href="stat7400.css" type="text/css" /> `$$\renewcommand{\Real}{\mathbb{R}}$$` <!-- $ make emacs sane --> ## Preliminaries --- Many statistical problems involve minimizing (or maximizing) a function `\(f : \mathcal{X} \to \Real\)`, i.e. finding `\begin{equation*} x^* = \underset{x \in \mathcal{X}}{\text{argmin}} f(x) \end{equation*}` -- Maximizing `\(f(x)\)` is equivalent to mimimizing `\(-f(x)\)`. -- The domain `\(\mathcal{X}\)` can be * a finite set --- combinatorial optimization; -- * a continuous, usually connected, subset of `\(\Real^n\)`. --- The function can be * continuous; -- * twice continuously differentiable; -- * unrestricted. -- The function can * have a single local minimum that is also a global minimum; -- * have one or more local minima but no global minimum; -- * have many local minima and a global minimum. --- Algorithms can be designed to * find a local minimum from some specified starting point; -- * find the global minimum. -- The objective can be * to find a global minimum; -- * to find a local minimum; -- * to improve on an initial guess. --- Very good software is available for many problems * general purpose optimizers will often do well; -- * occasionally it is useful to write your own, usually to exploit special structure; -- * understanding general strategies is useful even when using canned software. -- Optimization problems often require some tuning; proper scaling is often critical --- layout: true ## One-Dimensional Optimization --- Given an initial point `\(x^{(k)}\)` and a direction `\(d\)` we can define `\begin{equation*} x(t) = x^{(k)} + t d \end{equation*}` -- and set `\begin{equation*} x^{(k+1)} = x^{(k)} + t^* d \end{equation*}` -- where `\begin{equation*} t^* = \text{argmin}_t f(x(t)) \end{equation*}` -- with `\(t \in \Real\)` restricted to satisfy `\(x(t) \in \mathcal{X}\)`. -- This is often called a _line search_. --- Many one-dimensional optimization methods are available. -- Many are based on finding a root of the derivative. -- Newton's method approximates `\(f(x(t))\)` by a quadratic. -- Once the function `\(f\)` has been minimized in the direction `\(d\)` a new direction can be tried. -- It is usually not necessary to find the minimizer `\(t^*\)` exactly: a few steps in an iterative algorithm are often sufficient. -- R provides -- * `optimize` for minimizing a function over an interval -- * `uniroot` for finding the root of a function in an interval --- layout: true ## Choosing Search Directions --- Several methods are available for choosing search directions: * _Cyclic descent, or coordinate descent_: -- If `\(x = (x_1,\dots,x_n)\)` then minimize first with respect to `\(x_1\)`, then `\(x_2\)`, and so on through `\(x_n\)`, and repeat until convergence -- * This method is also called _non-liner Gauss-Seidel iteration_ or _backfitting_. It is similar in spirit to Gibbs sampling. -- * This may take more iterations than other methods, but often the individual steps are very fast. -- * A recent [paper](https://web.stanford.edu/~hastie/Papers/Sparsenet/jasa_MFH_final.pdf) and the associated `sparsenet` package takes advantage of this for fitting sparse linear models. --- * _Steepest descent_, or _gradient descent_: -- Take `\(d = - \nabla f(x^{(k)})\)`. -- * Steepest descent can be slow to converge due to zig-zagging. -- * It can work well for problems with nearly circular contours. -- * Preconditioning to improve the shape of the contours can help. -- * _Conjugate gradient_: -- Start with the steepest descent direction and then choose directions conjugate to a strictly positive definite matrix `\(A\)`, often a Hessian matrix. -- * This can reduce the zig-zagging. -- * It needs to be restarted at least every `\(n\)` steps. -- * Unless the objective function `\(f\)` is quadratic a new matrix `\(A\)` is typically computed at each restart. --- Most of these methods will be _linearly convergent_: -- Under suitable regularity conditions, and if `\(x^{(0)}\)` is close enough to a local minimizer `\(x^*\)`, then -- `\begin{equation*} \lim_{k \to \infty} \frac{\|x^{(k+1)} - x^*\|}{\|x^{(k)} - x^*\|} = a \end{equation*}` with `\(0 < a < 1\)`. -- `\(a\)` is the _rate of convergence_. -- A method for which `\(a = 0\)` is said to converge _super-linearly_. --- layout: true ## Newton's Method --- Newton's method approximates `\(f\)` by the quadratic `\begin{equation*} f^{(k)}(x) = f(x^{(k)}) + \nabla f(x^{(k)}) (x - x^{(k)}) + \frac{1}{2} (x - x^{(k)})^T \nabla^2f (x^{(k)}) (x - x^{(k)}) \end{equation*}` -- and computes `\(x^{(k+1)}\)` to minimize the quadratic approximation `\(f^{(k)}\)`: `\begin{equation*} x^{(k+1)} = x^{(k)} - \left(\nabla^2f (x^{(k)})\right)^{-1} \nabla f(x^{(k)}) \end{equation*}` --- Convergence can be very fast: -- If * `\(f\)` is twice continuously differentiable near a local minimizer `\(x^*\)` -- * `\(\nabla^2f (x^*)\)` is strictly positive definite -- * `\(x^{(0)}\)` is sufficiently close to `\(x^*\)` -- then `\begin{equation*} \lim_{k \to \infty} \frac{\|x^{(k+1)}-x^*\|}{\|x^{(k)}-x^*\|^2} = a \end{equation*}` with `\(0 < a < \infty\)`. -- This is called _quadratic convergence_. --- If these conditions fail, then Newton's method may not converge, [even for a convex unimodal function](../examples/newtonbad.Rmd). -- If the new value `\(x^{(k+1)}\)` does not improve `\(f\)` then one can use a line search in the direction of the Newton step. -- Newton's method requires first and second derivatives: -- Numerical derivatives can be used. -- Computing the derivatives can be very expensive if `\(n\)` is large. --- Many implementations -- * modify the second derivative matrix if it is not strictly positive definite; -- * switch to an alternate method, such as steepest descent, if the Newton direction does not produce sufficient improvement. -- Computation of the Newton step is usually done using a Cholesky factorization of the Hessian. -- A modified factorization is often used if the Hessian is not numerically strictly positive definite. -- If -- * `\(\widetilde{\theta}\)` is a consistent estimate of `\(\theta\)`, and -- * `\(\widehat{\theta}\)` is computed as a single Newton step from `\(\widetilde{\theta}\)`, -- then, under mild regularity conditions, `\(\widehat{\theta}\)` will have the same asymptotic normal distribution as the MLE. --- layout: true ## Quasi-Newton Methods --- Quasi-Newton methods compute the next step as `\begin{equation*} x^{(k+1)} = x^{(k)} - B_k^{-1} \nabla f(x^{(k)}) \end{equation*}` where `\(B_k\)` is an approximation to the Hessian matrix `\(\nabla^2 f(x^{(k)})\)`. -- `\(B_{k+1}\)` is usually chosen so that `\begin{equation*} \nabla f(x^{(k+1)}) = \nabla f(x^{(k)}) + B_{k+1} (x^{(k+1)} - x^{(k)}) \end{equation*}` -- For `\(n = 1\)` this leads to the _secant method_; for `\(n > 1\)` this is under-determined. -- For `\(n > 1\)` various strategies for low rank updating of `\(B\)` are available; often these are based on successive gradient values. -- The inverse can be updated using the _Sherman-Morrison-Woodbury formula_: If `\(A\)` is invertible then `\begin{equation*} (A + u v^T)^{-1} = A^{-1}-\frac{A^{-1} uv^T A^{-1}}{1 + v^T A^{-1} u} \end{equation*}` as long as the denominator is not zero. --- Commonly used methods include -- * Davidon-Fletcher-Powell (DFP); -- * Broyden-Fletcher-Goldfarb-Shanno (BFGS). -- Methods generally ensure that -- * `\(B_k\)` is symmetric -- * `\(B_k\)` is strictly positive definite -- Convergence of Quasi-Newton methods is generally super-linear but not quadratic. --- layout: true ## Fisher Scoring --- Maximum likelihood estimates can be computed by minimizing the negative log likelihood `\begin{equation*} f(\theta) = - \log L(\theta) \end{equation*}` -- The Hessian matrix `\(\nabla^2 f(\theta^{(k)})\)` is the _observed information matrix_ at `\(\theta^{(k)}\)`. -- Sometimes the _expected information matrix_ `\(I(\theta^{(k)})\)` is easy to compute. -- The Fisher scoring method uses a Newton step with the observed information matrix replaced by the expected information matrix: `\begin{equation*} \theta^{(k+1)} = \theta^{(k)} + I(\theta^{(k)})^{-1} \nabla \log L(\theta^{(k)}) \end{equation*}` --- ### Example: Logistic Regression Suppose `\(y_i \sim \text{Binomial}(m_i, \pi_i)\)` with `\begin{equation*} \pi_i = \frac{\exp(x_i^T\beta)}{1 + \exp(x_i^T\beta)} \end{equation*}` -- The negative log likelihood, up to additive constants, is `\begin{equation*} f(\beta) = - \sum (y_i x_i^T\beta - m_i \log(1 + \exp(x_i^T \beta)). \end{equation*}` --- The gradient is `\begin{align*} \nabla f(\beta) &= - \sum \left(y_i x_i - m_i \frac{\exp(x_i^T\beta)}{1 + \exp(x_i^T\beta)} x_i\right)\\ &= - \sum \left(y_i x_i - m_i \pi_i x_i\right)\\ &= - \sum (y_i - m_i \pi_i) x_i \end{align*}` -- The Hessian is `\begin{equation*} \nabla^2 f(\beta) = \sum m_i \pi_i (1 - \pi_i)x_i x_i^T \end{equation*}` -- The Hessian is non-stochastic, so Fisher scoring and Newton's method are identical and produce the update `\begin{equation*} \beta^{(k+1)} = \beta^{(k)} + \left(\sum m_i \pi_i^{(k)} (1 - \pi_i^{(k)})x_i x_i^T\right)^{-1} \left(\sum (y_i - m_i \pi_i^{(k)}) x_i\right) \end{equation*}` --- If `\(X\)` is the matrix with rows `\(x_i^T\)`, `\(W^{(k)}\)` is the diagonal matrix with diagonal elements `\begin{equation*} W_{ii}^{(k)} = m_i\pi_i^{(k)} (1 - \pi_i^{(k)}) \end{equation*}` -- and `\(V^{(k)}\)` is the vector with elements `\begin{equation*} V_i^{(k)} = \frac{y_i - m_i \pi_i^{(k)}}{W_{ii}} \end{equation*}` -- then this update can be written as `\begin{equation*} \beta^{(k+1)} = \beta^{(k)} + (X^T W^{(k)} X)^{-1} X^T W^{(k)} V^{(k)} \end{equation*}` --- The change in `\(\beta\)` is the result of fitting the linear model `\(V^{(k)} \sim X\)` by weighted least squares with weights `\(W^{(k)}\)`. -- * This type of algorithm is called an _iteratively reweighted least squares_ (IRWLS) algorithm. -- * Similar results hold for all generalized linear models. -- * In these models Fisher scoring and Newton's method are identical when the _canonical link_ function is chosen which makes the natural parameter linear in `\(\beta\)`. --- layout: true ## Gauss-Newton Method --- Suppose `\begin{equation*} Y_i \sim N(\eta(x_i, \theta), \sigma^2). \end{equation*}` -- For example, `\(\eta(x, \theta)\)` might be of the form `\begin{equation*} \eta(x, \theta) = \theta_0 + \theta_1 e^{-\theta_2 x}. \end{equation*}` -- Then the MLE minimizes `\begin{equation*} f(\theta) = \sum_{i=1}^n (y_i - \eta(x_i, \theta))^2. \end{equation*}` -- We can approximate `\(\eta\)` near a current guess `\(\theta^{(k)}\)` by the Taylor series expansion `\begin{equation*} \eta_k(x_i, \theta) \approx \eta(x_i, \theta^{(k)}) + \nabla_\theta \eta(x_i, \theta^{(k)}) (\theta - \theta^{(k)}) \end{equation*}` --- This leads to the approximate objective function `\begin{align*} f_k(\theta) &\approx \sum_{i=1}^n (y_i - \eta_k(x_i, \theta))^2\\ &= \sum_{i=1}^n (y_i - \eta(x_i, \theta^{(k)}) - \nabla_\theta \eta(x_i, \theta^{(k)}) (\theta - \theta^{(k)}))^2 \end{align*}` -- This is the sum of squared deviations for a linear regression model, so is minimized by `\begin{equation*} \theta^{(k+1)} = \theta^{(k)} + (J_k^T J_k)^{-1} J_k^T(y - \eta(x, \theta^{(k)})) \end{equation*}` where `\(J_k = \nabla_\theta \eta(x, \theta_{(k)})\)` is the Jacobian matrix of the mean function. --- The Gauss-Newton algorithm works well for problems with small residuals, where it is close to Newton's method. -- Like Newton's method it can suffer from taking too large a step. -- Backtracking or line search can be used to address this. -- An alternative is the Levenberg-Marquardt algorithm that uses `\begin{equation*} \theta^{(k+1)} = \theta^{(k)} + (J_k^T J_k + \lambda I)^{-1} J_k^T(y - \eta(x, \theta^{(k)})) \end{equation*}` for some _damping parameter_ `\(\lambda\)`. -- Various strategies for choosing `\(\lambda\)` are available. -- The R function `nls` uses these approaches. --- Nonlinear models are often _partially linear_. An example: `\begin{equation*} y_i = \alpha + \beta \exp\{- x_i / \theta\}. \end{equation*}` -- For a fixed `\(\theta\)` this is a linear model. So a possible approach is: -- * Fix `\(\theta\)` and estimate `\(\alpha\)` and `\(\beta\)` by linear regression. -- * Fix `\(\alpha\)` and `\(\beta\)` and estimate `\(\theta\)` by a non-linear method. -- * Iterate until convergence. --- layout: true ## Termination and Scaling --- Optimization algorithms generally use three criteria for terminating the search: -- * relative change in `\(x\)` -- * relative or absolute change in the function values and/or derivatives -- * number of iterations -- How well these work often depends on problem scaling -- * Implementations often allow specification of scaling factors for parameters. -- * Some also allow scaling of the objective function. --- Dennis and Schnabel (1983) recommend: -- * A relative gradient criterion `$$\max_{1 \le i \le n} \left|\frac{\nabla f(x)_i \max\{|x_i|, \text{typ}x_i\}} {\max\{|f(x)|,\text{typ}f\}}\right| \le \epsilon_{\text{grad}}$$` <!-- \begin{equation*} \max_{1 \le i \le n} \left|\frac{\nabla f(x)_i \max\{|x_i|, \text{typ}x_i\}} {\max\{|f(x)|,\text{typ}f\}}\right| \le \epsilon_{\text{grad}} \end{equation*} --> where `\(\text{typ}x_i\)` and `\(\text{typ}f\)` are typical magnitudes of `\(x_i\)` and `\(f\)`. -- * A relative change in `\(x\)` criterion `$$\max_{1 \le i \le n} \frac{|\Delta x_i|}{\max\{|x_i|, \text{typ}x_i\}} \le \epsilon_{\text{step}}$$` <!-- \begin{equation*} \max_{1 \le i \le n} \frac{|\Delta x_i|}{\max\{|x_i|, \text{typ}x_i\}} \le \epsilon_{\text{step}} \end{equation*} --> -- Practical use of optimization algorithms often involves -- * trying different starting strategies -- * adjusting scaling after preliminary runs -- * other reparameterizations to improve conditioning of the problem --- layout: true ## Nelder-Mead Simplex Method --- This is _not_ related to the simplex method for linear programming! -- * The method uses only function values and is fairly robust but can we quite slow and need repeated restarts. -- It can work reasonably even if the objective function is not differentiable. -- * The method uses function values at a set of `\(n+1\)` _affinely independent_ points in `\(n\)` dimensions, which form a simplex. -- * Affine independence means that the points `\(x_1, \dots, x_{n+1}\)` are such that `\(x_1 - x_{n+1}, \dots, x_n - x_{n+1}\)` are linearly independent. -- * The method goes through a series of reflection, expansion, contraction , and reduction steps to transform the simplex until the function values do not change much or the simplex becomes very small. --- One version of the algorithm, adapted from [Wikipedia](https://en.wikipedia.org/wiki/Nelder%E2%80%93Mead_method): -- .pull-left.small[ 1) Order the vertices according to their function values, `$$f(x_1) \le f(x_2) \le \dots \le f(x_{n+1})$$` {{content}} ] -- 2) Calculate `\(x_0 = \frac{1}{n}\sum_{i=1}^n x_i\)`, the center of the face opposite the worst point `\(x_{n+1}\)`. {{content}} -- 3) _Reflection_: compute the reflected point `$$x_r = x_0 + \alpha (x_0 - x_{n+1})$$` for some `\(\alpha \ge 1\)`, e.g. `\(\alpha = 1\)`. {{content}} -- If `\(f(x_1) \le f(x_r) < f(x_n)\)`, i.e. `\(x_r\)` is better than the second worst point `\(x_n\)` but not better than the best point `\(x_1\)`, then replace the worst point `\(x_{n+1}\)` by `\(x_r\)` and go to Step 1. {{content}} -- 4) _Expansion:_ If `\(f(x_r) < f(x_1)\)`, i.e. `\(x_r\)` is the best point so far, then compute the expanded point `$$x_e = x_0 + \gamma (x_0 - x_{n+1})$$` for some `\(\gamma > \alpha\)`, e.g. `\(\gamma = 2\)`. If `\(f(x_e) < f(x_r)\)` then replace `\(x_{n+1}\)` by `\(x_e\)` and go to Step 1. Otherwise replace `\(x_{n+1}\)` with `\(x_r\)` and go to Step 1. -- .pull-right.small[ 5) _Contraction_: If we reach this step then we know that `\(f(x_r) \ge f(x_n)\)`. Compute the contracted point `$$x_c = x_0 + \rho (x_0 - x_{n+1})$$` for some `\(\rho < 0\)`, e.g. `\(\rho = -1/2\)`. If `\(f(x_c) < f(x_{n+1})\)` replace `\(x_{n+1}\)` by `\(x_c\)` and go to Step 1. Otherwise go to Step 6. {{content}} ] -- 6) _Reduction_: For `\(i = 2, \dots, n+1\)` set `$$x_i = x_1 + \sigma (x_i - x_1)$$` for some `\(\sigma < 1\)`, e.g. `\(\sigma = 1/2\)`. {{content}} -- The [Wikipedia](https://en.wikipedia.org/wiki/Nelder%E2%80%93Mead_method) page shows some examples as animations. --- layout: true ## Simulated Annealing --- Simulated annealing is motivated by an analogy to slowly cooling metals in order to reach the minimum energy state. -- It is a _stochastic global optimization_ method. -- It can be very slow but can be effective for difficult problems with many local minima. -- For any value `\(T > 0\)` the function `\begin{equation*} f_T(x) = e^{f(x)/T} \end{equation*}` has minima at the same locations as `\(f(x)\)`. -- Increasing the _temperature parameter_ `\(T\)` flattens `\(f_T\)`; decreasing `\(T\)` sharpens the local minima. --- Suppose we have a current estimate of the minimizer, `\(x^{(k)}\)` and a mechanism for randomly generating a _proposal_ `\(y\)` for the next estimate. -- A step in the simulated annealing algorithm given a current estimate `\(x^{(k)}\)`: -- .smaller[ * Generate a proposal `\(y\)` for the next estimate. {{content}} ] -- * If `\(f(y) \le f(x^{(k)})\)` then _accept_ `\(y\)` and set `\(x_{(k+1)} = y\)`. {{content}} -- * If `\(f(y) > f(x^{(k)})\)` then with probability `\(f_T(x^{(k)}) / f_T(y)\)` _accept_ `\(y\)` and set `\(x_{(k+1)} = y\)`. {{content}} -- * Otherwise, _reject_ `\(y\)` and set `\(x_{(k+1)} = x_{(k)}\)`. {{content}} -- * Adjust the temperature according to a _cooling schedule_ and repeat. --- Occasionally accepting steps that increase the objective function allows escape from local minima. -- A common way to generate `\(y\)` is as a multivariate normal vector centered at `\(x_{(k)}\)`. -- A common choice of cooling schedule is of the form `\(T = 1 / \log(k)\)`. -- The standard deviations of the normal proposals may be tied to the current temperature setting. -- Under certain conditions this approach can be shown to converge to a global minimizer with probability one. -- Using other forms of proposals this approach can also be used for discrete optimization problems. --- layout: true ## EM and MCEM Algorithms --- Suppose we have a problem where _complete data_ `\(X,Z\)` has likelihood `\begin{equation*} L^c(\theta|x,z) = f(x,z|\theta) \end{equation*}` -- but we only observe _incomplete data_ `\(X\)` with likelihood `\begin{equation*} L(\theta|x) = g(x|\theta) = \int f(x,z|\theta) dz. \end{equation*}` -- Often the complete data likelihood is much simpler than the incomplete data one. -- The conditional density of the unobserved data given the observed data is `\begin{equation*} k(z|x, \theta) = \frac{f(x,z|\theta)}{g(x|\theta)}. \end{equation*}` -- Define, taking expectations with respect to `\(k(z|x,\phi)\)`, `\begin{align*} Q(\theta|\phi,x) &= E_\phi[\log f(x,z|\theta)|x]\\ H(\theta|\phi,x) &= E_\phi[\log k(z|x, \theta)|x]. \end{align*}` --- The EM algorithm consists of starting with an initial estimate `\(\widehat{\theta}^{(0)}\)` and repeating two steps until convergence: -- * _E(xpectation)-Step:_ Construct `\(Q(\theta|\widehat{\theta}^{(i)})\)` -- * _M(aximization)-Step:_ Compute `\(\widehat{\theta}^{(i+1)}=\text{argmax}_\theta\; Q(\theta|\widehat{\theta}^{(i)})\)` -- This algorithm was introduced by Dempster, Laird and Rubin (1977). -- It unified many special case algorithms in the literature. --- The EM algorithm increases the log likelihood at every step; this helps to ensure a very stable algorithm. -- The EM algorithm is typically linearly convergent. -- Acceleration methods may be useful. Givens and Hoeting describe some; others are available in the literature. -- If `\(Q(\theta|\phi,x)\)` cannot be constructed in closed form, it can often be computed by Monte Carlo methods; this is the MCEM algorithm of Wei and Tanner (1990). -- In many cases MCMC methods will have to be used in the MCEM algorithm. --- ### Example: Normal Mixture Models Suppose `\(X_1, \dots, X_n\)` are independent and identically distributed with density `\begin{equation*} f(x|\theta) = \sum_{j = 1}^M p_j f(x|\mu_j, \sigma_j^2) \end{equation*}` with `\(p_1, \dots, p_M \ge 0\)`, `\(\sum p_j = 1\)`, and `\(f(x|\mu,\sigma^2)\)` is a `\(\text{Normal}(\mu, \sigma^2)\)` density. -- `\(M\)` is assumed known. -- We can think of `\(X_i\)` as generated in two stages: -- * First a category `\(J\)` is selected from `\(\{1, \dots, M\}\)` with probabilities `\(p_1, \dots, p_M\)`. -- * Then, given `\(J = j\)`, a normal random variable is generated from `\(f(x|\mu_j,\sigma_j^2)\)`. --- We can represent the unobserved category using indicator variables `\begin{equation*} Z_{ij} = \begin{cases} 1 & J_i = j\\ 0 & \text{otherwise.} \end{cases} \end{equation*}` <!-- `\begin{equation*} Z_{ij} = \begin{cases} 1 & \text{if `\(J_i = j\)`}\\ 0 & \text{otherwise.} \end{cases} \end{equation*}` --> -- The complete data log likelihood is `\begin{equation*} \log L(\theta | x, z) = \sum_{i=1}^n\sum_{j=1}^M z_{ij}\left[\log p_j - \log \sigma_j - \frac{1}{2}\left(\frac{x_i - \mu_j}{\sigma_j}\right)^2\right]. \end{equation*}` --- The Expectation step produces `\begin{equation*} Q(\theta|\widehat{\theta}^{(k)}) = \sum_{i=1}^n\sum_{j=1}^M \widehat{p}_{ij}^{(k)}\left[\log p_j - \log \sigma_j - \frac{1}{2}\left(\frac{x_i - \mu_j}{\sigma_j}\right)^2\right] \end{equation*}` -- with `\begin{equation*} \widehat{p}_{ij}^{(k)} = E[Z_{ij}|X = x, \theta^{(k)}] = \frac{\widehat{p}_j^{(k)} f\left(x_i\left|\widehat{\mu}_j^{(k)}, \widehat{\sigma}_j^{(k)2}\right.\right)} {\sum_{\ell = 1}^M \widehat{p}_\ell^{(k)} f\left(x_i\left|\widehat{\mu}_\ell^{(k)}, \widehat{\sigma}_\ell^{(k)2}\right.\right)}. \end{equation*}` --- The Maximization step then produces `\begin{align*} \widehat{\mu}_j^{(k+1)} &= \frac{\sum_{i=1}^n \widehat{p}_{ij}^{(k)} x_i} {\sum_{i=1}^n \widehat{p}_{ij}^{(k)}}\\ \widehat{\sigma}_j^{(k+1)2} &= \frac{\sum_{i=1}^n \widehat{p}_{ij}^{(k)} (x_i - \widehat{\mu}_j^{(k+1)})^2} {\sum_{i=1}^n \widehat{p}_{ij}^{(k)}}\\ \widehat{p}_j^{(k+1)} &= \frac{1}{n} \sum_{i=1}^n \widehat{p}_{ij}^{(k)} \end{align*}` --- A simple R function to implement a single EM step might be written as .small-code[ ```r EMmix1 <- function(x, theta) { mu <- theta$mu sigma <- theta$sigma p <- theta$p M <- length(mu) ## E step Ez <- outer(x, 1 : M, function(x, i) p[i] * dnorm(x, mu[i], sigma[i])) Ez <- sweep(Ez, 1, rowSums(Ez), "/") colSums.Ez <- colSums(Ez) ## M step xp <- sweep(Ez, 1, x, "*") mu.new <- colSums(xp) / colSums.Ez sqRes <- outer(x, mu.new, function(x, m) (x - m)^2) sigma.new <- sqrt(colSums(Ez * sqRes) / colSums.Ez) p.new <- colSums.Ez / sum(colSums.Ez) ## pack up result list(mu = mu.new, sigma = sigma.new, p = p.new) } ``` ] <!-- $ make Emacs happy --> --- Some notes: -- * Reasonable starting values are important. -- * We want a local maximum near a good starting value, not a global maximum. -- * Code to examine the log likelihood for a simplified example is [available](../examples/mixll.R) -- Some papers: -- > Tobias Ryden (2008). EM versus Markov chain Monte Carlo for > estimation of hidden Markov models: a computational perspective, > _Bayesian Analysis_ 3 (4), 659--688. > A. Berlinet and C. Roland (2009). Parabolic acceleration of the EM > algorithm. _Statistics and Computing_ 19 (1), 35--48. > Chen, Lin S., Prentice, Ross L., and Wang, Pei (2014). A penalized > EM algorithm incorporating missing data mechanism for Gaussian > parameter estimation, _Biometrics_ 70 (2), 312--322. <!-- %% **** data example? --> --- ### Theoretical Properties of the EM Algorithm The conditional density of the unobserved data given the observed data is `\begin{equation*} k(z|x, \theta) = \frac{f(x,z|\theta)}{g(x|\theta)}. \end{equation*}` -- Define, taking expectations with respect to `\(k(z|x,\phi)\)`, `\begin{align*} Q(\theta|\phi,x) &= E_\phi[\log f(x,z|\theta)|x]\\ H(\theta|\phi,x) &= E_\phi[\log k(z|x, \theta)|x]. \end{align*}` --- For any `\(\phi\)` `\begin{equation*} \log L(\theta|x) = Q(\theta|\phi,x) - H(\theta|\phi,x) \end{equation*}` -- since for any `\(z\)` `\begin{align*} \log L(\theta|x) &= \log g(x|\theta)\\ &= \log f(x,z|\theta) - \log f(x, z|\theta) + \log g(x|\theta)\\ &= \log f(x,z|\theta) - \log \frac{f(x,z|\theta)}{g(x|\theta)}\\ &= \log f(x, z|\theta) - \log k(z|x,\theta). \end{align*}` -- and therefore, taking taking expectations with respect to `\(k(z|x,\phi)\)`, `\begin{align*} \log L(\theta|x) &= E_\phi[\log f(x, z|\theta)] - E_\phi[\log k(z|x,\theta)]\\ &= Q(\theta|\phi,x) - H(\theta|\phi,x). \end{align*}` --- Furthermore, for all `\(\theta\)` `\begin{equation*} H(\theta|\phi) \le H(\phi|\phi) \end{equation*}` -- since, by Jensen's inequality, `\begin{align*} H(\theta|\phi) - H(\phi|\phi) &= E_\phi[\log k(z|x,\theta)] - E_\phi[\log k(z|x,\phi)]\\ &= E_\phi\left[\log\frac{k(z|x,\theta)}{k(z|x,\phi)}\right]\\ &\le \log E_\phi\left[\frac{k(z|x,\theta)}{k(z|x,\phi)}\right]\\ &= \log \int \frac{k(z|x,\theta)}{k(z|x,\phi)} k(z|x,\phi) dz\\ &= \log \int k(z|x,\theta) dz = 0. \end{align*}` --- This implies that for any `\(\theta\)` and `\(\phi\)` `\begin{equation*} \log L(\theta|x) \ge Q(\theta|\phi,x) - H(\phi|\phi,x) \end{equation*}` with equality when `\(\theta = \phi\)`. -- Therefore if `\(\widehat{\theta}\)` maximizes `\(L(\theta|x)\)` then `\(\widehat{\theta}\)` maximizes `\(Q(\theta|\widehat{\theta},x)\)` with respect to `\(\theta\)`: `\begin{align*} Q(\theta|\widehat{\theta},x) - H(\widehat{\theta}|\widehat{\theta},x) &\le \log L(\theta|x)\\ &\le \log L(\widehat{\theta}|x) = Q(\widehat{\theta}|\widehat{\theta},x) - H(\widehat{\theta}|\widehat{\theta},x). \end{align*}` -- If `\(\widehat{\theta}(\phi)\)` maximizes `\(Q(\theta|\phi,x)\)` for a given `\(\phi\)` then `\(\log L(\phi|x) \le \log L(\widehat{\theta}(\phi)| x)\)`: `\begin{align*} \log L(\phi|x) &= Q(\phi|\phi,x) - H(\phi|\phi,x)\\ &\le Q(\widehat{\theta}(\phi)|\phi,x) - H(\phi| \phi,x)\\ & \le Q(\widehat{\theta}(\phi)|\phi,x) - H(\widehat{\theta}(\phi)| \phi,x)\\ & = \log L(\widehat{\theta}(\phi)| x). \end{align*}` --- In the following figures the black curve corresponds to `\(\log L(\theta|x)\)` and the red curve to the _surrogate function_ `\(Q(\theta|\phi, x) - H(\phi|\phi, x)\)`. <img src="optim_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" /> --- layout: true ## MM Algorithms --- The EM algorithm can be viewed as a special case of an MM algorithm. -- MM stands for -- * _Minimization and Majorization_, or * _Maximization and Minorization_. -- Suppose the objective is to _maximize_ a function `\(f(\theta)\)`. -- The assumption is that a _surrogate function_ `\(g(\theta | \phi)\)` is available such that `\begin{equation*} f(\theta) \ge g(\theta|\phi) \end{equation*}` for all `\(\theta, \phi\)`, with equality when `\(\theta = \phi\)`. -- The function `\(g\)` is said to _minorize_ the function `\(f\)`. --- The MM algorithm starts with an initial guess `\(\widehat{\theta}^{(0)}\)` and then computes `\begin{equation*} \widehat{\theta}^{(i + 1)} = \text{argmax}_\theta g(\theta|\widehat{\theta}^{(i)}) \end{equation*}` -- As in the EM algorithm, -- * if `\(\widehat{\theta}\)` maximizes `\(f(\theta)\)`, then `\(\widehat{\theta}\)` maximizes `\(g(\theta|\widehat{\theta})\)`; -- * if `\(\widehat{\theta}(\phi)\)` maximizes `\(g(\theta|\phi)\)`, then $$ f(\phi) \le f(\widehat{\theta}(\phi)).$$ -- The objective function values will increase, and under reasonable conditions the algorithm will converge to a local maxizer. -- Full maximization of `\(g\)` is not needed as long as sufficient progress is made (single Newton steps are often sufficient). --- The art in designing an MM algorithm is finding a surrogate minorizing function `\(g\)` that -- * produces an efficient algorithm -- * is easy to maximize -- In `\(p\)`-dimensional problems it is often possible to choose minorizing functions of the form `\begin{equation*} g(\theta|\phi) = \sum_{j=1}^p g_j(\theta_j|\phi) \end{equation*}` so that `\(g\)` can be maximized by separate one-dimensional maximizations of the `\(g_j\)`. --- ### Example: Bradley Terry Model In the Bradley Terry competition model each team has a strength `\(\theta_i\)` and $$ P(\text{$i$ beats `\(j\)`}) = \frac{\theta_i}{\theta_i + \theta_j} $$ with `\(\theta_1 = 1\)` for identifiability. -- Assuming independent games in which `\(y_{ij}\)` is the number of times `\(i\)` beats `\(j\)` the log likelihood is `\begin{equation*} \log L(\theta) = \sum_{i,j} y_{ij} \left(\log(\theta_i) - \log(\theta_i+\theta_j)\right). \end{equation*}` --- Since the logarithm is concave, for any `\(x\)` and `\(y\)` `\begin{equation*} -\log y \ge -\log x - \frac{y-x}{x} \end{equation*}` -- and therefore for any `\(\theta\)` and `\(\phi\)` `\begin{align*} \log L(\theta) &\ge \sum_{i,j} y_{ij} \left(\log(\theta_i) - \log(\phi_i+\phi_j) - \frac{\theta_i+\theta_j-\phi_i-\phi_j}{\phi_i+\phi_j}\right)\\ &= g(\theta|\phi)\\ &= \sum_i g_i(\theta_i|\phi) \end{align*}` -- with `\begin{equation*} g_i(\theta_i|\phi) = \sum_j y_{ij}\log\theta_i - \sum_j(y_{ij}+y_{ji})\frac{\theta_i-\phi_i}{\phi_i+\phi_j}. \end{equation*}` --- The parameters are separated in `\(g\)`, and the one-dimensional maximizers can be easily computed as `\begin{equation*} \widehat{\theta}_i(\phi) = \frac{\sum_{j} y_{ij}}{\sum_{j}(y_{ij}+y_{ji})/(\phi_i+\phi_j)}. \end{equation*}` -- Some References on MM algorithms: > Hunter DR and Lange K (2004), [A Tutorial on MM > Algorithms](https://personal.psu.edu/drh20/papers/mmtutorial.pdf). > _The American Statistician_, 58: 30-37. > Lange, K (2004). _Optimization_, Springer-Verlag, New York. --- layout: true ## Gradient Descent Methods --- Gradient descent, or steepest descent, methods have one advantage: only the gradient needs to be computed. -- A simple gradient descent approach uses $$ x_{n+1} = x_n - \alpha \nabla f(x_n). $$ -- For a quadratic function `\(f(x) = \frac{1}{2} x^T A x - x^T b\)` with strictly positive definite second derivative matrix `\(A\)` this algorithm converges linearly if $$ \alpha < \frac{2}{\rho(A)} $$ where `\(\rho(A)\)` is the spectral radius of `\(A\)`, i.e. the largest eigenvalue. --- The optimal choice of `\(\alpha\)` is $$ \alpha = \frac{2}{\rho(A)} \times \frac{\kappa(A)}{\kappa(A) + 1} $$ where `\(\kappa(A)\)` is the condition number of `\(A\)`. -- The corresponding rate of convergence is $$ \frac{\kappa(A) - 1}{\kappa(A) + 1}. $$ -- This provides some guidance for use in the non-quadratic case: -- * `\(\alpha\)` needs to be small enough but not too small; -- * making sure the problem is well-conditioned is very important. --- The step size `\(\alpha\)` can be chosen adaptively by a line search or by _backtracking_, e.g. starting with `\(\alpha = \alpha_0\)` and reducing `\(\alpha\)` by a factor of 1/2 until $$ f(x - \alpha \nabla f(x)) > f(x) - \frac{1}{2}\alpha\|\nabla f(x)\|^2 $$ -- Other adaptive strategies try to avoid evaluating `\(f(x)\)`. -- _Momentum_ methods try to reduce zig-zagging by using a combination of the previous step direction and the current gradient. --- _Stochastic gradient descent_ methods try to reduce the cost of each step in cases where $$ f(x) = \frac{1}{n} \sum_{i=1}^n f_i(x) $$ by at each step choosing a sample `\(i_1, \dots, i_m\)` and computing the gradient of `$$\frac{1}{m} \sum_{k=1}^m f_{i_k}(x).$$` -- Choosing `\(m = 1\)` sometimes works. --- Variants of Stochastic gradient descent are used widely in machine learning algorithms, in particular for fitting deep neural networks. -- In machine learning the step size `\(\alpha\)` is called the _learning rate_. -- Stochastic gradient descent will converge almost surely under mild conditions if used with a _learning rate decay_, e.g. with a learning rate `\(\alpha_k\)` at iteration `\(k\)` of the form $$ \alpha_k = \alpha_0 \frac{1}{1 + \delta k}. $$ --- layout: true ## Constrained Optimization --- Equality constraints arise, for example, in -- * restricted maximum likelihood estimation needed for the null hypothesis in likelihood ratio tests -- * estimating probabilities that sum to one -- Inequality constraints arise in estimating -- * probabilities or rates that are constrained to be non-negative -- * monotone functions or monotone sets of parameters -- * convex functions -- Box constraints are the simplest and the most common. -- Linear inequality constraints also occur frequently. --- Inequality constraints are often handled by converting a constrained problem to an unconstrained one: -- * Minimizing `\(f(x)\)` subject to `\(g_i(x) \ge 0\)` for `\(i = 1, \dots, M\)` is equivalent to minimizing `$$F(x) = \begin{cases} f(x) & g_i(x) \ge 0: i = 1, \dots, M\\ \infty & \text{otherwise.} \end{cases}$$` <!-- \begin{equation*} F(x) = \begin{cases} f(x) & \text{if `\(g_i(x) \ge 0\)` for `\(i = 1, \dots, M\)` }\\ \infty & \text{otherwise.} \end{cases} \end{equation*} --> -- * This objective function is not continuous. -- Sometimes a complicated unconstrained problem can be changed to simpler constrained one. --- ### Example: `\(L_1\)` Regression The `\(L_1\)` regression estimator is defined by the minimization problem `\begin{equation*} \widehat{b} = \underset{b}{\text{argmin}} \sum |y_i - x_i^T b| \end{equation*}` -- This unconstrained optimization problem with a non-differentiable objective function can be reformulated as a constrained linear programming problem: --- For a vector `\(z = (z_1, \dots, z_n)^T\)` let `\([z]_+\)` denote the vector of positive parts of `\(z\)`, i.e. `\begin{equation*} ([z]_+)_i = \max(z_i, 0) \end{equation*}` and let `\begin{align*} b_+ &= [b]_+\\ b_- &= [-b]_+\\ u &= [y - Xb]_+\\ v &= [Xb - y]_+ \end{align*}` -- Then the `\(L_1\)` estimator satisfies `\begin{equation*} \widehat{b} = \underset{b}{\text{argmin}} \; \mathbf{1}^T u + \mathbf{1}^T v \end{equation*}` subject to the constraints `\begin{align*} &y = Xb_+ - Xb_- + u - v\\ &u \ge 0, v \ge 0, b_+ \ge 0, b_- \ge 0 \end{align*}` --- This linear programming problem can be solved by the standard simplex algorithm. -- A specialized simplex algorithm taking advantage of structure in the constraint matrix allows larger data sets to be handled (Barrodale and Roberts, 1974). -- Interior point methods can also be used and are effective for large `\(n\)` and moderate `\(p\)`. -- An alternative approach called _smoothing_ is based on approximating the absolute value function by a twice continuously differentiable function and can be effective for moderate `\(n\)` and large `\(p\)`. -- Chen and Wei (2005) provide an overview in the context of the more general _quantile regression_ problem. --- Quantile regression computes estimates as `\begin{equation*} \widehat{b} = \underset{b}{\text{argmin}} \sum \rho_\tau(y_i - x_i^Tb) \end{equation*}` with `\begin{equation*} \rho_\tau(y) = y\left(\tau - 1_{\{y < 0\}}\right) = \tau [y]_+ + (1 - \tau) [-y]_+. \end{equation*}` for `\(0 < \tau < 1\)`. For `\(\tau = \frac{1}{2}\)` this is `\(\rho_{\frac{1}{2}}(y) = \frac{1}{2}|y|\)`. <img src="optim_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> --- Reformulation as a constrained optimization problem is also sometimes used for LASSO and SVM computations. -- ### Some Approaches and Algorithms If the optimum is in the interior of the feasible region then using standard methods on `\(F\)` may work, especially if these use backtracking if they encounter infinite values. -- For interior optima transormations to an unbounded feasible region may help. -- Standard algorithms can often be modified to handle box constraints. --- Other constraints are often handled using _barrier methods_ that approximate `\(F\)` by a smooth function `\begin{equation*} F_\gamma(x) = \begin{cases} f(x) - \gamma \sum_{i = 1}^M \log g_i(x) & g(x) \ge 0\\ \infty & \text{otherwise} \end{cases} \end{equation*}` <!-- `\begin{equation*} F_\gamma(x) = \begin{cases} f(x) - \gamma \sum_{i = 1}^M \log g_i(x) & \text{if `\(g(x) \ge 0\)`}\\ \infty & \text{otherwise} \end{cases} \end{equation*}` --> for some `\(\gamma > 0\)`. -- The algorithm starts with a feasible solution, minimizes `\(F_\gamma\)` for a given `\(\gamma\)`, reduces `\(\gamma\)`, and repeats until convergence. -- Barrier methods are also called path-following algorithms. -- Path-following algorithms are useful in other settings where a harder problem can be approached through a sequence of easier problems. -- Path-following algorithms usually start the search for the next `\(\gamma\)` at the solution for the previous `\(\gamma\)`; this is sometimes called a _warm start_ approach. -- The intermediate results for positive `\(\gamma\)` are in the interior of the feasible region, hence this is an _interior point method_ -- More sophisticated interior point methods are available in particular for convex optimization problems. --- ### An Adaptive Barrier Algorithm Consider the problem of minimizing `\(f(x)\)` subject to the linear inequality constraints `\begin{equation*} g_i(x) = b_i^T x - c_i \ge 0 \end{equation*}` for `\(i = 1, \dots, M\)`. -- For an interior point `\(x^{(k)}\)` of the feasible region define the surrogate function `\begin{equation*} R(x | x^{(k)}) = f(x) - \mu \sum_{i=1}^M \left[g_i(x^{(k)}) \log(g_i(x)) - b_i^T x\right] \end{equation*}` for some `\(\mu > 0\)`. --- As a function of `\(x\)` the barrier function `\begin{equation*} f(x) - R(x | x^{(k)}) = \mu \sum_{i=1}^M \left[g_i(x^{(k)}) \log(g_i(x)) - b_i^T x\right] \end{equation*}` is concave and maximized at `\(x = x^{(k)}\)`. -- Thus if `\begin{equation*} x^{(k+1)} = \underset{x}{\text{argmin}} \; R(x | x^{(k)}) \end{equation*}` -- then `\begin{align*} f(x^{(k+1)}) & = R(x^{(k+1)}|x^{(k)}) + f(x^{(k+1)}) - R(x^{(k+1)}|x^{(k)})\\ &\le R(x^{(k)}|x^{(k)}) + f(x^{(k+1)}) - R(x^{(k+1)}|x^{(k)})\\ &\le R(x^{(k)}|x^{(k)}) + f(x^{(k)}) - R(x^{(k)}|x^{(k)})\\ & = f(x^{(k)}) \end{align*}` -- The values of the objective function are non-increasing. --- This has strong similarities to an EM algorithm argument. -- The coefficient of a logarithmic barrier term decreases to zero if `\(x^{(k)}\)` approaches the boundary. -- If `\(f\)` is convex and has a unique minimizer `\(x^*\)` then `\(x^{(k)}\)` converges to `\(x^*\)`. -- This algorithm was introduced in K. Lange (1994). --- layout: true ## Optimization in R --- Several optimization functions are available in the standard R distribution: -- * `optim` implements a range of methods: * Nelder-Mead simplex * Quasi-Newton with the BFGS update * A modified Quasi-Newton BFGS algorithm that allows box constraints * A conjugate gradient algorithm * A version of simulated annealing. -- * Some notes: * A variety of iteration control options are available. * Analytical derivatives can be supplied to methods that use them. * Hessian matrices at the optimum can be requested. -- * `nlminb` provides an interface to the FORTRAN [PORT](https://web.archive.org/web/20070508140716/http://netlib.bell-labs.com/cm/cs/cstr/153.pdf) library developed at Bell Labs. Box constraints are supported. -- * `nlm` implements a modified Newton algorithm as described in Dennis and Schnabel (1983). -- * `constrOptim` implements the adaptive barrier method of Lange (1994) using `optim` for the optimizations within iterations. --- The [Optimization Task View](https://cran.r-project.org/view=Optimization) on CRAN describes a number of other methods available in contributed packages. -- Simple examples illustrating `optim` and `constrOptim` are [available](../examples/optimpath.R).
//adapted from Emi Tanaka's gist at //https://gist.github.com/emitanaka/eaa258bb8471c041797ff377704c8505