Optimization

class: center, middle, title-slide

.title[
# Optimization
]
.author[
### Luke Tierney
]
.institute[
### University of Iowa
]
.date[
### 2023-05-19
]

---

layout: true

`$$\renewcommand{\Real}{\mathbb{R}}$$`

## Preliminaries
---

Many statistical problems involve minimizing (or maximizing) a function
`$f : \mathcal{X} \to \Real$`, i.e. finding

`\begin{equation*}
  x^* = \underset{x \in \mathcal{X}}{\text{argmin}} f(x)
\end{equation*}`

Maximizing `$f(x)$` is equivalent to mimimizing `$-f(x)$`.

The domain `$\mathcal{X}$` can be

* a finite set --- combinatorial optimization;

* a continuous, usually connected, subset of `$\Real^n$`.

---

The function can be

* continuous;

* twice continuously differentiable;

* unrestricted.

The function can

* have a single local minimum that is also a global minimum;

* have one or more local minima but no global minimum;

* have many local minima and a global minimum.

---

Algorithms can be designed to

* find a local minimum from some specified starting point;

* find the global minimum.

The objective can be

* to find a global minimum;

* to find a local minimum;

* to improve on an initial guess.

---

Very good software is available for many problems

* general purpose optimizers will often do well;

* occasionally it is useful to write your own, usually to
  exploit special structure;

* understanding general strategies is useful even when using
  canned software.

Optimization problems often require some tuning;
proper scaling is often critical

---
layout: true
## One-Dimensional Optimization
---

Given an initial point `$x^{(k)}$` and a direction `$d$` we can define

`\begin{equation*}
  x(t) = x^{(k)} + t d
\end{equation*}`

and set

`\begin{equation*}
  x^{(k+1)} = x^{(k)} + t^* d
\end{equation*}`

where

`\begin{equation*}
  t^* = \text{argmin}_t f(x(t))
\end{equation*}`

with `$t \in \Real$` restricted to satisfy `$x(t) \in \mathcal{X}$`.

This is often called a _line search_.

---

Many one-dimensional optimization methods are available.

Many are based on finding a root of the derivative.

Newton's method approximates `$f(x(t))$` by a quadratic.

Once the function `$f$` has been minimized in the direction `$d$` a
new direction can be tried.

It is usually not necessary to find the minimizer `$t^*$` exactly:
a few steps in an iterative algorithm are often sufficient.

R provides

* `optimize` for minimizing a function over an interval

* `uniroot` for finding the root of a function in an interval

---
layout: true
## Choosing Search Directions
---

Several methods are available for choosing search directions:

* _Cyclic descent, or coordinate descent_:

If `$x = (x_1,\dots,x_n)$`
  then minimize first with respect to `$x_1$`, then
  `$x_2$`, and so on through `$x_n$`, and repeat until convergence

* This method is also called _non-liner Gauss-Seidel iteration_ or
    _backfitting_.  It is similar in spirit to Gibbs sampling.
--

* This may take more iterations than other methods, but often
    the individual steps are very fast.
--

* A recent
    [paper](https://web.stanford.edu/~hastie/Papers/Sparsenet/jasa_MFH_final.pdf)
    and the associated `sparsenet` package takes advantage of
    this for fitting sparse linear models.

---

* _Steepest descent_, or _gradient descent_:

Take `$d = - \nabla f(x^{(k)})$`.

* Steepest descent can be slow to converge due to zig-zagging.
--

* It can work well for problems with nearly circular contours.
--

* Preconditioning to improve the shape of the contours can help.

* _Conjugate gradient_:

Start with the steepest descent
  direction and then choose directions conjugate to a strictly
  positive definite matrix `$A$`, often a Hessian matrix.

* This can reduce the zig-zagging.
--

* It needs to be restarted at least every `$n$` steps.
--

* Unless the objective function `$f$` is quadratic a new matrix
    `$A$` is typically computed at each restart.

---

Most of these methods will be _linearly convergent_:

Under suitable regularity conditions, and if `$x^{(0)}$` is close enough
to a local minimizer `$x^*$`, then

`\begin{equation*}
  \lim_{k \to \infty} \frac{\|x^{(k+1)} - x^*\|}{\|x^{(k)} - x^*\|} = a
\end{equation*}`

with `$0 < a < 1$`.

`$a$` is the _rate of convergence_.

A method for which `$a = 0$` is said to converge _super-linearly_.

---
layout: true
## Newton's Method
---

Newton's method approximates `$f$` by the quadratic

`\begin{equation*}
  f^{(k)}(x) =
  f(x^{(k)}) + \nabla f(x^{(k)}) (x - x^{(k)}) +
  \frac{1}{2} (x - x^{(k)})^T \nabla^2f (x^{(k)}) (x - x^{(k)})
\end{equation*}`

and computes `$x^{(k+1)}$` to minimize the quadratic approximation `$f^{(k)}$`:

`\begin{equation*}
  x^{(k+1)} = x^{(k)} - \left(\nabla^2f (x^{(k)})\right)^{-1} \nabla f(x^{(k)}) 
\end{equation*}`

---

Convergence can be very fast:

* `$f$` is twice continuously differentiable near a local
  minimizer `$x^*$`

* `$\nabla^2f (x^*)$` is strictly positive definite

* `$x^{(0)}$` is sufficiently close to `$x^*$`

then

`\begin{equation*}
  \lim_{k \to \infty} \frac{\|x^{(k+1)}-x^*\|}{\|x^{(k)}-x^*\|^2} = a
\end{equation*}`

with `$0 < a < \infty$`.

This is called _quadratic convergence_.

---

If these conditions fail, then Newton's method may not converge,
[even for a convex unimodal
function](../examples/newtonbad.Rmd).

If the new value `$x^{(k+1)}$` does not improve `$f$` then one can
use a line search in the direction of the Newton step.

Newton's method requires first and second derivatives:

Numerical derivatives can be used.

Computing the derivatives can be very expensive if `$n$` is large.

---

Many implementations

* modify the second derivative matrix if it is not strictly
  positive definite;

* switch to an alternate method, such as steepest descent, if
  the Newton direction does not produce sufficient improvement.

Computation of the Newton step is usually done using a Cholesky
factorization of the Hessian.

A modified factorization is often used if the Hessian is not
numerically strictly positive definite.

* `$\widetilde{\theta}$` is a consistent estimate of `$\theta$`, and

* `$\widehat{\theta}$` is computed as a single Newton step from
  `$\widetilde{\theta}$`,

then, under mild regularity conditions, `$\widehat{\theta}$` will
have the same asymptotic normal distribution as the MLE.

---
layout: true
## Quasi-Newton Methods
---

Quasi-Newton methods compute the next step as

`\begin{equation*}
  x^{(k+1)} = x^{(k)} - B_k^{-1} \nabla f(x^{(k)})
\end{equation*}`

where `$B_k$` is an approximation to the Hessian matrix `$\nabla^2 f(x^{(k)})$`.

`$B_{k+1}$` is usually chosen so that

`\begin{equation*}
  \nabla f(x^{(k+1)}) = \nabla f(x^{(k)}) + B_{k+1} (x^{(k+1)} - x^{(k)})
\end{equation*}`

For `$n = 1$` this leads to the _secant method_; for `$n > 1$`
this is under-determined.

For `$n > 1$` various strategies for low rank updating of `$B$` are
available; often these are based on successive gradient values.

The inverse can be updated using the
_Sherman-Morrison-Woodbury formula_: If `$A$` is invertible then
`\begin{equation*}
  (A + u v^T)^{-1} = A^{-1}-\frac{A^{-1} uv^T A^{-1}}{1 + v^T A^{-1} u}
\end{equation*}`
as long as the denominator is not zero.

---

Commonly used methods include

* Davidon-Fletcher-Powell (DFP);

* Broyden-Fletcher-Goldfarb-Shanno (BFGS).

Methods generally ensure that

* `$B_k$` is symmetric

* `$B_k$` is strictly positive definite

Convergence of Quasi-Newton methods is generally super-linear
but not quadratic.

---
layout: true
## Fisher Scoring
---

Maximum likelihood estimates can be computed by minimizing the
negative log likelihood

`\begin{equation*}
  f(\theta) = - \log L(\theta)
\end{equation*}`

The Hessian matrix `$\nabla^2 f(\theta^{(k)})$` is the _observed
information matrix_ at `$\theta^{(k)}$`.

Sometimes the _expected information matrix_ `$I(\theta^{(k)})$` is easy
to compute.

The Fisher scoring method uses a Newton step with the observed
information matrix replaced by the expected information matrix:

`\begin{equation*}
  \theta^{(k+1)} =
  \theta^{(k)} + I(\theta^{(k)})^{-1} \nabla \log L(\theta^{(k)})
\end{equation*}`

---

### Example: Logistic Regression
Suppose `$y_i \sim \text{Binomial}(m_i, \pi_i)$` with

`\begin{equation*}
  \pi_i = \frac{\exp(x_i^T\beta)}{1 + \exp(x_i^T\beta)}
\end{equation*}`

The negative log likelihood, up to additive constants, is

`\begin{equation*}
  f(\beta) = - \sum (y_i x_i^T\beta - m_i \log(1 + \exp(x_i^T \beta)).
\end{equation*}`

---

The gradient is

`\begin{align*}
  \nabla f(\beta)
  &= - \sum
  \left(y_i x_i - m_i \frac{\exp(x_i^T\beta)}{1 + \exp(x_i^T\beta)} x_i\right)\\
  &= - \sum \left(y_i x_i - m_i \pi_i x_i\right)\\
  &= - \sum (y_i - m_i \pi_i) x_i
\end{align*}`

The Hessian is

`\begin{equation*}
  \nabla^2 f(\beta)
  = \sum m_i \pi_i (1 - \pi_i)x_i x_i^T
\end{equation*}`

The Hessian is non-stochastic, so Fisher scoring and Newton's method
are identical and produce the update

`\begin{equation*}
  \beta^{(k+1)} = 
  \beta^{(k)} + \left(\sum m_i \pi_i^{(k)} (1 - \pi_i^{(k)})x_i x_i^T\right)^{-1}
  \left(\sum (y_i - m_i \pi_i^{(k)}) x_i\right)
\end{equation*}`

---

If `$X$` is the matrix with rows `$x_i^T$`, `$W^{(k)}$` is the diagonal
matrix with diagonal elements

`\begin{equation*}
  W_{ii}^{(k)} = m_i\pi_i^{(k)} (1 - \pi_i^{(k)})
\end{equation*}`

and `$V^{(k)}$` is the vector with elements

`\begin{equation*}
  V_i^{(k)} = \frac{y_i - m_i \pi_i^{(k)}}{W_{ii}}
\end{equation*}`

then this update can be written as

`\begin{equation*}
  \beta^{(k+1)} = \beta^{(k)} + (X^T W^{(k)} X)^{-1} X^T W^{(k)} V^{(k)}
\end{equation*}`

---

The change in `$\beta$` is the result of fitting the linear model
`$V^{(k)} \sim X$` by weighted least squares with weights `$W^{(k)}$`.

* This type of algorithm is called an _iteratively reweighted
  least squares_ (IRWLS) algorithm.

* Similar results hold for all generalized linear models.

* In these models Fisher scoring and Newton's method are identical
  when the _canonical link_ function is chosen which makes the
  natural parameter linear in `$\beta$`.

---
layout: true
## Gauss-Newton Method
---

Suppose

`\begin{equation*}
  Y_i \sim N(\eta(x_i, \theta), \sigma^2).  
\end{equation*}`

For example, `$\eta(x, \theta)$` might be of the form

`\begin{equation*}
  \eta(x, \theta) = \theta_0 + \theta_1 e^{-\theta_2 x}.
\end{equation*}`

Then the MLE minimizes

`\begin{equation*}
  f(\theta) = \sum_{i=1}^n (y_i - \eta(x_i, \theta))^2.
\end{equation*}`

We can approximate `$\eta$` near a current guess `$\theta^{(k)}$` by the
Taylor series expansion

`\begin{equation*}
  \eta_k(x_i, \theta) \approx
  \eta(x_i, \theta^{(k)}) +
  \nabla_\theta \eta(x_i, \theta^{(k)}) (\theta - \theta^{(k)})
\end{equation*}`

---

This leads to the approximate objective function

`\begin{align*}
  f_k(\theta) &\approx \sum_{i=1}^n (y_i - \eta_k(x_i, \theta))^2\\
  &= \sum_{i=1}^n (y_i - \eta(x_i, \theta^{(k)}) -
    \nabla_\theta \eta(x_i, \theta^{(k)}) (\theta - \theta^{(k)}))^2
\end{align*}`

This is the sum of squared deviations for a linear regression model,
so is minimized by

`\begin{equation*}
  \theta^{(k+1)} = \theta^{(k)} + (J_k^T J_k)^{-1} J_k^T(y - \eta(x, \theta^{(k)}))
\end{equation*}`
where `$J_k = \nabla_\theta \eta(x, \theta_{(k)})$` is the Jacobian
matrix of the mean function.

---

The Gauss-Newton algorithm works well for problems with small
residuals, where it is close to Newton's method.

Like Newton's method it can suffer from taking too large a step.

Backtracking or line search can be used to address this.

An alternative is the Levenberg-Marquardt algorithm that uses

`\begin{equation*}
  \theta^{(k+1)} = \theta^{(k)} +
  (J_k^T J_k + \lambda I)^{-1} J_k^T(y - \eta(x, \theta^{(k)}))
\end{equation*}`

for some _damping parameter_ `$\lambda$`.

Various strategies for choosing `$\lambda$` are available.

The R function `nls` uses these approaches.

---

Nonlinear models are often _partially linear_. An example:

`\begin{equation*}
  y_i = \alpha + \beta \exp\{- x_i / \theta\}.
\end{equation*}`

For a fixed `$\theta$` this is a linear model. So a possible approach is:

* Fix `$\theta$` and estimate `$\alpha$` and `$\beta$` by linear regression.

* Fix `$\alpha$` and `$\beta$` and estimate `$\theta$` by a non-linear method.

* Iterate until convergence.

---
layout: true
## Termination and Scaling
---

Optimization algorithms generally use three criteria for terminating
the search:

* relative change in `$x$`

* relative or absolute change in the function values and/or
  derivatives

* number of iterations

How well these work often depends on problem scaling

* Implementations often allow specification of scaling factors for
  parameters.

* Some also allow scaling of the objective function.

---

Dennis and Schnabel (1983) recommend:

* A relative gradient criterion
  `$$\max_{1 \le i \le n} \left|\frac{\nabla f(x)_i \max\{|x_i|, \text{typ}x_i\}} {\max\{|f(x)|,\text{typ}f\}}\right| \le \epsilon_{\text{grad}}$$`
  
  where `$\text{typ}x_i$` and `$\text{typ}f$` are typical magnitudes of `$x_i$` and `$f$`.

* A relative change in `$x$` criterion
  `$$\max_{1 \le i \le n} \frac{|\Delta x_i|}{\max\{|x_i|, \text{typ}x_i\}} \le \epsilon_{\text{step}}$$`

Practical use of optimization algorithms often involves

* trying different starting strategies

* adjusting scaling after preliminary runs

* other reparameterizations to improve conditioning of the problem

---
layout: true
## Nelder-Mead Simplex Method
---

This is _not_ related to the simplex method for linear programming!

* The method uses only function values and is fairly robust but
  can we quite slow and need repeated restarts.

It can work reasonably even if the objective function is not
  differentiable.

* The method uses function values at a set of `$n+1$` _affinely independent_
  points in `$n$` dimensions, which form a simplex.

* Affine independence means that the points `$x_1, \dots, x_{n+1}$`
  are such that `$x_1 - x_{n+1}, \dots, x_n - x_{n+1}$` are linearly
  independent.

* The method goes through a series of reflection, expansion,
  contraction , and reduction steps to transform the simplex until the
  function values do not change much or the simplex becomes very
  small.

---

One version of the algorithm, adapted from
[Wikipedia](https://en.wikipedia.org/wiki/Nelder%E2%80%93Mead_method):

.pull-left.small[
1) Order the vertices according to their function values,
  `$$f(x_1) \le f(x_2) \le \dots \le f(x_{n+1})$$`
{{content}}
]
--

2) Calculate `$x_0 = \frac{1}{n}\sum_{i=1}^n x_i$`, the center of the
   face opposite the worst point `$x_{n+1}$`.
{{content}}
--

3) _Reflection_: compute the reflected point
  `$$x_r = x_0 + \alpha (x_0 - x_{n+1})$$`
  for some `$\alpha \ge 1$`, e.g. `$\alpha = 1$`.
{{content}}
--

If `$f(x_1) \le f(x_r) < f(x_n)$`, i.e. `$x_r$` is better than the second
  worst point `$x_n$` but
  not better than the best point `$x_1$`, then replace the worst point
  `$x_{n+1}$` by `$x_r$` and go to Step 1.
{{content}}
--

4) _Expansion:_ If `$f(x_r) < f(x_1)$`, i.e. `$x_r$` is the best
  point so far, then compute the expanded point
  `$$x_e = x_0 + \gamma (x_0 - x_{n+1})$$`
  for some `$\gamma > \alpha$`, e.g. `$\gamma = 2$`.  If `$f(x_e) < f(x_r)$`
  then replace `$x_{n+1}$` by `$x_e$` and go to Step 1.  Otherwise replace
  `$x_{n+1}$` with `$x_r$` and go to Step 1.

.pull-right.small[
5) _Contraction_: If we reach this step then we know that
  `$f(x_r) \ge f(x_n)$`.  Compute the contracted point
  `$$x_c = x_0 + \rho (x_0 - x_{n+1})$$`
  for some `$\rho < 0$`, e.g. `$\rho = -1/2$`. If `$f(x_c) < f(x_{n+1})$`
  replace `$x_{n+1}$` by `$x_c$` and go to Step 1. Otherwise go to Step 6.
{{content}}
]
--

6) _Reduction_: For `$i = 2, \dots, n+1$` set
  `$$x_i = x_1 + \sigma (x_i - x_1)$$`
  for some `$\sigma < 1$`, e.g. `$\sigma = 1/2$`.
{{content}}
--

The
[Wikipedia](https://en.wikipedia.org/wiki/Nelder%E2%80%93Mead_method)
page shows some examples as animations.

---
layout: true
## Simulated Annealing
---

Simulated annealing is motivated by an analogy to slowly cooling
metals in order to reach the minimum energy state.

It is a _stochastic global optimization_ method.

It can be very slow but can be effective for difficult problems
with many local minima.

For any value `$T > 0$` the function

`\begin{equation*}
  f_T(x) = e^{f(x)/T}
\end{equation*}`

has minima at the same locations as `$f(x)$`.

Increasing the _temperature parameter_ `$T$` flattens `$f_T$`; decreasing
`$T$` sharpens the local minima.

---

Suppose we have a current estimate of the minimizer, `$x^{(k)}$` and a
mechanism for randomly generating a _proposal_ `$y$` for the next
estimate.

A step in the simulated annealing algorithm given a current estimate
`$x^{(k)}$`:

.smaller[
* Generate a proposal `$y$` for the next estimate.
{{content}}
]
--

* If `$f(y) \le f(x^{(k)})$` then _accept_ `$y$` and set
  `$x_{(k+1)} = y$`.
{{content}}
--

* If `$f(y) > f(x^{(k)})$` then with probability
  `$f_T(x^{(k)}) / f_T(y)$` _accept_ `$y$` and set `$x_{(k+1)} = y$`.
{{content}}
--

* Otherwise, _reject_ `$y$` and set `$x_{(k+1)} = x_{(k)}$`.
{{content}}
--

* Adjust the temperature according to a _cooling schedule_
  and repeat.

---

Occasionally accepting steps that increase the objective function
allows escape from local minima.

A common way to generate `$y$` is as a multivariate normal vector
centered at `$x_{(k)}$`.

A common choice of cooling schedule is of the form `$T = 1 / \log(k)$`.

The standard deviations of the normal proposals may be tied to the
current temperature setting.

Under certain conditions this approach can be shown to converge to a
global minimizer with probability one.

Using other forms of proposals this approach can also be used for
discrete optimization problems.

---
layout: true
## EM and MCEM Algorithms
---

Suppose we have a problem where _complete data_ `$X,Z$` has likelihood

`\begin{equation*}
  L^c(\theta|x,z) = f(x,z|\theta)
\end{equation*}`

but we only observe _incomplete data_ `$X$` with likelihood

`\begin{equation*}
  L(\theta|x) = g(x|\theta) = \int f(x,z|\theta) dz.
\end{equation*}`

Often the complete data likelihood is much simpler than the
incomplete data one.

The conditional density of the unobserved data given the
observed data is
`\begin{equation*}
  k(z|x, \theta) = \frac{f(x,z|\theta)}{g(x|\theta)}.
\end{equation*}`

Define, taking expectations with respect to `$k(z|x,\phi)$`,

`\begin{align*}
  Q(\theta|\phi,x) &= E_\phi[\log f(x,z|\theta)|x]\\
  H(\theta|\phi,x) &= E_\phi[\log k(z|x, \theta)|x].
\end{align*}`

---

The EM algorithm consists of starting with an initial estimate
`$\widehat{\theta}^{(0)}$` and repeating two steps until convergence:

* _E(xpectation)-Step:_ Construct
  `$Q(\theta|\widehat{\theta}^{(i)})$`

* _M(aximization)-Step:_ Compute
  `$\widehat{\theta}^{(i+1)}=\text{argmax}_\theta\; Q(\theta|\widehat{\theta}^{(i)})$`

This algorithm was introduced by Dempster, Laird and Rubin
(1977).

It unified many special case algorithms in the literature.

---

The EM algorithm increases the log likelihood at every step;
this helps to ensure a very stable algorithm.

The EM algorithm is typically linearly convergent.

Acceleration methods may be useful. Givens and Hoeting describe
some; others are available in the literature.

If `$Q(\theta|\phi,x)$` cannot be constructed in closed form, it
can often be computed by Monte Carlo methods; this is the MCEM
algorithm of Wei and Tanner (1990).

In many cases MCMC methods will have to be used in the MCEM
algorithm.

---

### Example: Normal Mixture Models

Suppose `$X_1, \dots, X_n$` are independent and identically distributed
with density

`\begin{equation*}
  f(x|\theta) = \sum_{j = 1}^M p_j f(x|\mu_j, \sigma_j^2)
\end{equation*}`

with `$p_1, \dots, p_M \ge 0$`, `$\sum p_j = 1$`, and
`$f(x|\mu,\sigma^2)$`
is a `$\text{Normal}(\mu, \sigma^2)$` density.

`$M$` is assumed known.

We can think of  `$X_i$` as generated in two stages:

* First a category `$J$` is selected from `$\{1, \dots, M\}$` with
  probabilities `$p_1, \dots, p_M$`.

* Then, given `$J = j$`, a normal random variable is generated from
  `$f(x|\mu_j,\sigma_j^2)$`.

---

We can represent the unobserved category using indicator variables

`\begin{equation*}
  Z_{ij} =
  \begin{cases}
    1 & J_i = j\\
    0 & \text{otherwise.}
  \end{cases}
\end{equation*}`

The complete data log likelihood is

`\begin{equation*}
  \log L(\theta | x, z) = \sum_{i=1}^n\sum_{j=1}^M 
  z_{ij}\left[\log p_j - \log \sigma_j -
  \frac{1}{2}\left(\frac{x_i - \mu_j}{\sigma_j}\right)^2\right].
\end{equation*}`

---

The Expectation step produces

`\begin{equation*}
  Q(\theta|\widehat{\theta}^{(k)}) = \sum_{i=1}^n\sum_{j=1}^M 
  \widehat{p}_{ij}^{(k)}\left[\log p_j - \log \sigma_j -
  \frac{1}{2}\left(\frac{x_i - \mu_j}{\sigma_j}\right)^2\right]
\end{equation*}`

with

`\begin{equation*}
  \widehat{p}_{ij}^{(k)} = E[Z_{ij}|X = x, \theta^{(k)}] =
  \frac{\widehat{p}_j^{(k)}
    f\left(x_i\left|\widehat{\mu}_j^{(k)},
        \widehat{\sigma}_j^{(k)2}\right.\right)}
       {\sum_{\ell = 1}^M \widehat{p}_\ell^{(k)}
         f\left(x_i\left|\widehat{\mu}_\ell^{(k)},
             \widehat{\sigma}_\ell^{(k)2}\right.\right)}.
\end{equation*}`

---

The Maximization step then produces

`\begin{align*}
  \widehat{\mu}_j^{(k+1)} &=
  \frac{\sum_{i=1}^n \widehat{p}_{ij}^{(k)} x_i}
       {\sum_{i=1}^n \widehat{p}_{ij}^{(k)}}\\
  \widehat{\sigma}_j^{(k+1)2} &=
  \frac{\sum_{i=1}^n \widehat{p}_{ij}^{(k)} (x_i - \widehat{\mu}_j^{(k+1)})^2}
       {\sum_{i=1}^n \widehat{p}_{ij}^{(k)}}\\
  \widehat{p}_j^{(k+1)} &= \frac{1}{n} \sum_{i=1}^n \widehat{p}_{ij}^{(k)}
\end{align*}`

---

A simple R function to implement a single EM step might be written as

.small-code[

```r
EMmix1 <- function(x, theta) {
    mu <- theta$mu
    sigma <- theta$sigma
    p <- theta$p
    M <- length(mu)

## E step
    Ez <- outer(x, 1 : M, function(x, i) p[i] * dnorm(x, mu[i], sigma[i]))
    Ez <- sweep(Ez, 1, rowSums(Ez), "/")
    colSums.Ez <- colSums(Ez)

## M step
    xp <- sweep(Ez, 1, x, "*")
    mu.new <- colSums(xp) / colSums.Ez

sqRes <- outer(x, mu.new, function(x, m) (x - m)^2)
    sigma.new <- sqrt(colSums(Ez * sqRes) / colSums.Ez)

p.new <- colSums.Ez / sum(colSums.Ez)

## pack up result
    list(mu = mu.new, sigma = sigma.new, p = p.new)
}
```
]

---

Some notes:

* Reasonable starting values are important.

* We want a local maximum near a good starting value, not a global
  maximum.

* Code to examine the log likelihood for a simplified example
  is [available](../examples/mixll.R)

Some papers:

> Tobias Ryden (2008). EM versus Markov chain Monte Carlo for
>  estimation of hidden Markov models: a computational perspective,
>  _Bayesian Analysis_ 3 (4), 659--688.

> A. Berlinet and C. Roland (2009). Parabolic acceleration of the EM
>  algorithm.  _Statistics and Computing_ 19 (1), 35--48.

> Chen, Lin S., Prentice, Ross L., and Wang, Pei (2014).  A penalized
> EM algorithm incorporating missing data mechanism for Gaussian
> parameter estimation, _Biometrics_ 70 (2), 312--322.

---

### Theoretical Properties of the EM Algorithm

The conditional density of the unobserved data given the
observed data is

`\begin{equation*}
  k(z|x, \theta) = \frac{f(x,z|\theta)}{g(x|\theta)}.
\end{equation*}`

Define, taking expectations with respect to `$k(z|x,\phi)$`,

`\begin{align*}
  Q(\theta|\phi,x) &= E_\phi[\log f(x,z|\theta)|x]\\
  H(\theta|\phi,x) &= E_\phi[\log k(z|x, \theta)|x].
\end{align*}`

---

For any `$\phi$`

`\begin{equation*}
  \log L(\theta|x) = Q(\theta|\phi,x) - H(\theta|\phi,x)
\end{equation*}`

since for any `$z$`

`\begin{align*}
  \log L(\theta|x) &= \log g(x|\theta)\\
    &= \log f(x,z|\theta) - \log f(x, z|\theta) + \log g(x|\theta)\\
    &= \log f(x,z|\theta) - \log \frac{f(x,z|\theta)}{g(x|\theta)}\\
    &= \log f(x, z|\theta) - \log k(z|x,\theta).
\end{align*}`

and therefore, taking taking expectations with respect to
`$k(z|x,\phi)$`,

`\begin{align*}
  \log L(\theta|x)
  &= E_\phi[\log f(x, z|\theta)] - E_\phi[\log k(z|x,\theta)]\\
  &= Q(\theta|\phi,x) - H(\theta|\phi,x).
\end{align*}`

---

Furthermore, for all `$\theta$`

`\begin{equation*}
  H(\theta|\phi) \le H(\phi|\phi)
\end{equation*}`

since, by Jensen's inequality,

`\begin{align*}
  H(\theta|\phi) - H(\phi|\phi)
  &= E_\phi[\log k(z|x,\theta)] - E_\phi[\log k(z|x,\phi)]\\
  &= E_\phi\left[\log\frac{k(z|x,\theta)}{k(z|x,\phi)}\right]\\
  &\le \log E_\phi\left[\frac{k(z|x,\theta)}{k(z|x,\phi)}\right]\\
  &= \log \int \frac{k(z|x,\theta)}{k(z|x,\phi)} k(z|x,\phi) dz\\
  &= \log \int k(z|x,\theta) dz = 0.
\end{align*}`

---

This implies that for any `$\theta$` and `$\phi$`

`\begin{equation*}
  \log L(\theta|x) \ge Q(\theta|\phi,x) - H(\phi|\phi,x)
\end{equation*}`

with equality when `$\theta = \phi$`.

Therefore if `$\widehat{\theta}$` maximizes `$L(\theta|x)$` then `$\widehat{\theta}$`
maximizes `$Q(\theta|\widehat{\theta},x)$` with respect to `$\theta$`:

`\begin{align*}
  Q(\theta|\widehat{\theta},x) - H(\widehat{\theta}|\widehat{\theta},x)
  &\le \log L(\theta|x)\\
  &\le \log L(\widehat{\theta}|x)
    = Q(\widehat{\theta}|\widehat{\theta},x) -
    H(\widehat{\theta}|\widehat{\theta},x).
\end{align*}`

If `$\widehat{\theta}(\phi)$` maximizes `$Q(\theta|\phi,x)$` for a
given `$\phi$` then
`$\log L(\phi|x) \le \log L(\widehat{\theta}(\phi)| x)$`:

`\begin{align*}
  \log L(\phi|x) &= Q(\phi|\phi,x) - H(\phi|\phi,x)\\
      &\le Q(\widehat{\theta}(\phi)|\phi,x) - H(\phi| \phi,x)\\
      & \le Q(\widehat{\theta}(\phi)|\phi,x) -
        H(\widehat{\theta}(\phi)| \phi,x)\\
      & = \log  L(\widehat{\theta}(\phi)| x).
\end{align*}`

---

In the following figures the black curve corresponds to
`$\log L(\theta|x)$` and the red curve to the _surrogate function_
`$Q(\theta|\phi, x) - H(\phi|\phi, x)$`.

---
layout: true
## MM Algorithms
---

The EM algorithm can be viewed as a special case of an MM algorithm.

MM stands for

* _Minimization and Majorization_, or
* _Maximization and Minorization_.

Suppose the objective is to _maximize_ a function `$f(\theta)$`.

The assumption is that a _surrogate function_ `$g(\theta | \phi)$` is
available such that

`\begin{equation*}
  f(\theta) \ge g(\theta|\phi)
\end{equation*}`

for all `$\theta, \phi$`, with equality when `$\theta = \phi$`.

The function `$g$` is said to _minorize_ the function `$f$`.

---

The MM algorithm starts with an initial guess `$\widehat{\theta}^{(0)}$`
and then computes

`\begin{equation*}
  \widehat{\theta}^{(i + 1)} = 
  \text{argmax}_\theta g(\theta|\widehat{\theta}^{(i)})
\end{equation*}`

As in the EM algorithm,

* if `$\widehat{\theta}$` maximizes `$f(\theta)$`, then
  `$\widehat{\theta}$` maximizes `$g(\theta|\widehat{\theta})$`;

* if `$\widehat{\theta}(\phi)$` maximizes `$g(\theta|\phi)$`, then
  $$ f(\phi) \le f(\widehat{\theta}(\phi)).$$

The objective function values will increase, and under reasonable
conditions the algorithm will converge to a local maxizer.

Full maximization of `$g$` is not needed as long as sufficient progress
is made (single Newton steps are often sufficient).

---

The art in designing an MM algorithm is finding a surrogate
 minorizing function `$g$` that

* produces an efficient algorithm

* is easy to maximize

In `$p$`-dimensional problems it is often possible to choose
minorizing functions of the form

`\begin{equation*}
  g(\theta|\phi) = \sum_{j=1}^p g_j(\theta_j|\phi)
\end{equation*}`

so that `$g$` can be maximized by separate one-dimensional maximizations
of the `$g_j$`.

---

### Example: Bradley Terry Model

In the Bradley Terry competition model each team has a strength
`$\theta_i$` and

$$ P(\text{$i$ beats `$j$`}) = \frac{\theta_i}{\theta_i + \theta_j} $$

with `$\theta_1 = 1$` for identifiability.

Assuming independent games in which `$y_{ij}$` is the number of times
`$i$` beats `$j$` the log likelihood is

`\begin{equation*}
  \log L(\theta) =
  \sum_{i,j} y_{ij} \left(\log(\theta_i) - \log(\theta_i+\theta_j)\right).
\end{equation*}`

---

Since the logarithm is concave, for any `$x$` and `$y$`

`\begin{equation*}
  -\log y \ge -\log x - \frac{y-x}{x}
\end{equation*}`

and therefore for any `$\theta$` and `$\phi$`

`\begin{align*}
  \log L(\theta) &\ge 
    \sum_{i,j} y_{ij} \left(\log(\theta_i) - \log(\phi_i+\phi_j) -
    \frac{\theta_i+\theta_j-\phi_i-\phi_j}{\phi_i+\phi_j}\right)\\
    &= g(\theta|\phi)\\
    &= \sum_i g_i(\theta_i|\phi)
\end{align*}`

with

`\begin{equation*}
  g_i(\theta_i|\phi) =
  \sum_j y_{ij}\log\theta_i -
  \sum_j(y_{ij}+y_{ji})\frac{\theta_i-\phi_i}{\phi_i+\phi_j}.
\end{equation*}`

---

The parameters are separated in `$g$`, and the one-dimensional
maximizers can be easily computed as

`\begin{equation*}
  \widehat{\theta}_i(\phi) = 
  \frac{\sum_{j} y_{ij}}{\sum_{j}(y_{ij}+y_{ji})/(\phi_i+\phi_j)}.
\end{equation*}`

Some References on MM algorithms:

> Hunter DR and Lange K (2004), [A Tutorial on MM
>  Algorithms](https://personal.psu.edu/drh20/papers/mmtutorial.pdf).
>  _The American Statistician_, 58: 30-37.

> Lange, K (2004). _Optimization_, Springer-Verlag, New York.

---
layout: true
## Gradient Descent Methods
---

Gradient descent, or steepest descent, methods have one advantage:
only the gradient needs to be computed.

A simple gradient descent approach uses

$$
x_{n+1} = x_n - \alpha \nabla f(x_n).
$$

For a quadratic function `$f(x) = \frac{1}{2} x^T A x - x^T b$` with
strictly positive definite second derivative matrix `$A$` this algorithm
converges linearly if

$$
\alpha < \frac{2}{\rho(A)}
$$

where `$\rho(A)$` is the spectral radius of `$A$`, i.e. the largest
eigenvalue.

---

The optimal choice of `$\alpha$` is

$$
\alpha = \frac{2}{\rho(A)} \times \frac{\kappa(A)}{\kappa(A) + 1}
$$

where `$\kappa(A)$` is the condition number of `$A$`.

The corresponding rate of convergence is

$$
\frac{\kappa(A) - 1}{\kappa(A) + 1}.
$$

This provides some guidance for use in the non-quadratic case:

* `$\alpha$` needs to be small enough but not too small;

* making sure the problem is well-conditioned is very important.

---

The step size `$\alpha$` can be chosen adaptively by a line search or by
_backtracking_, e.g. starting with `$\alpha = \alpha_0$` and reducing
`$\alpha$` by a factor of 1/2 until

$$
f(x - \alpha \nabla f(x)) > f(x) - \frac{1}{2}\alpha\|\nabla f(x)\|^2
$$

Other adaptive strategies try to avoid evaluating `$f(x)$`.

_Momentum_ methods try to reduce zig-zagging by using a combination of
the previous step direction and the current gradient.

---

_Stochastic gradient descent_ methods try to reduce the cost of each
step in cases where

$$ f(x) = \frac{1}{n} \sum_{i=1}^n f_i(x) $$

by at each step choosing a sample `$i_1, \dots, i_m$` and computing the
gradient of

`$$\frac{1}{m} \sum_{k=1}^m f_{i_k}(x).$$`

Choosing `$m = 1$` sometimes works.

---

Variants of Stochastic gradient descent are used widely in machine
learning algorithms, in particular for fitting deep neural networks.

In machine learning the step size `$\alpha$` is called the _learning rate_.

Stochastic gradient descent will converge almost surely under mild
conditions if used with a _learning rate decay_, e.g. with a
learning rate `$\alpha_k$` at iteration `$k$` of the form

$$
\alpha_k = \alpha_0 \frac{1}{1 + \delta k}.
$$

---
layout: true
## Constrained Optimization
---

Equality constraints arise, for example, in

* restricted maximum likelihood estimation needed for the null
  hypothesis in likelihood ratio tests

* estimating probabilities that sum to one

Inequality constraints arise in estimating

* probabilities or rates that are constrained to be non-negative

* monotone functions or monotone sets of parameters

* convex functions

Box constraints are the simplest and the most common.

Linear inequality constraints also occur frequently.

---

Inequality constraints are often handled by converting a constrained
problem to an unconstrained one:

* Minimizing `$f(x)$` subject to `$g_i(x) \ge 0$` for
  `$i = 1, \dots, M$` is equivalent to minimizing
  `$$F(x) = \begin{cases} f(x) & g_i(x) \ge 0: i = 1, \dots, M\\  \infty & \text{otherwise.} \end{cases}$$`

*  This objective function is not continuous.

Sometimes a complicated unconstrained problem can be changed to
simpler constrained one.

---

### Example: `$L_1$`  Regression

The `$L_1$` regression estimator is defined by the minimization problem

`\begin{equation*}
  \widehat{b} = \underset{b}{\text{argmin}} \sum |y_i - x_i^T b|
\end{equation*}`

This unconstrained optimization problem with a non-differentiable
objective function can be reformulated as a constrained linear
programming problem:

---

For a vector `$z = (z_1, \dots, z_n)^T$` let `$[z]_+$` denote the vector
of positive parts of `$z$`, i.e.

`\begin{equation*}
  ([z]_+)_i = \max(z_i, 0)
\end{equation*}`

and let

`\begin{align*}
  b_+ &= [b]_+\\
  b_- &= [-b]_+\\
  u   &= [y - Xb]_+\\
  v   &= [Xb - y]_+
\end{align*}`

Then the `$L_1$` estimator satisfies

`\begin{equation*}
  \widehat{b} = \underset{b}{\text{argmin}} \; \mathbf{1}^T u + \mathbf{1}^T v
\end{equation*}`

subject to the constraints

`\begin{align*}
  &y = Xb_+ - Xb_- + u - v\\
  &u \ge 0, v \ge 0, b_+ \ge 0, b_- \ge 0
\end{align*}`

---

This linear programming problem can be solved by the standard
simplex algorithm.

A specialized simplex algorithm taking advantage of structure in
the constraint matrix allows larger data sets to be handled
(Barrodale and Roberts, 1974).

Interior point methods can also be used and are effective for
large `$n$` and moderate `$p$`.

An alternative approach called _smoothing_ is based on
approximating the absolute value function by a twice continuously
differentiable function and can be effective for moderate `$n$` and
large `$p$`.

Chen and Wei (2005) provide an overview in the context of the
more general _quantile regression_ problem.

---

Quantile regression computes estimates as
`\begin{equation*}
  \widehat{b} = \underset{b}{\text{argmin}} \sum \rho_\tau(y_i - x_i^Tb)
\end{equation*}`
with 
`\begin{equation*}
\rho_\tau(y)  = y\left(\tau - 1_{\{y < 0\}}\right)
= \tau [y]_+ + (1 - \tau) [-y]_+.
\end{equation*}`
for `$0 < \tau < 1$`. For `$\tau = \frac{1}{2}$` this is
`$\rho_{\frac{1}{2}}(y) = \frac{1}{2}|y|$`.

---

Reformulation as a constrained optimization problem is also
sometimes used for LASSO and SVM computations.

### Some Approaches and Algorithms

If the optimum is in the interior of the feasible region then
using standard methods on `$F$` may work, especially if these use
backtracking if they encounter infinite values.

For interior optima transormations to an unbounded feasible
region may help.

Standard algorithms can often be modified to handle box
constraints.

---

Other constraints are often handled using
_barrier methods_ that approximate `$F$` by a smooth function

`\begin{equation*}
  F_\gamma(x) =    
  \begin{cases}
    f(x) - \gamma \sum_{i = 1}^M \log g_i(x) & g(x) \ge 0\\
    \infty & \text{otherwise}
  \end{cases}
\end{equation*}`

for some `$\gamma > 0$`.

The algorithm starts with a feasible solution, minimizes `$F_\gamma$`
for a given `$\gamma$`, reduces `$\gamma$`, and repeats until convergence.

Barrier methods are also called path-following algorithms.

Path-following algorithms are useful in other settings where a harder
problem can be approached through a sequence of easier problems.

Path-following algorithms usually start the search for the next
`$\gamma$` at the solution for the previous `$\gamma$`; this is
sometimes called a _warm start_ approach.

The intermediate results for positive `$\gamma$` are in the
interior of the feasible region, hence this is an _interior
point method_

More sophisticated interior point methods are available in particular
for convex optimization problems.

---

### An Adaptive Barrier Algorithm
Consider the problem of minimizing `$f(x)$` subject to the linear
inequality constraints

`\begin{equation*}
  g_i(x) = b_i^T x - c_i \ge 0
\end{equation*}`

for `$i = 1, \dots, M$`.

For an interior point `$x^{(k)}$` of the feasible region define the
surrogate function

`\begin{equation*}
  R(x | x^{(k)}) =
  f(x) - \mu \sum_{i=1}^M \left[g_i(x^{(k)}) \log(g_i(x)) - b_i^T x\right]
\end{equation*}`

for some `$\mu > 0$`.

---

As a function of `$x$` the barrier function

`\begin{equation*}
  f(x) - R(x | x^{(k)}) =
  \mu \sum_{i=1}^M \left[g_i(x^{(k)}) \log(g_i(x)) - b_i^T x\right]
\end{equation*}`

is concave and maximized at `$x = x^{(k)}$`.

Thus if

`\begin{equation*}
  x^{(k+1)} = \underset{x}{\text{argmin}} \; R(x | x^{(k)})
\end{equation*}`

then

`\begin{align*}
  f(x^{(k+1)}) & = R(x^{(k+1)}|x^{(k)}) + f(x^{(k+1)}) - R(x^{(k+1)}|x^{(k)})\\
  &\le R(x^{(k)}|x^{(k)}) + f(x^{(k+1)}) - R(x^{(k+1)}|x^{(k)})\\
  &\le R(x^{(k)}|x^{(k)}) + f(x^{(k)}) - R(x^{(k)}|x^{(k)})\\
  & = f(x^{(k)})
\end{align*}`

The values of the objective function are non-increasing.

---

This has strong similarities to an EM algorithm argument.

The coefficient of a logarithmic barrier term decreases to zero
if `$x^{(k)}$` approaches the boundary.

If `$f$` is convex and has a unique minimizer `$x^*$` then `$x^{(k)}$`
converges to `$x^*$`.

This algorithm was introduced in K. Lange (1994).

---
layout: true
## Optimization in R
---

Several optimization functions are available in the standard R distribution:

* `optim` implements a range of methods:

* Nelder-Mead simplex
  * Quasi-Newton with the BFGS update
  * A modified Quasi-Newton BFGS algorithm that allows box constraints
  * A conjugate gradient algorithm
  * A version of simulated annealing.

* Some notes:
  * A variety of iteration control options are available.
  * Analytical derivatives can be supplied to methods that use them.
  * Hessian matrices at the optimum can be requested.

* `nlminb` provides an interface to the FORTRAN
  [PORT](https://web.archive.org/web/20070508140716/http://netlib.bell-labs.com/cm/cs/cstr/153.pdf) library
  developed at Bell Labs. Box constraints are supported.

* `nlm` implements a modified Newton algorithm as described
  in Dennis and Schnabel (1983).

* `constrOptim` implements the adaptive barrier method of
  Lange (1994) using `optim` for the optimizations within
  iterations.

---

The [Optimization Task
View](https://cran.r-project.org/view=Optimization) on
CRAN describes a number of other methods available in contributed
packages.

Simple examples illustrating `optim` and `constrOptim`
are [available](../examples/optimpath.R).

//adapted from Emi Tanaka's gist at //https://gist.github.com/emitanaka/eaa258bb8471c041797ff377704c8505