Statistical Learning

class: center, middle, title-slide

.title[
# Statistical Learning
]
.author[
### Luke Tierney
]
.institute[
### University of Iowa
]
.date[
### 2023-05-19
]

---

layout: true

## Some Machine Learning Terminology
---

Two forms of learning:

* _supervised learning_: _features_ and responses are available for
  a _training set_, and a way of predicting response from features
  of new data is to be _learned_.

* _unsupervised learning_: no distinguished responses are available;
  the goal is to discover patterns and associations among features.

Classification and regression are supervised learning methods.

Clustering, multi-dimensional scaling, and principal curves are
unsupervised learning methods.

_Data mining_ involves extracting information from large (many
cases and/or many variables), messy (many missing values, many
different kinds of variables and measurement scales) data bases.

Machine learning often emphasizes methods that are sufficiently fast
and automated for use in data mining.

Machine learning is now often considered a branch of _Artificial
  Intelligence (AI)_.

---

Tree models are popular in machine learning

* supervised: as predictors in classification and regression settings
--

* unsupervised: for describing clustering results.

Some other methods often associated with machine learning:

* Bagging
--

* Boosting
--

* Random Forests
--

* Support Vector Machines
--

* Neural Networks

---

References:

> T. Hastie, R. Tibshirani, and J. Friedman (2009).  _The Elements of
> Statistical Learning, 2nd Ed._.

> G. James, D. Witten, T. Hastie, and R. Tibshirani (2021).  _An
> Introduction to Statistical Learning, with Applications in R, 2nd
> Ed._.

> D. Hand, H, Mannila, and P. Smyth (2001). _Principles of Data
> Mining_.

> C. M. Bishop (2006). _Pattern Recognition and Machine Learning_.

> M. Shu (2008). Kernels and ensembles: perspectives on statistical
> learning, _The American Statistician_ 62(2), 97--109.

Some examples are available on line
[on line](../examples/learning.Rmd).

---
layout: true
## Tree Models
---

Tree models were popularized by a book and software named CART, for
_Classification and Regression Trees_.

The name CART was trademarked and could not be used by other
implementations.

Tree models partition the predictor space based on a series of binary
splits.

Leaf nodes predict a response:

* a category for _classification trees_;
--

* a numerical value for _regression trees_.

Regression trees may also use a simple linear model within leaf nodes
of the partition.

---

.pull-left[
Using `rpart` a tree model for predicting union membership can
be constructed by

```r
data(trade.union, package = "SemiPar")
library(rpart)
trade.union$member.fac <-
    as.factor(ifelse(trade.union$union.member,
                     "yes",
                     "no"))
fit <- rpart(member.fac ~ wage + age + years.educ,
             data = trade.union)
plot(fit)
text(fit, use.n = TRUE)
```
]

.pull-right[
<img src="learning_files/figure-html/rpart-unions-1.png" style="display: block; margin: auto;" />
]

---

.pull-left[
Regression trees use a constant fit by default.

A regression tree for the California air pollution data:

```r
data(calif.air.poll, package = "SemiPar")
library(rpart)
fit2 <-
    rpart(ozone.level ~ daggett.pressure.gradient +
              inversion.base.height +
              inversion.base.temp,
          data = calif.air.poll)
plot(fit2)
text(fit2)
```
]
.pull-right[
<img src="learning_files/figure-html/rpart-calpol-1.png" style="display: block; margin: auto;" />
]

---

Tree models are flexible but simple; results are easy to explain to
non-specialists.

Small changes in data

* can change tree structure substantially;
--

* usually do not change predictive performance much.

Fitting procedures usually consist of two phases:

* growing a large tree;
--

* pruning back the tree to reduce over-fitting.

Tree growing usually uses a greedy approach.
Pruning usually minimizes a penalized goodness of fit measure

`\begin{equation*}
  R(\mathcal{T}) + \lambda \text{size}(\mathcal{T})
\end{equation*}`

with `$R$` a raw measure of goodness of fit.

The parameter `$\lambda$` can be chosen by some form of
cross-validation.

---

For regression trees, mean square prediction error is usually used for
both growing and pruning.

For classification trees

* growing usually uses a loss function that rewards class purity,
  e.g. a Gini index
  `$$G_m = \sum_{k=1}^K \widehat{p}_{mk}(1 - \widehat{p}_{mk})$$`
  or a cross-entropy
  `$$D_m = - \sum_{k=1}^K \widehat{p}_{mk}\log \widehat{p}_{mk}$$`
  with `$\widehat{p}_{mk}$` the proportion of training observations in
  region `$m$` that are in class `$k$`.

* Pruning usually focuses on minimizing classification error rates.

The `rpart` package provides one implementation; the `tree` and
`party` packages are also available, among others.

---
layout: true
## Bagging, Boosting, and Random Forests
---

All three are _ensemble methods_: They combine weaker predictors,
or _learners_, to form a stronger one.

A related idea is _Bayesian Model Averaging (BMA)_.

---

### Bagging: Bootstrap AGGregation

Bootstrapping in prediction models produces a sample of predictors

`\begin{equation*}
  T_1^*(x),\dots,T_R^*(x).    
\end{equation*}`

Usually bootstrapping is viewed as a way of assessing the variability
of the predictor `$T(x)$` based on the original sample.

For predictors that are not linear in the data an aggregated estimator
such as

`\begin{equation*}
  T_\text{BAG}(x) = \frac{1}{R}\sum_{i=1}^R T_i^*(x)
\end{equation*}`

may be an improvement.

Other aggregations are possible; for classification trees two options
are

* averaging probabilities;
* majority vote.

---

Bagging can be particularly effective for tree models.

Less pruning, or even no pruning, is needed since variance is
reduced by averaging.

Each bootstrap sample will use about 2/3 of the observations; about
1/3 will be _out of bag_, or OOB. The OOB observations can be
used to construct an error estimate.

For tree methods:

* The resulting predictors are more accurate than simple trees,
  but lose the simple interpretability.
--

* The total reduction in RSS or the Gini index due to splits on a
  particular variable can be used as a measure of variable importance.

_Bumping_ (Bootstrap umbrella of model parameters) is another
approach:

* Given a bootstrap sample of predictors
  `$T_1^*(x),\dots,T_R^*(x)$` choose the one that best fits the
  original data.
--

* The original sample is included in the bootstrap sample so the
  original predictor can be chosen if it is best.

---

### Random Forests

Introduced by Breiman (2001).

Also covered by a trademark.

Similar to bagging for regression or classification trees.

Draws `$n_\text{tree}$` bootstrap samples.
For each sample a tree is grown _without_ pruning.

* At each node `$m_\text{try}$` out of `$p$` available predictors are
  sampled at random.
--

* A common choice is `$m_\text{try} \approx \sqrt{p}$`.
--

* The best split among the sampled predictors is used.

Form an ensemble predictor by aggregating the trees.

Error rates are measured by

* at each bootstrap iteration predict data not in the sample
  (out-of-bag, OOB, data);
--

* combine the OOB error measures across samples.

---

Bagging without pruning for tree models is equivalent to a random
forest with `$m_\text{try} = p$`.

A motivation is to reduce correlation among the bootstrap trees and so
increase the benefit of averaging.

The R package `randomForest` provides an interface to FORTRAN
code of Breiman and Cutler.

The software provides measures of

* "importance" of each predictor variable;
--

* similarity of observations.

Some details are available in A. Liaw and M. Wiener (2002).
"Classification and Regression by randomForest," _R News_.

[Other packages](http://philipppro.github.io/More_complete_list/)
implementing random forests are a available as well.

A recent addition is the
[`ranger`](https://www.jstatsoft.org/article/view/v077i01)
package.

---

## Boosting

Boosting is a way of improving on a weak supervised learner.

The basic learner needs to be able to work with weighted data.

The simplest version applies to binary classification with responses
`$y_i = \pm 1$`.

A binary classifier produced from a set of weighted training data is a
function

`\begin{equation*}
  G(x): \mathcal{X} \to \{-1,+1\}
\end{equation*}`

---

The _AdaBoost.M1_ (adaptive boosting) algorithm:

1. Initialize observation weights `$w_i = 1/n, i = 1,\dots,n$`.
2. For `$m = 1, \dots, M$` do

a. Fit a classifier `$G_m(x)$` to the training data with weights `$w_i$`.

b. Compute the weighted error rate
    `$$\text{err}_m = \frac{\sum_{i=1}^n w_i 1_{\{y_i \neq G_m(x_i)\}}}{\sum_{i=1}^n w_i}.$$`
 c. Compute `$\alpha_m = \log((1 - \text{err}_m) / \text{err}_m)$`.

d. Set `$w_i \leftarrow w_i \exp(\alpha_m 1_{\{y_i \neq G_m(x_i)\}})$`

3. Output `$G(x) = \text{sign}\left(\sum_{i=1}^M \alpha_m G_m(x)\right)$`

The weights are adjusted to put more weight on points that were
classified incorrectly.

---

These ideas extend to multiple categories and to continuous responses.

Empirical evidence suggests boosting is successful in a range of
problems.

Theoretical investigations support this.

The resulting classifiers are closely related to additive models
constructed from a set of elementary basis functions.

The number of steps `$M$` plays the role of a model selection parameter

* too small a value produces a poor fit
--

* too large a value fits the training data too well

Some form of regularization, e.g. based on a validation sample, is
needed.

Other forms of regularization, e.g. variants of shrinkage, are
possible as well.

---

.pull-left.smaller.width-60[
A variant for boosting regression trees:

1. Set `$\widehat{f}(x) = 0$` and `$r_i = y_i$` for all `$i$` in the
   training set.
2. For `$m = 1, \dots, M$`:

a. Fit a tree `$\widehat{f}^m(x)$` with `$d$` splits to the
     training data `$X, r$`.

b. Update `$\widehat{f}$` by adding a shrunken version of `$\widehat{f}^m(x)$`,
     `$$\widehat{f}(x) \leftarrow \widehat{f}(x) + \lambda \widehat{f}^m(x).$$`
  c. Update the residuals
     `$$r_i \leftarrow r_i -  \lambda \widehat{f}^m(x)$$`

3. Return the boosted model
   `$$\widehat{f}(x) = \sum_{m = 1}^{M}  \lambda \widehat{f}^m(x)$$`
]

.pull-right.smaller.width-40[
Using a fairly small `$d$` often works well.
{{content}}
]
--

With `$d = 1$` this fits an additive model.
{{content}}
--

Small values of `$\lambda$`, e.g. 0.01 or 0.001, often work well.
{{content}}
--

`$M$` is generally chosen by cross-validation.

---

### References on Boosting

P. Buehlmann and T. Hothorn (2007). "Boosting algorithms:
regularization, prediction and model fitting (with discussion),"
_Statistical Science_, 22(4),477--522.

Andreas Mayr, Harald Binder, Olaf Gefeller, Matthias Schmid
(2014). "The evolution of boosting algorithms - from machine learning
to statistical modelling," _Methods of Information in Medicine_
53(6), [arXiv:1403.1452](https://arxiv.org/abs/1403.1452).

Tianqi Chen and Carlos Guestrin (2016). XGBoost: A Scalable Tree
Boosting System. In 22nd SIGKDD Conference on Knowledge Discovery and
Data Mining, [arXiv:1603.02754](https://arxiv.org/abs/1603.02754).

---

### California Air Pollution Data

Load data and split out a training sample:

```r
data(calif.air.poll, package = "SemiPar")
library(mgcv)
train <- sample(nrow(calif.air.poll), nrow(calif.air.poll) / 2)
```

Fit the additive linear model to the training data and compute
the mean square prediction error for the test data:

```r
fit <- gam(ozone.level ~ s(daggett.pressure.gradient)
                       + s(inversion.base.height)
                       + s(inversion.base.temp),
           data = calif.air.poll[train, ])
(gam.mse <- mean((calif.air.poll$ozone.level[-train] -
                  predict(fit, calif.air.poll[-train, ]))^2))
## [1] 19.52757
```

---

Fit a tree to the training data using all pedictors:

```r
library(rpart)
tree.ca <- rpart(ozone.level ~ .,
                 data = calif.air.poll[train, ])
(tree.mse <-
     mean((calif.air.poll$ozone.level[-train] -
           predict(tree.ca, calif.air.poll[-train, ]))^2))
## [1] 23.40834
```

Use bagging on the training set:

```r
library(randomForest)
bag.ca <- randomForest(ozone.level ~ .,
                       data = calif.air.poll[train, ],
                       mtry = ncol(calif.air.poll) - 1)
(bag.mse <-
     mean((calif.air.poll$ozone.level[-train] -
           predict(bag.ca, calif.air.poll[-train, ]))^2))
## [1] 21.31767
```

---

Fit a random forest:

```r
rf.ca <- randomForest(ozone.level ~ .,
                      data = calif.air.poll[train, ])
(rf.mse <- mean((calif.air.poll$ozone.level[-train] -
                predict(rf.ca, calif.air.poll[-train, ]))^2))
## [1] 20.87033
```

---

Use `gbm` from the `gbm` package to fit boosted
regression trees:

```r
library(gbm)
boost.ca <- gbm(ozone.level ~ ., data = calif.air.poll[train, ],
                n.trees = 100)
## Distribution not specified, assuming gaussian ...
(gbm1.mse <- mean((calif.air.poll$ozone.level[-train] -
                   predict(boost.ca, calif.air.poll[-train, ],
                           n.trees = 100))^2))
## [1] 21.46409
boost.ca2 <- gbm(ozone.level ~ ., data = calif.air.poll[train, ],
                 n.trees = 100, interaction.depth = 2)
## Distribution not specified, assuming gaussian ...
(gbm2.mse <- mean((calif.air.poll$ozone.level[-train] -
                   predict(boost.ca2, calif.air.poll[-train, ],
                           n.trees = 100))^2))
## [1] 20.73196
```

---

Results:

| Method      | MSE                  |                 |
| :---------- | ------------------:  | --------------- |
|gam          | 19.53     |                 |
|tree         | 23.41    |                 |
|bagged       | 21.32     |                 |
|randomForest | 20.87      |                 |
|boosted      | 21.46   | `$M = 100$`       |
|             | 20.73    | `$M = 100, d = 2$`|

These results were obtained without first re-scaling the predictors.

Some examples are available on line
[on line](../examples/learning.Rmd).

---
layout: true
## Support Vector Machines
---

Support vector machines are a method of classification.

The simplest form is for binary classification with training data
`$(x_1,y_1),\dots,(x_n,y_n)$` with

`\begin{align*}
  \renewcommand{\Real}{\mathbb{R}}
  x_i &\in \Real^p\\
  y_i &\in \{-1,+1\}
\end{align*}`

Various extensions to multiple classes are available; one uses a form
of majority vote among all pairwise classifiers.

Extensions to continuous resposes are also available.

An R implementation is `svm` in package `e1071`.

---

### Support Vector Classifiers

A linear binary classifier is of the form

`\begin{equation*}
  G(x) = \text{sign}(x^T\beta + \beta_0)
\end{equation*}`

One way to choose a classifier is to minimize a penalized measure of
misclassification

`\begin{equation*}
  \renewcommand{\Norm}[1]{\|{#1}\|}
  \min_{\beta,\beta_0} \sum_{i=1}^n(1 - y_i f(x))_+ + \lambda \Norm{\beta}^2
\end{equation*}`

with `$f(x) = x^T\beta + \beta_0$`.

* The misclassification cost is zero for correctly classified
  points far from the bundary
--

* The cost increases for misclassified points farther from the
  boundary.

---

The misclassification cost is qualitatively similar to the negative
log-likelihood for a logistic regression model,

$$
  \rho(y_i, f(x)) = - y_i f(x) + \log\left(1 + e^{y_i f(x)}\right) =
  \log\left(1 + e^{-y_i f(x)}\right)
$$

The support vector classifier loss function is sometimes called
_hinge loss_.

---

Via rewriting in terms of equivalent convex optimization problems it
can be shown that the minimizer `$\widehat{\beta}$` has the form

`\begin{equation*}
  \widehat{\beta} = \sum_{i=1}^n \widehat{\alpha}_i y_i x_i
\end{equation*}`

for some values `$\widehat{\alpha}_i \in [0,1/(2\lambda)]$`.

Therefore

`\begin{equation*}
\newcommand{\ip}[2]{\langle{#1},{#2}\rangle}
\widehat{f}(x) = x^T\widehat{\beta} + \widehat{\beta_0}
= \widehat{\beta_0} + \sum_{i=1}^n \widehat{\alpha}_i y_i x^T x_i
= \widehat{\beta_0} + \sum_{i=1}^n \widehat{\alpha}_i y_i \ip{x}{x_i}
\end{equation*}`

The values of `$\widehat{\alpha_i}$` are only non-zero for `$x_i$` close
to the plane `$f(x) = 0$`.

These `$x_i$` are called _support vectors_.

---

### The "Kernel Trick"

To allow for non-linear decision boundaries, we can use an extended
feature set

$$
  h(x_i) = (h_1(x_i),\dots,h_M(x_i))
$$

A linear boundary in `$\Real^M$` maps down to a nonlinear boundary
in `$\Real^p$`.

For example, for `$p=2$` and

$$
  h(x) = (x_1, x_2, x_1 x_2, x_1^2, x_2^2)
$$

then `$M=5$` and a linear boundary in `$\Real^5$` maps down to a quadratic
boundary in `$\Real^2$`.

---

The estimated classification function will be of the form

`\begin{equation*}
  \widehat{f}(x) = 
  \widehat{\beta}_0 + \sum_{i=1}^n \widehat{\alpha}_i y_i \ip{h(x)}{h(x_i)} =
  \widehat{\beta}_0 + \sum_{i=1}^n \widehat{\alpha}_i y_i K(x,x_i)
\end{equation*}`

where the _kernel function_  `$K$` is

$$
  K(x, x') = \ip{h(x)}{h(x')}
$$

The kernel function is symmetric and positive semi-definite.

We do not need to specify `$h$` explicitly, only `$K$` is needed.

Any symmetric, positive semi-definite function can be used.

---

Some common choices:

`\begin{align*}
  \renewcommand{\ip}[2]{\langle{#1},{#2}\rangle}
  \renewcommand{\Norm}[1]{\|{#1}\|}
  \text{$d$th degree polynimial:} & \qquad K(x,x') = (1+\ip{x}{x'})^d\\
  \text{radial basis:} & \qquad K(x,x') = \exp(-\Norm{x-x'}^2/c)\\
  \text{neural network:}& \qquad K(x,x') = \text{tanh}(a\ip{x}{x'}+b)
\end{align*}`

The parameter `$\lambda$` in the optimization criterion is a
regularization parameter.  It can be chosen by cross-validation.

Particular kernels and their parameters also need to be chosen.

This is analogous/equivalent to choosing sets of basis functions.

Smoothing splines can be expressed in terms of kernels as well.

* This leads to _reproducing kernel Hilbert spaces_.
--

* This does not lead to the sparseness of the SVM approach.

---

### An Artificial Example

.pull-left[
Classify random data as above or below a parabola:

```r
x1 <- runif(100)
x2 <- runif(100)
z <- ifelse(x2 > 2 * (x1 - .5)^2 + .5, 1, 0)
plot(x1, x2, col = ifelse(z, "red", "blue"))
x <- seq(0, 1, len = 101)
lines(x, 2 * (x - .5)^2 + .5, lty = 2)
```
]

.pull-right[
<img src="learning_files/figure-html/svm-example-1.png" style="display: block; margin: auto;" />
]

---

.pull-left[
Fit a support vector classifier using
`$\lambda = \frac{1}{2\text{cost}}$`:

```r
library(e1071)
fit <- svm(factor(z) ~ x1 + x2, cost = 10)
fit
## 
## Call:
## svm(formula = factor(z) ~ x1 + x2, cost = 10)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  10 
## 
## Number of Support Vectors:  15
```

```r
plot(fit, data.frame(z = z, x1 = x1, x2 = x2),
     formula = x2 ~ x1, grid = 100)
```
]

.pull-right[
<img src="learning_files/figure-html/svm-fit-plot-1.png" style="display: block; margin: auto;" />
]

---
layout: true
## Neural Networks
---

.pull-left[
Neural networks are flexible nonlinear models.
{{content}}
]
--

They are motivated by simple models for the working of neurons.
{{content}}
--

They connect _input nodes_ to _output nodes_ through one or
more layers of _hidden nodes_
{{content}}
--

The simplest form is the feed-forward network with one hidden layer,
inputs `$x_1,\dots,x_p$` and outputs `$y_1,\dots,y_k$`.
{{content}}
--

A graphical representation:

.pull-right[

<img src="img/nnfig.png" width="80%" style="display: block; margin: auto;" />
]

---

Mathematical form:

`\begin{align*}
  z_m &= h(\alpha_{0m} + x^T \alpha_m)\\
  t_k &= \beta_{0k} + z^T \beta_k\\
  f_k(x) &= g_k(T)
\end{align*}`

The _activation function_ `$h$` is often a _sigmoidal_ function, like
the logistic CDF

`\begin{equation*}
  h(x) = 1 / (1 + e^{-x})
\end{equation*}`

For regression there is usually one output with `$g_1(t)$` the
identity function.

For binary classification there is usually one output with
`$g_1(t) = 1 / (1 + e^{-t})$`

For `$k$`-class classification with `$k > 2$` usually there are `$k$`
outputs, corresponding to binary class indicator data, with

`\begin{equation*}
  g_k(t) = \frac{e^{t_k}}{\sum_j e^{t_j}}
\end{equation*}`

This is often called a _softmax_ criterion.

---

By increasing the size of the hidden layer `$M$` a neural network can
uniformly approximate any continuous function on a compact set
arbitrarily well.

Some examples, fit to `$n = 101$` data points using function
`neuralnet` from package `neuralnet` with a hidden layer with
`$M=5$` nodes:

---

Fitting is done by maximizing a log likelihood `$L(\alpha,\beta)$`
assuming

* normal errors for regression
--

* a logistic model for classification

The likelihood is highly multimodal and the parameters are not
identified

* relabeling hidden nodes does not change the model, for example
--

* random starting values are usually used
--

* parameters are not interpretable

If `$M$` is large enough to allow flexible fitting then over-fitting is
a risk.

---

Regularization is used to control overfitting: a penalized log
likelihood of the form

`\begin{equation*}
  L(\alpha,\beta) - \lambda(\sum_m \|\alpha_m\|^2 + \sum_k \|\beta_k\|^2)
\end{equation*}`

is maximized.

* For this to make sense it is important to center and scale the
  features to have comparable units.
--

* This approach is referred to as _weight decay_ and
  `$\lambda$` is the decay parameter.

As long as `$M$` is large enough and regularization is used, the
specific value of `$M$` seems to matter little.

The weight decay parameter is often determined by `$N$`-fold cross
validation, often with `$N=10$`

Because of the random starting points, results in repeated runs can
differ.

* one option is to make several runs and pick the best fit
--

* another is to combine results from several runs by averaging
  or majority voting.

---

Fitting a neural net to the artificial data example:

```r
library(neuralnet)
d <- data.frame(x1, x2, z)
fit <- neuralnet(z ~ x1 + x2, d, linear.output = FALSE, hidden = 10)
```
<img src="learning_files/figure-html/unnamed-chunk-15-1.png" style="display: block; margin: auto;" />

---

.pull-left[
### Example: Recognizing Handwritten Digits
{{content}}
]
--

Data consists of scanned ZIP code digist from the U.S. postal
service.
{{content}}
--

Data is available at <http://yann.lecun.com/exdb/mnist/> as a
binary file.
{{content}}
--

Training data consist of a small number of original images, around
300, and additional images generated by random shifts.
{{content}}
--

Data are `$28 \times 28$` gray-scale images, along with labels.

.pull-right[
<img src="img/digits.png" width="80%" style="display: block; margin: auto;" />
]

.pull-left[
This has become a standard machine learning test example.
]

.pull-right[
Data can be read into R using `readBin`.
]

---

The fit, using 6000 observations and `$M=100$` nodes in the hidden layer
took 11.5 hours on `r-lnx400`:

```r
library(nnet)
fit <- nnet(X, class.ind(lab), size = 100,
            MaxNWts = 100000, softmax = TRUE)
```
--

It produced a training misclassification rate of about 8% and a test
misclassification rate of about 12%.

Other implementations are faster and better for large problems.

---
layout: true
## Deep Learning
---

Deep learning models are multi-level non-linear models

A supervised model with observed responses `$Y$` and features `$X$` with
`$M$` layers would be
`\begin{equation*}
  Y \sim f_1(y|Z_1), Z_1 \sim f_2(z_1| Z_2), \dots, Z_M \sim f_M(z_M|X)
\end{equation*}`
with `$Z_1, \dots, Z_M$` unobserved latent values.

An unsupervised model with observed features `$X$` would be

`\begin{equation*}
  X \sim f_1(x|Z_1), Z_1 \sim f_2(z_1| Z_2), \dots, Z_M \sim f_M(z_M)
\end{equation*}`

These need to be nonlinear so they don't collapse into one big linear
model.

---

The layers are often viewed as capturing features at different levels
of granularity.

For image classification these might be

* `$X$`: pixel intensities
--

* `$Z_1$`: edges
--

* `$Z_2$`: object parts (e.g. eyes, noses)
--

* `$Z_3$`: whole objects (e.g. faces)

Multi-layer, or deep, neural networks are one approach, that has
become very successful.

<!--
A common approach to supervised learning problems seems to be

* use an unsupervised deep learning method to extract a reduced set
    of features
  * fit a linear or logistic model for the response to these features.
-->

Deep learning methods have become very successful in recent years due
to a combination of increased computing power and algorithm
improvements.

---

Some key algorithm developments include:

* Use of _stochastic gradient descent_ for optimization.

* Backpropagation for efficient gradient evaluation.

* Using the piece-wise linear _Rectified Linear Unit
  (ReLU)_ activation function
  `$$\text{ReLU}(x) = \begin{cases} x & x \ge 0\\ 0 & \text{otherwise.} \end{cases}$$`

* Specialized structures, such as convolutional and recurrent
  neural networks.

* Use of dropout, regularization, and early stopping to avoid
  over-fitting.

---

### Stochastic Gradient Descent

_Gradient descent_ for minimizing a function `$f$` tries to
improve a current guess by taking a step in the direction of the
negative gradient:
`\begin{equation*}
  x' = x - \eta \nabla f(x)
\end{equation*}`

The step size `$\eta$` is sometimes called the _learning rate_.

In one dimension the best step size near the minimum is `$1/f''(x)$`.

A step size that is too small converges too slowly; a step size too
large may not converge at all.

Line search is possible but may be expensive.

Using a fixed step size, with monitoring to avoid divergence, or using
a slowly decreasing step size are common choices.

---

For a DNN the function to be minimized with respect to parameters `$A$`
is typically of the form

`\begin{equation*}
  \sum_{i=1}^n L_i(y_i, x_i, A)
\end{equation*}`

for large `$n$`.

Computing function and gradient values for all `$n$` training cases can
be very costly.

_Stochastic gradient descent_ at each step chooses a random
_minibatch_ of `$B$` of the training cases and computes a new step
based on the loss function for the minibatch.

The minibatch size can be as small as `$B = 1$`.

Stochastic gradient descent optimizations are usually divided into
_epochs_, with each epoch expected to use each training case
once.

---

### Backpropagation

Derivatives of the objective function are computed by the chain rule.

This is done most efficiently by working backwards; this corresponds
to the _reverse mode_ of automatic differentiation.

A DNN with two hidden layers can be represented as

`\begin{equation*}
  F(x; A) = G(A_3 H_2(A_2 H_1(A_1 x)))
\end{equation*}`

If `$G$` is elementwise the identity, and the `$H_i$` are elementwise
ReLU, then this is a piece-wise linear function of `$x$`.

The computation of `$w = F(x; A)$` can be broken down into intermediate
steps as
`\begin{align*}
  t_1 &= A_1 x   & z_1 &= H_1(t_1)\\
  t_2 &= A_2 z_1 & z_2 &= H_2(t_2)\\
  t_3 &= A_3 z_2 & w &= G(t_3)
\end{align*}`

This is called _forward propagation_.

---

The gradient components are then computed in reverse, by
_backpropagation_, as

`\begin{align*}
  B_3 &= \nabla G(t_3) &
  \frac{\partial w}{\partial A_3}
  &= \nabla G(t_3) z_2 = B_3 z_2\\
  B_2 &= B_3 A_3  \nabla H_2 (t_2) &
  \frac{\partial w}{\partial A_2}
  &= \nabla G(t_3) A_3 \nabla H_2 (t_2) z_1 = B_2 z_1\\
  B_1 &= B_2 A_2 \nabla H_1 (t_1) &
  \frac{\partial w}{\partial A_1}
  &= \nabla G(t_3) A_3 \nabla H_2 (t_2) A_2 \nabla H_1(t_1) x = B_1 x
\end{align*}`

For ReLU activations the elements of `$\nabla H_i(t_i)$` will be 0 or 1.

For `$n$` parameters the computation will typically be of order `$O(n)$`.

Many of the computations can be effectively parallelized.

---

### Convolutional and Recurrent Neural Networks

In image processing features (pixel intensities) have a neighborhood
structure.

A _convolutional neural network_ uses one or more hidden layers that
are:

* only locally connected;
--

* use the same parameters at each location.

A simple convolution layer might use a pixel and each of its 4
neighbors with

`\begin{equation*}
  t = (a_1 R + a_2 L + a_3 U + a_4 D) z
\end{equation*}`

where, e.g.

`\begin{equation*}
  R_{ij} =
  \begin{cases}
    1 & \text{if pixel i is immediately to the right of pixel j}\\
    0 & \text{otherwise}.
  \end{cases}
\end{equation*}`

With only a small nunber of parameters per layer it is feasible to add
tens of layers.

Similarly, a _recurrent neural network_ can be designed to handle
temporal dependencies for time series or speech recognition.

---

### Avoiding Over-Fitting

Both `$L_1$` and `$L_2$` regularization are used.

Another strategy is _dropout_:

* In each epoch keep a node with probability `$p$` and drop with
  probability `$1-p$`.

* In the final fit multiply each node's output by `$p$`.

This simulates an ensemble method fitting many networks, but costs
much less.

Random starts are an important component of fitting networks.

Stopping early, combined with random starts and randomness from
stochastic gradient descent, is also thought to be an effective
regularization.

Cross-validation during training can be used to determine when to
stop.

---

### Notes and References

Deep learning methods have been very successful in a number of areas,
such as:

* Image classification and face
  recognition. [_AlexNet_](https://en.wikipedia.org/wiki/AlexNet) is
  a very successful image classifier.

* [_Google Translate_](https://translate.google.com/) is now based
  on a [deep neural network
  approach](https://ai.googleblog.com/2016/11/zero-shot-translation-with-googles.html).

* [Speech
  recognition](https://ai.googleblog.com/2015/08/the-neural-networks-behind-google-voice.html).

* Playing [Go](https://web.archive.org/web/20210830105857/https://deepmind.com/research/case-studies/alphago-the-story-so-far) and chess.

Being able to effectively handle large data sets is an important
consideration in this research.

Highly parallel GPU based and distributed architectures are often
needed.

---

Some issues:

* Very large training data sets are often needed.

* In high dimensional problems having a high signal to noise ratio
  seems to be needed.

* Models can be very brittle -- small data perturbations can lead to
  [very wrong results](https://arxiv.org/pdf/1610.08401.pdf).

* Biases in data will lead to biases in predictions. A [probably
  harmless example](http://karpathy.github.io/2015/10/25/selfie/)
  deals with evaluating selfies in social media; there are much more
  serious examples.

Some R packages for deep learning include `darch`,
`deepnet`, `deepr`, `domino`, `h2o`,
`keras`.

---

Some references:

* A nice introduction was provided by Thomas Lumley in a [2019 Ihaka
  Lecture](https://www.auckland.ac.nz/en/science/about-the-faculty/department-of-statistics/ihaka-lecture-series.html).

* [deeplearning.net](https://web.archive.org/web/20180322004528/https://www.crunchbase.com/organization/deep-learning)
  web site.

* Li Deng and Dong Yu (2014), [_Deep Learning: Methods and
  Applications_](https://www.microsoft.com/en-us/research/publication/deep-learning-methods-and-applications/?from=http%3A%2F%2Fresearch.microsoft.com%2Fpubs%2F209355%2Fdeeplearning-nowpublishing-vol7-sig-039.pdf).

* Charu Aggarwal (2018), _Neural Networks and Deep Learning_.

* [A Primer on Deep
  Learning](https://www.datarobot.com/blog/a-primer-on-deep-learning/).

* A [blog
  post](http://www.rblog.uni-freiburg.de/2017/02/07/deep-learning-in-r/)
  on deep learning software in R.

* A nice [simulator](http://playground.tensorflow.org).

Some examples are available 
[on line](../examples/keras.Rmd).

---
layout: true
## Mixture of Experts
---

Mixture models for prediction of `$y$` based on fearures `$x$` produce
predictive distributions of the form
`\begin{equation*}
  f(y|x) = \sum_{i=1}^M f_i(y|x) \pi_i
\end{equation*}`
with `$f_i$` depending on parameters that need to be learned from
training data.

A generalization allows the mixing probabilities to depend on the
features:

`\begin{equation*}
  f(y|x) = \sum_{i=1}^M f_i(y|x) \pi_i(x)
\end{equation*}`

with `$f_i$` and `$\pi_i$` depending on parameters that need to be
learned.

---

The `$f_i$` are referred to as _experts_, with different experts
being better informed about different ranges of `$x$` values, and `$f$`
this is called a _mixture of experts_.

Tree models can be viewed as a special case of a mixture of experts
with `$\pi_i(x) \in \{0,1\}$`.

The mixtures `$\pi_i$` can themselves be modeled as a mixture of
experts.  This is the _hierarchical mixture of experts_ (HME)
model.

//adapted from Emi Tanaka's gist at //https://gist.github.com/emitanaka/eaa258bb8471c041797ff377704c8505