Additive Models

class: center, middle, title-slide

.title[
# Additive Models
]
.author[
### Luke Tierney
]
.institute[
### University of Iowa
]
.date[
### 2023-05-19
]

---

layout: true

## Basics
---

One approach to flexible modeling with multiple predictors is to use
_additive models_:

`\begin{equation*}
  Y = \beta_0 + f_1(x_1) + \dots + f_p(x_p) + \epsilon
\end{equation*}`

where the `$f_j$` are assumed smooth.

Variations include

* some linear and some smooth terms
  `$$Y = \beta_0 + \beta_1 x_1 + f_2(x_2) + \epsilon$$`

* some bivariate smooth terms
  `$$Y = f_1(x_1) + f_{23}(x_2,x_3) + \epsilon$$`

---

A joint model using basis functions would be of the form

$$
Y = X_0\beta + X_1\delta_1 + \dots + X_p\delta_p + \epsilon
$$

with penalized objective function

`\begin{equation*}
  \renewcommand{\Norm}[1]{\|{#1}\|}
\Norm{Y - X_0\beta - \sum_{i=1}^pX_i\delta_i}^2 +
\sum_{i=1}^p \lambda_i \delta_i^T D_i \delta_1
\end{equation*}`

The model can be fit using the mixed model formulation with `$p$`
independent variance components.

An alternative is the _backfitting algorithm_.

---
layout: true
## Backfitting Algorithm
---

For a model

`\begin{equation*}
  f(x) = \beta_0 + \sum_{j=1}^p f_j(x_j)
\end{equation*}`

with data `$y_i, x_{ij}$` and smoothers `$S_j$`

* initialize `$\widehat{\beta}_0 = \overline{y}$`

* repeat
  `$$\begin{align*} \widehat{f}_j &\leftarrow S_j\left[\{y_i-\widehat{\beta}_0-\sum_{k \neq j}   \widehat{f}_k(x_{ik})\}_1^n\right]\\ \widehat{f}_j &\leftarrow \widehat{f}_j - \frac{1}{n}\sum_{i=1}^n \widehat{f}_j(x_{ij}) \end{align*}$$`
    until the changes in the `$\widehat{f}_j$` are below some threshold.

---

A more complex linear term is handled analogously.

For penalized linear smoothers with fixed smoothing parameters this
can be viewed as solving the equations for the minimizer by a block
Gauss-Seidel algorithm.

Different smoothers can be used on each variable.

Smoothing parameters can be adjusted during each pass or jointly.

* `bruto` (Hastie and Tibshirani, 1990) uses a variable
  selection/smoothing parameter selection pass based on approximate
  GCV.

* `gam` from package `mgcv` uses GCV.

Backfitting may allow larger models to be fit.

Backfitting can be viewed as one of several ways of fitting
penalized/mixed models.

Some examples are available
[on line](../examples/additive.Rmd).

---
layout: true
## Example: Ozone Levels
---

.pull-left.width-40[
Data set relates ozone levels to pressure gradient, temperature, and
height of inversion.
{{content}}
]
--

A `gam` fit is produced by

```r
library(mgcv)
data(calif.air.poll, package = "SemiPar")
fit <- gam(ozone.level ~
               s(daggett.pressure.gradient)
           + s(inversion.base.height)
           + s(inversion.base.temp),
           data = calif.air.poll)
```
{{content}}
--

The default plot method produces

.pull-right.width-60[
<img src="additive_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" />
]

---
layout: false
## Mixed Additive Models

Mixed additive models can be written as

`\begin{equation*}
  Y = X_0 \beta + Z U + X_1 \delta_1 + \dots + X_p\delta_p + \epsilon
\end{equation*}`

where `$U$` is a "traditional" random effects term with

`\begin{equation*}
  U \sim N(0,\Sigma(\theta))
\end{equation*}`

for some parameter `$\theta$` and the terms
`$X_1 \delta_1 + \dots + X_p\delta_p$`
represent smooth additive terms.

In principle these can be fit with ordinary penalized least
squares or mixed models software.

---
layout: true
## Example: Sitka Pines Experiment
---

An experiment on sitka pines measured size over time for 79 trees
grown in an ozone-rich environment and a control environment.

Measurements were taken at 13 time points.

.pull-left[

```r
data(sitka, package = "SemiPar")
library(lattice)
sitka$ozone.char <-
    ifelse(sitka$ozone, "ozone", "control")
xyplot(log.size ~ days | ozone.char,
       groups = id.num,
       type = "b",
       data = sitka)
```
]

.pull-right[
<img src="additive_files/figure-html/sitka-data-1.png" style="display: block; margin: auto;" />
]

---

The plot suggests a model with

* a smooth term for time
--

* a mean shift for ozone
--

* a random intercept for trees
--

* perhaps also a random slope for trees

The random intercept model can be fit with `spm` using (not
working at present)

```r
library(SemiPar)
attach(sitka)
fit <- spm(log.size ~ ozone + f(days),
           random = ~ 1, group = id.num)
```

and by `gamm` with

```r
trees <- as.factor(sitka$id.num)
fit <- gamm(log.size ~ ozone + s(days),
            random = list(trees = ~ 1), data = sitka)
```

---

`spm` cannot fit a more complex random effects structure at
this point.

Using `gamm` we can fit random slope and intercept with

```r
fit <- gamm(log.size ~ ozone + s(days),
            random = list(trees = ~ 1 + days), data = sitka)
```

.pull-left[
Residuals don't show any further obvious pattern.

Autocorrelated errors over time might be worth considering.
]

.pull-right[
<img src="additive_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" />
]

---
layout: false
## Generalized Additive Models

Standard generalized linear models include
`\begin{equation*}
  y_i \sim 
  \text{Bernoulli}\left(\frac{\exp\{(X\beta)_i\}}{1+\exp\{(X\beta)_i\}}\right)
\end{equation*}`

and

`\begin{equation*}
  y_i \sim 
  \text{Poisson}\left(\exp\{(X\beta)_i\}\right)
\end{equation*}`

Maximum likelihood estimates can be computed by _iteratively
  reweighted least squares (IRWLS)_

Penalized maximum likelihood estimates maximize

`\begin{equation*}
  \text{Loglik}(y, X_0\beta+X_i\delta) - \frac{1}{2}\lambda \delta^T D \delta
\end{equation*}`

This has a mixed model/Bayesian interpretation.

GLMM (genelarized linear mixed model) software can be used.

The IRWLS algorithm can be merged with backfitting.

---
layout: true
## Example: Trade Union Membership
---

Data relating union membership and various characteristics are
available.

A Bernoulli generalized additive model relates the probability of
union membership to the available predictor variables.

One possible model is fit by

```r
data(trade.union, package = "SemiPar")
fit <- gam(union.member ~ s(wage) + s(years.educ) + s(age)
                        + female + race + south,
           family = binomial,
           subset = wage < 40,    # remove high leverage point
           data = trade.union)
```

---

.pull-left.width-45[
The estimated smooth terms are

<img src="additive_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" />
]

.pull-right.width-55.tiny-code[
Some summary information on the smooth terms:

```r
summary(fit)
## 
## Family: binomial 
## Link function: logit 
## 
## Formula:
## union.member ~ s(wage) + s(years.educ) + s(age) + female + race + 
##     south
## 
## Parametric coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  -0.2434     0.4614  -0.527  0.59785   
## female       -0.7101     0.2670  -2.660  0.00782 **
## race         -0.3939     0.1615  -2.439  0.01472 * 
## south        -0.5209     0.2950  -1.765  0.07750 . 
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                 edf Ref.df Chi.sq  p-value    
## s(wage)       2.814  3.520 22.420 0.000117 ***
## s(years.educ) 2.951  3.716  6.020 0.204710    
## s(age)        1.000  1.000  2.279 0.131177    
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.113   Deviance explained = 12.7%
## UBRE = -0.1362  Scale est. = 1         n = 533
```
]

<!--
% 
% ## Inference for Penalized Splines
% %%**** skip this section or replace with stuff from Wood?
% Assume that
% \begin{equation*}
%   Y_i = m(x_i) + \epsilon_i
% \end{equation*}
% with the `$\epsilon_i$` independent `$N(0,\sigma_\epsilon^2)$`.

% ## Linear Smoothers and Equivalent Kernels
% \begin{itemize}
% \item For linear smoothers
%   \begin{equation*}
%     \widehat{m}(x) = \ell_x^T Y
%   \end{equation*}
%   for some vector `$\ell_x$`.
% \item For the Nadaraya-Watson estimator, `$\ell_x = (w_1,\dots,w_n)^T$` with
%   \begin{equation*}
%     w_i = \frac{K((x - x_i)/h)}{\sum_j K((x-x_j)/h)}
%   \end{equation*}
% \item For a penalized spline of order 3
%   \begin{equation*}
%     \ell_x = C (C^T C + \lambda D)^{-1}C_x^T
%   \end{equation*}
%   with `$C_x = (1, x, |x-\kappa_1|^3,\dots,|x-\kappa_K|^3)$` and
%   \begin{equation*}
%     C = 
%     \begin{bmatrix}
%       C_{x_1}\\
%       \vdots\\
%       C_{x_n}        
%     \end{bmatrix}
%   \end{equation*}
% 
% \item `$\ell_x$` is sometimes called the _equivalent kernel_ or
%   _equivalent local kernel_ of the smoother.
%   \begin{center}
%     \includegraphics[width=4in]{equivKern}
%   \end{center}
% % lambda <- 0.3928105
% % plot(predict(smooth.spline(sapply(1:18, function(i) smooth.spline(ifelse(1:18 == i, 1, 0),spar=lambda)$y[10]),spar=0),seq(1,18,len=100)),type="l",xlab="x",ylab="Equivalend Kernels")
% % lines(predict(smooth.spline(sapply(1:18, function(i) smooth.spline(ifelse(1:18 == i, 1, 0),spar=lambda)$y[5]),spar=0),seq(1,18,len=100)),col="red")
% % lines(predict(smooth.spline(sapply(1:18, function(i) smooth.spline(ifelse(1:18 == i, 1, 0),spar=lambda)$y[15]),spar=0),seq(1,18,len=100)),col="blue")
% % title("Smothing Spline with 18 Equally Spaced Knots")
% % dev.copy2eps(file="equivKern.eps")
% \end{itemize}

% 
% ## Pointwise Confidence and Prediction Intervals
% \begin{itemize}
% \item The standard deviation of `$\widehat{m}(x) = \ell_x^T Y$` is
%   $$
%     \SD(\widehat{m}(x)) = \sqrt{\ell_x^T (\sigma_\epsilon^2 I) \ell_x}
%     = \sigma_\epsilon \Norm{\ell_x}
%   $$
% \item A large sample confidence interval for `$E[\widehat{m}(x)]$` is
%   $$
%     \widehat{m}(x) \pm z_{1-\alpha/2} \widehat{\sigma}_\epsilon \Norm{\ell_x}
%   $$
% \item A reasonable small sample interval is for `$E[\widehat{m}(x)]$` is
%   $$
%     \widehat{m}(x) \pm t_{\text{df}_\text{res},1-\alpha/2}
%     \widehat{\sigma}_\epsilon \Norm{\ell_x}
%   $$
% \item Corresponding prediction intervals are
%   $$
%     \widehat{m}(x) \pm z_{1-\alpha/2} \widehat{\sigma}_\epsilon
%     \sqrt{1+\Norm{\ell_x}^2}
%   $$
%   and
%   $$
%     \widehat{m}(x) \pm t_{\text{df}_\text{res},1-\alpha/2}
%     \widehat{\sigma}_\epsilon \sqrt{1+\Norm{\ell_x}^2}
%   $$
% \item These intervals do not account for uncertainty in the smoothing
%   parameter.
% \item Because of bias these confidence intervals are not strictly
%   intervals for `$m(x)$`.
% \item Ruppert, Wand, and Carroll argue that for penalized splines a
%   standard deviation of the form
%   \begin{equation*}
%     \SD(\widehat{m}(x) - m(x)) =
%     \sigma_\epsilon\sqrt{C_x(C^T C + \lambda D)^{-1}C_x^T}
%   \end{equation*}
%   incorporates expected bias based on the randomness of `$\delta$` in
%   the mixed model formulation.  The intervals
%   \begin{equation*}
%     \widehat{m}(x) \pm z_{1-\alpha/2} \widehat{\sigma}_\epsilon 
%     \sqrt{C_x(C^T C + \lambda D)^{-1}C_x^T}
%   \end{equation*}
%   are slightly wider at `$x$` values with high curvature and bias and
%   arguably a better choice.
% \item These intervals are also Bayesian HPD intervals for a flat prior
%   on `$\beta$` and an independent `$`N`(0,(\lambda D)^{-1})$`
%   prior on `$\delta$`.
% \item Wood (2017) describes some approaches for accounting for
%   uncertainty about `$\lambda$`.
% \end{itemize}

% 
% ### Simultaneous Confidence Bands
% \begin{itemize}
% \item The sampling distribution of the coefficient estimates is
%   approximately
%   \begin{equation*}
%     \begin{bmatrix}
%       \widehat{\beta}-\beta\\
%       \widehat{\delta}-\delta
%     \end{bmatrix}
%     \sim N(0,\sigma_\epsilon^2(C^T C + \lambda D)^{-1})
%   \end{equation*} 
% \item For a grid of values `$g = \{g_1,\dots,g_M\}$` the distribution of
%   \begin{equation*}
%     \max_{x \in g} \left(
%       \frac{\widehat{m}(x)-m(x)}{\widehat{SD}(\widehat{m}(x)-m(x))}
%       \right)
%   \end{equation*}
%   is approximately pivotal and approximately equal to the distribution of
%   \begin{equation*}
%     W_g = \max_{x \in g} \left(
%     \frac{C_x 
%       \begin{bmatrix}
%         \widehat{\beta}-\beta\\
%         \widehat{\delta}-\delta
%       \end{bmatrix}}{\widehat{SD}(\widehat{m}(x)-m(x))}
%     \right)
%   \end{equation*}
% \item The upper `$\alpha$` quantile `$W_{g,1-\alpha}$` of the distribution
%   of `$W_g$` can be estimated by simulation.
% \item An approximate simultaneous confidence band is then
%   \begin{equation*}
%     \widehat{m}(x) \pm W_{g,1-\alpha} \widehat{SD}(\widehat{m}(x)-m(x))
%   \end{equation*}
%   for all `$x \in g$`.
% \item Alternative asymptotic approximations to `$W_{g,1-\alpha}$` based
%   on _upcrossing theory_ are also available.
% \end{itemize}
-->

---
layout: true
## Alternative Penalties
---

Bases for function spaces are infinite dimensional.

Some form of penalty or _regularization_ is needed.

Penalties often have a useful Bayesian interpretation.

Most common penalties on coefficients `$\delta$`

* quadratic, `$\sum \delta_i^2$` or, more generally, `$\delta^T D \delta$`

* absolute value, `$L_1$`, LASSO: `$\sum |\delta_i|$`

---

### Ridge Regression

_Ridge regression_ uses the `$L_2$` penalty
`$\lambda \sum \delta_i^2$`.

Using a quadratic penalty `$\delta^T D \delta$` with strictly positive
definite `$D$` is sometimes called _generalized ridge regression_.

The minimizer of
`$\Norm{Y - X\delta}^2 + \lambda \delta^T D \delta$`
is

`\begin{equation*}
  \widehat{\delta_\lambda} = (X^T X + \lambda D)^{-1} X^T Y
\end{equation*}`

This shrinks the OLS estimate towards zero as `$\lambda \to \infty$`.

If `$X^TX = D = I$` then the ridge regression estimate is

`\begin{equation*}
  \widehat{\delta_\lambda} = \frac{1}{1+\lambda}\widehat{\delta}_\text{OLS}
\end{equation*}`

---

### LASSO

The LASSO (Least Absolute Shrinkage and Selection Operator) or
`$L_1$`-penalized minimization problem

$$
  \min_\delta\{\Norm{Y - X\delta}^2 + 2\lambda\sum |\delta_i|\}
$$

does not in general have a closed form solution.

But if `$X^T X = I$` then

`\begin{equation*}
  \widehat{\delta}_{i,\lambda} =
  \text{sign}(\widehat{\delta}_{i,\text{OLS}})
  (|\widehat{\delta}_{i,\text{OLS}}| - \lambda)_{+}
\end{equation*}`

The OLS estimates are shifted towards zero and truncated at zero.

The `$L_1$` penalty approach has a Bayesian interpretation as a
posterior mode for a Laplace or double exponential prior.

---

The variable selection property of the `$L_1$` penalty is particularly
appealing when the number of regressors is large, possibly larger than
the number of observations.

For least squares regression with the LASSO penalty

* the _solution path_ as `$\lambda$` varies is piece-wise
  linear

* there are algorithms for computing the entire solution path
  efficiently

* Common practice is to plot the coefficients `$\beta_j(\lambda)$`
  against the _shrinkage factor_
  `$s = \Norm{\beta(\lambda)}_1/\Norm{\beta(\infty)}_1$`

R Packages implementing general `$L_1$`-penalized regression include
`lars`, `lasso2`, and `glmnet`.

A [paper](https://web.archive.org/web/20180723050722/http://statweb.stanford.edu/~tibs/ftp/covtest.pdf), [talk
slides](https://web.archive.org/web/20180723050728/http://statweb.stanford.edu/~tibs/ftp/covtest-talk.pdf), and
[R package](https://cran.r-project.org/package=covTest)
present a significance test for coefficients entering the model.

---

### Elastic Net

The _elastic net_ penalty is a combination of the LASSO and Ridge
penalties:

`\begin{equation*}
  \lambda \left[(1 - \alpha) \sum \delta_i^2 +
    2 \alpha \sum |\delta_i|\right]
\end{equation*}`

* Ridge regression corresponds to `$\alpha = 0$`.
--

* LASSO corresponds to `$\alpha = 1$`.

`$\lambda$` and `$\alpha$` can be estimated by cross-validation.

Elastic net was introduced to address some shortcomings of LASSO,
including

* inability to select more than `$n$` predictors in `$p > n$` problems;
--

* tendency to select only one of correlated predictors.

The `glmnet` package implements elastic net regression.

Scaling of predictors is important; by default `glmnet`
standardizes before fitting.

---

### Non-Convex Penalties

The elastic net penalties are convex for all `$\alpha$`.

This greatly simplifies the optimization to be solved.

LASSO and other elastic net fits tend to select more variables than
needed.

Some non-convex penalties have the theoretical property of
consistently estimating the set of covariates with non-zero
coefficients under some asymptotic formulations.

Some also reduce the bias for the non-zero coefficient estimates.

Some examples are

* smoothly clipped absolute deviation (SCAD);

* minimax concave penalty (MCP).

---

MCP is of the form `$\sum \rho(\delta_i, \lambda, \gamma)$` with

`\begin{equation*}
  \rho(x, \lambda, \gamma) =
  \begin{cases}
    \lambda |x| - \frac{x^2}{2\gamma} & |x| \le \gamma\lambda\\
    \frac{1}{2}\gamma\lambda^2 & \text{otherwise}
  \end{cases}
\end{equation*}`

for `$\gamma > 1$`.

This behaves like `$\lambda |x|$` for small `$|x|$` and smoothly
transitions to a constant for large `$|x|$`. SCAD is similar in shape.

Jian Huang and Patrick Breheny have worked extensively on these.

---
layout: true
## Alternative Bases
---

Many other bases are used, including

* polynomials

* trigonometric polynomials (Fourier basis)

* wavelet bases

Different bases are more suitable for modeling different functions

General idea: choose a basis in which the target can be approximated
well with a small number of basis elements.

---

### Wavelets

Wavelet smoothing often assumes observations at `$N=2^J$` equally spaced
points and uses an orthonormal basis of `$N$` vectors organized in `$J$`
levels.

Some wavelet bases:

---

A common approach for wavelet smoothing is to use `$L_1$` shrinkage with

`\begin{equation*}
  \lambda = \widehat{\sigma} \sqrt{2\log N}
\end{equation*}`

A variant is to use different levels of smoothing at each level of
detail.

`$\widehat{\sigma}$` is usually estimated by assuming the highest
frequency level is pure noise.

Several R packages are available for wavelet modeling, including
`waveslim`, `rwt`, `wavethresh`, and
`wavelets`

Matlab has very good wavelet toolboxes.

S-Plus also used to have a good wavelet library.

<!--
An example of a recent paper on wavelet methodology is

> M. I. Knight and G. P. Nason (2009) A 'nondecimated' lifting transform,
> Statistics and Computing 19(1), 1--16.
-->

---
layout: false
## Other Approaches

MARS, multiple adaptive regression splines.  Available in the
`mda` package.

`polymars` in package `polyspline`.

Smoothing spline ANOVA.

Projection pursuit regression.

Single and multiple index models.

Neural networks.

Tree models.

Some examples are available
[on line](../examples/additive.Rmd).

//adapted from Emi Tanaka's gist at //https://gist.github.com/emitanaka/eaa258bb8471c041797ff377704c8505