class: center, middle, title-slide .title[ # Additive Models ] .author[ ### Luke Tierney ] .institute[ ### University of Iowa ] .date[ ### 2023-05-19 ] --- layout: true <link rel="stylesheet" href="stat7400.css" type="text/css" /> <style type="text/css"> .remark-code { font-size: 75%; } </style> <!-- $ -- get emacs sane again --> ## Basics --- One approach to flexible modeling with multiple predictors is to use _additive models_: `\begin{equation*} Y = \beta_0 + f_1(x_1) + \dots + f_p(x_p) + \epsilon \end{equation*}` where the `\(f_j\)` are assumed smooth. -- Variations include -- * some linear and some smooth terms `$$Y = \beta_0 + \beta_1 x_1 + f_2(x_2) + \epsilon$$` -- * some bivariate smooth terms `$$Y = f_1(x_1) + f_{23}(x_2,x_3) + \epsilon$$` --- A joint model using basis functions would be of the form $$ Y = X_0\beta + X_1\delta_1 + \dots + X_p\delta_p + \epsilon $$ -- with penalized objective function `\begin{equation*} \renewcommand{\Norm}[1]{\|{#1}\|} \Norm{Y - X_0\beta - \sum_{i=1}^pX_i\delta_i}^2 + \sum_{i=1}^p \lambda_i \delta_i^T D_i \delta_1 \end{equation*}` -- The model can be fit using the mixed model formulation with `\(p\)` independent variance components. -- An alternative is the _backfitting algorithm_. --- layout: true ## Backfitting Algorithm --- For a model `\begin{equation*} f(x) = \beta_0 + \sum_{j=1}^p f_j(x_j) \end{equation*}` with data `\(y_i, x_{ij}\)` and smoothers `\(S_j\)` -- * initialize `\(\widehat{\beta}_0 = \overline{y}\)` -- * repeat `$$\begin{align*} \widehat{f}_j &\leftarrow S_j\left[\{y_i-\widehat{\beta}_0-\sum_{k \neq j} \widehat{f}_k(x_{ik})\}_1^n\right]\\ \widehat{f}_j &\leftarrow \widehat{f}_j - \frac{1}{n}\sum_{i=1}^n \widehat{f}_j(x_{ij}) \end{align*}$$` until the changes in the `\(\widehat{f}_j\)` are below some threshold. --- A more complex linear term is handled analogously. -- For penalized linear smoothers with fixed smoothing parameters this can be viewed as solving the equations for the minimizer by a block Gauss-Seidel algorithm. -- Different smoothers can be used on each variable. -- Smoothing parameters can be adjusted during each pass or jointly. -- * `bruto` (Hastie and Tibshirani, 1990) uses a variable selection/smoothing parameter selection pass based on approximate GCV. -- * `gam` from package `mgcv` uses GCV. -- Backfitting may allow larger models to be fit. -- Backfitting can be viewed as one of several ways of fitting penalized/mixed models. -- Some examples are available [on line](../examples/additive.Rmd). --- layout: true ## Example: Ozone Levels --- .pull-left.width-40[ Data set relates ozone levels to pressure gradient, temperature, and height of inversion. {{content}} ] -- A `gam` fit is produced by ```r library(mgcv) data(calif.air.poll, package = "SemiPar") fit <- gam(ozone.level ~ s(daggett.pressure.gradient) + s(inversion.base.height) + s(inversion.base.temp), data = calif.air.poll) ``` {{content}} -- The default plot method produces -- .pull-right.width-60[ <img src="additive_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> ] --- layout: false ## Mixed Additive Models Mixed additive models can be written as `\begin{equation*} Y = X_0 \beta + Z U + X_1 \delta_1 + \dots + X_p\delta_p + \epsilon \end{equation*}` -- where `\(U\)` is a "traditional" random effects term with `\begin{equation*} U \sim N(0,\Sigma(\theta)) \end{equation*}` -- for some parameter `\(\theta\)` and the terms `\(X_1 \delta_1 + \dots + X_p\delta_p\)` represent smooth additive terms. -- In principle these can be fit with ordinary penalized least squares or mixed models software. --- layout: true ## Example: Sitka Pines Experiment --- An experiment on sitka pines measured size over time for 79 trees grown in an ozone-rich environment and a control environment. -- Measurements were taken at 13 time points. -- .pull-left[ ```r data(sitka, package = "SemiPar") library(lattice) sitka$ozone.char <- ifelse(sitka$ozone, "ozone", "control") xyplot(log.size ~ days | ozone.char, groups = id.num, type = "b", data = sitka) ``` ] .pull-right[ <img src="additive_files/figure-html/sitka-data-1.png" style="display: block; margin: auto;" /> ] --- The plot suggests a model with -- * a smooth term for time -- * a mean shift for ozone -- * a random intercept for trees -- * perhaps also a random slope for trees -- The random intercept model can be fit with `spm` using (not working at present) ```r library(SemiPar) attach(sitka) fit <- spm(log.size ~ ozone + f(days), random = ~ 1, group = id.num) ``` -- and by `gamm` with ```r trees <- as.factor(sitka$id.num) fit <- gamm(log.size ~ ozone + s(days), random = list(trees = ~ 1), data = sitka) ``` <!-- $ make emacs sane --> --- `spm` cannot fit a more complex random effects structure at this point. -- Using `gamm` we can fit random slope and intercept with ```r fit <- gamm(log.size ~ ozone + s(days), random = list(trees = ~ 1 + days), data = sitka) ``` -- .pull-left[ Residuals don't show any further obvious pattern. Autocorrelated errors over time might be worth considering. ] .pull-right[ <img src="additive_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" /> ] --- layout: false ## Generalized Additive Models Standard generalized linear models include `\begin{equation*} y_i \sim \text{Bernoulli}\left(\frac{\exp\{(X\beta)_i\}}{1+\exp\{(X\beta)_i\}}\right) \end{equation*}` -- and `\begin{equation*} y_i \sim \text{Poisson}\left(\exp\{(X\beta)_i\}\right) \end{equation*}` -- Maximum likelihood estimates can be computed by _iteratively reweighted least squares (IRWLS)_ -- Penalized maximum likelihood estimates maximize `\begin{equation*} \text{Loglik}(y, X_0\beta+X_i\delta) - \frac{1}{2}\lambda \delta^T D \delta \end{equation*}` -- This has a mixed model/Bayesian interpretation. -- GLMM (genelarized linear mixed model) software can be used. -- The IRWLS algorithm can be merged with backfitting. --- layout: true ## Example: Trade Union Membership --- Data relating union membership and various characteristics are available. -- A Bernoulli generalized additive model relates the probability of union membership to the available predictor variables. -- One possible model is fit by ```r data(trade.union, package = "SemiPar") fit <- gam(union.member ~ s(wage) + s(years.educ) + s(age) + female + race + south, family = binomial, subset = wage < 40, # remove high leverage point data = trade.union) ``` --- .pull-left.width-45[ The estimated smooth terms are <img src="additive_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" /> ] -- .pull-right.width-55.tiny-code[ Some summary information on the smooth terms: ```r summary(fit) ## ## Family: binomial ## Link function: logit ## ## Formula: ## union.member ~ s(wage) + s(years.educ) + s(age) + female + race + ## south ## ## Parametric coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -0.2434 0.4614 -0.527 0.59785 ## female -0.7101 0.2670 -2.660 0.00782 ** ## race -0.3939 0.1615 -2.439 0.01472 * ## south -0.5209 0.2950 -1.765 0.07750 . ## --- ## Signif. codes: ## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Approximate significance of smooth terms: ## edf Ref.df Chi.sq p-value ## s(wage) 2.814 3.520 22.420 0.000117 *** ## s(years.educ) 2.951 3.716 6.020 0.204710 ## s(age) 1.000 1.000 2.279 0.131177 ## --- ## Signif. codes: ## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## R-sq.(adj) = 0.113 Deviance explained = 12.7% ## UBRE = -0.1362 Scale est. = 1 n = 533 ``` ] <!-- % % ## Inference for Penalized Splines % %%**** skip this section or replace with stuff from Wood? % Assume that % \begin{equation*} % Y_i = m(x_i) + \epsilon_i % \end{equation*} % with the `\(\epsilon_i\)` independent `\(N(0,\sigma_\epsilon^2)\)`. % ## Linear Smoothers and Equivalent Kernels % \begin{itemize} % \item For linear smoothers % \begin{equation*} % \widehat{m}(x) = \ell_x^T Y % \end{equation*} % for some vector `\(\ell_x\)`. % \item For the Nadaraya-Watson estimator, `\(\ell_x = (w_1,\dots,w_n)^T\)` with % \begin{equation*} % w_i = \frac{K((x - x_i)/h)}{\sum_j K((x-x_j)/h)} % \end{equation*} % \item For a penalized spline of order 3 % \begin{equation*} % \ell_x = C (C^T C + \lambda D)^{-1}C_x^T % \end{equation*} % with `\(C_x = (1, x, |x-\kappa_1|^3,\dots,|x-\kappa_K|^3)\)` and % \begin{equation*} % C = % \begin{bmatrix} % C_{x_1}\\ % \vdots\\ % C_{x_n} % \end{bmatrix} % \end{equation*} % % \item `\(\ell_x\)` is sometimes called the _equivalent kernel_ or % _equivalent local kernel_ of the smoother. % \begin{center} % \includegraphics[width=4in]{equivKern} % \end{center} % % lambda <- 0.3928105 % % plot(predict(smooth.spline(sapply(1:18, function(i) smooth.spline(ifelse(1:18 == i, 1, 0),spar=lambda)$y[10]),spar=0),seq(1,18,len=100)),type="l",xlab="x",ylab="Equivalend Kernels") % % lines(predict(smooth.spline(sapply(1:18, function(i) smooth.spline(ifelse(1:18 == i, 1, 0),spar=lambda)$y[5]),spar=0),seq(1,18,len=100)),col="red") % % lines(predict(smooth.spline(sapply(1:18, function(i) smooth.spline(ifelse(1:18 == i, 1, 0),spar=lambda)$y[15]),spar=0),seq(1,18,len=100)),col="blue") % % title("Smothing Spline with 18 Equally Spaced Knots") % % dev.copy2eps(file="equivKern.eps") % \end{itemize} % % ## Pointwise Confidence and Prediction Intervals % \begin{itemize} % \item The standard deviation of `\(\widehat{m}(x) = \ell_x^T Y\)` is % $$ % \SD(\widehat{m}(x)) = \sqrt{\ell_x^T (\sigma_\epsilon^2 I) \ell_x} % = \sigma_\epsilon \Norm{\ell_x} % $$ % \item A large sample confidence interval for `\(E[\widehat{m}(x)]\)` is % $$ % \widehat{m}(x) \pm z_{1-\alpha/2} \widehat{\sigma}_\epsilon \Norm{\ell_x} % $$ % \item A reasonable small sample interval is for `\(E[\widehat{m}(x)]\)` is % $$ % \widehat{m}(x) \pm t_{\text{df}_\text{res},1-\alpha/2} % \widehat{\sigma}_\epsilon \Norm{\ell_x} % $$ % \item Corresponding prediction intervals are % $$ % \widehat{m}(x) \pm z_{1-\alpha/2} \widehat{\sigma}_\epsilon % \sqrt{1+\Norm{\ell_x}^2} % $$ % and % $$ % \widehat{m}(x) \pm t_{\text{df}_\text{res},1-\alpha/2} % \widehat{\sigma}_\epsilon \sqrt{1+\Norm{\ell_x}^2} % $$ % \item These intervals do not account for uncertainty in the smoothing % parameter. % \item Because of bias these confidence intervals are not strictly % intervals for `\(m(x)\)`. % \item Ruppert, Wand, and Carroll argue that for penalized splines a % standard deviation of the form % \begin{equation*} % \SD(\widehat{m}(x) - m(x)) = % \sigma_\epsilon\sqrt{C_x(C^T C + \lambda D)^{-1}C_x^T} % \end{equation*} % incorporates expected bias based on the randomness of `\(\delta\)` in % the mixed model formulation. The intervals % \begin{equation*} % \widehat{m}(x) \pm z_{1-\alpha/2} \widehat{\sigma}_\epsilon % \sqrt{C_x(C^T C + \lambda D)^{-1}C_x^T} % \end{equation*} % are slightly wider at `\(x\)` values with high curvature and bias and % arguably a better choice. % \item These intervals are also Bayesian HPD intervals for a flat prior % on `\(\beta\)` and an independent `\(`N`(0,(\lambda D)^{-1})\)` % prior on `\(\delta\)`. % \item Wood (2017) describes some approaches for accounting for % uncertainty about `\(\lambda\)`. % \end{itemize} % % ### Simultaneous Confidence Bands % \begin{itemize} % \item The sampling distribution of the coefficient estimates is % approximately % \begin{equation*} % \begin{bmatrix} % \widehat{\beta}-\beta\\ % \widehat{\delta}-\delta % \end{bmatrix} % \sim N(0,\sigma_\epsilon^2(C^T C + \lambda D)^{-1}) % \end{equation*} % \item For a grid of values `\(g = \{g_1,\dots,g_M\}\)` the distribution of % \begin{equation*} % \max_{x \in g} \left( % \frac{\widehat{m}(x)-m(x)}{\widehat{SD}(\widehat{m}(x)-m(x))} % \right) % \end{equation*} % is approximately pivotal and approximately equal to the distribution of % \begin{equation*} % W_g = \max_{x \in g} \left( % \frac{C_x % \begin{bmatrix} % \widehat{\beta}-\beta\\ % \widehat{\delta}-\delta % \end{bmatrix}}{\widehat{SD}(\widehat{m}(x)-m(x))} % \right) % \end{equation*} % \item The upper `\(\alpha\)` quantile `\(W_{g,1-\alpha}\)` of the distribution % of `\(W_g\)` can be estimated by simulation. % \item An approximate simultaneous confidence band is then % \begin{equation*} % \widehat{m}(x) \pm W_{g,1-\alpha} \widehat{SD}(\widehat{m}(x)-m(x)) % \end{equation*} % for all `\(x \in g\)`. % \item Alternative asymptotic approximations to `\(W_{g,1-\alpha}\)` based % on _upcrossing theory_ are also available. % \end{itemize} --> --- layout: true ## Alternative Penalties --- Bases for function spaces are infinite dimensional. -- Some form of penalty or _regularization_ is needed. -- Penalties often have a useful Bayesian interpretation. -- Most common penalties on coefficients `\(\delta\)` * quadratic, `\(\sum \delta_i^2\)` or, more generally, `\(\delta^T D \delta\)` -- * absolute value, `\(L_1\)`, LASSO: `\(\sum |\delta_i|\)` --- ### Ridge Regression _Ridge regression_ uses the `\(L_2\)` penalty `\(\lambda \sum \delta_i^2\)`. -- Using a quadratic penalty `\(\delta^T D \delta\)` with strictly positive definite `\(D\)` is sometimes called _generalized ridge regression_. -- The minimizer of `\(\Norm{Y - X\delta}^2 + \lambda \delta^T D \delta\)` is `\begin{equation*} \widehat{\delta_\lambda} = (X^T X + \lambda D)^{-1} X^T Y \end{equation*}` -- This shrinks the OLS estimate towards zero as `\(\lambda \to \infty\)`. -- If `\(X^TX = D = I\)` then the ridge regression estimate is `\begin{equation*} \widehat{\delta_\lambda} = \frac{1}{1+\lambda}\widehat{\delta}_\text{OLS} \end{equation*}` --- ### LASSO The LASSO (Least Absolute Shrinkage and Selection Operator) or `\(L_1\)`-penalized minimization problem $$ \min_\delta\{\Norm{Y - X\delta}^2 + 2\lambda\sum |\delta_i|\} $$ does not in general have a closed form solution. -- But if `\(X^T X = I\)` then `\begin{equation*} \widehat{\delta}_{i,\lambda} = \text{sign}(\widehat{\delta}_{i,\text{OLS}}) (|\widehat{\delta}_{i,\text{OLS}}| - \lambda)_{+} \end{equation*}` -- The OLS estimates are shifted towards zero and truncated at zero. -- The `\(L_1\)` penalty approach has a Bayesian interpretation as a posterior mode for a Laplace or double exponential prior. --- The variable selection property of the `\(L_1\)` penalty is particularly appealing when the number of regressors is large, possibly larger than the number of observations. -- For least squares regression with the LASSO penalty -- * the _solution path_ as `\(\lambda\)` varies is piece-wise linear -- * there are algorithms for computing the entire solution path efficiently -- * Common practice is to plot the coefficients `\(\beta_j(\lambda)\)` against the _shrinkage factor_ `\(s = \Norm{\beta(\lambda)}_1/\Norm{\beta(\infty)}_1\)` -- R Packages implementing general `\(L_1\)`-penalized regression include `lars`, `lasso2`, and `glmnet`. -- A [paper](https://web.archive.org/web/20180723050722/http://statweb.stanford.edu/~tibs/ftp/covtest.pdf), [talk slides](https://web.archive.org/web/20180723050728/http://statweb.stanford.edu/~tibs/ftp/covtest-talk.pdf), and [R package](https://cran.r-project.org/package=covTest) present a significance test for coefficients entering the model. --- ### Elastic Net The _elastic net_ penalty is a combination of the LASSO and Ridge penalties: `\begin{equation*} \lambda \left[(1 - \alpha) \sum \delta_i^2 + 2 \alpha \sum |\delta_i|\right] \end{equation*}` -- * Ridge regression corresponds to `\(\alpha = 0\)`. -- * LASSO corresponds to `\(\alpha = 1\)`. -- `\(\lambda\)` and `\(\alpha\)` can be estimated by cross-validation. -- Elastic net was introduced to address some shortcomings of LASSO, including -- * inability to select more than `\(n\)` predictors in `\(p > n\)` problems; -- * tendency to select only one of correlated predictors. -- The `glmnet` package implements elastic net regression. -- Scaling of predictors is important; by default `glmnet` standardizes before fitting. --- ### Non-Convex Penalties The elastic net penalties are convex for all `\(\alpha\)`. -- This greatly simplifies the optimization to be solved. -- LASSO and other elastic net fits tend to select more variables than needed. -- Some non-convex penalties have the theoretical property of consistently estimating the set of covariates with non-zero coefficients under some asymptotic formulations. -- Some also reduce the bias for the non-zero coefficient estimates. -- Some examples are * smoothly clipped absolute deviation (SCAD); -- * minimax concave penalty (MCP). --- MCP is of the form `\(\sum \rho(\delta_i, \lambda, \gamma)\)` with `\begin{equation*} \rho(x, \lambda, \gamma) = \begin{cases} \lambda |x| - \frac{x^2}{2\gamma} & |x| \le \gamma\lambda\\ \frac{1}{2}\gamma\lambda^2 & \text{otherwise} \end{cases} \end{equation*}` for `\(\gamma > 1\)`. -- This behaves like `\(\lambda |x|\)` for small `\(|x|\)` and smoothly transitions to a constant for large `\(|x|\)`. SCAD is similar in shape. -- Jian Huang and Patrick Breheny have worked extensively on these. --- layout: true ## Alternative Bases --- Many other bases are used, including * polynomials -- * trigonometric polynomials (Fourier basis) -- * wavelet bases -- Different bases are more suitable for modeling different functions -- General idea: choose a basis in which the target can be approximated well with a small number of basis elements. --- ### Wavelets Wavelet smoothing often assumes observations at `\(N=2^J\)` equally spaced points and uses an orthonormal basis of `\(N\)` vectors organized in `\(J\)` levels. -- Some wavelet bases: <img src="additive_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" /> --- A common approach for wavelet smoothing is to use `\(L_1\)` shrinkage with `\begin{equation*} \lambda = \widehat{\sigma} \sqrt{2\log N} \end{equation*}` -- A variant is to use different levels of smoothing at each level of detail. -- `\(\widehat{\sigma}\)` is usually estimated by assuming the highest frequency level is pure noise. -- Several R packages are available for wavelet modeling, including `waveslim`, `rwt`, `wavethresh`, and `wavelets` -- Matlab has very good wavelet toolboxes. -- S-Plus also used to have a good wavelet library. <!-- An example of a recent paper on wavelet methodology is > M. I. Knight and G. P. Nason (2009) A 'nondecimated' lifting transform, > Statistics and Computing 19(1), 1--16. --> --- layout: false ## Other Approaches -- MARS, multiple adaptive regression splines. Available in the `mda` package. -- `polymars` in package `polyspline`. -- Smoothing spline ANOVA. -- Projection pursuit regression. -- Single and multiple index models. -- Neural networks. -- Tree models. -- Some examples are available [on line](../examples/additive.Rmd).
//adapted from Emi Tanaka's gist at //https://gist.github.com/emitanaka/eaa258bb8471c041797ff377704c8505