Numerical Linear Algebra

class: center, middle, title-slide

.title[
# Numerical Linear Algebra
]
.author[
### Luke Tierney
]
.institute[
### University of Iowa
]
.date[
### 2023-05-19
]

---

layout: true

`$$\renewcommand{\machunit}{\mathbf{u}}$$`

## Preliminaries
---

Much of this material is based on

> G. H. Golub and C. F. VanLoan (2013), _Matrix Computations_, 4th
> Ed., JHU Press.

### Conditioning and Stability

Some problems are inherently difficult: no algorithm involving
rounding of inputs can be expected to work well. Such problems are
called _ill-conditioned_.

---

A numerical measure of conditioning, called a _condition number_, can
sometimes be defined:

* Suppose the objective is to compute `$y = f(x)$`.

* If `$x$` is perturbed by `$\Delta x$` then the result is changed by

`$$\Delta y = f(x + \Delta x) - f(x).$$`

* If

`$$\frac{|\Delta y|}{|y|} \approx \kappa \frac{|\Delta x|}{|x|}$$`

for small perturbations `$\Delta x$` then `$\kappa$` is the _condition
  number_ for the problem of computing `$f(x)$`.

A particular algorithm for computing an approximation `$\tilde{f}(x)$`
to `$f(x)$` is _numerically stable_ if for small perturbations
`$\Delta x$` of the input the result is close to `$f(x)$`.

---

### Error Analysis

Analyzing how errors accumulate and propagate through a computation,
called _forward error analysis_, is sometimes possible but often
very difficult.

_Backward error analysis_ tries to show that the
computed result
  
`\begin{equation*}
  \tilde{y} = \tilde{f}(x)
\end{equation*}`

is the exact solution to a slightly perturbed problem, 
--
i.e.

`\begin{equation*}
  \tilde{y} = f(\tilde{x})
\end{equation*}`
for some `$\tilde{x} \approx x$`.

---

* the problem of computing `$f(x)$` is well conditioned, and

* the algorithm `$\tilde{f}$` is stable,

then

`\begin{align*}
  \tilde{y} &= \tilde{f}(x) & \text{computed result}\\
  & = f(\tilde{x}) & \text{exact result}\\
  &\approx f(x) & \text{well-conditioned}
\end{align*}`

Backward error analysis is used heavily in numerical linear algebra.

---
layout: true
## Solving Linear Systems
---

Many problems involve solving linear systems of the form

`\begin{equation*}
  A x = b
\end{equation*}`

* least squares normal equations:

`$$X^TX \beta = X^T y$$`

* stationary distribution of a Markov chain:

`$$\begin{align*} \pi P &= \pi\\ \sum \pi_i &= 1 \end{align*}$$`

If `$A$` is `$n \times n$` and non-singular then in principle the solution
is

`\begin{equation*}
  x = A^{-1} b
\end{equation*}`

This is not usually a good numerical approach because

* it can be numerically inaccurate;
  * it is inefficient except for very small `$n$`.

---

### Triangular Systems

Triangular systems are easy to solve.

The upper triangular system

`\begin{equation*}
  \begin{bmatrix}
    5 & 3 \\
    0 & 2
  \end{bmatrix}
  \begin{bmatrix}
    x_1\\
    x_2
  \end{bmatrix}
  = 
  \begin{bmatrix}
    16\\
    4
  \end{bmatrix}
\end{equation*}`

has solution

`\begin{align*}
  x_2 & = 4 / 2 = 2\\
  x_1 &= (16 - 3 x_2)/5 = 10/5 = 2
\end{align*}`

This is called _back substitution_

Lower triangular systems are solved by _forward substitution_.

---

If one of the diagonal elements in a triangular matrix is zero, then
the matrix is singular.

If one of the diagonal elements in a triangular matrix is close to
zero, then the solution is very sensitive to other inputs:

`\begin{equation*}
  \begin{bmatrix}
    1 & a \\
    0 & \epsilon
  \end{bmatrix}
  \begin{bmatrix}
    x_1\\
    x_2
  \end{bmatrix}
  = 
  \begin{bmatrix}
    b_1\\
    b_2
  \end{bmatrix}
\end{equation*}`

has solution

`\begin{align*}
  x_2 & = \frac{b_2}{\epsilon}\\
  x_1 &= b_1 - a \frac{b_2}{\epsilon}
\end{align*}`

This sensitivity for small `$\epsilon$` is inherent in the problem: For
small values of `$\epsilon$` the problem of finding the solution `$x$` is
ill-conditioned.

---

### Gaussian Elimination

The system

`\begin{equation*}
  \begin{bmatrix}
    5 & 3 \\
    10 & 8
  \end{bmatrix}
  \begin{bmatrix}
    x_1\\
    x_2
  \end{bmatrix}
  = 
  \begin{bmatrix}
    16\\
    36
  \end{bmatrix}
\end{equation*}`

can be reduced to triangular form by subtracting two times the first
equation from the second.

In matrix form:

`\begin{equation*}
  \begin{bmatrix}
    1 & 0\\
    -2 & 1
  \end{bmatrix}
  \begin{bmatrix}
    5 & 3 \\
    10 & 8
  \end{bmatrix}
  \begin{bmatrix}
    x_1\\
    x_2
  \end{bmatrix}
  = 
  \begin{bmatrix}
    1 & 0\\
    -2 & 1
  \end{bmatrix}
  \begin{bmatrix}
    16\\
    36
  \end{bmatrix}
\end{equation*}`

`\begin{equation*}
  \begin{bmatrix}
    5 & 3 \\
    0 & 2
  \end{bmatrix}
  \begin{bmatrix}
    x_1\\
    x_2
  \end{bmatrix}
  = 
  \begin{bmatrix}
    16\\
    4
  \end{bmatrix}
\end{equation*}`

which is the previous triangular system.

---

For a general `$2 \times 2$` matrix `$A$` the lower triangular matrix used
for the reduction is

`\begin{equation*}
  \begin{bmatrix}
    1 & 0\\
    -\frac{a_{21}}{a_{11}} & 1
  \end{bmatrix}
\end{equation*}`

The ratio `$\frac{a_{21}}{a_{11}}$` is a called a _multiplier_.

This strategy works as long as `$a_{11} \neq 0$`.

If `$a_{11} \approx 0$`, say

`\begin{equation*}
  A = 
  \begin{bmatrix}
    \epsilon & 1\\
    1 & 1
  \end{bmatrix}
\end{equation*}`

for small `$\epsilon$`, then the multiplier `$1/\epsilon$` is large and
this does not work very well, even though `$A$` is very well behaved.

Using this approach would result in a numerically unstable algorithm
for a well-conditioned problem.

---

### Partial Pivoting

We can ensure that the multiplier is less than or equal to one in
magnitude by switching rows before eliminating:

`\begin{equation*}
  \begin{bmatrix}
    0 & 1\\
    1 & 0
  \end{bmatrix}
  \begin{bmatrix}
    5 & 3 \\
    10 & 8
  \end{bmatrix}
  \begin{bmatrix}
    x_1\\
    x_2
  \end{bmatrix}
  = 
  \begin{bmatrix}
    0 & 1\\
    1 & 0
  \end{bmatrix}
  \begin{bmatrix}
    16\\
    36
  \end{bmatrix}  
\end{equation*}`

`\begin{equation*}
  \begin{bmatrix}
    10 & 8\\
    5 & 3
  \end{bmatrix}
  \begin{bmatrix}
    x_1\\
    x_2
  \end{bmatrix}
  = 
  \begin{bmatrix}
    36\\
    16
  \end{bmatrix}  
\end{equation*}`

The matrix to reduce this system to triangular form is now

`\begin{equation*}
  \begin{bmatrix}
    1 & 0\\
    -0.5 & 1
  \end{bmatrix}
\end{equation*}`

---

So the final triangular system is constructed as

`\begin{equation*}
  \begin{bmatrix}
    1 & 0\\
    -0.5 & 1
  \end{bmatrix}
  \begin{bmatrix}
    0 & 1\\
    1 & 0
  \end{bmatrix}
  \begin{bmatrix}
    5 & 3 \\
    10 & 8
  \end{bmatrix}
  \begin{bmatrix}
    x_1\\
    x_2
  \end{bmatrix}
  = 
  \begin{bmatrix}
    1 & 0\\
    -0.5 & 1
  \end{bmatrix}
  \begin{bmatrix}
    0 & 1\\
    1 & 0
  \end{bmatrix}
  \begin{bmatrix}
    16\\
    36
  \end{bmatrix}  
\end{equation*}`

`\begin{equation*}
  \begin{bmatrix}
    10 & 8\\
    0 & -1 \\
  \end{bmatrix}
  \begin{bmatrix}
    x_1\\
    x_2
  \end{bmatrix}
  = 
  \begin{bmatrix}
    36\\
    -2
  \end{bmatrix}  
\end{equation*}`

Equivalently, we can think of our original system as

`\begin{equation*}
  \begin{bmatrix}
    0 & 1\\
    1 & 0
  \end{bmatrix}
  \begin{bmatrix}
    1 & 0\\
    0.5 & 1
  \end{bmatrix}
  \begin{bmatrix}
    10 & 8\\
    0 & -1 \\
  \end{bmatrix}
  \begin{bmatrix}
    x_1\\
    x_2
  \end{bmatrix}
  = 
  \begin{bmatrix}
    16\\
    36
  \end{bmatrix}  
\end{equation*}`

The decomposition of `$A$` as

`\begin{equation*}
  A = P L U
\end{equation*}`

with `$P$` a permutation matrix, `$L$` lower trianbular with ones on the
diagonal, and `$U$` upper triangular is called a _PLU decomposition_.

---

### PLU Decomposition

In general, we can write a square matrix `$A$` as

`\begin{equation*}
  A = P L U
\end{equation*}`

where

* `$P$` is a _permutation matrix_, i.e.

* it is an identity matrix with some rows switched
  * it satisfies `$P P^T = P^T P = I$`, i.e. it is an _orthogonal matrix_

* `$L$` is a _unit lower triangular matrix_, i.e.

* it is lower triangular
  * it has ones on the diagonal

* `$U$` is upper triangular

The permutation matrix `$P$` can be chosen so that the multipliers used
in forming `$L$` all have magnitude at most one.

---

`$A$` is non-singular if and only if the diagonal entries in `$U$` are all
non-zero.

If `$A$` is non-singular, then we can solve

`\begin{equation*}
  A x = b
\end{equation*}`

`\begin{equation*}
  PLU x = b.
\end{equation*}`

in three steps:

1. Solve `$P z = b$` for `$z = P^Tb$` (permute the right hand side)
2. Solve `$L y = z$` for `$y$` (forward solve lower triangular system)
3. Solve `$U x = y$` for `$x$` (back solve upper triangular system)

Computational complexity:

* Computing the _PLU_ decomposition takes `$O(n^3)$` operations.

* Computing a solution from a _PLU_ decomposition takes
  `$O(n^2)$` operations.

---

### Condition Number

Linear systems `$Ax=b$` have unique solutions if `$A$` is non-singular.

Solutions are sensitive to small perturbations if `$A$` is close to singular.

We need a useful measure of closeness to singularity

---

The _condition number_ is a useful measure:

`\begin{align*}
  \renewcommand{\Norm}[1]{\|{#1}\|}
  \kappa(A) &=
     \frac{\max_{x \ne 0} \frac{\Norm{Ax}}{\Norm{x}}}
     {\min_{x \ne 0} \frac{\Norm{Ax}}{\Norm{x}}} \\ &=
  \left(\max_{x \ne 0} \frac{\Norm{Ax}}{\Norm{x}}\right)
  \left(\max_{x \ne 0} \frac{\Norm{A^{-1}x}}{\Norm{x}}\right) \\ &=
  \Norm{A}\Norm{A^{-1}}
\end{align*}`

where `$\Norm{y}$` is a _vector norm_ (i.e. a measure of length) of
`$y$` 
--
and

$$
  \Norm{B} = \max_{x \ne 0} \frac{\Norm{Bx}}{\Norm{x}}
$$

is the corresponding _matrix norm_ of `$B$`.

---

Some common vector norms:

`\begin{align*}
  \|x\|_2 &= \sqrt{\sum_{i=1}^n x_i^2} & \text{Euclidean norm}\\
  \|x\|_1 &= \sum_{i=1}^n |x_i|  & \text{$L_1$ norm, Manhattan norm}\\
  \|x\|_\infty &= \max_i |x_i|  & \text{$L_\infty$ norm}
\end{align*}`

---

### Some Properties of Condition Numbers

`$\kappa(A) \ge 1$` for all `$A$`.

`$\kappa(A) = \infty$` if `$A$` is singular

If `$A$` is diagonal, then

`\begin{equation*}
  \kappa(A) = \frac{\max |a_{ii}|}{\min |a_{ii}|}
\end{equation*}`

Different norms produce different values; the values are usually
qualitatively similar

---

### Sensitivity of Linear Systems

Suppose `$x$` solves the original system and `$x^{*}$` solves a slightly
perturbed system,

`\begin{equation*}
  (A + \Delta A) x^{*} = b + \Delta b
\end{equation*}`

and suppose that

`\begin{align*}
  \delta \kappa(A) &\le \frac{1}{2}\\
  \frac{\Norm{\Delta A}}{\Norm{A}} & \le \delta\\
  \frac{\Norm{\Delta b}}{\Norm{b}} & \le \delta
\end{align*}`

Then

`\begin{equation*}
  \frac{\Norm{x-x^{*}}}{\Norm{x}} \le 4 \delta \kappa(A)
\end{equation*}`

---

### Stability of Gaussian Elimination with Partial Pivoting

Backward error analysis: The numerical solution `$\hat{x}$` to the system

`\begin{equation*}
  Ax = b
\end{equation*}`

produced by Gaussian elimination with partial pivoting is the exact
solution for a perturbed system

`\begin{equation*}
  (A + \Delta A) \hat{x} = b
\end{equation*}`

with

`\begin{equation*}
  \frac{\Norm{\Delta A}_{\infty}}{\Norm{A}_{\infty}}
  \le 8 n^3 \rho \machunit + O(\machunit^2)
\end{equation*}`

* The value of `$\rho$` is not _guaranteed_ to be small, but is
  rarely larger than 10

* The algorithm would be considered numerically stable if `$\rho$`
  were guaranteed to be bounded.

* _Complete pivoting_ is a bit more stable, but much more work.

* The algorithm is considered very good for practical purposes.

---

### General Linear Systems in R

R provides

* `solve` for general systems, based on LAPACK's
  [DGESV](https://netlib.org/lapack/explore-html/d8/d72/dgesv_8f.html).

* DGESV uses the `$P L U$` decomposition.

* `forwardsolve`, `backsolve` for triangular systems.

* `kappa` computes an estimate of the condition number or the exact
  condition number based on the Euclidean norm.

---
layout: true
## Cholesky Factorization
---

Suppose `$A$` is symmetric and (strictly) positive definite, i.e.

`\begin{equation*}
   x^T A x > 0
\end{equation*}`

for all `$x \ne 0$`. 
--
Some examples:

* If `$X$` is the `$n \times p$` design matrix for a linear model and
  `$X$` is of rank `$p$`, then `$A = X^T X$` is strictly positive definite.

If `$X$` is not of full rank then `$A = X^T X$` is non-negative definite
  or positive semi-definite, i.e. `$x^T A x \ge 0$` for all `$x$`.

* If `$A$` is the covariance matrix of a random vector `$X$` then
  `$A$` is positive semidefinite:

`$$\begin{align*} c^T A c &= c^T E[(X-\mu)(X-\mu)^T] c\\ &= E[((X-\mu)^T c)^T (X - \mu)^T c]\\ &= \text{Var}((X - \mu)^T c) \ge 0 \end{align*}$$`
--

The covariance matrix is strictly positive definite unless
  `$P(c^T X = c^T\mu) = 1$`
  for some `$c \ne 0$`, i.e. unless there is a perfect linear
  relation between some of the components of `$X$`.

---

**Theorem:**

If `$A$` is strictly positive definite, then there exists a unique lower
triangular matrix `$L$` with positive diagonal entries such that

`\begin{equation*}
  A = L L^T
\end{equation*}`

This is called the _Cholesky factorization_.

---

### Properties of the Cholesky Factorization Algorithm

It uses the symmetry to produce an efficient algorithm.

The algorithm needs to take square roots to find the diagonal
entries.

An alternative that avoids square roots factors `$A$` as

`\begin{equation*}
  A = L D L^T
\end{equation*}`

with `$D$` diagonal and `$L$` unit lower triangular.

The algorithm is numerically stable, and is guaranteed not to
attempt square roots of negative numbers if

$$
  q_n \machunit \kappa_2(A) \le 1
$$

where `$q_n$` is a small constant depending on the dimension `$n$`.

---

The algorithm will fail if the matrix is not (numerically)
strictly positive definite.

Modifications using pivoting are available that can be used for
nonnegative definite matrices.

Another option is to factor
`$A_\lambda = A + \lambda I$`
with `$\lambda > 0$` chosen large enough to make `$A_\lambda$` numerically
strictly positive definite.

This is often used in optimization.

---

### Some Applications of the Cholesky Factorization

* Solving the normal equations in least squares.

This requires that the predictors be linearly independent

* Generating multivariate normal random vectors.

* Parameterizing strictly positive definite matrices:

Any lower triangular matrix `$L$` with arbitrary values below the
  diagonal and positive diagonal entries determines and is uniquely
  determined by the positive definite matrix `$A = L L^T$`

---

### Cholesky Factorization in R

The function `chol` computes the Cholesky factorization.

The returned value is the upper triangular matrix `$R = L^T$`.

LAPACK is used for the computation.

---
layout: true
## QR Factorization
---

An `$m \times n$` matrix `$A$` with `$m \ge n$` can be written as

`\begin{equation*}
  A = Q R
\end{equation*}`

where

* `$Q$` is `$m \times n$` with orthonormal columns, i.e. `$Q^T Q = I_n$`

* `$R$` is upper triangular

Several algorithms are available for computing the QR decomposition:

* Modified Gram-Schmidt;

* Householder transformations (reflections);

* Givens transformations (rotations).

---

Each has advantages and disadvantages.

LINPACK `dqrdc` and LAPACK `DGEQP3` use Householder transformations.

The QR decomposition exists regardless of the rank of `$A$`.

The rank of `$A$` is `$n$` if and only if the diagonal elements of
`$R$` are all non-zero.

---

### Householder Transformations

A Householder transformation is a matrix of the form

`\begin{equation*}
  P = I - 2 v v^T/ v^T v
\end{equation*}`

where `$v$` is a nonzero vector.

* `$Px$` is the reflection of `$x$` in the hyperplane orthogonal to `$v$`.

* Given a vector `$x \neq 0$`, choosing `$v = x + \alpha e_1$` with

`$$\alpha = \pm \|x\|_2$$`

and `$e_1$` the first unit vector (first column of the identity) produces

`$$Px = \mp\|x\|_2 e_1$$`

---

This can be used to zero all but the first element of the first
column of a matrix:

`\begin{equation*}
  P
  \begin{bmatrix}
    \times & \times & \times\\
    \times & \times & \times\\
    \times & \times & \times\\
    \times & \times & \times\\
    \times & \times & \times\\
  \end{bmatrix}
  =
  \begin{bmatrix}
    \times & \times & \times\\
    0 & \times & \times\\
    0 & \times & \times\\
    0 & \times & \times\\
    0 & \times & \times\\
  \end{bmatrix}
\end{equation*}`

This is the first step in computing the `$QR$` factorization.

The denominator `$v^Tv$` can be written as

`\begin{equation*}
  v^Tv = x^Tx + 2 \alpha x_1 + \alpha^2
\end{equation*}`

Choosing `$\alpha = \text{sign}(x_1)\|x\|_2$` ensures that all
terms are non-negative and avoids cancellation.

---

With the right choice of sign Householder transformations are
very stable.

The computation `$Py$` would be carried out without creating the `$P$` matrix:

`\begin{equation*}
  Py = (I - 2 v v^T/ v^T v)y = y - \left(2 \frac{v^T y}{v^T v} \right) v.
\end{equation*}`

---

### Givens Rotations

A Givens rotation is a matrix `$G$` that is equal to
the identity except for elements `$G_{ii},G_{ij},G_{ji},G_{jj}$`, which are

`\begin{equation*}
  \begin{bmatrix}
    G_{ii} & G_{ij} \\
    G_{ji} & G_{jj}
  \end{bmatrix}
  =
  \begin{bmatrix}
    c & s\\
    -s & c
  \end{bmatrix}
\end{equation*}`

with `$c = \cos(\theta)$` and `$s = \sin(\theta)$` for some `$\theta$`.

Premultiplication by `$G^T$` is a clockwise rotation by `$\theta$`
radians in the `$(i,j)$` coordinate plane.

Given scalars `$a,b$` one can compute `$c, s$` so that
`\begin{equation*}
  \begin{bmatrix}
    c & s\\
    -s & c
  \end{bmatrix}^T
  \begin{bmatrix}
    a\\
    b
  \end{bmatrix}
  = 
  \begin{bmatrix}
    r\\
    0
  \end{bmatrix}
\end{equation*}`

This allows `$G$` to zero one element while changing only one other element.

---

A stable way to choose `$c, s$`:

&nbsp;&nbsp; **if** `$b = 0$` 
&nbsp;&nbsp;&nbsp;&nbsp; `$c = 1; s = 0$` 
&nbsp;&nbsp; **else** **if** `$|b| > |a|$` 
&nbsp;&nbsp;&nbsp;&nbsp; `$\tau = -a/b; s = 1/\sqrt{1+\tau^2}; c = s\tau$` 
&nbsp;&nbsp; **else** 
&nbsp;&nbsp;&nbsp;&nbsp; `$\tau = -b/a;\; c = 1/\sqrt{1 + \tau^2};\; s = c \tau$` 
&nbsp;&nbsp; **end**

A sequence of Givens rotations can be used to compute the `$QR$`
factorization.

* The zeroing can be done working down columns or across rows.

* Working across rows is useful for incrementally adding more
  observations.

---

### Applications

The QR decomposition can be used for solving `$n \times n$`
systems of equations

`\begin{equation*}
  Ax = b  
\end{equation*}`

since `$Q^{-1} = Q^T$` and so

`\begin{equation*}
  A x = Q R x = b
\end{equation*}`

is equivalent to the upper triangular system

`\begin{equation*}
  R x = Q^T b
\end{equation*}`

---

The QR decomposition can also be used to solve the normal
equations in linear regression:

If `$X$` is the `$n \times p$` design matrix then the normal equations are

`\begin{equation*}
  X^T X b = X^T y
\end{equation*}`

If `$X = QR$` is the QR decomposition of `$X$`, then `$X^T = R^T Q^T$` and
therefore

`\begin{align*}
  X^TX  &= R^T Q^T Q R = R^T R\\
  X^T y &= R^T Q^T y
\end{align*}`

If `$X$` is of full rank then `$R^T$` is invertible, and the normal
equations are equivalent to the upper triangular system

`\begin{equation*}
  R b = Q^T y
\end{equation*}`

This approach avoids computing `$X^T X$`.

If `$X$` is of full rank then `$R^T R$` is the Cholesky
factorization of `$X^T X$` (up to multiplications of rows of `$R$` by
`$\pm 1$`).

---

### QR with Column Pivoting

Sometimes the columns of `$X$` are linearly dependent or nearly so.

By permuting columns we can produce a factorization

`\begin{equation*}
  A = Q R P
\end{equation*}`

where

* `$P$` is a permutation matrix

* `$R$` is upper triangular and the diagonal elements of `$R$` have
  non-increasing magnitudes, i.e.

`$$|r_{ii}| \ge |r_{jj}|$$`

if `$i \le j$`.

---

If some of the diagonal entries of `$R$` are zero, then `$R$` will
be of the form

`\begin{equation*}
  R = 
  \begin{bmatrix}
    R_{11} & R_{12}\\
    0 & 0
  \end{bmatrix}
\end{equation*}`

where `$R_{11}$` is upper triangular with non-zero diagonal elements
non-increasing in magnitude.

The rank of the matrix is the number of non-zero rows in R.

The _numerical rank_ of a matrix can be determined by

* computing its QR factorization with column pivoting

* specifying a tolerance level `$\epsilon$` such that all
 diagonal entries `$|r_{ii}| < \epsilon$` are considered
 numerically zero.

* Modifying the computed QR factorization to zero all rows
  corresponding to numerically zero `$r_{ii}$` values.

---

### Some Regression Diagnostics

The projection matrix, or hat matrix, is

`\begin{equation*}
  H = X (X^T X)^{-1} X^T = QR (R^T R)^{-1} R^T Q^T = QQ^T
\end{equation*}`

The diagonal elements of the hat matrix are therefore

`\begin{equation*}
  h_i = \sum_{j=1}^p q_{ij}^2
\end{equation*}`

---

If `$\hat{e}_i = y_i - \hat{y}_i$` is the residual, then

`\begin{align*}
  s_{-i}^2 &= \frac{\text{SSE} - \hat{e}_i^2/(1-h_i)}{n-p-1}
  = \text{ estimate of variance without obs. i}\\
  t_i &= \frac{\hat{e}_i}{s_{-i}\sqrt{1-h_i}}
  = \text{externally studentized residual}\\
  D_i &= \frac{\hat{e}_i^2 h_i}{(1-h_i)^2s^2p}
  = \text{Cook's distance}
\end{align*}`

---

### QR Decomposition and Least Squares in R

The R function `qr` uses either LINPACK or LAPACK to compute QR
factorizations.

A modified LINPACK routine is the default.

The core linear model fitting function `lm.fit` uses QR factorization
with column pivoting.

---
layout: true
## Singular Value Decomposition
---

An `$m \times n$` matrix `$A$` with `$m \ge n$` can be factored as

`\begin{equation*}
  A = U D V^T
\end{equation*}`

where

* `$U$` is `$m \times n$` with orthonormal columns, i.e. `$U^T U = I_n$`.

* `$V$` is `$n \times n$` orthogonal, i.e. `$V V^T = V^T V = I_n$`.

* `$D = \text{diag}(d_1,\dots,d_n)$` is `$n \times n$` diagonal with
  `$d_1 \ge d_2 \ge \dots \ge d_n \ge 0$`.

This is the _singular value decomposition_, or _SVD_ of `$A$`.

* The values `$d_1, \dots, d_n$` are the _singular values_ of `$A$`.

* The columns of `$U$` are the _right singular vectors_ of `$A$`.

* The columns of `$V$` are the _left singular vectors_ of `$A$`.

---

If the columns of `$A$` have been centered so the column sums of `$A$` are
zero, then the columns of `$UD$` are the _principal components_ of `$A$`.

Excellent algorithms are available for computing the SVD.

These algorithms are usually several times slower than the QR
algorithms.

---

### Some Properties of the SVD

The Euclidean matrix norm of `$A$` is defined as

`\begin{equation*}
  \Norm{A}_2 = \max_{x \neq 0}\frac{\Norm{Ax}_2}{\Norm{x}_2}
\end{equation*}`

with `$\Norm{x}_2 = \sqrt{x^T x}$` the Euclidean vector norm.

If `$A$` has SVD `$A=UDV^T$`, then

`\begin{equation*}
  \Norm{A}_2 = d_1
\end{equation*}`

---

If `$k < \text{rank}(A)$` and

`\begin{equation*}
  A_k = \sum_{i=1}^k d_i u_i v_i^T
\end{equation*}`

where `$u_i$` is the `$i$`-th column of `$U$` and
`$v_i$` is the `$i$`-th column of `$V$` 
--
then

`\begin{equation*}
  \min_{B:\text{rank}(B) \le k} \Norm{A - B}_2 = \Norm{A - A_k} = d_{k+1}  
\end{equation*}`

In particular,

* `$d_1 u_1 v_1^T$` is the best rank one approximation to `$A$` (in
  the Euclidean matrix norm).

* `$A_k$` is the best rank `$k$` approximation to `$A$`.

* If `$m = n$` then `$d_n = \min\{d_1,\dots.d_n\}$` is the distance
  between `$A$` and the set of singular matrices.

---

If `$A$` is square then the condition number based on the
Euclidean norm is

`\begin{equation*}
  \kappa_2(A) = \Norm{A}_2\Norm{A^{-1}}_2 = \frac{d_1}{d_n}
\end{equation*}`

For an `$m \times n$` matrix with `$m > n$` we also have

`\begin{equation*}
  \kappa_2(A) =
  \frac{\max_{x \ne 0} \frac{\Norm{Ax}_2}{\Norm{x}_2}}
  {\min_{x \ne 0} \frac{\Norm{Ax}_2}{\Norm{x}_2}}
  = \frac{d_1}{d_n}
\end{equation*}`

* This can be used to relate `$\kappa_2(A^T A)$` to `$\kappa_2(A)$`.

* This has implications for regression computations.

The singular values are the non-negative square roots of the
eigenvalues of `$A^T A$` and the columns of `$V$` are the corresponding
eigenvectors.

---

### Moore-Penrose Generalized Inverse

.pull-left[
Suppose `$A$` has rank `$r \le n$` and SVD `$A = U D V^T$`. 
{{content}}
]
--
Then

`\begin{equation*}
  d_{r+1} = \dots = d_n = 0
\end{equation*}`
{{content}}
--

Let

`\begin{equation*}
  D^+ =
  \text{diag}\left(\frac{1}{d_1}, \dots, \frac{1}{d_r}, 0, \dots, 0\right)
\end{equation*}`
{{content}}
--

and let

`\begin{equation*}
  A^+ = V D^+ U^T
\end{equation*}`
--

.pull-right[
Then `$A^+$` satisfies

`\begin{align*}
  A A^+ A &= A\\
  A^+ A A^+ &= A^+\\
  (A A^+)^T &= A A^+\\
  (A^+A)^T &= A^+ A
\end{align*}`
{{content}}
]
--

`$A^+$` is the unique matrix with these properties and is called the
Moore-Penrose _generalized inverse_ or _pseudo-inverse_.

---

### SVD and Least Squares

If `$X$` is an `$n \times p$` design matrix of less than full rank, then
there are infinitely many values of `$b$` that minimize

`\begin{equation*}
  \Norm{y - Xb}_2^2
\end{equation*}`

Among these solutions,

`\begin{equation*}
  b = (X^T X)^+ X^T y
\end{equation*}`

minimizes `$\Norm{b}_2$`.

This is related to _penalized regression_ where one might choose `$b$`
to minimize

`\begin{equation*}
  \Norm{y - Xb}_2^2 + \lambda \Norm{b}_2^2
\end{equation*}`

for some choice of `$\lambda > 0$`.

---

### SVD and Principal Components Analysis

Let `$X$` be an `$n \times p$` matrix of `$n$` observations on `$p$`
variables.

Principal components analysis involves estimating the
eigenvectors and eigenvalues of the covariance matrix.

Let `$\widetilde{X}$` be the data matrix with columns centered at zero
by subtracting the column means.

The sample covariance matrix is

`\begin{equation*}
  S = \frac{1}{n-1} \widetilde{X}^T\widetilde{X}
\end{equation*}`

Let `$\widetilde{X} = U D V^T$` be the SVD of the centered data matrix
`$\widetilde{X}$`.

---

Then

`\begin{equation*}
  S = \frac{1}{n-1} V D U^T U D V^T = \frac{1}{n-1} V D^2 V^T
\end{equation*}`

* The diagonal elements of `$\frac{1}{n-1}D^2$` are the eigenvalues
  of `$S$`.

* The columns of `$V$` are the eigenvectors of `$S$`.

Using the SVD of `$\widetilde{X}$` is more numerically stable than

* forming `$\widetilde{X}^T\widetilde{X}$`

* computing the eigenvalues and eigenvectors.

---

### SVD and Numerical Rank

The rank of a matrix `$A$` is equal to the number of non-zero
singular values.

Exact zeros may not occur in the SVD due to rounding.

Numerical rank determination can be based on the SVD.
All `$d_i \le \delta$` can be set to zero for some choice of `$\delta$`.
Golub and van Loan recommend using

`$$\delta = \machunit \Norm{A}_\infty$$`

If the entries of `$A$` are only accurate to `$d$` decimal digits, then
Golub and van Loan recommend

`\begin{equation*}
  \delta = 10^{-d} \Norm{A}_\infty
\end{equation*}`

If the numerical rank of `$A$` is `$\hat{r}$` and `$d_{\hat{r}} \gg \delta$`
then `$\hat{r}$` can be used with some confidence; otherwise caution is
needed.

---

.pull-left.tiny-code[
### Other Applications

The SVD is used in many areas of numerical analysis.
{{content}}
]
--

It is also often useful as a theoretical tool.
{{content}}
--

Some approaches to compressing `$m \times n$` images are based on
the SVD.
{{content}}
--

A simple example using the
[`volcano`](https://teara.govt.nz/en/photograph/3920/maungawhau-mt-eden)
data:

```r
s <- svd(volcano)
head(s$d, 10)
## [1] 9644.28782 488.60992 341.18358 298.76602 141.83363
## [6] 72.12443 43.55698 33.52319 27.38376 19.97622
tail(s$d, 2)
## [1] 1.0526941 0.9545092
```

The relative error in the final image is 0.056.
--

.pull-right[
<img src="linalg_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" />
]

---

### SVD in R

R provides the function `svd` to compute the SVD.

The implementation uses LAPACK.

You can ask for the singular values only; this is will be faster for
larger problems.

---
layout: true
## Eigenvalues and Eigenvectors
---

Let `$A$` be an `$n \times n$` matrix.  `$\lambda$` is an _eigenvalue_
of `$A$` if

`\begin{equation*}
  A v = \lambda v
\end{equation*}`

for some `$v \neq 0$`; `$v$` is an _eigenvector_ or `$A$`.

If `$A$` is a real `$n \times n$` matrix then it has `$n$` eigenvalues.

* Several eigenvalues may be identical.

* Some eigenvalues may be complex; if so, then they come in
  conjugate pairs.

* The set of eigenvalues is called the _spectrum_.

If `$A$` is symmetric then the eigenvalues are real.

---

If `$A$` is symmetric then

* `$A$` is strictly positive definite if and only if all
  eigenvalues are positive.

* `$A$` is positive semi-definite if and only if all eigenvalues
  are non-negative.

* There exists an orthogonal matrix `$V$` such that

$$ A = V \Lambda V^T $$

with `$\Lambda = \text{diag}(\lambda_1,\dots,\lambda_n)$`; the
  columns of `$V$` are the corresponding normalized eigenvectors.

* This is called the _spectral decomposition_ of `$A$`.

---

Some problems require only the largest eigenvalue or the largest few;
sometimes the corresponding eigenvectors are also needed.

* The stationary distribution of an irreducible finite state-space
  Markov chain is the unique eigenvector, normalized to sum to one,
  corresponding to the largest eigenvalue `$\lambda = 1$`.

* The speed of convergence to the stationary distribution depends on
  the magnitude of the second largest eigenvalue.

The R function `eigen` can be used to compute eigenvalues and,
optionally, eigenvectors.

The package `svd` provides routines for computing a specified number
of the largest eigenvalues and corresponding eigenvectors.

---
layout: true
## Determinants
---

Theoretically the determinant can be computed as the product of

* the diagonals of `$U$` in the `$PLU$` decomposition;

* the squares of the diagonals of `$L$` in the Cholesky factorization;

* the diagonals of R in the QR decomposition;

* the eigenvalues.

Numerically these are almost always bad ideas.

It is almost always better to work out the sign and compute the sum of
the logarithms of the magnitudes of the factors.

---

The R functions `det` and `determinant` compute the determinant.

`determinant` is more complicated to use, but has a `logarithm` option.

Likelihood and Bayesian analyses often involve a determinant.

* Usually the log likelihood and log determinant should be used.

* Usually the log determinant can be computed from a decomposition
  needed elsewhere in the log likelihood calculation, e.g. a
  Cholesky factorization.

---

### Example: Multivariate Normal Log-Likelihood Computation

The density at `$x$` of a `$d$`-dimensional normal distribution with mean
vector `$\mu$` and covariance matrix `$C$` is

`\begin{equation*}
  \frac{1}{\sqrt{(2 \pi)^d \det{C}}}
  \exp\left\{-\frac{1}{2} (x - \mu)^T C^{-1} (x - \mu)\right\}.
\end{equation*}`

The log-density (dropping the additive constant) is

`\begin{equation*}
  -\frac{1}{2} \log\det{C}
  -\frac{1}{2} (x - \mu)^T C^{-1} (x - \mu).
\end{equation*}`

If `$C = L L^T$` is the Cholesky factorization of `$C$`,
then `$C^{-1} = L^{-T}L^{-1}$` and the quadratic form is

`\begin{equation*}
  (x - \mu)^T C^{-1} (x - \mu) = (x - \mu)^T L^{-T}L^{-1} (x - \mu) =
  \|L^{-1} (x - \mu)\|_2^2.
\end{equation*}`

---

So to compute the log density at `$x$`:

* Compute `$y = L^{-1} (x - \mu)$` by solving `$L y = (x - \mu)$` for `$y$`.

* Compute `$\frac{1}{2} \log\det{C} = \log \det{L}$`, which is the sum
  of the logarithms of the diagonal elements of `$L$`.

If `$C^{-1}$` is given, then the Colesky factorization of `$C^{-1}$` can
be used to compute

`\begin{equation*}
  \log\det{C} = -\log\det{C^{-1}}.
\end{equation*}`

---
layout: true
## Non-Negative Matrix Factorization
---

A number of problems lead to the desire to approximate a non-negative
matrix `$X$` by a product

`\begin{equation*}
  X \approx W H
\end{equation*}`

where `$W$`, `$H$` are non-negative matricies of low rank, i.e. with few
columns.

There are a number of algorithms available, most of the form

`\begin{equation*}
  \min_{W, H} [D(X, WH) + R(W, H)]
\end{equation*}`

where `$D$` is a loss function and `$R$` is a possible penalty for
encouraging desirable characteristics of `$W$` and `$H$`, such as
smoothness or sparseness.

The R package `NMF` provides one approach, and a vignette in the
package provides some background and references.

---

.pull-left.tiny-code[
As an example, using default settings in the NMF package the `volcano`
image can be approximated with factorizations of rank `$1, \dots, 5$` by

```r
library(NMF)
nmf1 <- nmf(volcano, 1); V1 <- nmf1@fit@W %*% nmf1@fit@H
nmf2 <- nmf(volcano, 2); V2 <- nmf2@fit@W %*% nmf2@fit@H
nmf3 <- nmf(volcano, 3); V3 <- nmf3@fit@W %*% nmf3@fit@H
nmf4 <- nmf(volcano, 4); V4 <- nmf4@fit@W %*% nmf4@fit@H
nmf5 <- nmf(volcano, 5); V5 <- nmf5@fit@W %*% nmf5@fit@H
```

The relative error for the final image is

```r
max(abs(volcano - V5)) / max(abs(volcano))
## [1] 0.03096702
```
]
.pull-right[
The images:

<img src="linalg_files/figure-html/unnamed-chunk-9-1.png" style="display: block; margin: auto;" />
]

---

Another application is _recommender systems_.

For example, `$X$` might be ratings of movies (columns) by viewers (rows).

* The set of actual values would be very sparse as each viewer will
  typically rate only a small subset of all movies.

* `$W$` would be a user preference matrix, `$H$` a corresponding movie
  feature matrix.

* The product `$WH$` would provide predicted ratings for movies the
  users have not yet seen.

---
layout:false
## Other Factorizations

Many other factorizations of matrices are available and being
developed.

Some examples are

* Robust variants of the SVD

* Sparse variants, e.g. Dan Yang, Zongming Ma, and Andreas Buja
  (2014), "A Sparse Singular Value Decomposition Method for
  High-Dimensional Data," Journal of Computational and Graphical
  Statistics 23(4), 923--942.

* Constrained factorizations, e.g. C. Ding, T. Li, and
  M. I. Jordan (2010), "Convex and Semi-Nonnegative Matrix
  Factorizations," IEEE Transactions on Pattern Analysis and Machine
  Intelligence, 32(1), 45--55.

---
layout: true
## Exploiting Special Structure
---

Specialized algorithms can sometimes be used for matrices with special
structure.

### Toeplitz Systems

.pull-left[
Stationary time series have covariance matrices that look like

`\begin{equation*}
  \begin{bmatrix}
    \sigma_0 & \sigma_1 & \sigma_2 & \sigma_3 & \dots\\
    \sigma_1 & \sigma_0 & \sigma_1 & \sigma_2 & \dots\\
    \sigma_2 & \sigma_1 & \sigma_0 & \sigma_1 & \dots\\
    \sigma_3 & \sigma_2 & \sigma_1 & \sigma_0 & \ddots\\
    \ddots & \ddots & \ddots & \ddots & \ddots &
  \end{bmatrix}    
\end{equation*}`
{{content}}
]
--

This is a _Toeplitz_ matrix.

--
.pull-right[
This matrix is also symmetric --- this is not required for a Toeplitz
matrix.
{{content}}
]
--

Special algorithms requiring `$O(n^2)$` operations are available for
Toeplitz systems.
{{content}}
--

General Cholesky factorization requires `$O(n^3)$` operations.
{{content}}
--

The R function `toeplitz` creates Toeplitz matrices.

---

### Circulant Systems

.pull-left[
Some problems give rise to matrices that look like

`\begin{equation*}
  C_n = 
  \begin{bmatrix}
    a_1 & a_2 & a_3 & \dots & a_n\\
    a_n & a_1 & a_2 &  \dots & a_{n-1}\\
    a_{n-1} & a_n & a_1 & \dots & a_{n-2}\\
    \vdots & \vdots & \vdots & \ddots & \vdots\\
    a_2 & a_3 & a_4 & \dots & a_1
  \end{bmatrix}
\end{equation*}`
{{content}}
]
--

This is a _circulant_ matrix, a subclass of Toeplitz matrices.

.pull-right[
Circulant matrices satisfy

`\begin{equation*}
  C_n = F_n^* \, \text{diag}(\sqrt{n} F_n a) \, F_n
\end{equation*}`
{{content}}
]
--

where `$F_n$` is the _Fourier matrix_ with

`\begin{equation*}
  F_n(j,k) = \frac{1}{\sqrt{n}}e^{-(j-1)(k-1)2\pi\sqrt{-1}/n}
\end{equation*}`
{{content}}
--

and `$F_n^*$` is the _conjugate transpose_, _Hermitian transpose_,
or _adjoint matrix_ of `$F_n$`.
{{content}}
--

The eigen values are the elements of `$\sqrt{n} F_n a$`.

---

Products `$F_n\, x$` and `$F_n^* x$` can be computed with the
_fast Fourier transform (FFT)_.

In R
`\begin{align*}
  \sqrt{n} F_n x &= \texttt{fft(x)}\\
  \sqrt{n} F_n^* x &= \texttt{fft(x, inverse = TRUE)}\\
\end{align*}`

These computations are generally `$O(n \log n)$` in complexity.

Circulant systems can be used to approximate other systems.

Multi-dimensional analogs exist as well.

A simple example is available
[on line](../examples/circulant.Rmd).

---

### Sparse Systems

Many problems lead to large systems in which only a small fraction of
coefficients are non-zero.

Some methods are available for general sparse systems.

Specialized methods are available for structured sparse systems such
as

* tri-diagonal systems;

* block diagonal systems;

* banded systems.

Careful choice of row and column permutations can often turn general
sparse systems into banded ones.

---

### Sparse and Structured Systems in R

Sparse matrix support in R is improving.

Some packages, like `nlme`, provide utilities they need.

One basic package available on [CRAN](https://cran.r-project.org) is
`sparseM`.

A more extensive package is `Matrix`.

`Matrix` is the engine for mixed effects/multi-level model fitting in
  `lme4`.

---
layout: false
## Update Formulas

Update formulas are available for most decompositions that allow
for efficient adding or dropping of rows or columns.

These can be useful for example in cross-validation and variable
selection computations.

They can also be useful for fitting linear models to very large
data sets; the package `biglm` uses this approach.

I am not aware of any convenient implementations in R at this
point but they may exist.

---
layout: true
## Iterative Methods
---

Iterative methods can be useful in large, sparse problems.

Iterative methods for sparse problems can also often be parallelized
effectively.

Iterative methods are also useful when

* `$A x$` can be computed efficiently for any given `$x$`, but

* it is expensive or impossible to compute `$A$` explicitly.

Some simple examples are available
[on line](../examples/itersolve.Rmd).

---

### Gauss-Seidel Iteration

Choose an initial solution `$x^{(0)}$` to
`\begin{equation*}
  Ax = b
\end{equation*}`

and then update from `$x^{(k)}$` to `$x^{(k+1)}$` by

`\begin{equation*}
  x^{(k+1)}_i = \frac{1}{a_{ii}}
  \left(b_i - \sum_{j=1}^{i-1} a_{ij} x_{j}^{(k+1)} -
  \sum_{j=i+1}^n a_{ij} x_{j}^{(k)}\right)
\end{equation*}`

for `$i = 1, \dots, n$`.

This is similar in spirit to Gibbs sampling.

---

.pull-left[
This can be written in matrix form as
`\begin{equation*}
  x^{(k+1)} = (D+L)^{-1}(-Ux^{(k)}+b)
\end{equation*}`
with
]
.pull-right[
`\begin{align*}
  L &= 
  \begin{bmatrix}
    0 & 0 & \dots & \dots & 0\\
    a_{21} & 0 & \dots & & \vdots\\
    a_{31} & a_{32} & \ddots & & 0\\
    \vdots & & & 0 & 0 \\
    a_{n1} & a_{n2} & \dots & a_{n,n-1} & 0
  \end{bmatrix}\\
  D &= \text{diag}(a_{11},\dots, a_{nn})\\
  U &= 
  \begin{bmatrix}
    0 & a_{12} & \dots & \dots & a_{1n}\\
    0 & 0 & \dots & & \vdots\\
    0 & 0 & \ddots & & a_{n-2,n}\\
    \vdots & & & & a_{n-1,n}\\
    0 & 0 & \dots & 0 & 0
  \end{bmatrix}
\end{align*}`
]
---

### Splitting Methods

The Gauss-Seidel method is a member of a class of _splitting methods_
where

`\begin{equation*}
  M x^{(k+1)} = N x^{(k)} + b
\end{equation*}`
with `$A = M - N$`.

For the Gauss-Seidel method
`\begin{align*}
  M &= D + L\\
  N &= -U.
\end{align*}`

Other members include Jacobi iterations with
`\begin{align*}
  M_J &= D\\
  N_J &= -(L+U)
\end{align*}`

Splitting methods are practical if solving linear systems with matrix
`$M$` is easy.

---

### Convergence

A splitting method for a non-singular matrix `$A$` will converge to the
unique solution of `$Ax=b$` if

`\begin{equation*}
 \rho(M^{-1} N) < 1
\end{equation*}`

where `$\rho(G)$` is the _spectral radius_ of `$G$`, 
--
i.e. the maximal absolute value of the eigenvalues of `$G$`.

This is true, for example, for the Gauss-Seidel method if `$A$` is
strictly positive definite.

Convergence can be very slow if `$\rho(M^{-1} N)$` is close to one.

---

.pull-left.tiny-code[
### A Simple Example

A simple implementation of Gauss-Seidel iteration using the splitting
representation:

```r
gssolve <- function(A, b, itermax = length(b),
 tol = sqrt(.Machine$double.eps)) {
 n <- length(b)
 M <- A; M[rep(1:n, n) < rep(1:n, each = n)] <- 0
 N <- -A; N[rep(1:n, n) >= rep(1:n, each = n)] <- 0

x <- numeric(n)
 for (i in seq_len(itermax)) {
 x.old <- x
 x <- forwardsolve(M, N %*% x + b)
 if (max(abs(x - x.old) / (tol + abs(x.old))) < tol &&
 max(abs(A %*% x - b)) < tol) {
 message(i, " iterations")
 return(x)
 }
 }
 
 message("iteration limit reached")
 x
}
```

]
--

.pull-right.tiny-code[
With a well-conditioned matrix Gauss-Seidel iteration can be faster than
a dense PLU decomposition.

```r
n <- 2000
A <- crossprod(matrix(rnorm(n * n), n)) / n + 5 * diag(n)
b <- rnorm(n)

system.time(solve(A, b))
##    user  system elapsed 
##   1.475   0.012   1.488
system.time(gssolve(A, b))
## 13 iterations
##    user  system elapsed 
##   0.490   0.008   0.498
```

Whether the Gauss-Seidel method is faster on this example may depend
on the the BLAS/LAPACK in use:

```r
sessionInfo()$BLAS
## [1] "/usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0"
sessionInfo()$LAPACK
## [1] "/usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0"
```
]

---

### Successive Over-Relaxation

A variation is to define

`\begin{equation*}
  x^{(k+1)}_i = \frac{\omega}{a_{ii}}
  \left(b_i - \sum_{j=1}^{i-1} a_{ij} x_{j}^{(k+1)} -
  \sum_{j=i+1}^n a_{ij} x_{j}^{(k)}\right) + (1-\omega) x^{(k)}_i
\end{equation*}`

or, in matrix form,

`\begin{equation*}
  M_\omega x^{(k+1)} = N_\omega x^{(k)} + \omega b
\end{equation*}`

with

`\begin{align*}
  M_\omega &= D + \omega L\\
  N_\omega &= (1-\omega)D - \omega U
\end{align*}`

for some `$\omega$`, usually with `$0 < \omega < 1$`.

---

This is called _successive over-relaxation (SOR)_, from its
first application in a structural engineering problem.

For some choices of `$\omega$` we can have

`\begin{equation*}
  \rho(M_\omega^{-1}N_\omega) \ll \rho(M^{-1}N)
\end{equation*}`

and thus faster convergence.

For some special but important problems the value of `$\omega$`
that minimizes `$\rho(M_\omega^{-1}N_\omega)$` is known or can be
computed.

---

### Conjugate Gradient Method

.pull-left[
If `$A$` is symmetric and strictly positive definite then the unique
solution to `$Ax=b$` is the unique minimizer of the quadratic function
`\begin{equation*}
  f(x) = \frac{1}{2}x^T A x - x^T b
\end{equation*}`
{{content}}
]
--
Any nonlinear or quadratic optimization method can be used to find the
minimum.
{{content}}
--

The most common one used in this context is the conjugate
gradient method.
--

.pull-right[
Choose an initial `$x_0$`, set `$d_0 = -g_0 = b - A x_0$`.
{{content}}
]
--

Then, while `$g_k \neq 0$`, for `$k = 0, 1, \dots$` compute
`\begin{align*}
  \alpha_k &= -\frac{g_k^T d_k}{d_k^T A d_k}\\
  x_{k+1} &= x_k + \alpha_k d_k\\
  g_{k+1} &= A x_{k+1} - b\\
  \beta_{k+1} &= \frac{g_{k+1}^T A d_k}{d_k^T A d_k}\\
  d_{k+1} &= -g_{k+1} + \beta_{k+1} d_k
\end{align*}`

---

Some properties:

* An alternate form of `$g_{k+1}$` is

$$ g_{k+1} = g_k + \alpha_k A d_k $$

This means only one matrix-vector multiplication is needed per
  iteration.

* The vector `$g_k$` is the gradient of `$f$` at `$x_k$`.

* The initial direction `$d_0 = -g_0$` is the _direction of steepest
  descent_ from `$x_0$`

* The directions `$d_0, d_1, \dots$` are `$A$`-_conjugate_, i.e.
  `$d_i^T A d_j = 0$` for `$i \neq j$`.

* The directions `$d_0, d_1, \dots, d_{n-1}$` are linearly
  independent.

---

### Convergence

With exact arithmetic,

`\begin{equation*}
  A x_n = b
\end{equation*}`

That is, the conjugate gradient algorithm terminates with the exact
solution in `$n$` steps.

Numerically this does not happen.

Numerically, the directions will not be exactly `$A$`-conjugate.

A convergence tolerance is used for termination; this can be
based on the relative change in the solution

`\begin{equation*}
  \frac{\Norm{x_{k+1}-x_{k}}}{\Norm{x_k}}
\end{equation*}`

or the residual or gradient

`\begin{equation*}
  g_k = A x_k - b
\end{equation*}`

or some combination; an iteration count limit is also a good idea.

---

If the algorithm does not terminate within `$n$` steps it is a good idea
to restart it with a steepest descent step from the current `$x_k$`.

In many sparse and structured problems the algorithm will terminate in
far fewer than `$n$` steps for reasonable tolerances.

Convergence is faster if the condition number of `$A$` is closer to
one.

The error can be bounded as

`\begin{equation*}
  \Norm{x - x_k}_A \leq 2 \Norm{x - x_0}_A
  \left(\frac{\sqrt{\kappa_2(A)} - 1}{\sqrt{\kappa_2(A)} + 1}\right)^k
\end{equation*}`

with `$\Norm{x}_A = \sqrt{x^T A x}$`.

---

_Preconditioning_ strategies can improve convergence.

These transform the original problem to one with
`$\tilde{A} = C^{-1} A C^{-1}$` for some symmetric strictly positive definite
`$C$`, and then use the conjugate gradient method for `$\tilde{A}$`

Simple choices of `$C$` are most useful; sometimes a diagonal matrix
will do.

Good preconditioners can sometimes be designed for specific problems.

---

.pull-left.tiny-code[
### A Simple Implementation

```r
cg <- function(A, b, x, done) {
 dot <- function(x, y) crossprod(x, y)[1]

n <- 0
 g <- A(x) - b
 d <- -g
 
 repeat {
 h <- A(d)
 u <- dot(d, h)
 a <- -dot(g, d) / u

n <- n + 1
 x.old <- x
 x <- x + a * d
 g <- g + a * h

b <- dot(h, g) / u
 d <- -g + b * d
 if (done(g, x, x.old, n))
 return(list(x = as.vector(x),
 g = as.vector(g),
 n = n))
 }
}
```
]

.pull-right.tiny-code[
* The linear transformation and the termination condition are
  specified as functions.

* The termination condition can use a combination of the gradient,
  current solution, previous solution, or iteration count.

A simple example:

```r
X <- crossprod(matrix(rnorm(25), 5))
y <- rnorm(5)
cg(A = function(x) X %*% x,
 b = y,
 x = rep(0, 5),
 done = function(g, x, x.old, n) n >= 5)
## $x
## [1] -2.004109 -8.240581 4.410103 -6.264520 -2.547753
## 
## $g
## [1] -2.037259e-14 -1.693090e-15 -1.690315e-14 -5.551115e-16
## [5] -1.154632e-14
## 
## $n
## [1] 5
solve(X, y)
## [1] -2.004109 -8.240581 4.410103 -6.264520 -2.547753
```
]

---

.pull-left.tiny-code[
A simple conjugate gradient based `cgsolve` function might be
defined as

```r
cgsolve <- function(X, y, tol = sqrt(.Machine$double.eps)) {
 A <- function(x)
 X %*% x
 done <- function(g, x, x.old, n)
 n >= length(x) ||
 (max(abs(x - x.old) / (tol + abs(x.old))) < tol &&
 max(abs(g)) < tol)
 x <- rep(0, length(y))
 v <- cg(A, y, x, done = done)
 if (v$n >= length(y))
 message("iteration limit reached")
 else
 message(v$n, " iterations")
 v$x
}
```
]

.pull-right.tiny-code[
With a positive definite matrix with a large diagonal component
`cgsolve` can be very effective:

```r
N <- 2000
X <- crossprod(matrix(rnorm(N * N), N))
XX <- X / N + 5 * diag(N)
y <- rnorm(N)

system.time(v0 <- solve(XX, y))
## user system elapsed 
## 1.479 0.012 1.490
system.time(v <- cgsolve(XX, y))
## 15 iterations
## user system elapsed 
## 0.080 0.000 0.081

max(abs((v - v0) / v0))
## [1] 3.637447e-09
```
]

---
layout: true
## Linear Algebra Software
---

### Some Standard Packages

Open source packages developed at national laboratories:

* LINPACK for linear equations and least squares

* EISPACK for eigenvalue problems

* LAPACK newer package for linear equations and eigenvalues

Designed for high performance. Available from Netlib at
<https://netlib.org/>

Commercial packages:

* IMSL used more in US

* NAG used more in UK

* ...

---

### BLAS: Basic Linear Algebra Subroutines

Modern BLAS has three levels:

* **Level 1:** Vector and vector/vector operations such as

* dot product `$x^T y$`;
  * scalar multiply and add (axpy): `$\alpha x + y$`;
  * Givens rotations;
  * ...

* **Level 2:** Matrix/vector operations, such as `$A x$`

* **Level 3:** Matrix/matrix operations, such as `$A B$`

LINPACK uses only Level 1; LAPACK uses all three levels.

BLAS defines the interface.

Standard reference implementations are available from Netlib.

Highly optimized versions are available from hardware vendors and
research organizations.

---

### Cholesky Factorization in LAPACK

The core of the
[`DPOTRF`](https://netlib.org/lapack/explore-3.1.1-html/dpotf2.f.html)
subroutine:

.pull-left.width-70.tiny-code[

<pre class = 'f'><code>*
* Compute the Cholesky factorization A = L*L'.
*
 DO 20 J = 1, N
*
* Compute L(J,J) and test for non-positive-definiteness.
*
 AJJ = A( J, J ) - DDOT( J-1, A( J, 1 ), LDA, A( J, 1 ),
 $ LDA )
 IF( AJJ.LE.ZERO ) THEN
 A( J, J ) = AJJ
 GO TO 30
 END IF
 AJJ = SQRT( AJJ )
 A( J, J ) = AJJ
*
* Compute elements J+1:N of column J.
*
 IF( J.LT.N ) THEN
 CALL DGEMV( 'No transpose', N-J, J-1, -ONE, A( J+1, 1 ),
 $ LDA, A( J, 1 ), LDA, ONE, A( J+1, J ), 1 )
 CALL DSCAL( N-J, ONE / AJJ, A( J+1, J ), 1 )
 END IF
 20 CONTINUE
</code></pre>

]
--

.pull-right.width-30[
* DDOT and
 DSCAL are Level 1 BLAS routines

* DGEMV is a Level 2 BLAS routine
]

---

### ATLAS: Automatically Tuned Linear Algebra Software

Available at <https://math-atlas.sourceforge.net/>.

* Analyzes machine for properties such as cache characteristics.

* Runs extensive tests to determine performance trade-offs.

* Creates Fortran and C versions of BLAS and some LAPACK routines
  tailored to the particular machine.

* Provides some routines that take advantage of multiple
  processors using _worker threads_.

---

### OpenBLAS

Another high-performance BLAS library was developed and maintained by
Kazushige Goto.

This is now being developed and maintained as the OpenBLAS roject,
available from <https://github.com/xianyi/OpenBLAS>.

Also provides versions that take advantage of multiple processors.

### Vendor Libraries

Intel provides the Math Kernel Libraries (MKL)

AMD has a similar library.

Apple provides accelerated BLAS libraries for its platforms.

---

### Using a High-Performance BLAS with R

R comes with a basic default BLAS.

R can be built to use a specified BLAS.

Once built one can change the BLAS R uses by replacing the shared
library R uses.

Some simple computations using the default and MKL vendor BLAS
for the data

```r
N <- 1000
X <- matrix(rnorm(N^2), N)
XX <- crossprod(X)
```

---

Results:

| Timing Expression                           | Reference | MKL/SEQ | MKL/THR |
| :------------------------------------------ | --------: | ------: | ------: |
| `system.time(for (i in 1:5) crossprod(X))}` |     2.107 |   0.405 |   0.145 |
| `system.time(for (i in 1:5) X %*% X)`       |     3.401 |   0.742 |   0.237 |
| `system.time(svd(X))`                       |     3.273 |   0.990 |   0.542 |
| `system.time(for (i in 1:5) qr(X))`         |     2.290 |   1.094 |   1.107 |
| `system.time(for (i in 1:5) qr(X, LAPACK=TRUE))` |  2.629 |   0.834 |  0.689 |
| `system.time(for (i in 1:20) chol(XX))`     |     2.824 |   0.556 |   0.186 |

These results were obtained using the non-threaded and threaded Intel Math
Kernel Library (MKL) a few years ago.

Versions of the current R using MKL for BLAS may be available as

```
/group/statsoft/R-patched/build-MKL-seq/bin/R
/group/statsoft/R-patched/build-MKL-thr/bin/R
```

Currently the standard version of R on our Linux systems uses OpenBLAS
via
[`flexiblas`](https://www.mpi-magdeburg.mpg.de/projects/flexiblas).

The R [`flexiblas` package](https://cran.r-project.org/package=flexiblas)
can be used to adjust the BLAS used in the current R session.

---
layout: false
### Using `flexiblas` on the CLAS Linux Systems

.pull-left.tiny-code[

Using default `flexiblas` settings:

```r
N <- 1000
X <- matrix(rnorm(N^2), N)
XX <- crossprod(X)

> system.time(chol(XX))
   user  system elapsed 
  0.146   0.022   0.020

library(flexiblas)
> flexiblas_current_backend()
[1] "OPENBLAS-OPENMP"
```
{{content}}
]

Switching to the reference BLAS:

```r
> flexiblas_list()
[1] "NETLIB"          "__FALLBACK__"    "OPENBLAS-OPENMP"
> flexiblas_list_loaded()
[1] "OPENBLAS-OPENMP"
> flexiblas_load_backend("NETLIB")
[1] 2
> flexiblas_switch(2)
> flexiblas_current_backend()
[1] "NETLIB"
> system.time(chol(XX))
   user  system elapsed 
  0.125   0.000   0.126 
```

.pull-right.tiny-code[
Back to OpenMP
```r
> flexiblas_switch(1)
> flexiblas_current_backend()
[1] "OPENBLAS-OPENMP"
> flexiblas_get_num_threads()
[1] 16
> system.time(chol(XX))
   user  system elapsed 
  0.136   0.023   0.016 
```
{{content}}
]

Adjusting the number of threads:

```r
> flexiblas_set_num_threads(4)
> system.time(chol(XX))
   user  system elapsed 
  0.015   0.002   0.005 
> flexiblas_set_num_threads(8)
> system.time(chol(XX))
   user  system elapsed 
  0.023   0.009   0.006 
```

---
layout: false
## Final Notes

Most reasonable approaches will be accurate for reasonable problems.

Choosing good scaling and centering can make problems more
reasonable (both numerically and statistically)

Most methods are efficient enough for our purposes.

In some problems worrying about efficiency is important if
reasonable problem sizes are to be handled.

Making sure you are using the right approach to the right problem is
much more important than efficiency.

Some quotes:

D. E.  Knuth, restating a comment by C. A. R. Hoare:

> We should forget about small efficiencies, say about 97% of the
> time: premature optimization is the root of all evil.

W. A. Wulf:

> More computing sins are committed in the name of efficiency (without
> necessarily achieving it) than for any other single reason ---
> including blind stupidity.

//adapted from Emi Tanaka's gist at //https://gist.github.com/emitanaka/eaa258bb8471c041797ff377704c8505