class: center, middle, title-slide .title[ # Numerical Linear Algebra ] .author[ ### Luke Tierney ] .institute[ ### University of Iowa ] .date[ ### 2023-05-19 ] --- layout: true <link rel="stylesheet" href="stat7400.css" type="text/css" /> `$$\renewcommand{\machunit}{\mathbf{u}}$$` <!-- **** does this belong here, farther down, or in arith.Rmd?? --> ## Preliminaries --- Much of this material is based on > G. H. Golub and C. F. VanLoan (2013), _Matrix Computations_, 4th > Ed., JHU Press. -- ### Conditioning and Stability Some problems are inherently difficult: no algorithm involving rounding of inputs can be expected to work well. Such problems are called _ill-conditioned_. --- A numerical measure of conditioning, called a _condition number_, can sometimes be defined: -- * Suppose the objective is to compute `\(y = f(x)\)`. -- * If `\(x\)` is perturbed by `\(\Delta x\)` then the result is changed by `$$\Delta y = f(x + \Delta x) - f(x).$$` -- * If `$$\frac{|\Delta y|}{|y|} \approx \kappa \frac{|\Delta x|}{|x|}$$` for small perturbations `\(\Delta x\)` then `\(\kappa\)` is the _condition number_ for the problem of computing `\(f(x)\)`. -- A particular algorithm for computing an approximation `\(\tilde{f}(x)\)` to `\(f(x)\)` is _numerically stable_ if for small perturbations `\(\Delta x\)` of the input the result is close to `\(f(x)\)`. --- ### Error Analysis Analyzing how errors accumulate and propagate through a computation, called _forward error analysis_, is sometimes possible but often very difficult. -- _Backward error analysis_ tries to show that the computed result `\begin{equation*} \tilde{y} = \tilde{f}(x) \end{equation*}` is the exact solution to a slightly perturbed problem, -- i.e. `\begin{equation*} \tilde{y} = f(\tilde{x}) \end{equation*}` for some `\(\tilde{x} \approx x\)`. --- If * the problem of computing `\(f(x)\)` is well conditioned, and -- * the algorithm `\(\tilde{f}\)` is stable, -- then <!-- ***** --> <!-- `\begin{align*} \tilde{y} &= \tilde{f}(x) & \text{computed result}\\ & = f(\tilde{x}) & \text{exact result for some `\(\tilde{x} \approx x\)`}\\ &\approx f(x) & \text{since `\(f\)` is well-conditioned} \end{align*}` --> `\begin{align*} \tilde{y} &= \tilde{f}(x) & \text{computed result}\\ & = f(\tilde{x}) & \text{exact result}\\ &\approx f(x) & \text{well-conditioned} \end{align*}` -- Backward error analysis is used heavily in numerical linear algebra. --- layout: true ## Solving Linear Systems --- Many problems involve solving linear systems of the form `\begin{equation*} A x = b \end{equation*}` -- * least squares normal equations: `$$X^TX \beta = X^T y$$` -- * stationary distribution of a Markov chain: `$$\begin{align*} \pi P &= \pi\\ \sum \pi_i &= 1 \end{align*}$$` -- If `\(A\)` is `\(n \times n\)` and non-singular then in principle the solution is `\begin{equation*} x = A^{-1} b \end{equation*}` -- This is not usually a good numerical approach because * it can be numerically inaccurate; * it is inefficient except for very small `\(n\)`. --- ### Triangular Systems Triangular systems are easy to solve. The upper triangular system `\begin{equation*} \begin{bmatrix} 5 & 3 \\ 0 & 2 \end{bmatrix} \begin{bmatrix} x_1\\ x_2 \end{bmatrix} = \begin{bmatrix} 16\\ 4 \end{bmatrix} \end{equation*}` -- has solution `\begin{align*} x_2 & = 4 / 2 = 2\\ x_1 &= (16 - 3 x_2)/5 = 10/5 = 2 \end{align*}` -- This is called _back substitution_ -- Lower triangular systems are solved by _forward substitution_. --- If one of the diagonal elements in a triangular matrix is zero, then the matrix is singular. -- If one of the diagonal elements in a triangular matrix is close to zero, then the solution is very sensitive to other inputs: `\begin{equation*} \begin{bmatrix} 1 & a \\ 0 & \epsilon \end{bmatrix} \begin{bmatrix} x_1\\ x_2 \end{bmatrix} = \begin{bmatrix} b_1\\ b_2 \end{bmatrix} \end{equation*}` -- has solution `\begin{align*} x_2 & = \frac{b_2}{\epsilon}\\ x_1 &= b_1 - a \frac{b_2}{\epsilon} \end{align*}` -- This sensitivity for small `\(\epsilon\)` is inherent in the problem: For small values of `\(\epsilon\)` the problem of finding the solution `\(x\)` is ill-conditioned. --- ### Gaussian Elimination The system `\begin{equation*} \begin{bmatrix} 5 & 3 \\ 10 & 8 \end{bmatrix} \begin{bmatrix} x_1\\ x_2 \end{bmatrix} = \begin{bmatrix} 16\\ 36 \end{bmatrix} \end{equation*}` -- can be reduced to triangular form by subtracting two times the first equation from the second. -- In matrix form: `\begin{equation*} \begin{bmatrix} 1 & 0\\ -2 & 1 \end{bmatrix} \begin{bmatrix} 5 & 3 \\ 10 & 8 \end{bmatrix} \begin{bmatrix} x_1\\ x_2 \end{bmatrix} = \begin{bmatrix} 1 & 0\\ -2 & 1 \end{bmatrix} \begin{bmatrix} 16\\ 36 \end{bmatrix} \end{equation*}` -- or `\begin{equation*} \begin{bmatrix} 5 & 3 \\ 0 & 2 \end{bmatrix} \begin{bmatrix} x_1\\ x_2 \end{bmatrix} = \begin{bmatrix} 16\\ 4 \end{bmatrix} \end{equation*}` -- which is the previous triangular system. --- For a general `\(2 \times 2\)` matrix `\(A\)` the lower triangular matrix used for the reduction is `\begin{equation*} \begin{bmatrix} 1 & 0\\ -\frac{a_{21}}{a_{11}} & 1 \end{bmatrix} \end{equation*}` -- The ratio `\(\frac{a_{21}}{a_{11}}\)` is a called a _multiplier_. -- This strategy works as long as `\(a_{11} \neq 0\)`. -- If `\(a_{11} \approx 0\)`, say `\begin{equation*} A = \begin{bmatrix} \epsilon & 1\\ 1 & 1 \end{bmatrix} \end{equation*}` -- for small `\(\epsilon\)`, then the multiplier `\(1/\epsilon\)` is large and this does not work very well, even though `\(A\)` is very well behaved. -- Using this approach would result in a numerically unstable algorithm for a well-conditioned problem. --- ### Partial Pivoting We can ensure that the multiplier is less than or equal to one in magnitude by switching rows before eliminating: `\begin{equation*} \begin{bmatrix} 0 & 1\\ 1 & 0 \end{bmatrix} \begin{bmatrix} 5 & 3 \\ 10 & 8 \end{bmatrix} \begin{bmatrix} x_1\\ x_2 \end{bmatrix} = \begin{bmatrix} 0 & 1\\ 1 & 0 \end{bmatrix} \begin{bmatrix} 16\\ 36 \end{bmatrix} \end{equation*}` -- or `\begin{equation*} \begin{bmatrix} 10 & 8\\ 5 & 3 \end{bmatrix} \begin{bmatrix} x_1\\ x_2 \end{bmatrix} = \begin{bmatrix} 36\\ 16 \end{bmatrix} \end{equation*}` -- The matrix to reduce this system to triangular form is now `\begin{equation*} \begin{bmatrix} 1 & 0\\ -0.5 & 1 \end{bmatrix} \end{equation*}` --- So the final triangular system is constructed as `\begin{equation*} \begin{bmatrix} 1 & 0\\ -0.5 & 1 \end{bmatrix} \begin{bmatrix} 0 & 1\\ 1 & 0 \end{bmatrix} \begin{bmatrix} 5 & 3 \\ 10 & 8 \end{bmatrix} \begin{bmatrix} x_1\\ x_2 \end{bmatrix} = \begin{bmatrix} 1 & 0\\ -0.5 & 1 \end{bmatrix} \begin{bmatrix} 0 & 1\\ 1 & 0 \end{bmatrix} \begin{bmatrix} 16\\ 36 \end{bmatrix} \end{equation*}` -- or `\begin{equation*} \begin{bmatrix} 10 & 8\\ 0 & -1 \\ \end{bmatrix} \begin{bmatrix} x_1\\ x_2 \end{bmatrix} = \begin{bmatrix} 36\\ -2 \end{bmatrix} \end{equation*}` -- Equivalently, we can think of our original system as `\begin{equation*} \begin{bmatrix} 0 & 1\\ 1 & 0 \end{bmatrix} \begin{bmatrix} 1 & 0\\ 0.5 & 1 \end{bmatrix} \begin{bmatrix} 10 & 8\\ 0 & -1 \\ \end{bmatrix} \begin{bmatrix} x_1\\ x_2 \end{bmatrix} = \begin{bmatrix} 16\\ 36 \end{bmatrix} \end{equation*}` -- The decomposition of `\(A\)` as `\begin{equation*} A = P L U \end{equation*}` -- with `\(P\)` a permutation matrix, `\(L\)` lower trianbular with ones on the diagonal, and `\(U\)` upper triangular is called a _PLU decomposition_. --- ### PLU Decomposition In general, we can write a square matrix `\(A\)` as `\begin{equation*} A = P L U \end{equation*}` where -- * `\(P\)` is a _permutation matrix_, i.e. * it is an identity matrix with some rows switched * it satisfies `\(P P^T = P^T P = I\)`, i.e. it is an _orthogonal matrix_ -- * `\(L\)` is a _unit lower triangular matrix_, i.e. * it is lower triangular * it has ones on the diagonal -- * `\(U\)` is upper triangular -- The permutation matrix `\(P\)` can be chosen so that the multipliers used in forming `\(L\)` all have magnitude at most one. --- `\(A\)` is non-singular if and only if the diagonal entries in `\(U\)` are all non-zero. -- If `\(A\)` is non-singular, then we can solve `\begin{equation*} A x = b \end{equation*}` -- or `\begin{equation*} PLU x = b. \end{equation*}` in three steps: -- 1. Solve `\(P z = b\)` for `\(z = P^Tb\)` (permute the right hand side) 2. Solve `\(L y = z\)` for `\(y\)` (forward solve lower triangular system) 3. Solve `\(U x = y\)` for `\(x\)` (back solve upper triangular system) -- Computational complexity: -- * Computing the _PLU_ decomposition takes `\(O(n^3)\)` operations. -- * Computing a solution from a _PLU_ decomposition takes `\(O(n^2)\)` operations. --- ### Condition Number Linear systems `\(Ax=b\)` have unique solutions if `\(A\)` is non-singular. -- Solutions are sensitive to small perturbations if `\(A\)` is close to singular. -- We need a useful measure of closeness to singularity --- The _condition number_ is a useful measure: `\begin{align*} \renewcommand{\Norm}[1]{\|{#1}\|} \kappa(A) &= \frac{\max_{x \ne 0} \frac{\Norm{Ax}}{\Norm{x}}} {\min_{x \ne 0} \frac{\Norm{Ax}}{\Norm{x}}} \\ &= \left(\max_{x \ne 0} \frac{\Norm{Ax}}{\Norm{x}}\right) \left(\max_{x \ne 0} \frac{\Norm{A^{-1}x}}{\Norm{x}}\right) \\ &= \Norm{A}\Norm{A^{-1}} \end{align*}` -- where `\(\Norm{y}\)` is a _vector norm_ (i.e. a measure of length) of `\(y\)` -- and $$ \Norm{B} = \max_{x \ne 0} \frac{\Norm{Bx}}{\Norm{x}} $$ is the corresponding _matrix norm_ of `\(B\)`. --- Some common vector norms: `\begin{align*} \|x\|_2 &= \sqrt{\sum_{i=1}^n x_i^2} & \text{Euclidean norm}\\ \|x\|_1 &= \sum_{i=1}^n |x_i| & \text{$L_1$ norm, Manhattan norm}\\ \|x\|_\infty &= \max_i |x_i| & \text{$L_\infty$ norm} \end{align*}` --- ### Some Properties of Condition Numbers `\(\kappa(A) \ge 1\)` for all `\(A\)`. -- `\(\kappa(A) = \infty\)` if `\(A\)` is singular -- If `\(A\)` is diagonal, then `\begin{equation*} \kappa(A) = \frac{\max |a_{ii}|}{\min |a_{ii}|} \end{equation*}` -- Different norms produce different values; the values are usually qualitatively similar --- ### Sensitivity of Linear Systems Suppose `\(x\)` solves the original system and `\(x^{*}\)` solves a slightly perturbed system, `\begin{equation*} (A + \Delta A) x^{*} = b + \Delta b \end{equation*}` -- and suppose that `\begin{align*} \delta \kappa(A) &\le \frac{1}{2}\\ \frac{\Norm{\Delta A}}{\Norm{A}} & \le \delta\\ \frac{\Norm{\Delta b}}{\Norm{b}} & \le \delta \end{align*}` -- Then `\begin{equation*} \frac{\Norm{x-x^{*}}}{\Norm{x}} \le 4 \delta \kappa(A) \end{equation*}` --- ### Stability of Gaussian Elimination with Partial Pivoting Backward error analysis: The numerical solution `\(\hat{x}\)` to the system `\begin{equation*} Ax = b \end{equation*}` produced by Gaussian elimination with partial pivoting is the exact solution for a perturbed system `\begin{equation*} (A + \Delta A) \hat{x} = b \end{equation*}` with `\begin{equation*} \frac{\Norm{\Delta A}_{\infty}}{\Norm{A}_{\infty}} \le 8 n^3 \rho \machunit + O(\machunit^2) \end{equation*}` -- * The value of `\(\rho\)` is not _guaranteed_ to be small, but is rarely larger than 10 -- * The algorithm would be considered numerically stable if `\(\rho\)` were guaranteed to be bounded. -- * _Complete pivoting_ is a bit more stable, but much more work. -- * The algorithm is considered very good for practical purposes. --- ### General Linear Systems in R R provides -- * `solve` for general systems, based on LAPACK's [DGESV](https://netlib.org/lapack/explore-html/d8/d72/dgesv_8f.html). -- * DGESV uses the `\(P L U\)` decomposition. -- * `forwardsolve`, `backsolve` for triangular systems. -- * `kappa` computes an estimate of the condition number or the exact condition number based on the Euclidean norm. --- layout: true ## Cholesky Factorization --- Suppose `\(A\)` is symmetric and (strictly) positive definite, i.e. `\begin{equation*} x^T A x > 0 \end{equation*}` for all `\(x \ne 0\)`. -- Some examples: * If `\(X\)` is the `\(n \times p\)` design matrix for a linear model and `\(X\)` is of rank `\(p\)`, then `\(A = X^T X\)` is strictly positive definite. -- If `\(X\)` is not of full rank then `\(A = X^T X\)` is non-negative definite or positive semi-definite, i.e. `\(x^T A x \ge 0\)` for all `\(x\)`. -- * If `\(A\)` is the covariance matrix of a random vector `\(X\)` then `\(A\)` is positive semidefinite: `$$\begin{align*} c^T A c &= c^T E[(X-\mu)(X-\mu)^T] c\\ &= E[((X-\mu)^T c)^T (X - \mu)^T c]\\ &= \text{Var}((X - \mu)^T c) \ge 0 \end{align*}$$` -- The covariance matrix is strictly positive definite unless `\(P(c^T X = c^T\mu) = 1\)` for some `\(c \ne 0\)`, i.e. unless there is a perfect linear relation between some of the components of `\(X\)`. --- <!-- begin Theorem --> **Theorem:** If `\(A\)` is strictly positive definite, then there exists a unique lower triangular matrix `\(L\)` with positive diagonal entries such that `\begin{equation*} A = L L^T \end{equation*}` This is called the _Cholesky factorization_. <!-- end Theorem --> --- ### Properties of the Cholesky Factorization Algorithm It uses the symmetry to produce an efficient algorithm. -- The algorithm needs to take square roots to find the diagonal entries. -- An alternative that avoids square roots factors `\(A\)` as `\begin{equation*} A = L D L^T \end{equation*}` with `\(D\)` diagonal and `\(L\)` unit lower triangular. -- The algorithm is numerically stable, and is guaranteed not to attempt square roots of negative numbers if $$ q_n \machunit \kappa_2(A) \le 1 $$ where `\(q_n\)` is a small constant depending on the dimension `\(n\)`. --- The algorithm will fail if the matrix is not (numerically) strictly positive definite. -- Modifications using pivoting are available that can be used for nonnegative definite matrices. -- Another option is to factor `\(A_\lambda = A + \lambda I\)` with `\(\lambda > 0\)` chosen large enough to make `\(A_\lambda\)` numerically strictly positive definite. -- This is often used in optimization. --- ### Some Applications of the Cholesky Factorization * Solving the normal equations in least squares. -- This requires that the predictors be linearly independent -- * Generating multivariate normal random vectors. -- * Parameterizing strictly positive definite matrices: -- Any lower triangular matrix `\(L\)` with arbitrary values below the diagonal and positive diagonal entries determines and is uniquely determined by the positive definite matrix `\(A = L L^T\)` --- ### Cholesky Factorization in R -- The function `chol` computes the Cholesky factorization. -- The returned value is the upper triangular matrix `\(R = L^T\)`. -- LAPACK is used for the computation. --- layout: true ## QR Factorization --- An `\(m \times n\)` matrix `\(A\)` with `\(m \ge n\)` can be written as `\begin{equation*} A = Q R \end{equation*}` where -- * `\(Q\)` is `\(m \times n\)` with orthonormal columns, i.e. `\(Q^T Q = I_n\)` -- * `\(R\)` is upper triangular -- Several algorithms are available for computing the QR decomposition: -- * Modified Gram-Schmidt; -- * Householder transformations (reflections); -- * Givens transformations (rotations). --- Each has advantages and disadvantages. -- LINPACK `dqrdc` and LAPACK `DGEQP3` use Householder transformations. -- The QR decomposition exists regardless of the rank of `\(A\)`. -- The rank of `\(A\)` is `\(n\)` if and only if the diagonal elements of `\(R\)` are all non-zero. --- ### Householder Transformations A Householder transformation is a matrix of the form `\begin{equation*} P = I - 2 v v^T/ v^T v \end{equation*}` where `\(v\)` is a nonzero vector. -- * `\(Px\)` is the reflection of `\(x\)` in the hyperplane orthogonal to `\(v\)`. -- * Given a vector `\(x \neq 0\)`, choosing `\(v = x + \alpha e_1\)` with `$$\alpha = \pm \|x\|_2$$` and `\(e_1\)` the first unit vector (first column of the identity) produces `$$Px = \mp\|x\|_2 e_1$$` --- This can be used to zero all but the first element of the first column of a matrix: `\begin{equation*} P \begin{bmatrix} \times & \times & \times\\ \times & \times & \times\\ \times & \times & \times\\ \times & \times & \times\\ \times & \times & \times\\ \end{bmatrix} = \begin{bmatrix} \times & \times & \times\\ 0 & \times & \times\\ 0 & \times & \times\\ 0 & \times & \times\\ 0 & \times & \times\\ \end{bmatrix} \end{equation*}` -- This is the first step in computing the `\(QR\)` factorization. -- The denominator `\(v^Tv\)` can be written as `\begin{equation*} v^Tv = x^Tx + 2 \alpha x_1 + \alpha^2 \end{equation*}` -- Choosing `\(\alpha = \text{sign}(x_1)\|x\|_2\)` ensures that all terms are non-negative and avoids cancellation. --- With the right choice of sign Householder transformations are very stable. -- The computation `\(Py\)` would be carried out without creating the `\(P\)` matrix: `\begin{equation*} Py = (I - 2 v v^T/ v^T v)y = y - \left(2 \frac{v^T y}{v^T v} \right) v. \end{equation*}` --- ### Givens Rotations A Givens rotation is a matrix `\(G\)` that is equal to the identity except for elements `\(G_{ii},G_{ij},G_{ji},G_{jj}\)`, which are `\begin{equation*} \begin{bmatrix} G_{ii} & G_{ij} \\ G_{ji} & G_{jj} \end{bmatrix} = \begin{bmatrix} c & s\\ -s & c \end{bmatrix} \end{equation*}` with `\(c = \cos(\theta)\)` and `\(s = \sin(\theta)\)` for some `\(\theta\)`. -- Premultiplication by `\(G^T\)` is a clockwise rotation by `\(\theta\)` radians in the `\((i,j)\)` coordinate plane. -- Given scalars `\(a,b\)` one can compute `\(c, s\)` so that `\begin{equation*} \begin{bmatrix} c & s\\ -s & c \end{bmatrix}^T \begin{bmatrix} a\\ b \end{bmatrix} = \begin{bmatrix} r\\ 0 \end{bmatrix} \end{equation*}` -- This allows `\(G\)` to zero one element while changing only one other element. --- A stable way to choose `\(c, s\)`: **if** `\(b = 0\)`<BR> `\(c = 1; s = 0\)`<BR> **else** **if** `\(|b| > |a|\)`<BR> `\(\tau = -a/b; s = 1/\sqrt{1+\tau^2}; c = s\tau\)`<BR> **else**<BR> `\(\tau = -b/a;\; c = 1/\sqrt{1 + \tau^2};\; s = c \tau\)`<BR> **end** A sequence of Givens rotations can be used to compute the `\(QR\)` factorization. * The zeroing can be done working down columns or across rows. * Working across rows is useful for incrementally adding more observations. --- ### Applications The QR decomposition can be used for solving `\(n \times n\)` systems of equations `\begin{equation*} Ax = b \end{equation*}` -- since `\(Q^{-1} = Q^T\)` and so `\begin{equation*} A x = Q R x = b \end{equation*}` is equivalent to the upper triangular system `\begin{equation*} R x = Q^T b \end{equation*}` --- The QR decomposition can also be used to solve the normal equations in linear regression: -- If `\(X\)` is the `\(n \times p\)` design matrix then the normal equations are `\begin{equation*} X^T X b = X^T y \end{equation*}` -- If `\(X = QR\)` is the QR decomposition of `\(X\)`, then `\(X^T = R^T Q^T\)` and therefore `\begin{align*} X^TX &= R^T Q^T Q R = R^T R\\ X^T y &= R^T Q^T y \end{align*}` -- If `\(X\)` is of full rank then `\(R^T\)` is invertible, and the normal equations are equivalent to the upper triangular system `\begin{equation*} R b = Q^T y \end{equation*}` -- This approach avoids computing `\(X^T X\)`. -- If `\(X\)` is of full rank then `\(R^T R\)` is the Cholesky factorization of `\(X^T X\)` (up to multiplications of rows of `\(R\)` by `\(\pm 1\)`). --- ### QR with Column Pivoting Sometimes the columns of `\(X\)` are linearly dependent or nearly so. -- By permuting columns we can produce a factorization `\begin{equation*} A = Q R P \end{equation*}` where -- * `\(P\)` is a permutation matrix -- * `\(R\)` is upper triangular and the diagonal elements of `\(R\)` have non-increasing magnitudes, i.e. `$$|r_{ii}| \ge |r_{jj}|$$` if `\(i \le j\)`. --- If some of the diagonal entries of `\(R\)` are zero, then `\(R\)` will be of the form `\begin{equation*} R = \begin{bmatrix} R_{11} & R_{12}\\ 0 & 0 \end{bmatrix} \end{equation*}` where `\(R_{11}\)` is upper triangular with non-zero diagonal elements non-increasing in magnitude. -- The rank of the matrix is the number of non-zero rows in R. -- The _numerical rank_ of a matrix can be determined by -- * computing its QR factorization with column pivoting -- * specifying a tolerance level `\(\epsilon\)` such that all diagonal entries `\(|r_{ii}| < \epsilon\)` are considered numerically zero. -- * Modifying the computed QR factorization to zero all rows corresponding to numerically zero `\(r_{ii}\)` values. --- ### Some Regression Diagnostics The projection matrix, or hat matrix, is `\begin{equation*} H = X (X^T X)^{-1} X^T = QR (R^T R)^{-1} R^T Q^T = QQ^T \end{equation*}` -- The diagonal elements of the hat matrix are therefore `\begin{equation*} h_i = \sum_{j=1}^p q_{ij}^2 \end{equation*}` --- If `\(\hat{e}_i = y_i - \hat{y}_i\)` is the residual, then `\begin{align*} s_{-i}^2 &= \frac{\text{SSE} - \hat{e}_i^2/(1-h_i)}{n-p-1} = \text{ estimate of variance without obs. i}\\ t_i &= \frac{\hat{e}_i}{s_{-i}\sqrt{1-h_i}} = \text{externally studentized residual}\\ D_i &= \frac{\hat{e}_i^2 h_i}{(1-h_i)^2s^2p} = \text{Cook's distance} \end{align*}` --- ### QR Decomposition and Least Squares in R The R function `qr` uses either LINPACK or LAPACK to compute QR factorizations. -- A modified LINPACK routine is the default. -- The core linear model fitting function `lm.fit` uses QR factorization with column pivoting. --- layout: true ## Singular Value Decomposition --- An `\(m \times n\)` matrix `\(A\)` with `\(m \ge n\)` can be factored as `\begin{equation*} A = U D V^T \end{equation*}` where -- * `\(U\)` is `\(m \times n\)` with orthonormal columns, i.e. `\(U^T U = I_n\)`. -- * `\(V\)` is `\(n \times n\)` orthogonal, i.e. `\(V V^T = V^T V = I_n\)`. -- * `\(D = \text{diag}(d_1,\dots,d_n)\)` is `\(n \times n\)` diagonal with `\(d_1 \ge d_2 \ge \dots \ge d_n \ge 0\)`. -- This is the _singular value decomposition_, or _SVD_ of `\(A\)`. -- * The values `\(d_1, \dots, d_n\)` are the _singular values_ of `\(A\)`. -- * The columns of `\(U\)` are the _right singular vectors_ of `\(A\)`. -- * The columns of `\(V\)` are the _left singular vectors_ of `\(A\)`. --- If the columns of `\(A\)` have been centered so the column sums of `\(A\)` are zero, then the columns of `\(UD\)` are the _principal components_ of `\(A\)`. -- Excellent algorithms are available for computing the SVD. -- These algorithms are usually several times slower than the QR algorithms. --- ### Some Properties of the SVD The Euclidean matrix norm of `\(A\)` is defined as `\begin{equation*} \Norm{A}_2 = \max_{x \neq 0}\frac{\Norm{Ax}_2}{\Norm{x}_2} \end{equation*}` with `\(\Norm{x}_2 = \sqrt{x^T x}\)` the Euclidean vector norm. -- If `\(A\)` has SVD `\(A=UDV^T\)`, then `\begin{equation*} \Norm{A}_2 = d_1 \end{equation*}` --- If `\(k < \text{rank}(A)\)` and `\begin{equation*} A_k = \sum_{i=1}^k d_i u_i v_i^T \end{equation*}` where `\(u_i\)` is the `\(i\)`-th column of `\(U\)` and `\(v_i\)` is the `\(i\)`-th column of `\(V\)` -- then `\begin{equation*} \min_{B:\text{rank}(B) \le k} \Norm{A - B}_2 = \Norm{A - A_k} = d_{k+1} \end{equation*}` -- In particular, -- * `\(d_1 u_1 v_1^T\)` is the best rank one approximation to `\(A\)` (in the Euclidean matrix norm). -- * `\(A_k\)` is the best rank `\(k\)` approximation to `\(A\)`. -- * If `\(m = n\)` then `\(d_n = \min\{d_1,\dots.d_n\}\)` is the distance between `\(A\)` and the set of singular matrices. --- If `\(A\)` is square then the condition number based on the Euclidean norm is `\begin{equation*} \kappa_2(A) = \Norm{A}_2\Norm{A^{-1}}_2 = \frac{d_1}{d_n} \end{equation*}` -- For an `\(m \times n\)` matrix with `\(m > n\)` we also have `\begin{equation*} \kappa_2(A) = \frac{\max_{x \ne 0} \frac{\Norm{Ax}_2}{\Norm{x}_2}} {\min_{x \ne 0} \frac{\Norm{Ax}_2}{\Norm{x}_2}} = \frac{d_1}{d_n} \end{equation*}` -- * This can be used to relate `\(\kappa_2(A^T A)\)` to `\(\kappa_2(A)\)`. -- * This has implications for regression computations. -- The singular values are the non-negative square roots of the eigenvalues of `\(A^T A\)` and the columns of `\(V\)` are the corresponding eigenvectors. --- ### Moore-Penrose Generalized Inverse .pull-left[ Suppose `\(A\)` has rank `\(r \le n\)` and SVD `\(A = U D V^T\)`. {{content}} ] -- Then `\begin{equation*} d_{r+1} = \dots = d_n = 0 \end{equation*}` {{content}} -- Let `\begin{equation*} D^+ = \text{diag}\left(\frac{1}{d_1}, \dots, \frac{1}{d_r}, 0, \dots, 0\right) \end{equation*}` {{content}} -- and let `\begin{equation*} A^+ = V D^+ U^T \end{equation*}` -- .pull-right[ Then `\(A^+\)` satisfies `\begin{align*} A A^+ A &= A\\ A^+ A A^+ &= A^+\\ (A A^+)^T &= A A^+\\ (A^+A)^T &= A^+ A \end{align*}` {{content}} ] -- `\(A^+\)` is the unique matrix with these properties and is called the Moore-Penrose _generalized inverse_ or _pseudo-inverse_. --- ### SVD and Least Squares If `\(X\)` is an `\(n \times p\)` design matrix of less than full rank, then there are infinitely many values of `\(b\)` that minimize `\begin{equation*} \Norm{y - Xb}_2^2 \end{equation*}` -- Among these solutions, `\begin{equation*} b = (X^T X)^+ X^T y \end{equation*}` minimizes `\(\Norm{b}_2\)`. -- This is related to _penalized regression_ where one might choose `\(b\)` to minimize `\begin{equation*} \Norm{y - Xb}_2^2 + \lambda \Norm{b}_2^2 \end{equation*}` for some choice of `\(\lambda > 0\)`. --- ### SVD and Principal Components Analysis Let `\(X\)` be an `\(n \times p\)` matrix of `\(n\)` observations on `\(p\)` variables. -- Principal components analysis involves estimating the eigenvectors and eigenvalues of the covariance matrix. -- Let `\(\widetilde{X}\)` be the data matrix with columns centered at zero by subtracting the column means. -- The sample covariance matrix is `\begin{equation*} S = \frac{1}{n-1} \widetilde{X}^T\widetilde{X} \end{equation*}` -- Let `\(\widetilde{X} = U D V^T\)` be the SVD of the centered data matrix `\(\widetilde{X}\)`. --- Then `\begin{equation*} S = \frac{1}{n-1} V D U^T U D V^T = \frac{1}{n-1} V D^2 V^T \end{equation*}` -- * The diagonal elements of `\(\frac{1}{n-1}D^2\)` are the eigenvalues of `\(S\)`. -- * The columns of `\(V\)` are the eigenvectors of `\(S\)`. -- Using the SVD of `\(\widetilde{X}\)` is more numerically stable than -- * forming `\(\widetilde{X}^T\widetilde{X}\)` -- * computing the eigenvalues and eigenvectors. --- ### SVD and Numerical Rank The rank of a matrix `\(A\)` is equal to the number of non-zero singular values. -- Exact zeros may not occur in the SVD due to rounding. -- Numerical rank determination can be based on the SVD. All `\(d_i \le \delta\)` can be set to zero for some choice of `\(\delta\)`. Golub and van Loan recommend using `$$\delta = \machunit \Norm{A}_\infty$$` -- If the entries of `\(A\)` are only accurate to `\(d\)` decimal digits, then Golub and van Loan recommend `\begin{equation*} \delta = 10^{-d} \Norm{A}_\infty \end{equation*}` -- If the numerical rank of `\(A\)` is `\(\hat{r}\)` and `\(d_{\hat{r}} \gg \delta\)` then `\(\hat{r}\)` can be used with some confidence; otherwise caution is needed. --- .pull-left.tiny-code[ ### Other Applications The SVD is used in many areas of numerical analysis. {{content}} ] -- It is also often useful as a theoretical tool. {{content}} -- Some approaches to compressing `\(m \times n\)` images are based on the SVD. {{content}} -- A simple example using the [`volcano`](https://teara.govt.nz/en/photograph/3920/maungawhau-mt-eden) data: <!-- $ -- get emacs sane again --> ```r s <- svd(volcano) head(s$d, 10) ## [1] 9644.28782 488.60992 341.18358 298.76602 141.83363 ## [6] 72.12443 43.55698 33.52319 27.38376 19.97622 tail(s$d, 2) ## [1] 1.0526941 0.9545092 ``` The relative error in the final image is 0.056. -- .pull-right[ <img src="linalg_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> ] --- ### SVD in R R provides the function `svd` to compute the SVD. -- The implementation uses LAPACK. -- You can ask for the singular values only; this is will be faster for larger problems. --- layout: true ## Eigenvalues and Eigenvectors --- Let `\(A\)` be an `\(n \times n\)` matrix. `\(\lambda\)` is an _eigenvalue_ of `\(A\)` if `\begin{equation*} A v = \lambda v \end{equation*}` for some `\(v \neq 0\)`; `\(v\)` is an _eigenvector_ or `\(A\)`. -- If `\(A\)` is a real `\(n \times n\)` matrix then it has `\(n\)` eigenvalues. -- * Several eigenvalues may be identical. -- * Some eigenvalues may be complex; if so, then they come in conjugate pairs. -- * The set of eigenvalues is called the _spectrum_. -- If `\(A\)` is symmetric then the eigenvalues are real. --- If `\(A\)` is symmetric then -- * `\(A\)` is strictly positive definite if and only if all eigenvalues are positive. -- * `\(A\)` is positive semi-definite if and only if all eigenvalues are non-negative. -- * There exists an orthogonal matrix `\(V\)` such that $$ A = V \Lambda V^T $$ with `\(\Lambda = \text{diag}(\lambda_1,\dots,\lambda_n)\)`; the columns of `\(V\)` are the corresponding normalized eigenvectors. -- * This is called the _spectral decomposition_ of `\(A\)`. --- Some problems require only the largest eigenvalue or the largest few; sometimes the corresponding eigenvectors are also needed. -- * The stationary distribution of an irreducible finite state-space Markov chain is the unique eigenvector, normalized to sum to one, corresponding to the largest eigenvalue `\(\lambda = 1\)`. -- * The speed of convergence to the stationary distribution depends on the magnitude of the second largest eigenvalue. -- The R function `eigen` can be used to compute eigenvalues and, optionally, eigenvectors. -- The package `svd` provides routines for computing a specified number of the largest eigenvalues and corresponding eigenvectors. --- layout: true ## Determinants --- Theoretically the determinant can be computed as the product of -- * the diagonals of `\(U\)` in the `\(PLU\)` decomposition; -- * the squares of the diagonals of `\(L\)` in the Cholesky factorization; -- * the diagonals of R in the QR decomposition; -- * the eigenvalues. -- Numerically these are almost always bad ideas. -- It is almost always better to work out the sign and compute the sum of the logarithms of the magnitudes of the factors. --- The R functions `det` and `determinant` compute the determinant. -- `determinant` is more complicated to use, but has a `logarithm` option. -- Likelihood and Bayesian analyses often involve a determinant. -- * Usually the log likelihood and log determinant should be used. -- * Usually the log determinant can be computed from a decomposition needed elsewhere in the log likelihood calculation, e.g. a Cholesky factorization. --- ### Example: Multivariate Normal Log-Likelihood Computation The density at `\(x\)` of a `\(d\)`-dimensional normal distribution with mean vector `\(\mu\)` and covariance matrix `\(C\)` is `\begin{equation*} \frac{1}{\sqrt{(2 \pi)^d \det{C}}} \exp\left\{-\frac{1}{2} (x - \mu)^T C^{-1} (x - \mu)\right\}. \end{equation*}` -- The log-density (dropping the additive constant) is `\begin{equation*} -\frac{1}{2} \log\det{C} -\frac{1}{2} (x - \mu)^T C^{-1} (x - \mu). \end{equation*}` -- If `\(C = L L^T\)` is the Cholesky factorization of `\(C\)`, then `\(C^{-1} = L^{-T}L^{-1}\)` and the quadratic form is `\begin{equation*} (x - \mu)^T C^{-1} (x - \mu) = (x - \mu)^T L^{-T}L^{-1} (x - \mu) = \|L^{-1} (x - \mu)\|_2^2. \end{equation*}` --- So to compute the log density at `\(x\)`: -- * Compute `\(y = L^{-1} (x - \mu)\)` by solving `\(L y = (x - \mu)\)` for `\(y\)`. -- * Compute `\(\frac{1}{2} \log\det{C} = \log \det{L}\)`, which is the sum of the logarithms of the diagonal elements of `\(L\)`. -- If `\(C^{-1}\)` is given, then the Colesky factorization of `\(C^{-1}\)` can be used to compute `\begin{equation*} \log\det{C} = -\log\det{C^{-1}}. \end{equation*}` --- layout: true ## Non-Negative Matrix Factorization --- <!-- https://www.r-bloggers.com/quick-intro-to-nmf-the-method-and-the-r-package/ --> A number of problems lead to the desire to approximate a non-negative matrix `\(X\)` by a product `\begin{equation*} X \approx W H \end{equation*}` where `\(W\)`, `\(H\)` are non-negative matricies of low rank, i.e. with few columns. -- There are a number of algorithms available, most of the form `\begin{equation*} \min_{W, H} [D(X, WH) + R(W, H)] \end{equation*}` where `\(D\)` is a loss function and `\(R\)` is a possible penalty for encouraging desirable characteristics of `\(W\)` and `\(H\)`, such as smoothness or sparseness. -- The R package `NMF` provides one approach, and a vignette in the package provides some background and references. --- .pull-left.tiny-code[ As an example, using default settings in the NMF package the `volcano` image can be approximated with factorizations of rank `\(1, \dots, 5\)` by <!-- ## nolint start: semicolon --> ```r library(NMF) nmf1 <- nmf(volcano, 1); V1 <- nmf1@fit@W %*% nmf1@fit@H nmf2 <- nmf(volcano, 2); V2 <- nmf2@fit@W %*% nmf2@fit@H nmf3 <- nmf(volcano, 3); V3 <- nmf3@fit@W %*% nmf3@fit@H nmf4 <- nmf(volcano, 4); V4 <- nmf4@fit@W %*% nmf4@fit@H nmf5 <- nmf(volcano, 5); V5 <- nmf5@fit@W %*% nmf5@fit@H ``` <!-- ## nolint end --> The relative error for the final image is ```r max(abs(volcano - V5)) / max(abs(volcano)) ## [1] 0.03096702 ``` ] .pull-right[ The images: <img src="linalg_files/figure-html/unnamed-chunk-9-1.png" style="display: block; margin: auto;" /> ] --- Another application is _recommender systems_. <!-- https://matloff.wordpress.com/2016/03/05/quick-intro-to-nmf-the-method-and-the-r-package/ http://bigdata-doctor.com/recommender-systems-101-practical-example-in-r/ --> -- For example, `\(X\)` might be ratings of movies (columns) by viewers (rows). -- * The set of actual values would be very sparse as each viewer will typically rate only a small subset of all movies. -- * `\(W\)` would be a user preference matrix, `\(H\)` a corresponding movie feature matrix. -- * The product `\(WH\)` would provide predicted ratings for movies the users have not yet seen. --- layout:false ## Other Factorizations Many other factorizations of matrices are available and being developed. -- Some examples are * Robust variants of the SVD -- * Sparse variants, e.g. Dan Yang, Zongming Ma, and Andreas Buja (2014), "A Sparse Singular Value Decomposition Method for High-Dimensional Data," Journal of Computational and Graphical Statistics 23(4), 923--942. -- * Constrained factorizations, e.g. C. Ding, T. Li, and M. I. Jordan (2010), "Convex and Semi-Nonnegative Matrix Factorizations," IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(1), 45--55. --- layout: true ## Exploiting Special Structure --- Specialized algorithms can sometimes be used for matrices with special structure. -- ### Toeplitz Systems .pull-left[ Stationary time series have covariance matrices that look like `\begin{equation*} \begin{bmatrix} \sigma_0 & \sigma_1 & \sigma_2 & \sigma_3 & \dots\\ \sigma_1 & \sigma_0 & \sigma_1 & \sigma_2 & \dots\\ \sigma_2 & \sigma_1 & \sigma_0 & \sigma_1 & \dots\\ \sigma_3 & \sigma_2 & \sigma_1 & \sigma_0 & \ddots\\ \ddots & \ddots & \ddots & \ddots & \ddots & \end{bmatrix} \end{equation*}` {{content}} ] -- This is a _Toeplitz_ matrix. -- .pull-right[ This matrix is also symmetric --- this is not required for a Toeplitz matrix. {{content}} ] -- Special algorithms requiring `\(O(n^2)\)` operations are available for Toeplitz systems. {{content}} -- General Cholesky factorization requires `\(O(n^3)\)` operations. {{content}} -- The R function `toeplitz` creates Toeplitz matrices. --- ### Circulant Systems .pull-left[ Some problems give rise to matrices that look like `\begin{equation*} C_n = \begin{bmatrix} a_1 & a_2 & a_3 & \dots & a_n\\ a_n & a_1 & a_2 & \dots & a_{n-1}\\ a_{n-1} & a_n & a_1 & \dots & a_{n-2}\\ \vdots & \vdots & \vdots & \ddots & \vdots\\ a_2 & a_3 & a_4 & \dots & a_1 \end{bmatrix} \end{equation*}` {{content}} ] -- This is a _circulant_ matrix, a subclass of Toeplitz matrices. -- .pull-right[ Circulant matrices satisfy `\begin{equation*} C_n = F_n^* \, \text{diag}(\sqrt{n} F_n a) \, F_n \end{equation*}` {{content}} ] -- where `\(F_n\)` is the _Fourier matrix_ with `\begin{equation*} F_n(j,k) = \frac{1}{\sqrt{n}}e^{-(j-1)(k-1)2\pi\sqrt{-1}/n} \end{equation*}` {{content}} -- and `\(F_n^*\)` is the _conjugate transpose_, _Hermitian transpose_, or _adjoint matrix_ of `\(F_n\)`. {{content}} -- <!-- http://www.cs.utk.edu/~dongarra/etemplates/node384.html --> The eigen values are the elements of `\(\sqrt{n} F_n a\)`. --- Products `\(F_n\, x\)` and `\(F_n^* x\)` can be computed with the _fast Fourier transform (FFT)_. -- In R `\begin{align*} \sqrt{n} F_n x &= \texttt{fft(x)}\\ \sqrt{n} F_n^* x &= \texttt{fft(x, inverse = TRUE)}\\ \end{align*}` -- These computations are generally `\(O(n \log n)\)` in complexity. -- Circulant systems can be used to approximate other systems. -- Multi-dimensional analogs exist as well. -- A simple example is available [on line](../examples/circulant.Rmd). --- ### Sparse Systems Many problems lead to large systems in which only a small fraction of coefficients are non-zero. -- Some methods are available for general sparse systems. -- Specialized methods are available for structured sparse systems such as * tri-diagonal systems; -- * block diagonal systems; -- * banded systems. -- Careful choice of row and column permutations can often turn general sparse systems into banded ones. --- ### Sparse and Structured Systems in R Sparse matrix support in R is improving. -- Some packages, like `nlme`, provide utilities they need. -- One basic package available on [CRAN](https://cran.r-project.org) is `sparseM`. -- A more extensive package is `Matrix`. -- `Matrix` is the engine for mixed effects/multi-level model fitting in `lme4`. --- layout: false ## Update Formulas Update formulas are available for most decompositions that allow for efficient adding or dropping of rows or columns. -- These can be useful for example in cross-validation and variable selection computations. -- They can also be useful for fitting linear models to very large data sets; the package `biglm` uses this approach. -- I am not aware of any convenient implementations in R at this point but they may exist. --- layout: true ## Iterative Methods --- Iterative methods can be useful in large, sparse problems. -- Iterative methods for sparse problems can also often be parallelized effectively. -- Iterative methods are also useful when * `\(A x\)` can be computed efficiently for any given `\(x\)`, but -- * it is expensive or impossible to compute `\(A\)` explicitly. -- Some simple examples are available [on line](../examples/itersolve.Rmd). --- ### Gauss-Seidel Iteration Choose an initial solution `\(x^{(0)}\)` to `\begin{equation*} Ax = b \end{equation*}` -- and then update from `\(x^{(k)}\)` to `\(x^{(k+1)}\)` by `\begin{equation*} x^{(k+1)}_i = \frac{1}{a_{ii}} \left(b_i - \sum_{j=1}^{i-1} a_{ij} x_{j}^{(k+1)} - \sum_{j=i+1}^n a_{ij} x_{j}^{(k)}\right) \end{equation*}` for `\(i = 1, \dots, n\)`. -- This is similar in spirit to Gibbs sampling. --- .pull-left[ This can be written in matrix form as `\begin{equation*} x^{(k+1)} = (D+L)^{-1}(-Ux^{(k)}+b) \end{equation*}` with ] .pull-right[ `\begin{align*} L &= \begin{bmatrix} 0 & 0 & \dots & \dots & 0\\ a_{21} & 0 & \dots & & \vdots\\ a_{31} & a_{32} & \ddots & & 0\\ \vdots & & & 0 & 0 \\ a_{n1} & a_{n2} & \dots & a_{n,n-1} & 0 \end{bmatrix}\\ D &= \text{diag}(a_{11},\dots, a_{nn})\\ U &= \begin{bmatrix} 0 & a_{12} & \dots & \dots & a_{1n}\\ 0 & 0 & \dots & & \vdots\\ 0 & 0 & \ddots & & a_{n-2,n}\\ \vdots & & & & a_{n-1,n}\\ 0 & 0 & \dots & 0 & 0 \end{bmatrix} \end{align*}` ] --- ### Splitting Methods The Gauss-Seidel method is a member of a class of _splitting methods_ where `\begin{equation*} M x^{(k+1)} = N x^{(k)} + b \end{equation*}` with `\(A = M - N\)`. -- For the Gauss-Seidel method `\begin{align*} M &= D + L\\ N &= -U. \end{align*}` -- Other members include Jacobi iterations with `\begin{align*} M_J &= D\\ N_J &= -(L+U) \end{align*}` -- Splitting methods are practical if solving linear systems with matrix `\(M\)` is easy. --- ### Convergence A splitting method for a non-singular matrix `\(A\)` will converge to the unique solution of `\(Ax=b\)` if `\begin{equation*} \rho(M^{-1} N) < 1 \end{equation*}` where `\(\rho(G)\)` is the _spectral radius_ of `\(G\)`, -- i.e. the maximal absolute value of the eigenvalues of `\(G\)`. -- This is true, for example, for the Gauss-Seidel method if `\(A\)` is strictly positive definite. -- Convergence can be very slow if `\(\rho(M^{-1} N)\)` is close to one. --- .pull-left.tiny-code[ ### A Simple Example A simple implementation of Gauss-Seidel iteration using the splitting representation: <!-- ## nolint start: semicolon --> ```r gssolve <- function(A, b, itermax = length(b), tol = sqrt(.Machine$double.eps)) { n <- length(b) M <- A; M[rep(1:n, n) < rep(1:n, each = n)] <- 0 N <- -A; N[rep(1:n, n) >= rep(1:n, each = n)] <- 0 x <- numeric(n) for (i in seq_len(itermax)) { x.old <- x x <- forwardsolve(M, N %*% x + b) if (max(abs(x - x.old) / (tol + abs(x.old))) < tol && max(abs(A %*% x - b)) < tol) { message(i, " iterations") return(x) } } message("iteration limit reached") x } ``` <!-- ## nolint end --> ] -- .pull-right.tiny-code[ With a well-conditioned matrix Gauss-Seidel iteration can be faster than a dense PLU decomposition. ```r n <- 2000 A <- crossprod(matrix(rnorm(n * n), n)) / n + 5 * diag(n) b <- rnorm(n) system.time(solve(A, b)) ## user system elapsed ## 1.475 0.012 1.488 system.time(gssolve(A, b)) ## 13 iterations ## user system elapsed ## 0.490 0.008 0.498 ``` Whether the Gauss-Seidel method is faster on this example may depend on the the BLAS/LAPACK in use: ```r sessionInfo()$BLAS ## [1] "/usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0" sessionInfo()$LAPACK ## [1] "/usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0" ``` ] --- ### Successive Over-Relaxation A variation is to define `\begin{equation*} x^{(k+1)}_i = \frac{\omega}{a_{ii}} \left(b_i - \sum_{j=1}^{i-1} a_{ij} x_{j}^{(k+1)} - \sum_{j=i+1}^n a_{ij} x_{j}^{(k)}\right) + (1-\omega) x^{(k)}_i \end{equation*}` or, in matrix form, `\begin{equation*} M_\omega x^{(k+1)} = N_\omega x^{(k)} + \omega b \end{equation*}` with `\begin{align*} M_\omega &= D + \omega L\\ N_\omega &= (1-\omega)D - \omega U \end{align*}` for some `\(\omega\)`, usually with `\(0 < \omega < 1\)`. --- This is called _successive over-relaxation (SOR)_, from its first application in a structural engineering problem. -- For some choices of `\(\omega\)` we can have `\begin{equation*} \rho(M_\omega^{-1}N_\omega) \ll \rho(M^{-1}N) \end{equation*}` and thus faster convergence. -- For some special but important problems the value of `\(\omega\)` that minimizes `\(\rho(M_\omega^{-1}N_\omega)\)` is known or can be computed. --- ### Conjugate Gradient Method .pull-left[ If `\(A\)` is symmetric and strictly positive definite then the unique solution to `\(Ax=b\)` is the unique minimizer of the quadratic function `\begin{equation*} f(x) = \frac{1}{2}x^T A x - x^T b \end{equation*}` {{content}} ] -- Any nonlinear or quadratic optimization method can be used to find the minimum. {{content}} -- The most common one used in this context is the conjugate gradient method. -- .pull-right[ Choose an initial `\(x_0\)`, set `\(d_0 = -g_0 = b - A x_0\)`. {{content}} ] -- Then, while `\(g_k \neq 0\)`, for `\(k = 0, 1, \dots\)` compute `\begin{align*} \alpha_k &= -\frac{g_k^T d_k}{d_k^T A d_k}\\ x_{k+1} &= x_k + \alpha_k d_k\\ g_{k+1} &= A x_{k+1} - b\\ \beta_{k+1} &= \frac{g_{k+1}^T A d_k}{d_k^T A d_k}\\ d_{k+1} &= -g_{k+1} + \beta_{k+1} d_k \end{align*}` --- Some properties: -- * An alternate form of `\(g_{k+1}\)` is $$ g_{k+1} = g_k + \alpha_k A d_k $$ -- This means only one matrix-vector multiplication is needed per iteration. -- * The vector `\(g_k\)` is the gradient of `\(f\)` at `\(x_k\)`. -- * The initial direction `\(d_0 = -g_0\)` is the _direction of steepest descent_ from `\(x_0\)` -- * The directions `\(d_0, d_1, \dots\)` are `\(A\)`-_conjugate_, i.e. `\(d_i^T A d_j = 0\)` for `\(i \neq j\)`. -- * The directions `\(d_0, d_1, \dots, d_{n-1}\)` are linearly independent. --- ### Convergence With exact arithmetic, `\begin{equation*} A x_n = b \end{equation*}` -- That is, the conjugate gradient algorithm terminates with the exact solution in `\(n\)` steps. -- Numerically this does not happen. -- Numerically, the directions will not be exactly `\(A\)`-conjugate. -- A convergence tolerance is used for termination; this can be based on the relative change in the solution `\begin{equation*} \frac{\Norm{x_{k+1}-x_{k}}}{\Norm{x_k}} \end{equation*}` -- or the residual or gradient `\begin{equation*} g_k = A x_k - b \end{equation*}` -- or some combination; an iteration count limit is also a good idea. --- If the algorithm does not terminate within `\(n\)` steps it is a good idea to restart it with a steepest descent step from the current `\(x_k\)`. -- In many sparse and structured problems the algorithm will terminate in far fewer than `\(n\)` steps for reasonable tolerances. -- Convergence is faster if the condition number of `\(A\)` is closer to one. -- The error can be bounded as `\begin{equation*} \Norm{x - x_k}_A \leq 2 \Norm{x - x_0}_A \left(\frac{\sqrt{\kappa_2(A)} - 1}{\sqrt{\kappa_2(A)} + 1}\right)^k \end{equation*}` with `\(\Norm{x}_A = \sqrt{x^T A x}\)`. --- _Preconditioning_ strategies can improve convergence. -- These transform the original problem to one with `\(\tilde{A} = C^{-1} A C^{-1}\)` for some symmetric strictly positive definite `\(C\)`, and then use the conjugate gradient method for `\(\tilde{A}\)` -- Simple choices of `\(C\)` are most useful; sometimes a diagonal matrix will do. -- Good preconditioners can sometimes be designed for specific problems. --- .pull-left.tiny-code[ ### A Simple Implementation ```r cg <- function(A, b, x, done) { dot <- function(x, y) crossprod(x, y)[1] n <- 0 g <- A(x) - b d <- -g repeat { h <- A(d) u <- dot(d, h) a <- -dot(g, d) / u n <- n + 1 x.old <- x x <- x + a * d g <- g + a * h b <- dot(h, g) / u d <- -g + b * d if (done(g, x, x.old, n)) return(list(x = as.vector(x), g = as.vector(g), n = n)) } } ``` ] .pull-right.tiny-code[ * The linear transformation and the termination condition are specified as functions. * The termination condition can use a combination of the gradient, current solution, previous solution, or iteration count. A simple example: ```r X <- crossprod(matrix(rnorm(25), 5)) y <- rnorm(5) cg(A = function(x) X %*% x, b = y, x = rep(0, 5), done = function(g, x, x.old, n) n >= 5) ## $x ## [1] -2.004109 -8.240581 4.410103 -6.264520 -2.547753 ## ## $g ## [1] -2.037259e-14 -1.693090e-15 -1.690315e-14 -5.551115e-16 ## [5] -1.154632e-14 ## ## $n ## [1] 5 solve(X, y) ## [1] -2.004109 -8.240581 4.410103 -6.264520 -2.547753 ``` ] --- .pull-left.tiny-code[ A simple conjugate gradient based `cgsolve` function might be defined as ```r cgsolve <- function(X, y, tol = sqrt(.Machine$double.eps)) { A <- function(x) X %*% x done <- function(g, x, x.old, n) n >= length(x) || (max(abs(x - x.old) / (tol + abs(x.old))) < tol && max(abs(g)) < tol) x <- rep(0, length(y)) v <- cg(A, y, x, done = done) if (v$n >= length(y)) message("iteration limit reached") else message(v$n, " iterations") v$x } ``` ] -- .pull-right.tiny-code[ With a positive definite matrix with a large diagonal component `cgsolve` can be very effective: ```r N <- 2000 X <- crossprod(matrix(rnorm(N * N), N)) XX <- X / N + 5 * diag(N) y <- rnorm(N) system.time(v0 <- solve(XX, y)) ## user system elapsed ## 1.479 0.012 1.490 system.time(v <- cgsolve(XX, y)) ## 15 iterations ## user system elapsed ## 0.080 0.000 0.081 max(abs((v - v0) / v0)) ## [1] 3.637447e-09 ``` ] --- layout: true ## Linear Algebra Software --- ### Some Standard Packages Open source packages developed at national laboratories: -- * LINPACK for linear equations and least squares -- * EISPACK for eigenvalue problems -- * LAPACK newer package for linear equations and eigenvalues -- Designed for high performance. Available from Netlib at <https://netlib.org/> -- Commercial packages: * IMSL used more in US * NAG used more in UK * ... --- ### BLAS: Basic Linear Algebra Subroutines Modern BLAS has three levels: -- * **Level 1:** Vector and vector/vector operations such as * dot product `\(x^T y\)`; * scalar multiply and add (axpy): `\(\alpha x + y\)`; * Givens rotations; * ... -- * **Level 2:** Matrix/vector operations, such as `\(A x\)` -- * **Level 3:** Matrix/matrix operations, such as `\(A B\)` -- LINPACK uses only Level 1; LAPACK uses all three levels. -- BLAS defines the interface. -- Standard reference implementations are available from Netlib. -- Highly optimized versions are available from hardware vendors and research organizations. --- ### Cholesky Factorization in LAPACK The core of the [`DPOTRF`](https://netlib.org/lapack/explore-3.1.1-html/dpotf2.f.html) subroutine: .pull-left.width-70.tiny-code[ <!-- ## nolint start: brace --> <pre class = 'f'><code>* * Compute the Cholesky factorization A = L*L'. * DO 20 J = 1, N * * Compute L(J,J) and test for non-positive-definiteness. * AJJ = A( J, J ) - <span style='color: red;'>DDOT</span>( J-1, A( J, 1 ), LDA, A( J, 1 ), $ LDA ) IF( AJJ.LE.ZERO ) THEN A( J, J ) = AJJ GO TO 30 END IF AJJ = SQRT( AJJ ) A( J, J ) = AJJ * * Compute elements J+1:N of column J. * IF( J.LT.N ) THEN CALL <span style='color: blue;'>DGEMV</span>( 'No transpose', N-J, J-1, -ONE, A( J+1, 1 ), $ LDA, A( J, 1 ), LDA, ONE, A( J+1, J ), 1 ) CALL <span style='color: red;'>DSCAL</span>( N-J, ONE / AJJ, A( J+1, J ), 1 ) END IF 20 CONTINUE </code></pre> <!-- ## nolint end --> ] -- .pull-right.width-30[ * <span style='color: red;'>DDOT</span> and <span style='color: red;'>DSCAL</span> are Level 1 BLAS routines * <span style='color: blue;'>DGEMV</span> is a Level 2 BLAS routine ] --- ### ATLAS: Automatically Tuned Linear Algebra Software Available at <https://math-atlas.sourceforge.net/>. * Analyzes machine for properties such as cache characteristics. * Runs extensive tests to determine performance trade-offs. * Creates Fortran and C versions of BLAS and some LAPACK routines tailored to the particular machine. * Provides some routines that take advantage of multiple processors using _worker threads_. --- ### OpenBLAS Another high-performance BLAS library was developed and maintained by Kazushige Goto. -- This is now being developed and maintained as the OpenBLAS roject, available from <https://github.com/xianyi/OpenBLAS>. -- Also provides versions that take advantage of multiple processors. -- ### Vendor Libraries Intel provides the Math Kernel Libraries (MKL) -- AMD has a similar library. -- Apple provides accelerated BLAS libraries for its platforms. --- ### Using a High-Performance BLAS with R R comes with a basic default BLAS. -- R can be built to use a specified BLAS. -- Once built one can change the BLAS R uses by replacing the shared library R uses. -- Some simple computations using the default and MKL vendor BLAS for the data ```r N <- 1000 X <- matrix(rnorm(N^2), N) XX <- crossprod(X) ``` --- Results: | Timing Expression | Reference | MKL/SEQ | MKL/THR | | :------------------------------------------ | --------: | ------: | ------: | | `system.time(for (i in 1:5) crossprod(X))}` | 2.107 | 0.405 | 0.145 | | `system.time(for (i in 1:5) X %*% X)` | 3.401 | 0.742 | 0.237 | | `system.time(svd(X))` | 3.273 | 0.990 | 0.542 | | `system.time(for (i in 1:5) qr(X))` | 2.290 | 1.094 | 1.107 | | `system.time(for (i in 1:5) qr(X, LAPACK=TRUE))` | 2.629 | 0.834 | 0.689 | | `system.time(for (i in 1:20) chol(XX))` | 2.824 | 0.556 | 0.186 | -- These results were obtained using the non-threaded and threaded Intel Math Kernel Library (MKL) a few years ago. -- Versions of the current R using MKL for BLAS may be available as ``` /group/statsoft/R-patched/build-MKL-seq/bin/R /group/statsoft/R-patched/build-MKL-thr/bin/R ``` -- Currently the standard version of R on our Linux systems uses OpenBLAS via [`flexiblas`](https://www.mpi-magdeburg.mpg.de/projects/flexiblas). The R [`flexiblas` package](https://cran.r-project.org/package=flexiblas) can be used to adjust the BLAS used in the current R session. --- layout: false ### Using `flexiblas` on the CLAS Linux Systems .pull-left.tiny-code[ Using default `flexiblas` settings: ```r N <- 1000 X <- matrix(rnorm(N^2), N) XX <- crossprod(X) > system.time(chol(XX)) user system elapsed 0.146 0.022 0.020 library(flexiblas) > flexiblas_current_backend() [1] "OPENBLAS-OPENMP" ``` {{content}} ] -- Switching to the reference BLAS: ```r > flexiblas_list() [1] "NETLIB" "__FALLBACK__" "OPENBLAS-OPENMP" > flexiblas_list_loaded() [1] "OPENBLAS-OPENMP" > flexiblas_load_backend("NETLIB") [1] 2 > flexiblas_switch(2) > flexiblas_current_backend() [1] "NETLIB" > system.time(chol(XX)) user system elapsed 0.125 0.000 0.126 ``` -- .pull-right.tiny-code[ Back to OpenMP ```r > flexiblas_switch(1) > flexiblas_current_backend() [1] "OPENBLAS-OPENMP" > flexiblas_get_num_threads() [1] 16 > system.time(chol(XX)) user system elapsed 0.136 0.023 0.016 ``` {{content}} ] -- Adjusting the number of threads: ```r > flexiblas_set_num_threads(4) > system.time(chol(XX)) user system elapsed 0.015 0.002 0.005 > flexiblas_set_num_threads(8) > system.time(chol(XX)) user system elapsed 0.023 0.009 0.006 ``` --- layout: false ## Final Notes Most reasonable approaches will be accurate for reasonable problems. -- Choosing good scaling and centering can make problems more reasonable (both numerically and statistically) -- Most methods are efficient enough for our purposes. -- In some problems worrying about efficiency is important if reasonable problem sizes are to be handled. -- Making sure you are using the right approach to the right problem is much more important than efficiency. -- Some quotes: -- D. E. Knuth, restating a comment by C. A. R. Hoare: > We should forget about small efficiencies, say about 97% of the > time: premature optimization is the root of all evil. -- W. A. Wulf: > More computing sins are committed in the name of efficiency (without > necessarily achieving it) than for any other single reason --- > including blind stupidity.
//adapted from Emi Tanaka's gist at //https://gist.github.com/emitanaka/eaa258bb8471c041797ff377704c8505