---
title: "QQ Plots and PP Plots"
output:
html_document:
toc: yes
---
```{r global_options, include=FALSE}
knitr::opts_chunk$set(collapse=TRUE)
```
```{r, include=FALSE}
library(UsingR)
```
## QQ Plots
### QQ Plot Basics
One way to assess how well a particular theoretical model describes a
data distribution is to plot data quantiles against theoretical quantiles.
Base graphics provides `qqnorm`, `lattice` has `qqmath`, and
`ggplot2` has `geom_qq`.
The default theoretical distribution used in these is a standard
normal, but, except for `qqnorm`, these allow you to specify an
alternative.
For a large sample from the theoretical distribution the plot should
be a straight line through the origin with slope 1:
```{r}
n <- 10000
ggplot() + geom_qq(aes(sample = rnorm(n)))
```
If the plot is a straight line with a different slope or intercept,
then the data distribution corresponds to a location-scale
transformation of the theoretical distribution.
The slope is the scale and the intercept is the location:
```{r}
ggplot() +
geom_qq(aes(sample = rnorm(n, 10, 4))) +
geom_abline(intercept = 10, slope = 4,
color = "red", size = 1.5, alpha = 0.8)
```
The QQ plot can be constructed directly as a scatterplot of the sorted
sample $x_{(i)}$ for $i = 1, \dots, n$ against quantiles for
$$
p_i = \frac{i}{n} - \frac{1}{2n}
$$
```{r}
p <- (1 : n) / n - 0.5 / n
y <- rnorm(n, 10, 4)
ggplot() + geom_point(aes(x = qnorm(p), y = sort(y)))
```
### Some Examples
The histograms and density estimates for the `duration` variable in
the `geyser` data set showed that the distribution is far from a
normal distribution, and the normal QQ plot shows this as well:
```{r}
ggplot(geyser) + geom_qq(aes(sample = duration))
```
Except for rounding the `parent` heights in the `Galton` data seemed
not too fat from normally distributed:
```{r}
ggplot(Galton) + geom_qq(aes(sample = parent))
```
* Rounding interferes more with this visualization than with a histogram or a
density plot.
* Rounding is more visible with this visualization than with a histogram or a
density plot.
Another Gatlton dataset available in the `UsingR` package with less
rounding is `father.son`:
```{r}
ggplot(father.son) + geom_qq(aes(sample = fheight))
```
The middle seems to be fairly straight, but the ends are somewhat wiggly.
How can you calibrate your judgment?
### Calibrating the Variability
One approach is to use simulation, sometimes called a _graphical
bootstrap_.
The `nboot` function will simulate `R` samples from a normal
distribution that match a variable `x` on sample size, sample mean,
and sample SD.
The result is returned in a data frame suitable for plotting:
```{R}
nsim <- function(n, m = 0, s = 1) {
z <- rnorm(n)
m + s * ((z - mean(z)) / sd(z))
}
nboot <- function(x, R) {
n <- length(x)
m <- mean(x)
s <- sd(x)
do.call(rbind,
lapply(1 : R,
function(i) {
xx <- sort(nsim(n, m, s))
p <- seq_along(x) / n - 0.5 / n
data.frame(x = xx, p = p, sim = i)
}))
}
```
Plotting these as lines shows the variability in shapes we can expect
when sampling from the theoretical normal distribution:
```{r}
gb <- nboot(father.son$fheight, 50)
ggplot() +
geom_line(aes(x = qnorm(p), y = x, group = sim),
color = "gray", data = gb)
```
We can then insert this simulation behind our data to help calibrate
the visualization:
```{r}
ggplot(father.son) +
geom_line(aes(x = qnorm(p), y = x, group = sim),
color = "gray", data = gb) +
geom_qq(aes(sample = fheight))
```
### Scalability
For large sample sizes overplotting will occur:
```{r}
ggplot(diamonds) + geom_qq(aes(sample = price))
```
This can be alleviated by using a grid of quantiles:
```{r}
nq <- 100
p <- (1 : nq) / nq - 0.5 / nq
ggplot() + geom_point(aes(x = qnorm(p), y = quantile(diamonds$price, p)))
```
A more reasonable model might be an exponential distribution:
```{r}
ggplot() + geom_point(aes(x = qexp(p), y = quantile(diamonds$price, p)))
```
### Comparing Two Distributions
The QQ plot can also be used to compare two distributions based on a
sample from each.
If the samples are the same size then this is just a plot of the
ordered sample values against each other.
Choosing a fixed set of quantiles allows samples of unequal size to be
compared.
Using a small set of quantiles we can compare the distributions of
waiting times between eruptions of Old Faithful from the two different
data sets we have looked at:
```{r}
nq <- 31
p <- (1 : nq) / nq - 0.5 / nq
wg <- geyser$waiting
wf <- faithful$waiting
ggplot() + geom_point(aes(x = quantile(wg, p), y = quantile(wf, p)))
```
## PP Plots
The PP plot for comparing a sample to a theoretical model plots the
theoretical proportion less than or equal to each observed value
against the actual proportion.
For a theoretical cumulative distribution function $F$ this means plotting
$$
F(x_{(i)}) \sim p_i
$$
For the `fheight` variable in the `father.son` data:
```{r}
m <- mean(father.son$fheight)
s <- sd(father.son$fheight)
n <- nrow(father.son)
p <- (1 : n) / n - 0.5 / n
ggplot(father.son) + geom_point(aes(x = p, y = sort(pnorm(fheight, m, s))))
```
* The values on the vertical axis are the _probability integral
transform_ of the data for the theoretical distribution.
* If the data are a sample from the theoretical distribution then
these transforms would be uniformly distributed on $[0, 1]$.
* The PP plot is a QQ plot of these transformed values against a
uniform distribution.
* The PP plot goes through the points $(0, 0)$ and $(1, 1)$ and so is
much less variable in the tails:
```{r}
pp <- ggplot() +
geom_line(aes(x = p, y = pnorm(x, m, s), group = sim),
color = "gray", data = gb)
pp
```
Adding the data:
```{r}
pp +
geom_point(aes(x = p, y = sort(pnorm(fheight, m, s))), data = (father.son))
```
The PP plot is also less sensitive to deviations in the tails.
A compromise between the QQ and PP plots uses the _arcsine square root_
variance-stabilizing transformation, which makes the variability
approximately constant across the range of the plot:
```{r}
vpp <- ggplot() +
geom_line(aes(x = asin(sqrt(p)), y = asin(sqrt(pnorm(x, m, s))), group = sim),
color = "gray", data = gb)
vpp
```
Adding the data:
```{r}
vpp +
geom_point(aes(x = asin(sqrt(p)), y = sort(asin(sqrt(pnorm(fheight, m, s))))),
data = (father.son))
```
## Plots For Assessing Model Fit
* Both QQ and PP plots can be used to asses how well a theoretical
family of models fits your data, or your residuals.
* To use a PP plot you have to estimate the parameters first.
* For a location-scale family, like the normal distribution family,
you can use a QQ plot with a standard member of the family.
* Some other families can use other transformations that lead to
straight lines for family members:
* The Weibull family is widely used in reliability modeling; its
CDF is
$$ F(t) = 1 - \exp\left\{-\left(\frac{t}{b}\right)^a\right\}$$
* The logarithms of Weibull random variables form a location-scale
family.
* Special paper used to be available for [Weibull probability
plots](http://weibull.com/hotwire/issue8/relbasics8.htm).
A Weibull QQ plot for `price` in the `diamonds` data:
```{r}
n <- nrow(diamonds)
p <- (1 : n) / n - 0.5 / n
ggplot(diamonds) +
geom_point(aes(x = log10(qweibull(p, 1, 1)), y = log10(sort(price))))
```
* The lower tail does not match a Weibull distribution.
* Is this important?
* In engineering applications it often is.
* In selecting a reasonable model to capture the shape of this
distribution it may not be.
* QQ plots are helpful for understanding departures from a theoretical
model.
* No data will fit a theoretical model perfectly.
* Case-specific judgment is needed to decide whether departures are
important.
* George Box: All models are wrong but some are useful.
## Some References
Adam Loy, Lendie Follett, and Heike Hofmann (2016),
"Variations of Q–Q plots: The power of our eyes!",
The American Statistician
John R. Michael (1983), "The stabilized probability plot," _Biometrika_
[JSTOR](https://www.jstor.org/stable/2335939?seq=1#page_scan_tab_contents).
M. B. Wilk and R. Gnanadesikan (1968), "Probability plotting methods
for the analysis of data," _Biometrika_
[JSTOR](https://www.jstor.org/stable/2334448?seq=1#page_scan_tab_contents).
Box, G. E. P. (1979), "Robustness in the strategy of scientific model
building", in Launer, R. L.; Wilkinson, G. N., _Robustness in
Statistics_, Academic Press, pp. 201–236.
Thomas Lumley (2019), ["What have I got against the Shapiro-Wilk
test?"](https://notstatschat.rbind.io/2019/02/09/what-have-i-got-against-the-shapiro-wilk-test/)