--- title: "QQ Plots and PP Plots" output: html_document: toc: yes --- {r global_options, include=FALSE} knitr::opts_chunk$set(collapse=TRUE)  {r, include=FALSE} library(UsingR)  ## QQ Plots ### QQ Plot Basics One way to assess how well a particular theoretical model describes a data distribution is to plot data quantiles against theoretical quantiles. Base graphics provides qqnorm, lattice has qqmath, and ggplot2 has geom_qq. The default theoretical distribution used in these is a standard normal, but, except for qqnorm, these allow you to specify an alternative. For a large sample from the theoretical distribution the plot should be a straight line through the origin with slope 1: {r} n <- 10000 ggplot() + geom_qq(aes(sample = rnorm(n)))  If the plot is a straight line with a different slope or intercept, then the data distribution corresponds to a location-scale transformation of the theoretical distribution. The slope is the scale and the intercept is the location: {r} ggplot() + geom_qq(aes(sample = rnorm(n, 10, 4))) + geom_abline(intercept = 10, slope = 4, color = "red", size = 1.5, alpha = 0.8)  The QQ plot can be constructed directly as a scatterplot of the sorted sample$x_{(i)}$for$i = 1, \dots, n$against quantiles for $$p_i = \frac{i}{n} - \frac{1}{2n}$$ {r} p <- (1 : n) / n - 0.5 / n y <- rnorm(n, 10, 4) ggplot() + geom_point(aes(x = qnorm(p), y = sort(y)))  ### Some Examples The histograms and density estimates for the duration variable in the geyser data set showed that the distribution is far from a normal distribution, and the normal QQ plot shows this as well: {r} ggplot(geyser) + geom_qq(aes(sample = duration))  Except for rounding the parent heights in the Galton data seemed not too fat from normally distributed: {r} ggplot(Galton) + geom_qq(aes(sample = parent))  * Rounding interferes more with this visualization than with a histogram or a density plot. * Rounding is more visible with this visualization than with a histogram or a density plot. Another Gatlton dataset available in the UsingR package with less rounding is father.son: {r} ggplot(father.son) + geom_qq(aes(sample = fheight))  The middle seems to be fairly straight, but the ends are somewhat wiggly. How can you calibrate your judgment? ### Calibrating the Variability One approach is to use simulation, sometimes called a _graphical bootstrap_. The nboot function will simulate R samples from a normal distribution that match a variable x on sample size, sample mean, and sample SD. The result is returned in a data frame suitable for plotting: {R} nsim <- function(n, m = 0, s = 1) { z <- rnorm(n) m + s * ((z - mean(z)) / sd(z)) } nboot <- function(x, R) { n <- length(x) m <- mean(x) s <- sd(x) do.call(rbind, lapply(1 : R, function(i) { xx <- sort(nsim(n, m, s)) p <- seq_along(x) / n - 0.5 / n data.frame(x = xx, p = p, sim = i) })) }  Plotting these as lines shows the variability in shapes we can expect when sampling from the theoretical normal distribution: {r} gb <- nboot(father.son$fheight, 50) ggplot() + geom_line(aes(x = qnorm(p), y = x, group = sim), color = "gray", data = gb)  We can then insert this simulation behind our data to help calibrate the visualization: {r} ggplot(father.son) + geom_line(aes(x = qnorm(p), y = x, group = sim), color = "gray", data = gb) + geom_qq(aes(sample = fheight))  ### Scalability For large sample sizes overplotting will occur: {r} ggplot(diamonds) + geom_qq(aes(sample = price))  This can be alleviated by using a grid of quantiles: {r} nq <- 100 p <- (1 : nq) / nq - 0.5 / nq ggplot() + geom_point(aes(x = qnorm(p), y = quantile(diamonds$price, p)))  A more reasonable model might be an exponential distribution: {r} ggplot() + geom_point(aes(x = qexp(p), y = quantile(diamonds$price, p)))  ### Comparing Two Distributions The QQ plot can also be used to compare two distributions based on a sample from each. If the samples are the same size then this is just a plot of the ordered sample values against each other. Choosing a fixed set of quantiles allows samples of unequal size to be compared. Using a small set of quantiles we can compare the distributions of waiting times between eruptions of Old Faithful from the two different data sets we have looked at: {r} nq <- 31 p <- (1 : nq) / nq - 0.5 / nq wg <- geyser$waiting wf <- faithful$waiting ggplot() + geom_point(aes(x = quantile(wg, p), y = quantile(wf, p)))  ## PP Plots The PP plot for comparing a sample to a theoretical model plots the theoretical proportion less than or equal to each observed value against the actual proportion. For a theoretical cumulative distribution function $F$ this means plotting $$F(x_{(i)}) \sim p_i$$ For the fheight variable in the father.son data: {r} m <- mean(father.son$fheight) s <- sd(father.son$fheight) n <- nrow(father.son) p <- (1 : n) / n - 0.5 / n ggplot(father.son) + geom_point(aes(x = p, y = sort(pnorm(fheight, m, s))))  * The values on the vertical axis are the _probability integral transform_ of the data for the theoretical distribution. * If the data are a sample from the theoretical distribution then these transforms would be uniformly distributed on $[0, 1]$. * The PP plot is a QQ plot of these transformed values against a uniform distribution. * The PP plot goes through the points $(0, 0)$ and $(1, 1)$ and so is much less variable in the tails: {r} pp <- ggplot() + geom_line(aes(x = p, y = pnorm(x, m, s), group = sim), color = "gray", data = gb) pp  Adding the data: {r} pp + geom_point(aes(x = p, y = sort(pnorm(fheight, m, s))), data = (father.son))  The PP plot is also less sensitive to deviations in the tails. A compromise between the QQ and PP plots uses the _arcsine square root_ variance-stabilizing transformation, which makes the variability approximately constant across the range of the plot: {r} vpp <- ggplot() + geom_line(aes(x = asin(sqrt(p)), y = asin(sqrt(pnorm(x, m, s))), group = sim), color = "gray", data = gb) vpp  Adding the data: {r} vpp + geom_point(aes(x = asin(sqrt(p)), y = sort(asin(sqrt(pnorm(fheight, m, s))))), data = (father.son))  ## Plots For Assessing Model Fit * Both QQ and PP plots can be used to asses how well a theoretical family of models fits your data, or your residuals. * To use a PP plot you have to estimate the parameters first. * For a location-scale family, like the normal distribution family, you can use a QQ plot with a standard member of the family. * Some other families can use other transformations that lead to straight lines for family members: * The Weibull family is widely used in reliability modeling; its CDF is $$F(t) = 1 - \exp\left\{-\left(\frac{t}{b}\right)^a\right\}$$ * The logarithms of Weibull random variables form a location-scale family. * Special paper used to be available for [Weibull probability plots](http://weibull.com/hotwire/issue8/relbasics8.htm). A Weibull QQ plot for price in the diamonds data: {r} n <- nrow(diamonds) p <- (1 : n) / n - 0.5 / n ggplot(diamonds) + geom_point(aes(x = log10(qweibull(p, 1, 1)), y = log10(sort(price))))  * The lower tail does not match a Weibull distribution. * Is this important? * In engineering applications it often is. * In selecting a reasonable model to capture the shape of this distribution it may not be. * QQ plots are helpful for understanding departures from a theoretical model. * No data will fit a theoretical model perfectly. * Case-specific judgment is needed to decide whether departures are important. * George Box: All models are wrong but some are useful. ## Some References Adam Loy, Lendie Follett, and Heike Hofmann (2016), "Variations of Q–Q plots: The power of our eyes!", The American Statistician John R. Michael (1983), "The stabilized probability plot," _Biometrika_ [JSTOR](https://www.jstor.org/stable/2335939?seq=1#page_scan_tab_contents). M. B. Wilk and R. Gnanadesikan (1968), "Probability plotting methods for the analysis of data," _Biometrika_ [JSTOR](https://www.jstor.org/stable/2334448?seq=1#page_scan_tab_contents). Box, G. E. P. (1979), "Robustness in the strategy of scientific model building", in Launer, R. L.; Wilkinson, G. N., _Robustness in Statistics_, Academic Press, pp. 201–236. Thomas Lumley (2019), ["What have I got against the Shapiro-Wilk test?"](https://notstatschat.rbind.io/2019/02/09/what-have-i-got-against-the-shapiro-wilk-test/)