--- title: "ECDF, QQ, and PP Plots" output: html_document: toc: yes code_folding: show code_download: true --- ```{r setup, include = FALSE, message = FALSE} source(here::here("setup.R")) knitr::opts_chunk$set(collapse = TRUE, message = FALSE, fig.height = 5, fig.width = 6, fig.align = "center") set.seed(12345) library(dplyr) library(ggplot2) library(lattice) library(gridExtra) source(here::here("datasets.R")) ``` ## Cumulative Distribution and Survival Functions The _empirical cumulative distribution function_ (ECDF) of a numeric sample computes the proportion of the sample at or below a specified value. For the `yields` of the `barley` data: ```{r, class.source = "fold-hide"} library(ggplot2) data(barley, package = "lattice") thm <- theme_minimal() + theme(text = element_text(size = 16)) + theme(panel.border = element_rect(color = "grey30", fill = NA)) p <- ggplot(barley) + stat_ecdf(aes(x = yield)) + ylab("cumulative proportion") + thm p ``` Flipping the axes produces an _empirical quantile plot_: ```{r, class.source = "fold-hide"} p + coord_flip() ``` Both make it easy to look up: * medians, quartiles, and other quantiles; * the proportion of the sample below a particular value; * the proportion above a particular value (one minus the proportion below). An ECDF plot can also be constructed as a step function plot of the _relative rank_ (rank over sample size) against the observed values: ```{r, class.source = "fold-hide"} ggplot(barley) + geom_step( aes(x = yield, y = rank(yield) / length(yield))) + ylab("cumulative proportion") + thm ``` Reversing the relative ranks produces a plot of the _empirical survival function_: ```{r} ggplot(barley) + geom_step( aes(x = yield, y = rank(-yield) / length(yield))) + ylab("surviving proportion") + thm ``` Survival plots are often used for data representing time to failure in engineering or time to death or disease recurrence in medicine. For a highly skewed distribution, such as the distribution of `price` in the `diamonds` data, transforming the axis to a square root or log scale may help. ```{r, fig.width = 8, class.source = "fold-hide"} library(patchwork) p1 <- ggplot(diamonds) + stat_ecdf(aes(x = price)) + ylab("cumulative proportion") + thm p2 <- p1 + scale_x_log10() p1 + p2 ``` There is a downside: Interpolating on a non-linear scale is much harder. ## QQ Plots ### Basics One way to assess how well a particular theoretical model describes a data distribution is to plot data quantiles against theoretical quantiles. This corresponds to transforming the ECDF horizontal axis to the scale of the theoretical distribution. The result is a plot of sample quantiles against theoretical quantiles, and should be close to a 45-degree straight line if the model fits the data well. Such a plot is called a quantile-quantile plot, or a QQ plot for short. Usually a QQ plot * uses points rather than a step function, and * 1/2 is subtracted from the ranks before calculating relative ranks (this makes the rank range more symmetric): For the `barley` data: ```{r, class.source = "fold-hide"} p <- ggplot(barley) + geom_point( aes(y = yield, x = qnorm((rank(yield) - 0.5) / length(yield)))) + xlab("theoretical quantile") + ylab("sample quantile") + thm p ``` For a location-scale family of models, like the normal family, a QQ plot against standard normal quantiles should be close to a straight line if the model is a good fit. For the normal family the intercept will be the mean and the slope will be the standard deviation. Adding a line can help judge the quality of the fit: ```{r, class.source = "fold-hide"} p + geom_abline(aes(intercept = mean(yield), slope = sd(yield)), color = "red") ``` `ggplot` provides `geom_qq` that makes this a little easier; base graphics provides `qqnorm` and `lattice` has `qqmath`. ### Some Examples The histograms and density estimates for the `duration` variable in the `geyser` data set showed that the distribution is far from a normal distribution, and the normal QQ plot shows this as well: ```{r, class.source = "fold-hide"} data(geyser, package = "MASS") ggplot(geyser) + geom_qq(aes(sample = duration)) + thm ``` Except for rounding the `parent` heights in the `Galton` data seemed not too far from normally distributed: ```{r, class.source = "fold-hide"} data(Galton, package = "HistData") ggplot(Galton) + geom_qq(aes(sample = parent)) + thm ``` * Rounding interferes more with this visualization than with a histogram or a density plot. * Rounding is more visible with this visualization than with a histogram or a density plot. Another Gatlton dataset available in the `UsingR` package with less rounding is `father.son`: ```{r, class.source = "fold-hide"} data(father.son, package = "UsingR") ggplot(father.son) + geom_qq(aes(sample = fheight)) + thm ``` The middle seems to be fairly straight, but the ends are somewhat wiggly. How can you calibrate your judgment? ### Calibrating the Variability One approach is to use simulation, sometimes called a _graphical bootstrap_. The `nboot` function will simulate `R` samples from a normal distribution that match a variable `x` on sample size, sample mean, and sample SD. The result is returned in a data frame suitable for plotting: ```{r, message = FALSE} library(dplyr) nsim <- function(n, m = 0, s = 1) { z <- rnorm(n) m + s * ((z - mean(z)) / sd(z)) } nboot <- function(x, R) { n <- length(x) m <- mean(x) s <- sd(x) sim <- function(i) { xx <- sort(nsim(n, m, s)) p <- (seq_along(x) - 0.5) / n data.frame(x = xx, p = p, sim = i) } bind_rows(lapply(1 : R, sim)) } ``` Plotting these as lines shows the variability in shapes we can expect when sampling from the theoretical normal distribution: ```{r, class.source = "fold-hide"} gb <- nboot(father.son$fheight, 50) ggplot() + geom_line(aes(x = qnorm(p), y = x, group = sim), color = "gray", data = gb) + thm ``` We can then insert this simulation behind our data to help calibrate the visualization: ```{r, class.source = "fold-hide"} ggplot(father.son) + geom_line(aes(x = qnorm(p), y = x, group = sim), color = "gray", data = gb) + geom_qq(aes(sample = fheight)) + thm ``` ### Scalability For large sample sizes, such as `price` from the `diamonds` data, overplotting will occur: ```{r, class.source = "fold-hide"} ggplot(diamonds) + geom_qq(aes(sample = price)) + thm ``` This can be alleviated by using a grid of quantiles: ```{r, class.source = "fold-hide"} nq <- 100 p <- ((1 : nq) - 0.5) / nq ggplot() + geom_point(aes(x = qnorm(p), y = quantile(diamonds$price, p))) + thm ``` A more reasonable model might be an exponential distribution: ```{r} ggplot() + geom_point(aes(x = qexp(p), y = quantile(diamonds$price, p))) + thm ``` ### Comparing Two Distributions The QQ plot can also be used to compare two distributions based on a sample from each. If the samples are the same size then this is just a plot of the ordered sample values against each other. Choosing a fixed set of quantiles allows samples of unequal size to be compared. Using a small set of quantiles we can compare the distributions of waiting times between eruptions of Old Faithful from the two different data sets we have looked at: ```{r, class.source = "fold-hide"} nq <- 31 p <- (1 : nq) / nq - 0.5 / nq wg <- geyser$waiting wf <- faithful$waiting ggplot() + geom_point(aes(x = quantile(wg, p), y = quantile(wf, p))) + thm ``` Adding a 45-degree line: ```{r, class.source = "fold-hide"} ggplot() + geom_abline(intercept = 0, slope = 1, lty = 2) + geom_point(aes(x = quantile(wg, p), y = quantile(wf, p))) + thm ``` ## PP Plots The PP plot for comparing a sample to a theoretical model plots the theoretical proportion less than or equal to each observed value against the actual proportion. For a theoretical cumulative distribution function $F$ this means plotting $$ F(x_i) \sim p_i $$ with $$ p_i = \frac{r_i - 1/2}{n} $$ where $r_i$ is the $i$-th observation's rank. For the `fheight` variable in the `father.son` data: ```{r, class.source = "fold-hide"} m <- mean(father.son$fheight) s <- sd(father.son$fheight) n <- nrow(father.son) p <- (1 : n) / n - 0.5 / n ggplot(father.son) + geom_point(aes(x = p, y = sort(pnorm(fheight, m, s)))) + thm ``` * The values on the vertical axis are the _probability integral transform_ of the data for the theoretical distribution. * If the data are a sample from the theoretical distribution then these transforms would be uniformly distributed on $[0, 1]$. * The PP plot is a QQ plot of these transformed values against a uniform distribution. * The PP plot goes through the points $(0, 0)$ and $(1, 1)$ and so is much less variable in the tails. Using the simulated data: ```{r, class.source = "fold-hide"} pp <- ggplot() + geom_line(aes(x = p, y = pnorm(x, m, s), group = sim), color = "gray", data = gb) + thm pp ``` Adding the `father.son` data: ```{r, class.source = "fold-hide"} pp + geom_point(aes(x = p, y = sort(pnorm(fheight, m, s))), data = (father.son)) ``` The PP plot is also less sensitive to deviations in the tails. A compromise between the QQ and PP plots uses the _arcsine square root_ variance-stabilizing transformation, which makes the variability approximately constant across the range of the plot: ```{r, class.source = "fold-hide"} vpp <- ggplot() + geom_line(aes(x = asin(sqrt(p)), y = asin(sqrt(pnorm(x, m, s))), group = sim), color = "gray", data = gb) + thm vpp ``` Adding the data: ```{r, class.source = "fold-hide"} vpp + geom_point( aes(x = asin(sqrt(p)), y = sort(asin(sqrt(pnorm(fheight, m, s))))), data = (father.son)) ``` ## Plots For Assessing Model Fit Both QQ and PP plots can be used to asses how well a theoretical family of models fits your data, or your residuals. To use a PP plot you have to estimate the parameters first. For a location-scale family, like the normal distribution family, you can use a QQ plot with a standard member of the family. Some other families can use other transformations that lead to straight lines for family members. The Weibull family is widely used in reliability modeling; its CDF is $$ F(t) = 1 - \exp\left$-\left(\frac{t}{b}\right)^a\right$$$ The logarithms of Weibull random variables form a location-scale family. Special paper used to be available for [Weibull probability plots](https://web.archive.org/web/20130207071825/https://weibull.com/hotwire/issue8/relbasics8.htm). A Weibull QQ plot for `price` in the `diamonds` data: ```{r, class.source = "fold-hide"} n <- nrow(diamonds) p <- (1 : n) / n - 0.5 / n ggplot(diamonds) + geom_point(aes(x = log10(qweibull(p, 1, 1)), y = log10(sort(price)))) + thm ``` The lower tail does not match a Weibull distribution. Is this important? In engineering applications it often is. In selecting a reasonable model to capture the shape of this distribution it may not be. QQ plots are helpful for understanding departures from a theoretical model. No data will fit a theoretical model perfectly. Case-specific judgment is needed to decide whether departures are important. > George Box: All models are wrong but some are useful. ## Some References Adam Loy, Lendie Follett, and Heike Hofmann (2016), "Variations of Q–Q plots: The power of our eyes!", _The American Statistician_; ([preprint](https://arxiv.org/abs/1503.02098)). John R. Michael (1983), "The stabilized probability plot," _Biometrika_ [JSTOR](https://www.jstor.org/stable/2335939?seq=1#page_scan_tab_contents). M. B. Wilk and R. Gnanadesikan (1968), "Probability plotting methods for the analysis of data," _Biometrika_ [JSTOR](https://www.jstor.org/stable/2334448?seq=1#page_scan_tab_contents). Box, G. E. P. (1979), "Robustness in the strategy of scientific model building", in Launer, R. L.; Wilkinson, G. N., _Robustness in Statistics_, Academic Press, pp. 201–236. Thomas Lumley (2019), ["What have I got against the Shapiro-Wilk test?"](https://notstatschat.rbind.io/2019/02/09/what-have-i-got-against-the-shapiro-wilk-test/) ## Readings Chapter [_Visualizing distributions: Empirical cumulative distribution functions and q-q plots_](https://clauswilke.com/dataviz/ecdf-qq.html) in [_Fundamentals of Data Visualization_](https://clauswilke.com/dataviz/). ## Exercises 1. The data set `heights` in package `dslabs` contains self-reported heights for a number of female and male students. The plot below shows the empirical CDF for male heights: ```{r, echo = FALSE, message = FALSE} library(dplyr) library(ggplot2) data(heights, package = "dslabs") thm <- theme_minimal() + theme(text = element_text(size = 16)) filter(heights, sex == "Male") |> ggplot(aes(x = height)) + stat_ecdf() + labs(y = "cumulative frequency", x = "height (inches)") + thm ``` Based on the plot, what percentage of males are taller than 75 inches? a. 100% b. 15% c. 20% d. 3% 2. Consider the following normal QQ plots. ```{r, fig.height = 4, fig.width = 8, message = FALSE} library(dplyr) library(ggplot2) library(patchwork) thm <- theme_minimal() + theme(text = element_text(size = 16)) data(heights, package = "dslabs") set.seed(12345) p1 <- ggplot(NULL, aes(sample = rnorm(200))) + geom_qq() + labs(title = "Sample 1", x = "theoretical quantile", y = "sample quantile") + thm p2 <- ggplot(faithful, aes(sample = eruptions)) + geom_qq() + labs(title = "Sample 2", x = "theoretical quantile", y = "sample quantile") + thm p1 + p2 ``` Would a normal distribution be a reasonable model for either of these samples? a. No for Sample 1. Yes for Sample 2. b. Yes for Sample 1. No for Sample 2. c. Yes for both. d. No for both.