class: center, middle, title-slide .title[ # ECDF, QQ, and PP Plots ] .author[ ### Luke Tierney ] .institute[ ### University of Iowa ] .date[ ### 2023-05-06 ] --- layout: true <link rel="stylesheet" href="stat4580.css" type="text/css" /> <!-- title based on Wilke's chapter --> ## Cumulative Distribution and Survival Functions --- .pull-left[ The _empirical cumulative distribution function_ (ECDF) of a numeric sample computes the proportion of the sample at or below a specified value. {{content}} ] -- For the `yields` of the `barley` data: -- .pull-right[ .hide-code[ ```r library(ggplot2) data(barley, package = "lattice") thm <- theme_minimal() + theme(text = element_text(size = 16)) + theme(panel.border = element_rect(color = "grey30", fill = NA)) p <- ggplot(barley) + stat_ecdf(aes(x = yield)) + ylab("cumulative proportion") + thm p ``` <img src="qqpp_files/figure-html/unnamed-chunk-1-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ Flipping the axes produces an _empirical quantile plot_: .hide-code[ ```r p + coord_flip() ``` <img src="qqpp_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" /> ] ] -- .pull-right[ Both make it easy to look up: {{content}} ] -- * medians, quartiles, and other quantiles; {{content}} -- * the proportion of the sample below a particular value; {{content}} -- * the proportion above a particular value (one minus the proportion below). --- .pull-left[ An ECDF plot can also be constructed as a step function plot of the _relative rank_ (rank over sample size) against the observed values: ] -- .pull-right[ .hide-code[ ```r ggplot(barley) + geom_step( aes(x = yield, y = rank(yield) / length(yield))) + ylab("cumulative proportion") + thm ``` <img src="qqpp_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ Reversing the relative ranks produces a plot of the _empirical survival function_: ] -- .pull-right[ .hide-code[ ```r ggplot(barley) + geom_step( aes(x = yield, y = rank(-yield) / length(yield))) + ylab("surviving proportion") + thm ``` <img src="qqpp_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> ] ] -- .pull-left[ Survival plots are often used for data representing time to failure in engineering or time to death or disease recurrence in medicine. ] --- For a highly skewed distribution, such as the distribution of `price` in the `diamonds` data, transforming the axis to a square root or log scale may help. -- .pull-left[ .hide-code[ ```r library(patchwork) p1 <- ggplot(diamonds) + stat_ecdf(aes(x = price)) + ylab("cumulative proportion") + thm p2 <- p1 + scale_x_log10() p1 + p2 ``` <img src="qqpp_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> ] ] -- .pull-right[ There is a downside: Interpolating on a non-linear scale is much harder. ] --- layout: true ## QQ Plots --- ### Basics -- One way to assess how well a particular theoretical model describes a data distribution is to plot data quantiles against theoretical quantiles. -- This corresponds to transforming the ECDF horizontal axis to the scale of the theoretical distribution. -- The result is a plot of sample quantiles against theoretical quantiles, and should be close to a 45-degree straight line if the model fits the data well. -- Such a plot is called a quantile-quantile plot, or a QQ plot for short. --- .pull-left[ Usually a QQ plot {{content}} ] -- * uses points rather than a step function, and {{content}} -- * 1/2 is subtracted from the ranks before calculating relative ranks (this makes the rank range more symmetric): {{content}} -- For the `barley` data: -- .pull-right[ .hide-code[ ```r p <- ggplot(barley) + geom_point( aes(y = yield, x = qnorm((rank(yield) - 0.5) / length(yield)))) + xlab("theoretical quantile") + ylab("sample quantile") + thm p ``` <img src="qqpp_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ For a location-scale family of models, like the normal family, a QQ plot against standard normal quantiles should be close to a straight line if the model is a good fit. {{content}} ] -- For the normal family the intercept will be the mean and the slope will be the standard deviation. {{content}} -- Adding a line can help judge the quality of the fit: -- .pull-right[ .hide-code[ ```r p + geom_abline(aes(intercept = mean(yield), slope = sd(yield)), color = "red") ``` <img src="qqpp_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" /> ] ] -- `ggplot` provides `geom_qq` that makes this a little easier; base graphics provides `qqnorm` and `lattice` has `qqmath`. --- ### Some Examples -- .pull-left[ The histograms and density estimates for the `duration` variable in the `geyser` data set showed that the distribution is far from a normal distribution, and the normal QQ plot shows this as well: ] -- .pull-right[ .hide-code[ ```r data(geyser, package = "MASS") ggplot(geyser) + geom_qq(aes(sample = duration)) + thm ``` <img src="qqpp_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" /> ] ] --- Except for rounding the `parent` heights in the `Galton` data seemed not too far from normally distributed: .pull-left[ .hide-code[ ```r data(Galton, package = "HistData") ggplot(Galton) + geom_qq(aes(sample = parent)) + thm ``` <img src="qqpp_files/figure-html/unnamed-chunk-9-1.png" style="display: block; margin: auto;" /> ] ] -- .pull-right[ * Rounding interferes more with this visualization than with a histogram or a density plot. {{content}} ] -- * Rounding is more visible with this visualization than with a histogram or a density plot. --- Another Gatlton dataset available in the `UsingR` package with less rounding is `father.son`: .pull-left[ .hide-code[ ```r data(father.son, package = "UsingR") ggplot(father.son) + geom_qq(aes(sample = fheight)) + thm ``` <img src="qqpp_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" /> ] ] -- .pull-right[ The middle seems to be fairly straight, but the ends are somewhat wiggly. {{content}} ] -- How can you calibrate your judgment? --- ### Calibrating the Variability -- .pull-left[ One approach is to use simulation, sometimes called a _graphical bootstrap_. {{content}} ] -- The `nboot` function will simulate `R` samples from a normal distribution that match a variable `x` on sample size, sample mean, and sample SD. {{content}} -- The result is returned in a data frame suitable for plotting: -- .pull-right.code-80[ ```r library(dplyr) nsim <- function(n, m = 0, s = 1) { z <- rnorm(n) m + s * ((z - mean(z)) / sd(z)) } nboot <- function(x, R) { n <- length(x) m <- mean(x) s <- sd(x) sim <- function(i) { xx <- sort(nsim(n, m, s)) p <- (seq_along(x) - 0.5) / n data.frame(x = xx, p = p, sim = i) } bind_rows(lapply(1 : R, sim)) } ``` ] --- .pull-left[ Plotting these as lines shows the variability in shapes we can expect when sampling from the theoretical normal distribution: ] .pull-right[ .hide-code[ ```r gb <- nboot(father.son$fheight, 50) ggplot() + geom_line(aes(x = qnorm(p), y = x, group = sim), color = "gray", data = gb) + thm ``` <img src="qqpp_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ We can then insert this simulation behind our data to help calibrate the visualization: ] .pull-right[ .hide-code[ ```r ggplot(father.son) + geom_line(aes(x = qnorm(p), y = x, group = sim), color = "gray", data = gb) + * geom_qq(aes(sample = fheight)) + thm ``` <img src="qqpp_files/figure-html/unnamed-chunk-13-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ ### Scalability For large sample sizes, such as `price` from the `diamonds` data, overplotting will occur: ] .pull-right[ .hide-code[ ```r ggplot(diamonds) + geom_qq(aes(sample = price)) + thm ``` <img src="qqpp_files/figure-html/unnamed-chunk-14-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ This can be alleviated by using a grid of quantiles: ] .pull-right[ .hide-code[ ```r nq <- 100 p <- ((1 : nq) - 0.5) / nq ggplot() + geom_point(aes(x = qnorm(p), y = quantile(diamonds$price, p))) + thm ``` <img src="qqpp_files/figure-html/unnamed-chunk-15-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ A more reasonable model might be an exponential distribution: ] .pull-right[ .hide-code[ ```r ggplot() + geom_point(aes(x = qexp(p), y = quantile(diamonds$price, p))) + thm ``` <img src="qqpp_files/figure-html/unnamed-chunk-16-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ ### Comparing Two Distributions {{content}} ] -- The QQ plot can also be used to compare two distributions based on a sample from each. {{content}} -- If the samples are the same size then this is just a plot of the ordered sample values against each other. {{content}} -- Choosing a fixed set of quantiles allows samples of unequal size to be compared. {{content}} -- Using a small set of quantiles we can compare the distributions of waiting times between eruptions of Old Faithful from the two different data sets we have looked at: -- .pull-right[ .hide-code[ ```r nq <- 31 p <- (1 : nq) / nq - 0.5 / nq wg <- geyser$waiting wf <- faithful$waiting ggplot() + geom_point(aes(x = quantile(wg, p), y = quantile(wf, p))) + thm ``` <img src="qqpp_files/figure-html/unnamed-chunk-17-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ Adding a 45-degree line: ] .pull-right[ .hide-code[ ```r ggplot() + geom_abline(intercept = 0, slope = 1, lty = 2) + geom_point(aes(x = quantile(wg, p), y = quantile(wf, p))) + thm ``` <img src="qqpp_files/figure-html/unnamed-chunk-18-1.png" style="display: block; margin: auto;" /> ] ] --- layout: true ## PP Plots --- .pull-left[ The PP plot for comparing a sample to a theoretical model plots the theoretical proportion less than or equal to each observed value against the actual proportion. {{content}} ] -- For a theoretical cumulative distribution function `\(F\)` this means plotting $$ F(x_i) \sim p_i $$ with $$ p_i = \frac{r_i - 1/2}{n} $$ where `\(r_i\)` is the `\(i\)`-th observation's rank. {{content}} -- For the `fheight` variable in the `father.son` data: -- .pull-right[ .hide-code[ ```r m <- mean(father.son$fheight) s <- sd(father.son$fheight) ggplot(father.son) + geom_point(aes(x = (rank(fheight) - 0.5) / length(fheight), y = pnorm(fheight, m, s))) + thm ``` <img src="qqpp_files/figure-html/unnamed-chunk-19-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ * The values on the vertical axis are the _probability integral transform_ of the data for the theoretical distribution. {{content}} ] -- * If the data are a sample from the theoretical distribution then these transforms would be uniformly distributed on `\([0, 1]\)`. {{content}} -- * The PP plot is a QQ plot of these transformed values against a uniform distribution. {{content}} -- * The PP plot goes through the points `\((0, 0)\)` and `\((1, 1)\)` and so is much less variable in the tails. {{content}} -- Using the simulated data: -- .pull-right[ .hide-code[ ```r pp <- ggplot() + geom_line(aes(x = p, y = pnorm(x, m, s), group = sim), color = "gray", data = gb) + thm pp ``` <img src="qqpp_files/figure-html/unnamed-chunk-20-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ Adding the `father.son` data: ] .pull-right[ .hide-code[ ```r pp + geom_point(aes(x = (rank(fheight) - 0.5) / length(fheight), y = pnorm(fheight, m, s)), data = (father.son)) ``` <img src="qqpp_files/figure-html/unnamed-chunk-21-1.png" style="display: block; margin: auto;" /> ] ] -- The PP plot is also less sensitive to deviations in the tails. --- .pull-left[ A compromise between the QQ and PP plots uses the _arcsine square root_ variance-stabilizing transformation, which makes the variability approximately constant across the range of the plot: ] .pull-right[ .hide-code[ ```r vpp <- ggplot() + geom_line(aes(x = asin(sqrt(p)), y = asin(sqrt(pnorm(x, m, s))), group = sim), color = "gray", data = gb) + thm vpp ``` <img src="qqpp_files/figure-html/unnamed-chunk-22-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ Adding the data: ] .pull-right[ .hide-code[ ```r vpp + geom_point( aes(x = asin(sqrt((rank(fheight) - 0.5) / length(fheight))), y = asin(sqrt(pnorm(fheight, m, s)))), data = (father.son)) ``` <img src="qqpp_files/figure-html/unnamed-chunk-23-1.png" style="display: block; margin: auto;" /> ] ] --- layout: true ## Plots For Assessing Model Fit --- Both QQ and PP plots can be used to asses how well a theoretical family of models fits your data, or your residuals. -- To use a PP plot you have to estimate the parameters first. -- For a location-scale family, like the normal distribution family, you can use a QQ plot with a standard member of the family. -- Some other families can use other transformations that lead to straight lines for family members. -- The Weibull family is widely used in reliability modeling; its CDF is $$ F(t) = 1 - \exp\left\(-\left(\frac{t}{b}\right)^a\right\)$$ -- The logarithms of Weibull random variables form a location-scale family. -- Special paper used to be available for [Weibull probability plots](https://weibull.com/hotwire/issue8/relbasics8.htm). --- .pull-left[ A Weibull QQ plot for `price` in the `diamonds` data: .hide-code[ ```r n <- nrow(diamonds) p <- (1 : n) / n - 0.5 / n ggplot(diamonds) + geom_point(aes(x = log10(qweibull(p, 1, 1)), y = log10(sort(price)))) + thm ``` <img src="qqpp_files/figure-html/unnamed-chunk-24-1.png" style="display: block; margin: auto;" /> ] ] .pull-right[ The lower tail does not match a Weibull distribution. {{content}} ] -- Is this important? {{content}} -- In engineering applications it often is. {{content}} -- In selecting a reasonable model to capture the shape of this distribution it may not be. --- QQ plots are helpful for understanding departures from a theoretical model. -- No data will fit a theoretical model perfectly. -- Case-specific judgment is needed to decide whether departures are important. -- > George Box: All models are wrong but some are useful. --- layout: false ## Some References Adam Loy, Lendie Follett, and Heike Hofmann (2016), "Variations of Q–Q plots: The power of our eyes!", _The American Statistician_; ([preprint](https://arxiv.org/abs/1503.02098)). John R. Michael (1983), "The stabilized probability plot," _Biometrika_ [JSTOR](https://www.jstor.org/stable/2335939?seq=1#page_scan_tab_contents). M. B. Wilk and R. Gnanadesikan (1968), "Probability plotting methods for the analysis of data," _Biometrika_ [JSTOR](https://www.jstor.org/stable/2334448?seq=1#page_scan_tab_contents). Box, G. E. P. (1979), "Robustness in the strategy of scientific model building", in Launer, R. L.; Wilkinson, G. N., _Robustness in Statistics_, Academic Press, pp. 201–236. Thomas Lumley (2019), ["What have I got against the Shapiro-Wilk test?"](https://notstatschat.rbind.io/2019/02/09/what-have-i-got-against-the-shapiro-wilk-test/) --- ## Readings Chapter [_Visualizing distributions: Empirical cumulative distribution functions and q-q plots_](https://clauswilke.com/dataviz/ecdf-qq.html) in [_Fundamentals of Data Visualization_](https://clauswilke.com/dataviz/). --- layout: true --- ## Exercises 1) The data set `heights` in package `dslabs` contains self-reported heights for a number of female and male students. The plot below shows the empirical CDF for male heights: .pull-left[ <img src="qqpp_files/figure-html/unnamed-chunk-25-1.png" style="display: block; margin: auto;" /> ] .pull-right[ Based on the plot, what percentage of males are taller than 75 inches? * a. 100% * b. 15% * c. 20% * d. 3% ] --- 2. Consider the following normal QQ plots. .hide-code[ ```r library(dplyr) library(ggplot2) library(patchwork) thm <- theme_minimal() + theme(text = element_text(size = 16)) data(heights, package = "dslabs") set.seed(12345) p1 <- ggplot(NULL, aes(sample = rnorm(200))) + geom_qq() + labs(title = "Sample 1", x = "theoretical quantile", y = "sample quantile") + thm p2 <- ggplot(faithful, aes(sample = eruptions)) + geom_qq() + labs(title = "Sample 2", x = "theoretical quantile", y = "sample quantile") + thm p1 + p2 ``` <img src="qqpp_files/figure-html/unnamed-chunk-26-1.png" style="display: block; margin: auto;" /> ] Would a normal distribution be a reasonable model for either of these samples? * a. No for Sample 1. Yes for Sample 2. * b. Yes for Sample 1. No for Sample 2. * c. Yes for both. * d. No for both.
//adapted from Emi Tanaka's gist at //https://gist.github.com/emitanaka/eaa258bb8471c041797ff377704c8505