ECDF, QQ, and PP Plots

class: center, middle, title-slide

.title[
# ECDF, QQ, and PP Plots
]
.author[
### Luke Tierney
]
.institute[
### University of Iowa
]
.date[
### 2024-01-22
]

---

layout: true

## Cumulative Distribution and Survival Functions
---

.pull-left[
The _empirical cumulative distribution function_ (ECDF) of a numeric
sample computes the proportion of the sample at or below a specified
value.
{{content}}
]
--
    
For the `yields` of the `barley` data:

--
.pull-right[
.hide-code[

```r
library(ggplot2)
data(barley, package = "lattice")
thm <- theme_minimal() +
    theme(text = element_text(size = 16)) +
    theme(panel.border =
          element_rect(color = "grey30",
                       fill = NA))
p <- ggplot(barley) +
    stat_ecdf(aes(x = yield)) +
    ylab("cumulative proportion") +
    thm
p
```

<img src="qqpp_files/figure-html/unnamed-chunk-1-1.png" style="display: block; margin: auto;" />
]
]

---

.pull-left[
Flipping the axes produces an _empirical quantile plot_:

.hide-code[

```r
p + coord_flip()
```

<img src="qqpp_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" />
]
]

.pull-right[
Both make it easy to look up:
{{content}}
]
--

* medians, quartiles, and other quantiles;
{{content}}
--

* the proportion of the sample below a particular value;
{{content}}
--

* the proportion above a particular value (one minus the proportion below).

---

.pull-left[
An ECDF plot can also be constructed as a step function plot of the
_relative rank_ (rank over sample size) against the observed values:
]

.pull-right[
.hide-code[

```r
ggplot(barley) +
    geom_step(
        aes(x = yield,
            y = rank(yield) /
                length(yield))) +
    ylab("cumulative proportion") +
    thm
```

<img src="qqpp_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" />
]
]

---

.pull-left[
Reversing the relative ranks produces a plot of the _empirical survival
function_:
]

.pull-right[
.hide-code[

```r
ggplot(barley) +
    geom_step(
        aes(x = yield,
            y = rank(-yield) /
                length(yield))) +
    ylab("surviving proportion") +
    thm
```

<img src="qqpp_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" />
]
]

.pull-left[
Survival plots are often used for data representing time to failure
in engineering or time to death or disease recurrence in
medicine.
]

---

For a highly skewed distribution, such as the distribution of `price`
in the `diamonds` data, transforming the axis to a square root or log
scale may help.

.pull-left[
.hide-code[

```r
library(patchwork)
p1 <- ggplot(diamonds) +
    stat_ecdf(aes(x = price)) +
    ylab("cumulative proportion") +
    thm
p2 <- p1 + scale_x_log10()
p1 + p2
```

<img src="qqpp_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" />
]
]

.pull-right[
There is a downside: Interpolating on a non-linear scale is much harder.
]

---
layout: true
## QQ Plots
---

### Basics

One way to assess how well a particular theoretical model describes a
data distribution is to plot data quantiles against theoretical quantiles.

This corresponds to transforming the ECDF horizontal axis to the scale of
the theoretical distribution.

The result is a plot of sample quantiles against theoretical
quantiles, and should be close to a 45-degree straight line if the
model fits the data well.

Such a plot is called a quantile-quantile plot, or a QQ plot for short.

---

.pull-left[
Usually a QQ plot
{{content}}
]
--

* uses points rather than a step function, and
{{content}}
--

* 1/2 is subtracted from the ranks before calculating relative ranks (this
  makes the rank range more symmetric):
{{content}}
--

For the `barley` data:

.pull-right[
.hide-code[

```r
p <- ggplot(barley) +
    geom_point(
        aes(y = yield,
            x = qnorm((rank(yield) - 0.5) /
                      length(yield)))) +
    xlab("theoretical quantile") +
    ylab("sample quantile") +
    thm
p
```

<img src="qqpp_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" />
]
]

---

.pull-left[
For a location-scale family of models, like the normal family, a QQ
plot against standard normal quantiles should be close to a straight
line if the model is a good fit.
{{content}}
]
--

For the normal family the intercept will be the mean and the slope
will be the standard deviation.
{{content}}
--

Adding a line can help judge the quality of the fit:

.pull-right[
.hide-code[

```r
p + geom_abline(aes(intercept = mean(yield),
                    slope = sd(yield)),
                color = "red")
```

<img src="qqpp_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" />
]
]

`ggplot` provides `geom_qq` that makes this a little easier; base
graphics provides `qqnorm` and `lattice` has `qqmath`.

---

### Some Examples

.pull-left[
The histograms and density estimates for the `duration` variable in
the `geyser` data set showed that the distribution is far from a
normal distribution, and the normal QQ plot shows this as well:
]

.pull-right[
.hide-code[

```r
data(geyser, package = "MASS")
ggplot(geyser) +
    geom_qq(aes(sample = duration)) +
    thm
```

<img src="qqpp_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" />
]
]

---

Except for rounding the `parent` heights in the `Galton` data seemed
not too far from normally distributed:

.pull-left[
.hide-code[

```r
data(Galton, package = "HistData")
ggplot(Galton) +
    geom_qq(aes(sample = parent)) +
    thm
```

<img src="qqpp_files/figure-html/unnamed-chunk-9-1.png" style="display: block; margin: auto;" />
]
]

.pull-right[
* Rounding interferes more with this visualization than with a histogram or a
  density plot.
{{content}}
]
--

* Rounding is more visible with this visualization than with a histogram or a
  density plot.

---

Another Gatlton dataset available in the `UsingR` package with less
rounding is `father.son`:

.pull-left[
.hide-code[

```r
data(father.son, package = "UsingR")
ggplot(father.son) +
    geom_qq(aes(sample = fheight)) +
    thm
```

<img src="qqpp_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" />
]
]

.pull-right[
The middle seems to be fairly straight, but the ends are somewhat wiggly.
{{content}}
]

How can you calibrate your judgment?

---

### Calibrating the Variability

.pull-left[
One approach is to use simulation, sometimes called a _graphical
bootstrap_.
{{content}}
]
--

The `nboot` function will simulate `R` samples from a normal
distribution that match a variable `x` on sample size, sample mean,
and sample SD.
{{content}}
--

The result is returned in a data frame suitable for plotting:

.pull-right.code-80[

```r
library(dplyr)
nsim <- function(n, m = 0, s = 1) {
    z <- rnorm(n)
    m + s * ((z - mean(z)) / sd(z))
}

nboot <- function(x, R) {
    n <- length(x)
    m <- mean(x)
    s <- sd(x)
    sim <- function(i) {
        xx <- sort(nsim(n, m, s))
        p <- (seq_along(x) - 0.5) / n
        data.frame(x = xx, p = p, sim = i)
    }
    bind_rows(lapply(1 : R, sim))
}
```
]

---

.pull-left[
Plotting these as lines shows the variability in shapes we can expect
when sampling from the theoretical normal distribution:
]

.pull-right[
.hide-code[

```r
gb <- nboot(father.son$fheight, 50)
ggplot() +
    geom_line(aes(x = qnorm(p),
                  y = x,
                  group = sim),
              color = "gray", data = gb) +
    thm
```

<img src="qqpp_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" />
]
]

---

.pull-left[
We can then insert this simulation behind our data to help calibrate
the visualization:
]

.pull-right[
.hide-code[

```r
ggplot(father.son) +
    geom_line(aes(x = qnorm(p),
                  y = x,
                  group = sim),
              color = "gray", data = gb) +
*   geom_qq(aes(sample = fheight)) +
    thm
```

<img src="qqpp_files/figure-html/unnamed-chunk-13-1.png" style="display: block; margin: auto;" />
]
]

---

.pull-left[
### Scalability

For large sample sizes, such as `price` from the `diamonds` data,
overplotting will occur:
]

.pull-right[
.hide-code[

```r
ggplot(diamonds) +
    geom_qq(aes(sample = price)) +
    thm
```

<img src="qqpp_files/figure-html/unnamed-chunk-14-1.png" style="display: block; margin: auto;" />
]
]

---

.pull-left[
This can be alleviated by using a grid of quantiles:
]

.pull-right[
.hide-code[

```r
nq <- 100
p <- ((1 : nq) - 0.5) / nq
ggplot() +
    geom_point(aes(x = qnorm(p),
                   y = quantile(diamonds$price, p))) +
    thm
```

<img src="qqpp_files/figure-html/unnamed-chunk-15-1.png" style="display: block; margin: auto;" />
]
]

---

.pull-left[
A more reasonable model might be an exponential distribution:
]

.pull-right[
.hide-code[

```r
ggplot() +
    geom_point(aes(x = qexp(p), y = quantile(diamonds$price, p))) +
    thm
```

<img src="qqpp_files/figure-html/unnamed-chunk-16-1.png" style="display: block; margin: auto;" />
]
]

---

.pull-left[
### Comparing Two Distributions
{{content}}
]
--

The QQ plot can also be used to compare two distributions based on a
sample from each.
{{content}}
--

If the samples are the same size then this is just a plot of the
ordered sample values against each other.
{{content}}
--

Choosing a fixed set of quantiles allows samples of unequal size to be
compared.
{{content}}
--

Using a small set of quantiles we can compare the distributions of
waiting times between eruptions of Old Faithful from the two different
data sets we have looked at:

.pull-right[
.hide-code[

```r
nq <- 31
p <- (1 : nq) / nq - 0.5 / nq
wg <- geyser$waiting
wf <- faithful$waiting
ggplot() +
    geom_point(aes(x = quantile(wg, p),
                   y = quantile(wf, p))) +
    thm
```

<img src="qqpp_files/figure-html/unnamed-chunk-17-1.png" style="display: block; margin: auto;" />
]
]

---

.pull-left[
Adding a 45-degree line:
]

.pull-right[
.hide-code[

```r
ggplot() +
    geom_abline(intercept = 0, slope = 1, lty = 2) +
    geom_point(aes(x = quantile(wg, p), y = quantile(wf, p))) +
    thm
```

<img src="qqpp_files/figure-html/unnamed-chunk-18-1.png" style="display: block; margin: auto;" />
]
]

---
layout: true
## PP Plots
---

.pull-left[
The PP plot for comparing a sample to a theoretical model plots the
theoretical proportion less than or equal to each observed value
against the actual proportion.
{{content}}
]
--

For a theoretical cumulative distribution function `$F$` this means plotting

$$
F(x_i) \sim p_i
$$

with

$$
p_i = \frac{r_i - 1/2}{n}
$$

where `$r_i$` is the `$i$`-th observation's rank.
{{content}}
--

For the `fheight` variable in the `father.son` data:

.pull-right[
.hide-code[

```r
m <- mean(father.son$fheight)
s <- sd(father.son$fheight)
ggplot(father.son) +
    geom_point(aes(x = (rank(fheight) - 0.5) /
                       length(fheight),
                   y = pnorm(fheight, m, s))) +
    thm
```

<img src="qqpp_files/figure-html/unnamed-chunk-19-1.png" style="display: block; margin: auto;" />
]
]

---

.pull-left[
* The values on the vertical axis are the _probability integral
  transform_ of the data for the theoretical distribution.
{{content}}
]
--

* If the data are a sample from the theoretical distribution then
  these transforms would be uniformly distributed on `$[0, 1]$`.
{{content}}
--

* The PP plot is a QQ plot of these transformed values against a
  uniform distribution.
{{content}}
--

* The PP plot goes through the points `$(0, 0)$` and `$(1, 1)$` and so is
  much less variable in the tails.
{{content}}
--

Using the simulated data:

.pull-right[
.hide-code[

```r
pp <- ggplot() +
    geom_line(aes(x = p,
                  y = pnorm(x, m, s),
                  group = sim),
              color = "gray",
              data = gb) +
    thm
pp
```

<img src="qqpp_files/figure-html/unnamed-chunk-20-1.png" style="display: block; margin: auto;" />
]
]

---

.pull-left[
Adding the `father.son` data:
]

.pull-right[
.hide-code[

```r
pp +
    geom_point(aes(x = (rank(fheight) - 0.5) /
                       length(fheight),
                   y = pnorm(fheight, m, s)),
               data = (father.son))
```

<img src="qqpp_files/figure-html/unnamed-chunk-21-1.png" style="display: block; margin: auto;" />
]
]

The PP plot is also less sensitive to deviations in the tails.

---

.pull-left[
A compromise between the QQ and PP plots uses the _arcsine square root_
variance-stabilizing transformation, which makes the variability
approximately constant across the range of the plot:
]

.pull-right[
.hide-code[

```r
vpp <- ggplot() +
    geom_line(aes(x = asin(sqrt(p)),
                  y = asin(sqrt(pnorm(x, m, s))),
                  group = sim),
              color = "gray", data = gb) +
    thm
vpp
```

<img src="qqpp_files/figure-html/unnamed-chunk-22-1.png" style="display: block; margin: auto;" />
]
]

---

.pull-left[
Adding the data:
]

.pull-right[
.hide-code[

```r
vpp +
    geom_point(
        aes(x = asin(sqrt((rank(fheight) - 0.5) /
                          length(fheight))),
            y = asin(sqrt(pnorm(fheight, m, s)))),
        data = (father.son))
```

<img src="qqpp_files/figure-html/unnamed-chunk-23-1.png" style="display: block; margin: auto;" />
]
]

---
layout: true
## Plots For Assessing Model Fit
---

Both QQ and PP plots can be used to asses how well a theoretical
family of models fits your data, or your residuals.

To use a PP plot you have to estimate the parameters first.

For a location-scale family, like the normal distribution family, you
can use a QQ plot with a standard member of the family.

Some other families can use other transformations that lead to
straight lines for family members.

The Weibull family is widely used in reliability modeling; its
CDF is

$$ F(t) = 1 - \exp\left$-\left(\frac{t}{b}\right)^a\right$$$

The logarithms of Weibull random variables form a location-scale
family.

Special paper used to be available for [Weibull probability
plots](https://web.archive.org/web/20130207071825/https://weibull.com/hotwire/issue8/relbasics8.htm).

---

.pull-left[
A Weibull QQ plot for `price` in the `diamonds` data:

.hide-code[

```r
n <- nrow(diamonds)
p <- (1 : n) / n - 0.5 / n
ggplot(diamonds) +
    geom_point(aes(x = log10(qweibull(p, 1, 1)), y = log10(sort(price)))) +
    thm
```

<img src="qqpp_files/figure-html/unnamed-chunk-24-1.png" style="display: block; margin: auto;" />
]
]

.pull-right[
The lower tail does not match a Weibull distribution.
{{content}}
]
--

Is this important?
{{content}}
--

In engineering applications it often is.
{{content}}
--

In selecting a reasonable model to capture the shape of this
distribution it may not be.

---

QQ plots are helpful for understanding departures from a theoretical
model.

No data will fit a theoretical model perfectly.

Case-specific judgment is needed to decide whether departures are
important.

> George Box: All models are wrong but some are useful.

---
layout: false
## Some References

Adam Loy, Lendie Follett, and Heike Hofmann (2016), "Variations of Q–Q
plots: The power of our eyes!", _The American Statistician_;
([preprint](https://arxiv.org/abs/1503.02098)).

John R. Michael (1983), "The stabilized probability plot," _Biometrika_
[JSTOR](https://www.jstor.org/stable/2335939?seq=1#page_scan_tab_contents).

M. B. Wilk and R. Gnanadesikan (1968), "Probability plotting methods
for the analysis of data," _Biometrika_
[JSTOR](https://www.jstor.org/stable/2334448?seq=1#page_scan_tab_contents).

Box, G. E. P. (1979), "Robustness in the strategy of scientific model
building", in Launer, R. L.; Wilkinson, G. N., _Robustness in
Statistics_, Academic Press, pp. 201–236.

Thomas Lumley (2019), ["What have I got against the Shapiro-Wilk
test?"](https://notstatschat.rbind.io/2019/02/09/what-have-i-got-against-the-shapiro-wilk-test/)

---

## Readings

Chapter [_Visualizing distributions: Empirical cumulative distribution
  functions and q-q
  plots_](https://clauswilke.com/dataviz/ecdf-qq.html) in
  [_Fundamentals of Data
  Visualization_](https://clauswilke.com/dataviz/).

---
layout: true
---

## Exercises

1) The data set `heights` in package `dslabs` contains self-reported
    heights for a number of female and male students. The plot below
    shows the empirical CDF for male heights:

.pull-left[
<img src="qqpp_files/figure-html/unnamed-chunk-25-1.png" style="display: block; margin: auto;" />
]

.pull-right[
Based on the plot, what percentage of males are taller than 75 inches?

*   a. 100%
*   b. 15%
*   c. 20%
*   d. 3%
]

---

2. Consider the following normal QQ plots.

.hide-code[

```r
library(dplyr)
library(ggplot2)
library(patchwork)
thm <- theme_minimal() + theme(text = element_text(size = 16))
data(heights, package = "dslabs")
set.seed(12345)
p1 <- ggplot(NULL, aes(sample = rnorm(200))) +
    geom_qq() +
    labs(title = "Sample 1",
         x = "theoretical quantile",
         y = "sample quantile") +
    thm
p2 <- ggplot(faithful, aes(sample = eruptions)) +
    geom_qq() +
    labs(title = "Sample 2",
         x = "theoretical quantile",
         y = "sample quantile") +
    thm
p1 + p2
```

<img src="qqpp_files/figure-html/unnamed-chunk-26-1.png" style="display: block; margin: auto;" />
]

Would a normal distribution be a reasonable model for either of
these samples?

*   a. No for Sample 1. Yes for Sample 2.
*   b. Yes for Sample 1. No for Sample 2.
*   c. Yes for both.
*   d. No for both.

//adapted from Emi Tanaka's gist at //https://gist.github.com/emitanaka/eaa258bb8471c041797ff377704c8505