Visualizing Distributions

class: center, middle, title-slide

.title[
# Visualizing Distributions
]
.author[
### Luke Tierney
]
.institute[
### University of Iowa
]
.date[
### 2024-04-01
]

---

layout: true

## Introduction
---

Once there are more than a handful of numeric data values it is often
useful to step back and look at the _distribution_ of the data values:

* Where is the bulk of the data located?

* Is there a single area of concentration or are there several?

* Is the data distribution symmetric or is it skewed, i.e. trails off
  more slowly in one direction or another?

* Are there extreme, or outlying, values?

* Are there any suspicious or impossible values?

* Are there gaps in the data?

* Is there rounding, e.g. to integer values, or _heaping_, i.e. a
  few particular values occur very frequently?

---

Plots for visualizing distributions include

* Strip plots.

* Histograms.

* Density plots.

* Box plots.

* Violin plots.

* Swarm plots.

* Density ridges

---
layout: true
## Strip Plots
---

.pull-left.small-code[
### Strip Plot Basics

A variant of the dot plot is known as a _strip plot_.

A strip plot for the city temperature data is

.hide-code[

```r
thm <- theme_minimal() +
    theme(text = element_text(size = 16))
ggplot(citytemps) +
    geom_point(aes(x = temp, y = "All")) +
    thm +
    theme(axis.title.y = element_blank(),
          axis.text.y = element_blank())
```

<img src="dists_files/figure-html/unnamed-chunk-1-1.png" style="display: block; margin: auto;" />
]
]

.pull-right.small-code[
The strip plot can reveal gaps and outliers.
{{content}}
]

After looking at the plot we might want to examine the high and
low values:
{{content}}
--

```r
filter(citytemps, temp > 85)
##             city temp
## 1       Asuncion   95
## 2        Caracas   90
## 3  Dar es Salaam   86
## 4       Kinshasa   86
## 5          Lagos   91
## 6        Managua   88
## 7 Rio de Janeiro   88
## 8      São Paulo   90
filter(citytemps, temp < 10)
##       city temp
## 1   Anadyr    3
## 2  Calgary    1
## 3 Edmonton  -11
## 4     Kyiv    9
## 5    Minsk    0
## 6   Moscow    3
## 7 Winnipeg    8
```

---

For the eruption durations in the `faithful` data a strip plot shows
the two modes around 2 and 4 minutes:

.small-code[
.hide-code[

```r
ggplot(faithful) +
    geom_point(aes(x = eruptions, y = "All")) +
    thm +
    theme(axis.title.y = element_blank(),
          axis.text.y = element_blank())
```

<img src="dists_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" />
]
]

---

.pull-left[
### Multiple Groups
{{content}}
]
--

Strip  plots are most useful for showing subsets corresponding to a
categorical variable.
{{content}}
--

A strip plot for the yields for different varieties in the barley data
is

.pull-right.small-code[
.hide-code[

```r
ggplot(barley) +
    geom_point(aes(x = yield, y = variety)) +
    theme_minimal() +
    thm
```

<img src="dists_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" />
]
]

---

.pull-left[
### Scalability
{{content}}
]
--

Scalability in this form is limited due to over-plotting.
{{content}}
--

A simple strip plot of `price` within the different `cut` levels in
the `diamonds` data is not very helpful:
--

.pull-right.small-code[
.hide-code[

```r
ggplot(diamonds) +
    geom_point(aes(x = price, y = cut)) +
    thm +
    theme(axis.title.y = element_blank())
```

<img src="dists_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" />
]
]

---

.pull-left[
Several approaches are available to reduce the impact of over-plotting:
{{content}}
]
--

* reduce the point size;
{{content}}
--

* random displacement of points, called _jittering_;
{{content}}
--

* making the points translucent, or _alpha blending_.
{{content}}
--

Combining all three for examining `price` within `cut` for the
`diamonds` data produces

.pull-right.small-code[
.hide-code[

```r
ggplot(diamonds) +
    geom_point(aes(x = price, y = cut),
               size = 0.2,
               position = position_jitter(width = 0),
               alpha = 0.2) +
    thm + theme(axis.title.y = element_blank())
```

<img src="dists_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" />
]
]

---

.pull-left[
Skewness of the price distributions can be seen in this plot, though
other approaches will show this more clearly.
{{content}}
]
--

A peculiar feature reveled by this plot is the gap below
2000.
{{content}}
--

Examining the subset with `price < 2000` shows the gap is
roughly symmetric around 1500:

.pull-right.small-code[
.hide-code[

```r
ggplot(filter(diamonds, price < 2000)) +
    geom_point(aes(x = price, y = cut),
               size = 0.2,
               position = position_jitter(width = 0),
               alpha = 0.2) +
    thm +
    theme(axis.title.y = element_blank())
```

<img src="dists_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" />
]
]

---

.pull-left[
A plot along these lines was used on the New York Times [front page for
February 21, 2021](../img/NYT-2021-02-21.jpeg).
]

.pull-right[
<img src="../img/NYT-2021-02-21.jpeg" width="55%" style="display: block; margin: auto;" />
]

---

### Some Notes

* With a good combination of point size choice, jittering, and alpha
  blending the strip plot for groups of data can scale to several
  hundred thousand observations and ten to twenty of groups.

* For very large datat sets it can be useful to look at a strip plot
  of a sample of the data.

* Strip plots can reveal gaps, outliers, and data outside of the
  expected range.

* Skewness and multi-modality can be seen, but other visualizations
  show these more clearly.

* Storage needed for vector graphics images grows linearly with the
  number of observations.

* Base graphics provides `stripchart` and lattice provides `stripplot`.

---
layout: true
## Histograms
---

### Histogram Basics

.pull-left[
Historams are constructed by binning the data and counting the number
of observations in each bin.
{{content}}
]
--

The objective is usually to visualize the shape of the distribution.
{{content}}
--

The number of bins needs to be
{{content}}
--

* small enough to reveal interesting features;
{{content}}
--

* large enough not to be too noisy.
{{content}}
--

A very small bin width can be used to look for rounding or heaping.

.pull-right[
Common choices for the vertical scale are:
{{content}}
]
--

* bin counts, or frequencies;
{{content}}
--

* counts per unit, or densities.
{{content}}

The count scale is more intepretable for lay viewers.
{{content}}
--

The density scale is more suited for comparison to mathematical
density models.
{{content}}
--

Constructing histograms with unequal bin widths is possible but
rarely a good idea.
---

.pull-left[
### Histograms in R
{{content}}
]
--

There are many ways to plot histograms in R:
{{content}}
--

* the `hist()` function in the base `graphics` package;
{{content}}
--

* `truehist()` in package `MASS`;
{{content}}
--

* `histogram()` in package `lattice`;
{{content}}
--

* `geom_histogram()` in package `ggplot2`.
{{content}}
--

A histogram of eruption durations for another data set on Old Faithful
eruptions, this one from package `MASS`:
--

.pull-right.small-code[
.hide-code[

```r
data(geyser, package = "MASS")
ggplot(geyser) +
    geom_histogram(aes(x = duration)) +
    thm
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
```

<img src="dists_files/figure-html/unnamed-chunk-9-1.png" style="display: block; margin: auto;" />
]
]

---

.pull-left[
The default settings using `geom_histogram` are less than ideal.
{{content}}
]
--

Using a binwidth of 0.5 and customized `fill` and `color` settings
produces a better result:
--

.pull-right.small-code[
.hide-code[

```r
ggplot(geyser) +
    geom_histogram(aes(x = duration),
                   binwidth = 0.5,
                   fill = "grey",
                   color = "black") +
    thm
```

<img src="dists_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" />
]
]

---

.pull-left.small-code[
Reducing the bin width shows an interesting feature:

.hide-code[

```r
ggplot(geyser) +
    geom_histogram(aes(x = duration),
                   binwidth = 0.05,
                   fill = "grey",
                   color = "black") +
    thm
```

<img src="dists_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" />
]
]

.pull-right[
* Eruptions were sometimes classified as _short_ or _long_; these were
  coded as 2 and 4 minutes.
{{content}}
]
--

* For many purposes this kind of heaping or rounding does not matter.
{{content}}
--

* It would matter if we wanted to estimate means and standard
  deviations of the durations of the long and short eruptions.
{{content}}
--

* More data and information about geysers is available at
  https://geysertimes.org/.
  
{{content}}
--

* For exploration there is no one "correct" bin width or number of
  bins.
{{content}}
--

* It would be very useful to be able to change this parameter
  interactively.

---

### Superimposing a Density

A histogram can be used to compare the data distribution to a
theoretical model, such as a normal distribution.

This requires using a _density scale_ for the vertical axis.

The `Galton` data frame in the `HistData` package is one of several
data sets used by Galton to study the heights of parents and their
children.

Adding a normal density curve to a `ggplot` histogram involves:

* computing the parameters of the density;

* creating the histogram with a density scale using the computed
  variable `after_stat(density)`;

* adding the function curve using `geom_function`, `stat_function`, or
  `geom_line`.

---

.pull-left.small-code[
Create the histogram with a density scale using the _computed varlable_
`after_stat(density)`:

{{content}}
]
--

```r
data(Galton, package = "HistData")
ggplot(Galton) +
    geom_histogram(aes(x = parent,
                       y = after_stat(density)),
                   binwidth = 1,
                   fill = "grey",
                   color = "black") +
    thm
```

.pull-right[
<img src="dists_files/figure-html/galton-hist-1.png" style="display: block; margin: auto;" />
]

---

.pull-left.small-code[
Then compute the mean and standard deviation and add the normal
density curve:

```r
data(Galton, package = "HistData")
*p_mean <- mean(Galton$parent)
*p_sd <- sd(Galton$parent)
*p_dens <- function(x) dnorm(x, p_mean, p_sd)
ggplot(Galton) +
    geom_histogram(aes(x = parent,
                       y = after_stat(density)),
                   binwidth = 1,
                   fill = "grey",
                   color = "black") +
*   geom_function(fun = p_dens, color = "red") +
    thm
```
]

.pull-right[
<img src="dists_files/figure-html/galton-hist-dens-1.png" style="display: block; margin: auto;" />
]

---

.pull-left.width-40[
### Multiple Groups
{{content}}
]
--

Faceting works well for showing comparative histograms for multiple
groups.
{{content}}
--

Histograms of `price` within `cut` for the `diamonds` data:

.pull-right.width-60.small-code[
.hide-code[

```r
ggplot(diamonds) +
    geom_histogram(aes(x = price),
                   binwidth = 1000,
                   color = "black", fill = "grey") +
    facet_wrap(~ cut) +
    thm
```

<img src="dists_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" />
]
]

---

.pull-left.width-40[
These histograms show counts on the vertical axis, and their sizes
reflect the total counts for the groups.
{{content}}
]
--

Together the plots represent a view of the joint distribution of `cut`
and `price`.
{{content}}
--

Switching to a density scale by using `after_stat(density)` for the `y`
aesthetic allows the conditional distributions of `price` within
groups to be compared:
--

.pull-right.width-60.small-code[
.hide-code[

```r
p <- ggplot(diamonds) +
    geom_histogram(aes(x = price,
                       y = after_stat(density)),
                   binwidth = 1000,
                   color = "black",
                   fill = "grey") +
    thm
p + facet_wrap(~ cut)
```

<img src="dists_files/figure-html/unnamed-chunk-13-1.png" style="display: block; margin: auto;" />
]
]

---

.pull-left[
By mapping the `fill` aesthetic to `cut` it is possible to produce a
stacked histogram or a superimposed histogram

* `position = "stack"`, the default, for stacked;
* `position = "identity"` for superimposed.
{{content}}
]
--

But neither works very well visually.
{{content}}
--

For comparing locations of features it can help to facet with a single
column.
{{content}}
--

But this may create aspect ratios that are not ideal.

.pull-right.small-code[
<img src="dists_files/figure-html/hist-facet-one-col-1.png" style="display: block; margin: auto;" />
]

---

### Scalability

Histograms scale very well.

* The visual performance does not deteriorate with increasing numbers
  of observations.

* The computational effort needed is linear in the number of observations.

* The amount of storage needed for an image object is linear in the
  number of bins.

---
layout: true
## Density Plots
---

### Density Plot Basics

.pull-left[
Density plots can be thought of as plots of smoothed histograms.
{{content}}
]
--

The smoothness is controlled by a _bandwidth_ parameter that is
analogous to the histogram binwidth.
{{content}}
--

Most density plots use a [_kernel density
estimate_](https://en.wikipedia.org/wiki/Kernel_density_estimation),
but there are other possible strategies; qualitatively the particular
strategy rarely matters.
{{content}}
--

A density plot of the `geyser` `duration` variable with default
bandwidth:

.pull-right.small-code[
.hide-code[

```r
ggplot(geyser) +
    geom_density(aes(x = duration)) +
    thm
```

<img src="dists_files/figure-html/unnamed-chunk-14-1.png" style="display: block; margin: auto;" />
]
]

---

.pull-left[
Using a smaller bandwidth shows the heaping at 2 and 4 minutes:
]

.pull-right.small-code[
.hide-code[

```r
ggplot(geyser) +
    geom_density(aes(x = duration), bw = 0.05) +
    thm
```

<img src="dists_files/figure-html/unnamed-chunk-15-1.png" style="display: block; margin: auto;" />
]
]

---

.pull-left[
For a moderate number of observations a useful addition is a jittered
_rug plot_:
]

.pull-right.small-code[
.hide-code[

```r
ggplot(geyser) +
    geom_density(aes(x = duration)) +
    geom_rug(aes(x = duration, y = 0),
             position =
                 position_jitter(height = 0)) +
    thm
```

<img src="dists_files/figure-html/unnamed-chunk-16-1.png" style="display: block; margin: auto;" />
]
]

---

### Scalability

.pull-left[
Visual scalability is very good.
{{content}}
]

For the `diamonds` data `price` variable:

.pull-right.small-code[
.hide-code[

```r
ggplot(diamonds) +
    geom_density(aes(x = price)) +
    thm
```

<img src="dists_files/figure-html/unnamed-chunk-17-1.png" style="display: block; margin: auto;" />
]
]

---

Density estimates are generally computed at a grid of points and
interpolated.

Defaults in R vary from 50 to 512 points.

Computational effort for a density estimate at a point is proportional
to the number of observations.

Storage needed for an image is proportional to the number of points
where the density is estimated.

---

.pull-left[
### Multiple Groups
{{content}}
]
--

Density estimates for several groups can be shown in a single plot by
mapping a group index to an aesthetic, such as `color`:

.pull-right.small-code[
.hide-code[

```r
ggplot(barley) +
    geom_density(aes(x = yield,
                     color = site)) +
    thm
```

<img src="dists_files/figure-html/unnamed-chunk-18-1.png" style="display: block; margin: auto;" />
]
]

---

.pull-left[
Using `fill` and `alpha` can also be useful:
]

.pull-right.small-code[
.hide-code[

```r
ggplot(barley) +
    geom_density(aes(x = yield,
                     fill = site),
                 alpha = 0.2)
```

<img src="dists_files/figure-html/unnamed-chunk-19-1.png" style="display: block; margin: auto;" />
]
]

---

.pull-left[
Multiple densities in a single plot works best with a smaller number of
categories, say 2 or 3:
]

.pull-right.small-code[
.hide-code[

```r
ggplot(barley) +
    geom_density(aes(x = yield,
                     fill = year),
                 alpha = 0.4) +
    thm
```

<img src="dists_files/figure-html/unnamed-chunk-20-1.png" style="display: block; margin: auto;" />
]
]

---

.pull-left[
Using small multiples, or faceting, may be a better option:
]

.pull-right.small-code[
.hide-code[

```r
ggplot(barley) + geom_density(aes(x = yield)) + facet_wrap(~ site) + thm
```

<img src="dists_files/figure-html/unnamed-chunk-21-1.png" style="display: block; margin: auto;" />
]
]

---

.pull-left[
These ideas can be combined:
]

.pull-right.small-code[
.hide-code[

```r
ggplot(barley) +
    geom_density(aes(x = yield, color = year)) +
    facet_wrap(~ site) +
    thm
```

<img src="dists_files/figure-html/unnamed-chunk-22-1.png" style="display: block; margin: auto;" />
]
{{content}}
]
--

These plots again show lower yields for 1932 than for 1931 for all
sites except Morris.

---

.pull-left[
Density plots default to using the density scale.
{{content}}
]
--

For the diamonds data a density plot of `price` faceted on `cut` shows
the conditional distributions of `price` at the different `cut`
levels:

.pull-right.small-code[
.hide-code[

```r
ggplot(diamonds) +
    geom_density(aes(x = price)) +
    facet_wrap(~ cut) + thm
```

<img src="dists_files/figure-html/unnamed-chunk-23-1.png" style="display: block; margin: auto;" />
]
]

---

.pull-left[
Mapping the `y` aesthetic to `after_stat(count)` shows the joint distribution
of `price` and `cut`:
]

.pull-right.small-code[
.hide-code[

```r
ggplot(diamonds) +
    geom_density(aes(x = price,
                     y = after_stat(count))) +
    facet_wrap(~ cut) + thm
```

<img src="dists_files/figure-html/unnamed-chunk-24-1.png" style="display: block; margin: auto;" />
]
]

---

.pull-left[
A stacked density plot is sometimes useful but often hard to read:
]

.pull-right.small-code[
.hide-code[

```r
ggplot(diamonds) +
    geom_density(aes(x = price,
                     y = after_stat(count),
                     fill = cut),
                 position = "stack") +
    thm
```

<img src="dists_files/figure-html/unnamed-chunk-25-1.png" style="display: block; margin: auto;" />
]
]

---

.pull-left[
An intermediate option: A faceted plot on the count scale with a muted
plot for the full data to allow proportions of the whole to be
assessed:
]

.pull-right.small-code[
.hide-code[

```r
ggplot(diamonds) +
    geom_density(aes(x = price, y = after_stat(count)),
                 fill = "lightgrey", color = NA,
                 data = mutate(diamonds, cut = NULL)) +
    geom_density(aes(x = price,
                     y = after_stat(count),
                     fill = cut),
                 position = "stack", color = NA) +
    facet_wrap(~ cut) +
    scale_fill_viridis_d(guide = "none") +
    thm
```

<img src="dists_files/figure-html/unnamed-chunk-26-1.png" style="display: block; margin: auto;" />
]
]

---

.pull-left[
A filled density plot provides a vew of the conditional distribution
of `cut` at the different price levels:
]

.pull-right.small-code[
.hide-code[

```r
ggplot(diamonds) +
    geom_density(aes(x = price, y = after_stat(count), fill = cut),
                 position = "fill") +
    ylab(NULL) +
    thm
```

<img src="dists_files/figure-html/unnamed-chunk-27-1.png" style="display: block; margin: auto;" />
]
{{content}}
]
--

This is called a _CD plot_, or a _conditional density plot_.

---

### Some Notes

Computations are generally done with the base R function `density`.

`plot` has a method for the results returned by this function, so a
density plot can be created with

```r
plot(density(geyser$duration))
```

The `lattice` package provides the function `densityplot`.

---

### Interactive Bandwidth Choice

Being able to chose the bandwidth of a density plot, or the binwidth
of a histogram, interactively is useful for exploration.

One way to do this in R (which unfortunately does not work on the
RStudio server):

```r
data(geyser, package = "MASS")
source("https://stat.uiowa.edu/~luke/classes/STAT7400/examples/tkdens.R")
tkdens(geyser$duration, tkrplot = TRUE)
```

Another option:

```r
data(geyser, package = "MASS")
source("https://stat.uiowa.edu/~luke/classes/STAT7400/examples/shinydens.R")
shinyDens(geyser$duration)
```

<!--

ggplot(barley) +
    geom_density(aes(x = yield, y = -after_stat(count), fill = year),
                 data = filter(barley, year == 1931)) +
    geom_density(aes(x = yield, y = after_stat(count), fill = year),
                 data = filter(barley, year == 1932)) +
    coord_flip()
ggplot(barley) +
    geom_density(aes(x = yield, y = -after_stat(density), fill = year),
                 data = filter(barley, year == 1931)) +
    geom_density(aes(x = yield, fill = year),
                 data = filter(barley, year == 1932)) +
    coord_flip() +
    facet_wrap(~site)

## From Claus Wilke's book:
data(Titanic, package = "Stat2Data")
library(dplyr)
library(tidyr)
library(cowplot)
library(ggplot2)

Titanic |>
    select(-SexCode) |>
    rename(name = Name, class = PClass, age = Age, sex = Sex,
           survived = Survived) |>
    mutate(name = as.character(name),
           class = as.character(class),
           sex = as.character(sex)) -> titanic_all

##titanic <- select(titanic_train, age = Age, sex = Sex)
titanic <- titanic_all

data.frame(
    age = (1:28)*3 - 1.5, 
    male = hist(filter(titanic, sex == "male")$age,
                breaks = (0:28)*3 + .01, plot = FALSE)$counts,
    female = hist(filter(titanic, sex == "female")$age,
                  breaks = (0:28)*3 + .01, plot = FALSE)$counts
) |>
    gather(gender, count, -age) -> gender_counts

ggplot(gender_counts,
       aes(x = age, y = ifelse(gender == "male",-1, 1)*count, fill = gender)) + 
    geom_col() +
    scale_x_continuous(name = "age (years)", limits = c(0, 75),
                       expand = c(0, 0)) +
    scale_y_continuous(name = "count", breaks = 20*(-2:1),
                       labels = c("40", "20", "0", "20")) +
    scale_fill_manual(values = c("#D55E00", "#0072B2"), guide = "none") +
    draw_text(x = 70, y = -39, "male", hjust = 0) +
    draw_text(x = 70, y = 21, "female", hjust = 0) +
    coord_flip()
-->

---
layout: true
## Boxplots
---

.pull-left[
_Boxplots_, or _box-and-whisker_ plots, provide a skeletal
representation of a distribution.
{{content}}
]
--

They are very well suited for showing distributions for multiple
groups.
{{content}}
--

There are many variations of boxplots:
{{content}}
--

* Most start with a box from the first to the third quartiles and
  divided by the median.
{{content}}
--

* The simplest form then adds a whisker from the lower quartile to the
  minimum and from the upper quartile to the maximum.
{{content}}
--

* More common is to draw the upper whisker to the largest point below
  the upper quartile `$+ 1.5 * IQR$`, and the lower whisker analogously.
{{content}}
--

* _Outliers_ falling outside the range of the whiskers are then drawn
  directly:

.pull-right.small-code[
.hide-code[

```r
library(gapminder)
library(ggplot2)
ggplot(gapminder) +
    geom_boxplot(aes(x = continent, y = gdpPercap)) +
    xlab(NULL) +
    thm
```

<img src="dists_files/figure-html/unnamed-chunk-31-1.png" style="display: block; margin: auto;" />
]
]

---

.pull-left[
There are variants that distinguish between _mild outliers_ and
_extreme outliers_.
{{content}}
]
--

A common variant is to show an approximate 95% confidence interval for
the population median as a _notch_:

.pull-right.small-code[
.hide-code[

```r
ggplot(gapminder) +
    geom_boxplot(aes(x = continent, y = gdpPercap),
                 notch = TRUE) +
    xlab(NULL) +
    thm
```

<img src="dists_files/figure-html/unnamed-chunk-32-1.png" style="display: block; margin: auto;" />
]
]

---

.pull-left[
Another variant is to use a width proportional to the square root of
the sample size to reflect the strength of evidence in the data:
]

.pull-right.small-code[
.hide-code[

```r
ggplot(gapminder) +
    geom_boxplot(aes(x = continent, y = gdpPercap),
                 notch = TRUE, varwidth = TRUE) +
    xlab(NULL) +
    thm
```

<img src="dists_files/figure-html/unnamed-chunk-33-1.png" style="display: block; margin: auto;" />
]
]

---

.pull-left[
With moderate sample sizes it can be useful to super-impose the
original data, perhaps with jittering and alpha blending.
{{content}}
]
--

The outliers in the box plot can be turned off with `outlier.color =
NA` so they are not shown twice:

.pull-right.small-code[
.hide-code[

```r
p <- ggplot(gapminder) +
    geom_boxplot(aes(x = continent, y = gdpPercap),
                 notch = TRUE, varwidth = TRUE,
                 outlier.color = NA) +
    xlab(NULL) +
    thm
p + geom_point(aes(x = continent, y = gdpPercap),
               position =
                   position_jitter(width = 0.1),
               alpha = 0.1)
```

<img src="dists_files/figure-html/unnamed-chunk-34-1.png" style="display: block; margin: auto;" />
]
]

---
layout: true
## Violin Plots
---

.pull-left[
A variant of the boxplot is the _violin plot_:
{{content}}
]
--

> Hintze, J. L., Nelson, R. D. (1998), "Violin Plots: A Box
> Plot-Density Trace Synergism," _The American Statistician_ 52,
> 181-184.
{{content}}
--

The violin plot uses density estimates to show the distributions:

.pull-right.small-code[
.hide-code[

```r
ggplot(gapminder) +
    geom_violin(aes(x = continent, y = gdpPercap)) +
    xlab(NULL) +
    thm
```

<img src="dists_files/figure-html/unnamed-chunk-35-1.png" style="display: block; margin: auto;" />
]
]

---

.pull-left[
By default the "violins" are scaled to have the same area.
{{content}}
]
--

They can also be scaled to have the same maximum height or to have
areas proportional to sample sizes.
{{content}}
--

This is done by adding

* `scale = "width"` or
* `scale = "count"`

to the `geom_violin` call.
{{content}}
--

A comparison of boxplots and violin plots:

.pull-right.small-code[
.hide-code[

```r
ggplot(gapminder) +
    geom_boxplot(aes(x = continent, y = gdpPercap)) +
    geom_violin(aes(x = continent, y = gdpPercap),
                fill = NA, scale = "width",
                linetype = 2) +
    xlab(NULL) +
    thm
```

<img src="dists_files/figure-html/unnamed-chunk-36-1.png" style="display: block; margin: auto;" />
]
]

---

.pull-left[
A combination of boxplots and violin plots:
]

.pull-right.small-code[
.hide-code[

```r
ggplot(gapminder) +
    geom_violin(aes(x = continent, y = gdpPercap),
                scale = "width") +
    geom_boxplot(aes(x = continent, y = gdpPercap),
                 width = .1) +
    xlab(NULL) +
    thm
```

<img src="dists_files/figure-html/unnamed-chunk-37-1.png" style="display: block; margin: auto;" />
]
{{content}}
]

There are other variations, e.g. _vase plots_.

---

.pull-left[
Boxplots do not reflect the shape of a distribution.
{{content}}
]
--

For the `eruptions` in the `faithful` data set:

.pull-right.small-code[
.hide-code[

```r
ggplot(faithful) +
    geom_boxplot(aes(y = eruptions, x = "Box")) +
    geom_violin(aes(y = eruptions, x = "Violin"),
                trim = FALSE) +
    xlab(NULL) +
    thm
```

<img src="dists_files/figure-html/unnamed-chunk-38-1.png" style="display: block; margin: auto;" />
]
]

---
layout: true
## Swarm Plots
---

.pull-left[
Swarm plots show the full data in a form that also shows the density.
{{content}}
]
--

There are a number of variations and names, including _beeswarm
plots_, _violin scatterplots_, _violin strip charts_, and _sina plots_
{{content}}
--

[Sina
plots](https://www.tandfonline.com/doi/full/10.1080/10618600.2017.1366914)
are available as `geom_sina` in the `ggforce` package:

--

.pull-right.small-code[
.hide-code[

```r
library(ggforce)
ggplot(gapminder,
       aes(x = continent, y = gdpPercap)) +
    geom_sina(size = 0.2) +
    xlab(NULL) +
    thm
```

<img src="dists_files/figure-html/unnamed-chunk-39-1.png" style="display: block; margin: auto;" />
]
]

---

.pull-left[
Combined with a width-scaled violin plot:
]

.pull-right.small-code[
.hide-code[

```r
ggplot(gapminder,
       aes(x = continent, y = gdpPercap)) +
    geom_violin(scale = "width") +
    geom_sina(color = "blue",
              size = 0.4,
              scale = FALSE) +
    xlab(NULL) +
    thm
```

<img src="dists_files/figure-html/unnamed-chunk-40-1.png" style="display: block; margin: auto;" />
]
]

---
layout: true
## Effectiveness and Scalability
---

.pull-left.smaller[
* Boxplots are very simple and easy to compare.
{{content}}
]
--

* Boxplots strongly emphasize the middle half of the data.
{{content}}
--

* Boxplots may not be easy for a lay viewer to understand.

.pull-right.smaller[
* Box plots scale fairly well visually and computationally in the
  number of observations; over-plotting/storage of outliers becomes an
  issue for larger data sets
{{content}}
]
--

* Violin plots scale well both visually and computationally in the
  number of observations.

.small-code[
.hide-code[

```r
library(patchwork)
p1 <- ggplot(diamonds) +
    geom_boxplot(aes(x = cut, y = price)) +
    xlab(NULL) +
    thm
p2 <- ggplot(diamonds) +
    geom_violin(aes(x = cut, y = price)) +
    xlab(NULL) +
    thm
p1 + p2
```

<img src="dists_files/figure-html/unnamed-chunk-41-1.png" style="display: block; margin: auto;" />
]
]

---

* Scalability in the number of cases for swarm or sina plots is more
  limited.

* The number of groups that can be handled for comparison by these plots
  is in the range of a few dozen.

.hide-code[

```r
library(lattice)
p1 <- ggplot(barley) +
    geom_boxplot(aes(x = site, y = yield, fill = year)) +
    xlab(NULL) +
    thm
p2 <- ggplot(barley) +
    geom_violin(aes(x = site, y = yield, fill = year)) +
    xlab(NULL) +
    thm
p1 + p2
```

<img src="dists_files/figure-html/unnamed-chunk-42-1.png" style="display: block; margin: auto;" />
]

---

Axes can be flipped to avoid overplotting of labels:

.hide-code[

```r
library(lattice)
p3 <- p1 + coord_flip() + guides(fill = "none")
p4 <- p2 + coord_flip()
p3 + p4
```

<img src="dists_files/figure-html/unnamed-chunk-43-1.png" style="display: block; margin: auto;" />
]

---

Faceting can also be used to arrange groups of boxplots or violin plots.

For life expectancy by continent over the years in the `gapminder` data:

.hide-code[

```r
library(dplyr)
ggplot(filter(gapminder,
              year %% 10 == 2,
              continent != "Oceania")) +
    geom_boxplot(aes(x = lifeExp, y = factor(year))) +
    facet_wrap(~ continent, ncol = 1) +
    theme_minimal() +
    theme(text = element_text(size = 12)) +
    theme(strip.text.x = element_text(hjust = 0)) +
    ylab(NULL)
```

<img src="dists_files/figure-html/unnamed-chunk-44-1.png" style="display: block; margin: auto;" />
]

A related visualization motivated by a graph in the Economist is
available [here](https://cinc.rud.is/web/packages/ggeconodist/).

---
layout: true
## Ridgeline Plots
---

.pull-left[
[Ridgeline
plots](https://blog.revolutionanalytics.com/2017/07/joyplots.html),
also called _ridge plots_ or _joy plots_, are another way to show
density estimates for a number of groups that has become popular
recently.
{{content}}
]
--

An early example appears in [this NYT
article](https://www.nytimes.com/interactive/2017/06/12/upshot/the-politics-of-americas-religious-leaders.html).

{{content}}
--

The package `ggridges` defines `geom_density_ridges` for creating
these plots:

.pull-right.small-code[
.hide-code[

```r
library(ggridges)
ggplot(barley) +
    geom_density_ridges(aes(x = yield,
                            y = site,
                            group = site)) +
    ylab(NULL) +
    thm
```

<img src="dists_files/figure-html/unnamed-chunk-45-1.png" style="display: block; margin: auto;" />
]
]

---

.pull-left[
Grouping by an interaction with a categorical variable, `year`,
produces separate density estimates for each level.
{{content}}
]
--

Mapping the `fill` aesthetic to `year` allows the separate densities
to be identified:

.pull-right.small-code[
.hide-code[

```r
ggplot(barley) +
    geom_density_ridges(
        aes(x = yield,
            y = site,
            group = interaction(year, site),
            fill = year)) +
    ylab(NULL) +
    thm
```

<img src="dists_files/figure-html/unnamed-chunk-46-1.png" style="display: block; margin: auto;" />
]
]

---

.pull-left[
Alpha blending may sometimes help:
]

.pull-right.small-code[
.hide-code[

```r
ggplot(barley) +
    geom_density_ridges(
        aes(x = yield,
            y = site,
            group = interaction(year, site),
            fill = year),
        alpha = 0.8) +
    ylab(NULL) +
    thm
```

<img src="dists_files/figure-html/unnamed-chunk-47-1.png" style="display: block; margin: auto;" />
]
]

---

.pull-left[
Adjusting the vertical scale may also help:
]

.pull-right.small-code[
.hide-code[

```r
ggplot(barley) +
    geom_density_ridges(
        aes(x = yield,
            y = site,
            group = interaction(year, site),
            fill = year),
        scale = 0.8) +
    ylab(NULL) +
    thm
```

<img src="dists_files/figure-html/unnamed-chunk-48-1.png" style="display: block; margin: auto;" />
]
]

---

.pull-left[
Sometimes reordering the grouping variable, `year` in this case, can
help.
{{content}}
]
--

The factor levels of `year` can be reordered to match the order of
average yealds within each year by

```r
reorder(year, yield)
```
{{content}}
--

Using `-yield` produces the reverse order.

.pull-right.small-code[
.hide-code[

```r
library(dplyr)
ggplot(mutate(barley, year = reorder(year, -yield))) +
    geom_density_ridges(aes(x = yield, y = site,
                            group = interaction(year, site),
                            fill = year), scale = 0.8) +
    ylab(NULL) +
    thm
```

<img src="dists_files/figure-html/unnamed-chunk-50-1.png" style="display: block; margin: auto;" />
]
]

---

With some tuning ridgeline plots can scale well to many distributions.
An example from [Claus Wilke's book](https://clauswilke.com/dataviz/):

The `ggplot2movies` package provides data from
[IMDB](https://imdb.com/) on a large number of movies, including their
lengths, in a tibble `movies`:

.small-code[

```r
library(ggplot2movies)
dim(movies)
## [1] 58788    24
head(movies)
## # A tibble: 6 × 24
##   title      year length budget rating votes    r1    r2    r3    r4    r5    r6
##   <chr>     <int>  <int>  <int>  <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 $          1971    121     NA    6.4   348   4.5   4.5   4.5   4.5  14.5  24.5
## 2 $1000 a …  1939     71     NA    6      20   0    14.5   4.5  24.5  14.5  14.5
## 3 $21 a Da…  1941      7     NA    8.2     5   0     0     0     0     0    24.5
## 4 $40,000    1996     70     NA    8.2     6  14.5   0     0     0     0     0  
## 5 $50,000 …  1975     71     NA    3.4    17  24.5   4.5   0    14.5  14.5   4.5
## 6 $pent      2000     91     NA    4.3    45   4.5   4.5   4.5  14.5  14.5  14.5
## # ℹ 12 more variables: r7 <dbl>, r8 <dbl>, r9 <dbl>, r10 <dbl>, mpaa <chr>,
## #   Action <int>, Animation <int>, Comedy <int>, Drama <int>,
## #   Documentary <int>, Romance <int>, Short <int>
```
]

---

.pull-left.small-code[
A ridgeline plot of the movie lengths for each year:

.hide-code[

```r
library(dplyr)
mv12 <- filter(movies, year > 1912)
ggplot(mv12, aes(x = length, y = year, group = year)) +
    geom_density_ridges(scale = 10, size = 0.25, rel_min_height = 0.03) +
    scale_x_continuous(limits = c(0, 200)) +
    scale_y_reverse(breaks = c(2000, 1980, 1960, 1940, 1920)) +
    theme_minimal()
## Warning in geom_density_ridges(scale = 10, size = 0.25, rel_min_height = 0.03):
## Ignoring unknown parameters: `size`
## Warning: Removed 250 rows containing non-finite outside the scale range
## (`stat_density_ridges()`).
```

<img src="dists_files/figure-html/unnamed-chunk-52-1.png" style="display: block; margin: auto;" />
]
]
--

.pull-right[
This shows that since the early 1960's feature film lengths have
stabilized to a distribution centered around 90 minutes:
]

---

Another nice example:
[DW-NOMINATE](https://en.wikipedia.org/wiki/NOMINATE_%28scaling_method%29)
scores for measuring political position of members of congress over
the years:

[Original code]( http://rpubs.com/ianrmcdonald/293304) by Ian
McDonald; another version is provided in [Claus Wilke's
book](https://clauswilke.com/dataviz/).

---
layout: false
## Reading

Chapter [_Visualizing distributions: Histograms and density
  plots_](https://clauswilke.com/dataviz/histograms-density-plots.html)
  in [_Fundamentals of Data
  Visualization_](https://clauswilke.com/dataviz/).

Section [_Histograms and density
plots_](https://socviz.co/groupfacettx.html#histograms) in [_Data
Visualization_](https://socviz.co/).

Chapter [_Visualizing data distributions_](https://rafalab.dfci.harvard.edu/dsbook-part-1/dataviz/distributions.html)
in [_Introduction to Data Science: Data Analysis and Prediction
Algorithms with R_](https://rafalab.dfci.harvard.edu/dsbook-part-1/).

---
layout: false

## Interactive Tutorial

An interactive [`learnr`](https://rstudio.github.io/learnr/) tutorial
for these notes is [available](../tutorials/dists.Rmd).

You can run the tutorial with

```r
STAT4580::runTutorial("dists")
```

You can install the current version of the `STAT4580` package with

```r
remotes::install_gitlab("luke-tierney/STAT4580")
```

You may need to install the `remotes` package from CRAN first.

---
layout: true
## Exercises
---

1) Consider the code

```r
library(ggplot2)
data(Galton, package = "HistData")
ggplot(Galton, aes(x = parent)) +
    geom_histogram(---, fill = "grey", color = "black")
```

Which of the following replacements for `---` produces a histogram
with bins that are one inch wide and start at whole integers?
    
*   a. `binwidth = 1`
*   b. `binwidth = 1, center = 66.5`
*   c. `binwidth = 2, center = 66`
*   d. `center = 66`

---

2) Consider the code

```r
library(ggplot2)
ggplot(faithful, aes(x = eruptions)) + geom_density(---)
```
    
Which of the following replacements for `---` produces a density
plot with the area under the density in blue and no black border?

*   a. `color = "lightblue"`
*   b. `fill = "black", color = "lightblue"`
*   c. `fill = "lightblue", color = NA`
*   d. `fill = NA, color = "black"`

---

3) Consider the code

```r
library(ggplot2)
library(gapminder)
p <- ggplot(gapminder, aes(y = continent, x = lifeExp))
```

Which of the following produces violin plots without trimming at
the smallest and largest observations, and including a line at the
median?

*   a. `p + geom_violin(trim = FALSE)`
*   b. `p + geom_violin(trim = TRUE, show_median = TRUE)`
*   c. `p + geom_violin(trim = FALSE, draw_quantiles = 0.5)`
*   d. `p + geom_violin(trim = TRUE, show_quantiles = 0.5)`

---

4) Density ridges can also show quantiles, but the details of how to
request this are different. Consider this code:

```r
library(ggplot2)
library(ggridges)
library(gapminder)
ggplot(gapminder, aes(x = lifeExp, y = year, group = year)) +
    geom_density_ridges(---)
```
    
Which of the following replacements for `---` produces density
ridges with lines showing the locations of the medians?
    
*   a. `quantiles = 0.5`
*   b. `quantile_lines = TRUE, quantiles = 0.5`
*   c. `quantile_lines = TRUE`
*   d. `draw_quantiles = 0.5`

<!--
boxplots
violin plots

po <- function(prob, quant) {
    q25 <- quant(0.25)
    q75 <- quant(0.75)
    md <- quant(0.5)
    IQR <- q75 - q25
    uop <- prob(q75 + 1.5 * IQR, lower.tail = FALSE)
    lop <- prob(q25 - 1.5 * IQR)
    c(lop, uop)
}
-->

//adapted from Emi Tanaka's gist at //https://gist.github.com/emitanaka/eaa258bb8471c041797ff377704c8505