---
title: "Box Plots and Relations"
output:
html_document:
toc: yes
---
```{r global_options, include=FALSE}
knitr::opts_chunk$set(collapse=TRUE)
```
## Boxplots
Boxplots, or box-and-whisker plots, provide a skeletal representation
of a distribution.
They are very well suited for showing distributions for multiple
groups.
There are many variations of boxplots:
* Most start with a box from the first to the third quartiles and
divided by the median.
* The simplest form then adds a whisker from the lower quartile to the
minimum and from the upper quartile to the maximum.
* More common is to draw the upper whisker to the largest point below
the upper quartile $+ 1.5 * IQR$, and the lower whisker analogously.
* _Outliers_ falling outside the range of the whiskers are then drawn
directly:
```{r}
library(gapminder)
library(ggplot2)
ggplot(gapminder) + geom_boxplot(aes(x = continent, y = gdpPercap))
```
* There are variants that distinguish between _mild outliers_ and
_extreme outliers_.
* A common variant is to show an approximate 95% confidence interval for
the population median as a _notch_:
```{r}
ggplot(gapminder) +
geom_boxplot(aes(x = continent, y = gdpPercap), notch = TRUE)
```
* Another variant is to use a width proportional to the square root of
the sample size to reflect the strength of evidence in the data:
```{r}
ggplot(gapminder) +
geom_boxplot(aes(x = continent, y = gdpPercap),
notch = TRUE, varwidth = TRUE)
```
With moderate sample sizes it can be useful to super-impose the
original data, perhaps with jittering and alpha blending. The outliers
in the box plot can be turned off with `outlier.color = NA` so they
are not shown twice:
```{r}
p <- ggplot(gapminder) +
geom_boxplot(aes(x = continent, y = gdpPercap),
notch = TRUE, varwidth = TRUE,
outlier.color = NA)
p + geom_point(aes(x = continent, y = gdpPercap),
position = position_jitter(width = 0.1), alpha = 0.1)
```
## Violin Plots
A variant of the boxplot is the _violin plot_:
Hintze, J. L., Nelson, R. D. (1998), "Violin Plots: A Box Plot-Density
Trace Synergism," _The American Statistician_ 52, 181-184.
The violin plot uses density estimates to show the distributions:
```{r}
ggplot(gapminder) + geom_violin(aes(x = continent, y = gdpPercap))
```
By default the "violins" are scaled to have the same area. They can
also be scaled to have the same maximum height or to have areas
proportional to sample sizes. This is done by adding `scale = "width"`
or `scale = "count"` to the `geom_violin` call.
A comparison of boxplots and violin plots:
```{r}
ggplot(gapminder) +
geom_boxplot(aes(x = continent, y = gdpPercap)) +
geom_violin(aes(x = continent, y = gdpPercap),
fill = NA, scale = "width", linetype = 2)
```
A Combination of boxplots and violin plots:
```{r}
ggplot(gapminder) +
geom_violin(aes(x = continent, y = gdpPercap), scale = "width") +
geom_boxplot(aes(x = continent, y = gdpPercap), width = .1)
```
There are other variations, e.g. _vase plots_.
## Beeswarm Plots
Beeswarm plots, also called violin scatterplots, try to show both the
full data and the density of the data distribution.
The [`ggbeeswarm`](https://github.com/eclarke/ggbeeswarm) package
provides one implementation with two variants:
* `geom_quasirandom`
* `geom_beeswarm`
The `quasi_random` version seems to work a bit better on many examples, e.g.
```{r}
library(ggbeeswarm)
ggplot(gapminder) +
geom_quasirandom(aes(x = continent, y = gdpPercap), size = 0.2)
```
Combined with a width-scaled violin plot:
```{r}
ggplot(gapminder) +
geom_violin(aes(x = continent, y = gdpPercap), scale = "width") +
geom_quasirandom(aes(x = continent, y = gdpPercap), color = "blue")
```
## Effectiveness and Scalability
* Boxplots are very simple and easy to compare.
* Boxplots strongly emphasize the middle half of the data.
* Boxplots may not be easy for a lay viewer to understand.
* Box plots scale fairly well visually and computationally in the
number of observations; over-plotting/storage of outliers becomes an
issue for larger data sets
* Violin plots scale well both visually and computationally in the
number of observations.
```{r, fig.width = 10}
library(gridExtra)
p1 <- ggplot(diamonds) + geom_boxplot(aes(x = cut, y = price))
p2 <- ggplot(diamonds) + geom_violin(aes(x = cut, y = price))
grid.arrange(p1, p2, nrow = 1)
```
* Scalability in the number of cases for beeswarm plots is more limited.
* The number of groups that can be handled for comparison by these plots
is in the range of a few dozen.
```{r, fig.width = 10}
library(lattice)
p1 <- ggplot(barley) + geom_boxplot(aes(x = site, y = yield, fill = year))
p2 <- ggplot(barley) + geom_violin(aes(x = site, y = yield, fill = year))
grid.arrange(p1, p2, nrow = 1)
```
Axes can be flipped to avoid overplotting of labels:
```{r, fig.width = 10}
library(lattice)
p3 <- p1 + coord_flip()
p4 <- p2 + coord_flip()
grid.arrange(p3, p4, nrow = 1)
```
## Ridgeline Plots
[Ridgeline
plots](http://blog.revolutionanalytics.com/2017/07/joyplots.html),
also called _ridge plots_ or _joy plots_, are another way to show
density estimates for a number of groups that has become popular
recently.
The package `ggridges` defines `geom_density_ridges` for creating
these plots:
```{r, message = FALSE}
library(ggridges)
ggplot(barley) + geom_density_ridges(aes(x = yield, y = site, group = site))
```
Grouping by an interaction with a categorical variable, `year`,
produces separate density estimates for each level.
Mapping the `fill` aesthetic to `year` allows the separate densities
to be identified:
```{r}
ggplot(barley) +
geom_density_ridges(aes(x = yield, y = site,
group = interaction(year, site),
fill = year))
```
Alpha blending may sometimes help:
```{r}
ggplot(barley) +
geom_density_ridges(aes(x = yield, y = site,
group = interaction(year, site),
fill = year), alpha = 0.8)
```
Adjusting the vertical scale may also help:
```{r}
ggplot(barley) +
geom_density_ridges(aes(x = yield, y = site,
group = interaction(year, site),
fill = year), scale = 0.8)
```
Sometimes reordering the grouping variable, `year` in this case, can
help. The factor levels of `year` can be reordered to match the
order of average yealds within each year by
```{r, eval = FALSE}
reorder(year, yield)
```
Using `-yield` produces the reverse order.
```{r, message = FALSE}
library(dplyr)
ggplot(mutate(barley, year = reorder(year, -yield))) +
geom_density_ridges(aes(x = yield, y = site,
group = interaction(year, site),
fill = year), scale = 0.8)
```
With some tuning ridgeline plots can scale well to many distributions.
An example from [Claus Wilke's book](https://serialmentor.com/dataviz/):
The `ggplot2movies` package provides data from
[IMDB](http://imdb.com/) on a large number of movies, including their
lengths, in a tibble `movies`:
```{r}
library(ggplot2movies)
dim(movies)
head(movies)
```
A ridgeline plot of the movie lengths for each year:
```{r, message = FALSE}
library(dplyr)
mv12 <- filter(movies, year > 1912)
ggplot(mv12, aes(x = length, y = year, group = year)) +
geom_density_ridges(scale = 10, size = 0.25, rel_min_height = 0.03) +
scale_x_continuous(limits = c(0, 200)) +
scale_y_reverse(breaks = c(2000, 1980, 1960, 1940, 1920)) +
theme_minimal()
```
This shown that since the early 1960's feature film lengths have
stabilized to a distribution centered around 90 minutes:
Another nice example:
[DW-NOMINATE](https://en.wikipedia.org/wiki/NOMINATE_(scaling_method))
scores for measuring political position of members of congress over
the years:
![](img/polarization.jpeg)
[Original code]( http://rpubs.com/ianrmcdonald/293304) by Ian
McDonald; another version is provided in [Claus Wilke's
book](https://serialmentor.com/dataviz/).