---
title: "Visualizing Distributions"
output:
html_document:
toc: yes
code_folding: show
code_download: true
---
```{r setup, include = FALSE, message = FALSE}
source(here::here("setup.R"))
knitr::opts_chunk$set(collapse = TRUE, message = FALSE,
fig.height = 5, fig.width = 6, fig.align = "center")
set.seed(12345)
library(dplyr)
library(ggplot2)
library(lattice)
library(gridExtra)
source(here::here("datasets.R"))
```
## Introduction
Once there are more than a handful of numeric data values it is often
useful to step back and look at the _distribution_ of the data values:
* Where is the bulk of the data located?
* Is there a single area of concentration or are there several?
* Is the data distribution symmetric or is it skewed, i.e. trails off
more slowly in one direction or another?
* Are there extreme, or outlying, values?
* Are there any suspicious or impossible values?
* Are there gaps in the data?
* Is there rounding, e.g. to integer values, or _heaping_, i.e. a
few particular values occur very frequently?
Plots for visualizing distributions include
* Strip plots.
* Histograms.
* Density plots.
* Box plots.
* Violin plots.
* Swarm plots.
* Density ridges
## Strip Plots
### Strip Plot Basics
A variant of the dot plot is known as a _strip plot_.
A strip plot for the city temperature data is
```{r, fig.height = 2, warning = FALSE, class.source = "fold-hide"}
thm <- theme_minimal() +
theme(text = element_text(size = 16))
ggplot(citytemps) +
geom_point(aes(x = temp, y = "All")) +
thm +
theme(axis.title.y = element_blank(),
axis.text.y = element_blank())
```
The strip plot can reveal gaps and outliers.
After looking at the plot we might want to examine the high and
low values:
```{r}
filter(citytemps, temp > 85)
filter(citytemps, temp < 10)
```
For the eruption durations in the `faithful` data a strip plot shows
the two modes around 2 and 4 minutes:
```{r, fig.height = 2, warning = FALSE, class.source = "fold-hide"}
ggplot(faithful) +
geom_point(aes(x = eruptions, y = "All")) +
thm +
theme(axis.title.y = element_blank(),
axis.text.y = element_blank())
```
### Multiple Groups
Strip plots are most useful for showing subsets corresponding to a
categorical variable.
A strip plot for the yields for different varieties in the barley data
is
```{r, class.source = "fold-hide"}
ggplot(barley) +
geom_point(aes(x = yield, y = variety)) +
theme_minimal() +
thm
```
### Scalability
Scalability in this form is limited due to over-plotting.
A simple strip plot of `price` within the different `cut` levels in
the `diamonds` data is not very helpful:
```{r, class.source = "fold-hide"}
ggplot(diamonds) +
geom_point(aes(x = price, y = cut)) +
thm +
theme(axis.title.y = element_blank())
```
Several approaches are available to reduce the impact of over-plotting:
* reduce the point size;
* random displacement of points, called _jittering_;
* making the points translucent, or _alpha blending_.
Combining all three for examining `price` within `cut` for the
`diamonds` data produces
```{r, class.source = "fold-hide"}
ggplot(diamonds) +
geom_point(aes(x = price, y = cut),
size = 0.2,
position = position_jitter(width = 0),
alpha = 0.2) +
thm + theme(axis.title.y = element_blank())
```
Skewness of the price distributions can be seen in this plot, though
other approaches will show this more clearly.
A peculiar feature reveled by this plot is the gap below
2000.
Examining the subset with `price < 2000` shows the gap is
roughly symmetric around 1500:
```{r, class.source = "fold-hide"}
ggplot(filter(diamonds, price < 2000)) +
geom_point(aes(x = price, y = cut),
size = 0.2,
position = position_jitter(width = 0),
alpha = 0.2) +
thm +
theme(axis.title.y = element_blank())
```
A plot along these lines was used on the New Your Times [front page for
February 21, 2021](`r IMG("NYT-2021-02-21.jpeg")`).
```{r, echo = FALSE, out.width = "55%"}
knitr::include_graphics(IMG("NYT-2021-02-21.jpeg"))
```
### Some Notes
* With a good combination of point size choice, jittering, and alpha
blending the strip plot for groups of data can scale to several
hundred thousand observations and ten to twenty of groups.
* For very large datat sets it can be useful to look at a strip plot
of a sample of the data.
* Strip plots can reveal gaps, outliers, and data outside of the
expected range.
* Skewness and multi-modality can be seen, but other visualizations
show these more clearly.
* Storage needed for vector graphics images grows linearly with the
number of observations.
* Base graphics provides `stripchart` and lattice provides `stripplot`.
## Histograms
### Histogram Basics
Historams are constructed by binning the data and counting the number
of observations in each bin.
The objective is usually to visualize the shape of the distribution.
The number of bins needs to be
* small enough to reveal interesting features;
* large enough not to be too noisy.
A very small bin width can be used to look for rounding or heaping.
Common choices for the vertical scale are:
* bin counts, or frequencies;
* counts per unit, or densities.
The count scale is more intepretable for lay viewers.
The density scale is more suited for comparison to mathematical
density models.
Constructing histograms with unequal bin widths is possible but
rarely a good idea.
### Histograms in R
There are many ways to plot histograms in R:
* the `hist()` function in the base `graphics` package;
* `truehist()` in package `MASS`;
* `histogram()` in package `lattice`;
* `geom_histogram()` in package `ggplot2`.
A histogram of eruption durations for another data set on Old Faithful
eruptions, this one from package `MASS`:
```{r, message = TRUE, class.source = "fold-hide"}
data(geyser, package = "MASS")
ggplot(geyser) +
geom_histogram(aes(x = duration)) +
thm
```
The default settings using `geom_histogram` are less than ideal.
Using a binwidth of 0.5 and customized `fill` and `color` settings
produces a better result:
```{r, class.source = "fold-hide"}
ggplot(geyser) +
geom_histogram(aes(x = duration),
binwidth = 0.5,
fill = "grey",
color = "black") +
thm
```
Reducing the bin width shows an interesting feature:
```{r, class.source = "fold-hide"}
ggplot(geyser) +
geom_histogram(aes(x = duration),
binwidth = 0.05,
fill = "grey",
color = "black") +
thm
```
* Eruptions were sometimes classified as _short_ or _long_; these were
coded as 2 and 4 minutes.
* For many purposes this kind of heaping or rounding does not matter.
* It would matter if we wanted to estimate means and standard
deviations of the durations of the long and short eruptions.
* More data and information about geysers is available at
http://geysertimes.org/.
* For exploration there is no one "correct" bin width or number of
bins.
* It would be very useful to be able to change this parameter
interactively.
### Superimposing a Density
A histogram can be used to compare the data distribution to a
theoretical model, such as a normal distribution.
This requires using a density scale for the vertical axis.
The `Galton` data frame in the `HistData` package is one of several
data sets used by Galton to study the heights of parents and their
children.
Adding a normal density curve to a `ggplot` histogram involves:
* computing the parameters of the density;
* creating the histogram with a density scale using the computed
variable `after_stat(density)`;
* adding the function curve using `geom_function`, `stat_function`, or
`geom_line`.
Create the histogram with a density scale using the _computed varlable_
`after_stat(density)`:
```{r galton-hist, eval = FALSE}
data(Galton, package = "HistData")
ggplot(Galton) +
geom_histogram(aes(x = parent,
y = after_stat(density)),
binwidth = 1,
fill = "grey",
color = "black") +
thm
```
```{r galton-hist, echo = FALSE}
```
Then compute the mean and standard deviation and add the normal
density curve:
```{r galton-hist-dens, eval = FALSE}
data(Galton, package = "HistData")
p_mean <- mean(Galton$parent)
p_sd <- sd(Galton$parent)
p_dens <- function(x) dnorm(x, p_mean, p_sd)
ggplot(Galton) +
geom_histogram(aes(x = parent,
y = after_stat(density)),
binwidth = 1,
fill = "grey",
color = "black") +
geom_function(fun = p_dens, color = "red") +
thm
```
```{r galton-hist-dens, echo = FALSE}
```
### Multiple Groups
Faceting works well for showing comparative histograms for multiple
groups.
Histograms of `price` within `cut` for the `diamonds` data:
```{r, fig.width = 8, class.source = "fold-hide"}
ggplot(diamonds) +
geom_histogram(aes(x = price),
binwidth = 1000,
color = "black", fill = "grey") +
facet_wrap(~ cut) +
thm
```
These histograms show counts on the vertical axis, and their sizes
reflect the total counts for the groups.
Together the plots represent a view of the joint distribution of `cut`
and `price`.
Switching to a density scale by using `after_stat(density)` for the `y`
aesthetic allows the conditional distributions of `price` within
groups to be compared:
```{r, fig.width = 8, class.source = "fold-hide"}
p <- ggplot(diamonds) +
geom_histogram(aes(x = price,
y = after_stat(density)),
binwidth = 1000,
color = "black",
fill = "grey") +
thm
p + facet_wrap(~ cut)
```
By mapping the `fill` aesthetic to `cut` it is possible to produce a
stacked histogram or a superimposed histogram
* `position = "stack"`, the default, for stacked;
* `position = "identity"` for superimposed.
But neither works very well visually.
For comparing locations of features it can help to facet with a single
column.
But this may create aspect ratios that are not ideal.
```{r hist-facet-one-col, echo = FALSE, fig.height = 7}
p + facet_wrap(~ cut, ncol = 1) +
coord_fixed(1.5 * 1e7)
```
### Scalability
Histograms scale very well.
* The visual performance does not deteriorate with increasing numbers
of observations.
* The computational effort needed is linear in the number of observations.
* The amount of storage needed for an image object is linear in the
number of bins.
## Density Plots
### Density Plot Basics
Density plots can be thought of as plots of smoothed histograms.
The smoothness is controlled by a _bandwidth_ parameter that is
analogous to the histogram binwidth.
Most density plots use a [_kernel density
estimate_](https://en.wikipedia.org/wiki/Kernel_density_estimation),
but there are other possible strategies; qualitatively the particular
strategy rarely matters.
A density plot of the `geyser` `duration` variable with default
bandwidth:
```{r, class.source = "fold-hide"}
ggplot(geyser) +
geom_density(aes(x = duration)) +
thm
```
Using a smaller bandwidth shows the heaping at 2 and 4 minutes:
```{r, class.source = "fold-hide"}
ggplot(geyser) +
geom_density(aes(x = duration), bw = 0.05) +
thm
```
For a moderate number of observations a useful addition is a jittered
_rug plot_:
```{r, class.source = "fold-hide"}
ggplot(geyser) +
geom_density(aes(x = duration)) +
geom_rug(aes(x = duration, y = 0),
position =
position_jitter(height = 0)) +
thm
```
### Scalability
Visual scalability is very good.
For the `diamonds` data `price` variable:
```{r, class.source = "fold-hide"}
ggplot(diamonds) +
geom_density(aes(x = price)) +
thm
```
Density estimates are generally computed at a grid of points and
interpolated.
Defaults in R vary from 50 to 512 points.
Computational effort for a density estimate at a point is proportional
to the number of observations.
Storage needed for an image is proportional to the number of points
where the density is estimated.
### Multiple Groups
Density estimates for several groups can be shown in a single plot by
mapping a group index to an aesthetic, such as `color`:
```{r, class.source = "fold-hide"}
ggplot(barley) +
geom_density(aes(x = yield,
color = site)) +
thm
```
Using `fill` and `alpha` can also be useful:
```{r, class.source = "fold-hide"}
ggplot(barley) +
geom_density(aes(x = yield,
fill = site),
alpha = 0.2)
```
Multiple densities in a single plot works best with a smaller number of
categories, say 2 or 3:
```{r, class.source = "fold-hide"}
ggplot(barley) +
geom_density(aes(x = yield,
fill = year),
alpha = 0.4) +
thm
```
Using small multiples, or faceting, may be a better option:
```{r, class.source = "fold-hide"}
ggplot(barley) + geom_density(aes(x = yield)) + facet_wrap(~ site) + thm
```
These ideas can be combined:
```{r, class.source = "fold-hide"}
ggplot(barley) +
geom_density(aes(x = yield, color = year)) +
facet_wrap(~ site) +
thm
```
These plots again show lower yields for 1932 than for 1931 for all
sites except Morris.
Density plots default to using the density scale.
For the diamonds data a density plot of `price` faceted on `cut` shows
the conditional distributions of `price` at the different `cut`
levels:
```{r, class.source = "fold-hide"}
ggplot(diamonds) +
geom_density(aes(x = price)) +
facet_wrap(~ cut) + thm
```
Mapping the `y` aesthetic to `after_stat(count)` shows the joint distribution
of `price` and `cut`:
```{r, class.source = "fold-hide"}
ggplot(diamonds) +
geom_density(aes(x = price,
y = after_stat(count))) +
facet_wrap(~ cut) + thm
```
A stacked density plot is sometimes useful but often hard to read:
```{r, class.source = "fold-hide"}
ggplot(diamonds) +
geom_density(aes(x = price,
y = after_stat(count),
fill = cut),
position = "stack") +
thm
```
An intermediate option: A faceted plot on the count scale with a muted
plot for the full data to allow proportions of the whole to be
assessed:
```{r, class.source = "fold-hide"}
ggplot(diamonds) +
geom_density(aes(x = price, y = after_stat(count)),
fill = "lightgrey", color = NA,
data = mutate(diamonds, cut = NULL)) +
geom_density(aes(x = price,
y = after_stat(count),
fill = cut),
position = "stack", color = NA) +
facet_wrap(~ cut) +
scale_fill_viridis_d(guide = "none") +
thm
```
A filled density plot provides a vew of the conditional distribution
of `cut` at the different price levels:
```{r, class.source = "fold-hide"}
ggplot(diamonds) +
geom_density(aes(x = price, y = after_stat(count), fill = cut),
position = "fill") +
ylab(NULL) +
thm
```
This is called a _CD plot_, or a _conditional density plot_.
### Some Notes
Computations are generally done with the base R function `density`.
`plot` has a method for the results returned by this function, so a
density plot can be created with
```{r, eval = FALSE}
plot(density(geyser$duration))
```
The `lattice` package provides the function `densityplot`.
### Interactive Bandwidth Choice
Being able to chose the bandwidth of a density plot, or the binwidth
of a histogram, interactively is useful for exploration.
One way to do this in R (which unfortunately does not work on the
RStudio server):
```{r, eval = FALSE}
data(geyser, package = "MASS")
source("http://homepage.divms.uiowa.edu/~luke/classes/STAT7400/examples/tkdens.R")
tkdens(geyser$duration, tkrplot = TRUE)
```
Another option:
```{r, eval = FALSE}
data(geyser, package = "MASS")
source("http://homepage.divms.uiowa.edu/~luke/classes/STAT7400/examples/shinydens.R")
shinyDens(geyser$duration)
```
## Boxplots
_Boxplots_, or _box-and-whisker_ plots, provide a skeletal
representation of a distribution.
They are very well suited for showing distributions for multiple
groups.
There are many variations of boxplots:
* Most start with a box from the first to the third quartiles and
divided by the median.
* The simplest form then adds a whisker from the lower quartile to the
minimum and from the upper quartile to the maximum.
* More common is to draw the upper whisker to the largest point below
the upper quartile $+ 1.5 * IQR$, and the lower whisker analogously.
* _Outliers_ falling outside the range of the whiskers are then drawn
directly:
```{r, class.source = "fold-hide"}
library(gapminder)
library(ggplot2)
ggplot(gapminder) +
geom_boxplot(aes(x = continent, y = gdpPercap)) +
xlab(NULL) +
thm
```
There are variants that distinguish between _mild outliers_ and
_extreme outliers_.
A common variant is to show an approximate 95% confidence interval for
the population median as a _notch_:
```{r, class.source = "fold-hide"}
ggplot(gapminder) +
geom_boxplot(aes(x = continent, y = gdpPercap),
notch = TRUE) +
xlab(NULL) +
thm
```
Another variant is to use a width proportional to the square root of
the sample size to reflect the strength of evidence in the data:
```{r, class.source = "fold-hide"}
ggplot(gapminder) +
geom_boxplot(aes(x = continent, y = gdpPercap),
notch = TRUE, varwidth = TRUE) +
xlab(NULL) +
thm
```
With moderate sample sizes it can be useful to super-impose the
original data, perhaps with jittering and alpha blending.
The outliers in the box plot can be turned off with `outlier.color =
NA` so they are not shown twice:
```{r, class.source = "fold-hide"}
p <- ggplot(gapminder) +
geom_boxplot(aes(x = continent, y = gdpPercap),
notch = TRUE, varwidth = TRUE,
outlier.color = NA) +
xlab(NULL) +
thm
p + geom_point(aes(x = continent, y = gdpPercap),
position =
position_jitter(width = 0.1),
alpha = 0.1)
```
## Violin Plots
A variant of the boxplot is the _violin plot_:
> Hintze, J. L., Nelson, R. D. (1998), "Violin Plots: A Box
> Plot-Density Trace Synergism," _The American Statistician_ 52,
> 181-184.
The violin plot uses density estimates to show the distributions:
```{r, class.source = "fold-hide"}
ggplot(gapminder) +
geom_violin(aes(x = continent, y = gdpPercap)) +
xlab(NULL) +
thm
```
By default the "violins" are scaled to have the same area.
They can also be scaled to have the same maximum height or to have
areas proportional to sample sizes.
This is done by adding
* `scale = "width"` or
* `scale = "count"`
to the `geom_violin` call.
A comparison of boxplots and violin plots:
```{r, class.source = "fold-hide"}
ggplot(gapminder) +
geom_boxplot(aes(x = continent, y = gdpPercap)) +
geom_violin(aes(x = continent, y = gdpPercap),
fill = NA, scale = "width",
linetype = 2) +
xlab(NULL) +
thm
```
A combination of boxplots and violin plots:
```{r, class.source = "fold-hide"}
ggplot(gapminder) +
geom_violin(aes(x = continent, y = gdpPercap),
scale = "width") +
geom_boxplot(aes(x = continent, y = gdpPercap),
width = .1) +
xlab(NULL) +
thm
```
There are other variations, e.g. _vase plots_.
Boxplots do not reflect the shape of a distribution.
For the `eruptions` in the `faithful` data set:
```{r, class.source = "fold-hide"}
ggplot(faithful) +
geom_boxplot(aes(y = eruptions, x = "Box")) +
geom_violin(aes(y = eruptions, x = "Violin")) +
xlab(NULL) +
thm
```
## Swarm Plots
Swarm plots show the full data in a form that also shows the density.
There are a number of variations and names, including _beeswarm
plots_, _violin scatterplots_, _violin strip charts_, and _sina plots_
[Sina
plots](https://www.tandfonline.com/doi/full/10.1080/10618600.2017.1366914)
are available as `geom_sina` in the `ggforce` package:
```{r, class.source = "fold-hide"}
library(ggforce)
ggplot(gapminder,
aes(x = continent, y = gdpPercap)) +
geom_sina(size = 0.2) +
xlab(NULL) +
thm
```
Combined with a width-scaled violin plot:
```{r, class.source = "fold-hide"}
ggplot(gapminder,
aes(x = continent, y = gdpPercap)) +
geom_violin(scale = "width") +
geom_sina(color = "blue",
size = 0.4,
scale = FALSE) +
xlab(NULL) +
thm
```
## Effectiveness and Scalability
* Boxplots are very simple and easy to compare.
* Boxplots strongly emphasize the middle half of the data.
* Boxplots may not be easy for a lay viewer to understand.
* Box plots scale fairly well visually and computationally in the
number of observations; over-plotting/storage of outliers becomes an
issue for larger data sets
* Violin plots scale well both visually and computationally in the
number of observations.
```{r, fig.width = 11, fig.height = 4, class.source = "fold-hide"}
library(patchwork)
p1 <- ggplot(diamonds) +
geom_boxplot(aes(x = cut, y = price)) +
xlab(NULL) +
thm
p2 <- ggplot(diamonds) +
geom_violin(aes(x = cut, y = price)) +
xlab(NULL) +
thm
p1 + p2
```
* Scalability in the number of cases for swarm or sina plots is more
limited.
* The number of groups that can be handled for comparison by these plots
is in the range of a few dozen.
```{r, fig.width = 11, fig.height = 5, class.source = "fold-hide"}
library(lattice)
p1 <- ggplot(barley) +
geom_boxplot(aes(x = site, y = yield, fill = year)) +
xlab(NULL) +
thm
p2 <- ggplot(barley) +
geom_violin(aes(x = site, y = yield, fill = year)) +
xlab(NULL) +
thm
p1 + p2
```
Axes can be flipped to avoid overplotting of labels:
```{r, fig.width = 11, class.source = "fold-hide"}
library(lattice)
p3 <- p1 + coord_flip() + guides(fill = "none")
p4 <- p2 + coord_flip()
p3 + p4
```
Faceting can also be used to arrange groups of boxplots or violin plots.
For life expectancy by continent over the years in the `gapminder` data:
```{r, class.source = "fold-hide"}
library(dplyr)
ggplot(filter(gapminder,
year %% 10 == 2,
continent != "Oceania")) +
geom_boxplot(aes(x = lifeExp, y = factor(year))) +
facet_wrap(~ continent, ncol = 1) +
theme_minimal() +
theme(text = element_text(size = 12)) +
theme(strip.text.x = element_text(hjust = 0)) +
ylab(NULL)
```
A related visualization motivated by a graph in the Economist is
available [here](https://cinc.rud.is/web/packages/ggeconodist/).
## Ridgeline Plots
[Ridgeline
plots](http://blog.revolutionanalytics.com/2017/07/joyplots.html),
also called _ridge plots_ or _joy plots_, are another way to show
density estimates for a number of groups that has become popular
recently.
The package `ggridges` defines `geom_density_ridges` for creating
these plots:
```{r, message = FALSE, class.source = "fold-hide"}
library(ggridges)
ggplot(barley) +
geom_density_ridges(aes(x = yield,
y = site,
group = site)) +
ylab(NULL) +
thm
```
Grouping by an interaction with a categorical variable, `year`,
produces separate density estimates for each level.
Mapping the `fill` aesthetic to `year` allows the separate densities
to be identified:
```{r, class.source = "fold-hide"}
ggplot(barley) +
geom_density_ridges(
aes(x = yield,
y = site,
group = interaction(year, site),
fill = year)) +
ylab(NULL) +
thm
```
Alpha blending may sometimes help:
```{r, class.source = "fold-hide"}
ggplot(barley) +
geom_density_ridges(
aes(x = yield,
y = site,
group = interaction(year, site),
fill = year),
alpha = 0.8) +
ylab(NULL) +
thm
```
Adjusting the vertical scale may also help:
```{r, class.source = "fold-hide"}
ggplot(barley) +
geom_density_ridges(
aes(x = yield,
y = site,
group = interaction(year, site),
fill = year),
scale = 0.8) +
ylab(NULL) +
thm
```
Sometimes reordering the grouping variable, `year` in this case, can
help.
The factor levels of `year` can be reordered to match the order of
average yealds within each year by
```{r, eval = FALSE}
reorder(year, yield)
```
Using `-yield` produces the reverse order.
```{r, message = FALSE, class.source = "fold-hide"}
library(dplyr)
ggplot(mutate(barley, year = reorder(year, -yield))) +
geom_density_ridges(aes(x = yield, y = site,
group = interaction(year, site),
fill = year), scale = 0.8) +
ylab(NULL) +
thm
```
With some tuning ridgeline plots can scale well to many distributions.
An example from [Claus Wilke's book](https://clauswilke.com/dataviz/):
The `ggplot2movies` package provides data from
[IMDB](http://imdb.com/) on a large number of movies, including their
lengths, in a tibble `movies`:
```{r}
library(ggplot2movies)
dim(movies)
head(movies)
```
A ridgeline plot of the movie lengths for each year:
```{r, message = FALSE, class.source = "fold-hide"}
library(dplyr)
mv12 <- filter(movies, year > 1912)
ggplot(mv12, aes(x = length, y = year, group = year)) +
geom_density_ridges(scale = 10, size = 0.25, rel_min_height = 0.03) +
scale_x_continuous(limits = c(0, 200)) +
scale_y_reverse(breaks = c(2000, 1980, 1960, 1940, 1920)) +
theme_minimal()
```
This shows that since the early 1960's feature film lengths have
stabilized to a distribution centered around 90 minutes:
Another nice example:
[DW-NOMINATE](https://en.wikipedia.org/wiki/NOMINATE_%28scaling_method%29)
scores for measuring political position of members of congress over
the years:
```{r, echo = FALSE, out.width = "70%"}
knitr::include_graphics(IMG("polarization.jpeg"))
```
[Original code]( http://rpubs.com/ianrmcdonald/293304) by Ian
McDonald; another version is provided in [Claus Wilke's
book](https://clauswilke.com/dataviz/).
## Reading
Chapter [_Visualizing distributions: Histograms and density
plots_](https://clauswilke.com/dataviz/histograms-density-plots.html)
in [_Fundamentals of Data
Visualization_](https://clauswilke.com/dataviz/).
Section [_Histograms and density
plots_](https://socviz.co/groupfacettx.html#histograms) in [_Data
Visualization_](https://socviz.co/).
Chapter [_Visualizing data distributions_](https://rafalab.github.io/dsbook/distributions.html)
in [_Introduction to Data Science Data Analysis and Prediction
Algorithms with R_](https://rafalab.github.io/dsbook).
## Interactive Tutorial
An interactive [`learnr`](https://rstudio.github.io/learnr/) tutorial
for these notes is [available](`r WLNK("tutorials/dists.Rmd")`).
You can run the tutorial with
```{r, eval = FALSE}
STAT4580::runTutorial("dists")
```
You can install the current version of the `STAT4580` package with
```{r, eval = FALSE}
remotes::install_gitlab("luke-tierney/STAT4580")
```
You may need to install the `remotes` package from CRAN first.
## Exercises
1. Consider the code
```{r, eval = FALSE}
library(ggplot2)
data(Galton, package = "HistData")
ggplot(Galton, aes(x = parent)) +
geom_histogram(---, fill = "grey", color = "black")
```
Which of the following replacements for `---` produces a histogram
with bins that are one inch wide and start at whole integers?
a. `binwidth = 1`
b. `binwidth = 1, center = 66.5`
c. `binwidth = 2, center = 66`
d. `center = 66`
2. Consider the code
```{r, eval = FALSE}
library(ggplot2)
ggplot(faithful, aes(x = eruptions)) + geom_density(---)
```
Which of the following replacements for `---` produces a density
plot with the area under the density in blue and no black border?
a. `color = "lightblue"`
b. `fill = "black", color = "lightblue"`
c. `fill = "lightblue", color = NA`
d. `fill = NA, color = "black"`
3. Consider the code
```{r, eval = FALSE}
library(ggplot2)
library(gapminder)
p <- ggplot(gapminder, aes(y = continent, x = lifeExp))
```
Which of the following produces violin plots without trimming at
the smallest and largest observations, and including a line at the
median?
a. `p + geom_violin(trim = FALSE)`
b. `p + geom_violin(trim = TRUE, show_median = TRUE)`
c. `p + geom_violin(trim = FALSE, draw_quantiles = 0.5)`
d. `p + geom_violin(trim = TRUE, show_quantiles = 0.5)`
4. Density ridges can also show quantiles, but the details of how to
request this are different. Consider this code:
```{r, eval = FALSE}
library(ggplot2)
library(ggridges)
library(gapminder)
ggplot(gapminder, aes(x = lifeExp, y = year, group = year)) +
geom_density_ridges(---)
```
Which of the following replacements for `---` produces density
ridges with lines showing the locations of the medians?
a. `quantiles = 0.5`
b. `quantile_lines = TRUE, quantiles = 0.5`
c. `quantile_lines = TRUE`
d. `draw_quantiles = 0.5`