Introduction

Once there are more than a handful of numeric data values it is often useful to step back and look at the distribution of the data values:

• Where is the bulk of the data located?

• Is there a single area of concentration or are there several?

• Is the data distribution symmetric or is it skewed, i.e. trails off more slowly in one direction or another?

• Are there extreme, or outlying, values?

• Are there any suspicious or impossible values?

• Are there gaps in the data?

• Is there rounding, e.g. to integer values, or heaping, i.e. a few particular values occur very frequently?

Plots for visualizing distributions include

• Strip plots.

• Histograms.

• Density plots.

• Box plots.

• Violin plots.

• Swarm plots.

• Density ridges

Strip Plots

Strip Plot Basics

A variant of the dot plot is known as a strip plot.

A strip plot for the city temperature data is

thm <- theme_minimal() +
theme(text = element_text(size = 16))
ggplot(citytemps) +
geom_point(aes(x = temp, y = "All")) +
thm +
theme(axis.title.y = element_blank(),
axis.text.y = element_blank())

The strip plot can reveal gaps and outliers.

After looking at the plot we might want to examine the high and low values:

filter(citytemps, temp > 85)
##            city temp
## 1         Accra   91
## 2      Asuncion   93
## 3 Dar es Salaam   86
## 4         Lagos   95
## 5         Perth   86
filter(citytemps, temp < 10)
##       city temp
## 2 Montréal    3
## 3   Ottawa   -4
## 4  Toronto    1
## 5 Winnipeg  -20

For the eruption durations in the faithful data a strip plot shows the two modes around 2 and 4 minutes:

ggplot(faithful) +
geom_point(aes(x = eruptions, y = "All")) +
thm +
theme(axis.title.y = element_blank(),
axis.text.y = element_blank())

Multiple Groups

Strip plots are most useful for showing subsets corresponding to a categorical variable.

A strip plot for the yields for different varieties in the barley data is

ggplot(barley) +
geom_point(aes(x = yield, y = variety)) +
theme_minimal() +
thm

Scalability

Scalability in this form is limited due to over-plotting.

A simple strip plot of price within the different cut levels in the diamonds data is not very helpful:

ggplot(diamonds) +
geom_point(aes(x = price, y = cut)) +
thm +
theme(axis.title.y = element_blank())

Several approaches are available to reduce the impact of over-plotting:

• reduce the point size;

• random displacement of points, called jittering;

• making the points translucent, or alpha blending.

Combining all three for examining price within cut for the diamonds data produces

ggplot(diamonds) +
geom_point(aes(x = price, y = cut),
size = 0.2,
position = position_jitter(width = 0),
alpha = 0.2) +
thm + theme(axis.title.y = element_blank())

Skewness of the price distributions can be seen in this plot, though other approaches will show this more clearly.

A peculiar feature reveled by this plot is the gap below 2000.

Examining the subset with price < 2000 shows the gap is roughly symmetric around 1500:

ggplot(filter(diamonds, price < 2000)) +
geom_point(aes(x = price, y = cut),
size = 0.2,
position = position_jitter(width = 0),
alpha = 0.2) +
thm +
theme(axis.title.y = element_blank())

A plot along these lines was used on the New Your Times front page for February 21, 2021.

Some Notes

• With a good combination of point size choice, jittering, and alpha blending the strip plot for groups of data can scale to several hundred thousand observations and ten to twenty of groups.

• For very large datat sets it can be useful to look at a strip plot of a sample of the data.

• Strip plots can reveal gaps, outliers, and data outside of the expected range.

• Skewness and multi-modality can be seen, but other visualizations show these more clearly.

• Storage needed for vector graphics images grows linearly with the number of observations.

• Base graphics provides stripchart and lattice provides stripplot.

Histograms

Histogram Basics

Historams are constructed by binning the data and counting the number of observations in each bin.

The objective is usually to visualize the shape of the distribution.

The number of bins needs to be

• small enough to reveal interesting features;

• large enough not to be too noisy.

A very small bin width can be used to look for rounding or heaping.

Common choices for the vertical scale are:

• bin counts, or frequencies;

• counts per unit, or densities.

The count scale is more intepretable for lay viewers.

The density scale is more suited for comparison to mathematical density models.

Constructing histograms with unequal bin widths is possible but rarely a good idea.

Histograms in R

There are many ways to plot histograms in R:

• the hist() function in the base graphics package;

• truehist() in package MASS;

• histogram() in package lattice;

• geom_histogram() in package ggplot2.

A histogram of eruption durations for another data set on Old Faithful eruptions, this one from package MASS:

data(geyser, package = "MASS")
ggplot(geyser) +
geom_histogram(aes(x = duration)) +
thm
## stat_bin() using bins = 30. Pick better value with binwidth.

The default settings using geom_histogram are less than ideal.

Using a binwidth of 0.5 and customized fill and color settings produces a better result:

ggplot(geyser) +
geom_histogram(aes(x = duration),
binwidth = 0.5,
fill = "grey",
color = "black") +
thm

Reducing the bin width shows an interesting feature:

ggplot(geyser) +
geom_histogram(aes(x = duration),
binwidth = 0.05,
fill = "grey",
color = "black") +
thm

• Eruptions were sometimes classified as short or long; these were coded as 2 and 4 minutes.

• For many purposes this kind of heaping or rounding does not matter.

• It would matter if we wanted to estimate means and standard deviations of the durations of the long and short eruptions.

• More data and information about geysers is available at http://geysertimes.org/.

• For exploration there is no one “correct” bin width or number of bins.

• It would be very useful to be able to change this parameter interactively.

Superimposing a Density

A histogram can be used to compare the data distribution to a theoretical model, such as a normal distribution.

This requires using a density scale for the vertical axis.

The Galton data frame in the HistData package is one of several data sets used by Galton to study the heights of parents and their children.

Adding a normal density curve to a ggplot histogram involves:

• computing the parameters of the density;

• creating the histogram with a density scale using the computed variable after_stat(density);

• adding the function curve using geom_function, stat_function, or geom_line.

Create the histogram with a density scale using the computed varlable after_stat(density):

data(Galton, package = "HistData")
ggplot(Galton) +
geom_histogram(aes(x = parent,
y = after_stat(density)),
binwidth = 1,
fill = "grey",
color = "black") +
thm

Then compute the mean and standard deviation and add the normal density curve:

data(Galton, package = "HistData")
p_mean <- mean(Galton$parent) p_sd <- sd(Galton$parent)
p_dens <- function(x) dnorm(x, p_mean, p_sd)
ggplot(Galton) +
geom_histogram(aes(x = parent,
y = after_stat(density)),
binwidth = 1,
fill = "grey",
color = "black") +
geom_function(fun = p_dens, color = "red") +
thm

Multiple Groups

Faceting works well for showing comparative histograms for multiple groups.

Histograms of price within cut for the diamonds data:

ggplot(diamonds) +
geom_histogram(aes(x = price),
binwidth = 1000,
color = "black", fill = "grey") +
facet_wrap(~ cut) +
thm

These histograms show counts on the vertical axis, and their sizes reflect the total counts for the groups.

Together the plots represent a view of the joint distribution of cut and price.

Switching to a density scale by using after_stat(density) for the y aesthetic allows the conditional distributions of price within groups to be compared:

p <- ggplot(diamonds) +
geom_histogram(aes(x = price,
y = after_stat(density)),
binwidth = 1000,
color = "black",
fill = "grey") +
thm
p + facet_wrap(~ cut)

By mapping the fill aesthetic to cut it is possible to produce a stacked histogram or a superimposed histogram

• position = "stack", the default, for stacked;
• position = "identity" for superimposed.

But neither works very well visually.

For comparing locations of features it can help to facet with a single column.

But this may create aspect ratios that are not ideal.

Scalability

Histograms scale very well.

• The visual performance does not deteriorate with increasing numbers of observations.

• The computational effort needed is linear in the number of observations.

• The amount of storage needed for an image object is linear in the number of bins.

Density Plots

Density Plot Basics

Density plots can be thought of as plots of smoothed histograms.

The smoothness is controlled by a bandwidth parameter that is analogous to the histogram binwidth.

Most density plots use a kernel density estimate, but there are other possible strategies; qualitatively the particular strategy rarely matters.

A density plot of the geyser duration variable with default bandwidth:

ggplot(geyser) +
geom_density(aes(x = duration)) +
thm

Using a smaller bandwidth shows the heaping at 2 and 4 minutes:

ggplot(geyser) +
geom_density(aes(x = duration), bw = 0.05) +
thm

For a moderate number of observations a useful addition is a jittered rug plot:

ggplot(geyser) +
geom_density(aes(x = duration)) +
geom_rug(aes(x = duration, y = 0),
position =
position_jitter(height = 0)) +
thm

Scalability

Visual scalability is very good.

For the diamonds data price variable:

ggplot(diamonds) +
geom_density(aes(x = price)) +
thm

Density estimates are generally computed at a grid of points and interpolated.

Defaults in R vary from 50 to 512 points.

Computational effort for a density estimate at a point is proportional to the number of observations.

Storage needed for an image is proportional to the number of points where the density is estimated.

Multiple Groups

Density estimates for several groups can be shown in a single plot by mapping a group index to an aesthetic, such as color:

ggplot(barley) +
geom_density(aes(x = yield,
color = site)) +
thm

Using fill and alpha can also be useful:

ggplot(barley) +
geom_density(aes(x = yield,
fill = site),
alpha = 0.2)

Multiple densities in a single plot works best with a smaller number of categories, say 2 or 3:

ggplot(barley) +
geom_density(aes(x = yield,
fill = year),
alpha = 0.4) +
thm

Using small multiples, or faceting, may be a better option:

ggplot(barley) + geom_density(aes(x = yield)) + facet_wrap(~ site) + thm

These ideas can be combined:

ggplot(barley) +
geom_density(aes(x = yield, color = year)) +
facet_wrap(~ site) +
thm

These plots again show lower yields for 1932 than for 1931 for all sites except Morris.

Density plots default to using the density scale.

For the diamonds data a density plot of price faceted on cut shows the conditional distributions of price at the different cut levels:

ggplot(diamonds) +
geom_density(aes(x = price)) +
facet_wrap(~ cut) + thm

Mapping the y aesthetic to after_stat(count) shows the joint distribution of price and cut:

ggplot(diamonds) +
geom_density(aes(x = price,
y = after_stat(count))) +
facet_wrap(~ cut) + thm

A stacked density plot is sometimes useful but often hard to read:

ggplot(diamonds) +
geom_density(aes(x = price,
y = after_stat(count),
fill = cut),
position = "stack") +
thm

An intermediate option: A faceted plot on the count scale with a muted plot for the full data to allow proportions of the whole to be assessed:

ggplot(diamonds) +
geom_density(aes(x = price, y = after_stat(count)),
fill = "lightgrey", color = NA,
data = mutate(diamonds, cut = NULL)) +
geom_density(aes(x = price,
y = after_stat(count),
fill = cut),
position = "stack", color = NA) +
facet_wrap(~ cut) +
scale_fill_viridis_d(guide = "none") +
thm

A filled density plot provides a vew of the conditional distribution of cut at the different price levels:

ggplot(diamonds) +
geom_density(aes(x = price, y = after_stat(count), fill = cut),
position = "fill") +
ylab(NULL) +
thm

This is called a CD plot, or a conditional density plot.

Some Notes

Computations are generally done with the base R function density.

plot has a method for the results returned by this function, so a density plot can be created with

plot(density(geyser$duration)) The lattice package provides the function densityplot. Interactive Bandwidth Choice Being able to chose the bandwidth of a density plot, or the binwidth of a histogram, interactively is useful for exploration. One way to do this in R (which unfortunately does not work on the RStudio server): data(geyser, package = "MASS") source("http://homepage.divms.uiowa.edu/~luke/classes/STAT7400/examples/tkdens.R") tkdens(geyser$duration, tkrplot = TRUE)

Another option:

data(geyser, package = "MASS")
source("http://homepage.divms.uiowa.edu/~luke/classes/STAT7400/examples/shinydens.R")
shinyDens(geyser$duration) Boxplots Boxplots, or box-and-whisker plots, provide a skeletal representation of a distribution. They are very well suited for showing distributions for multiple groups. There are many variations of boxplots: • Most start with a box from the first to the third quartiles and divided by the median. • The simplest form then adds a whisker from the lower quartile to the minimum and from the upper quartile to the maximum. • More common is to draw the upper whisker to the largest point below the upper quartile $$+ 1.5 * IQR$$, and the lower whisker analogously. • Outliers falling outside the range of the whiskers are then drawn directly: library(gapminder) library(ggplot2) ggplot(gapminder) + geom_boxplot(aes(x = continent, y = gdpPercap)) + xlab(NULL) + thm There are variants that distinguish between mild outliers and extreme outliers. A common variant is to show an approximate 95% confidence interval for the population median as a notch: ggplot(gapminder) + geom_boxplot(aes(x = continent, y = gdpPercap), notch = TRUE) + xlab(NULL) + thm Another variant is to use a width proportional to the square root of the sample size to reflect the strength of evidence in the data: ggplot(gapminder) + geom_boxplot(aes(x = continent, y = gdpPercap), notch = TRUE, varwidth = TRUE) + xlab(NULL) + thm With moderate sample sizes it can be useful to super-impose the original data, perhaps with jittering and alpha blending. The outliers in the box plot can be turned off with outlier.color = NA so they are not shown twice: p <- ggplot(gapminder) + geom_boxplot(aes(x = continent, y = gdpPercap), notch = TRUE, varwidth = TRUE, outlier.color = NA) + xlab(NULL) + thm p + geom_point(aes(x = continent, y = gdpPercap), position = position_jitter(width = 0.1), alpha = 0.1) Violin Plots A variant of the boxplot is the violin plot: Hintze, J. L., Nelson, R. D. (1998), “Violin Plots: A Box Plot-Density Trace Synergism,” The American Statistician 52, 181-184. The violin plot uses density estimates to show the distributions: ggplot(gapminder) + geom_violin(aes(x = continent, y = gdpPercap)) + xlab(NULL) + thm By default the “violins” are scaled to have the same area. They can also be scaled to have the same maximum height or to have areas proportional to sample sizes. This is done by adding • scale = "width" or • scale = "count" to the geom_violin call. A comparison of boxplots and violin plots: ggplot(gapminder) + geom_boxplot(aes(x = continent, y = gdpPercap)) + geom_violin(aes(x = continent, y = gdpPercap), fill = NA, scale = "width", linetype = 2) + xlab(NULL) + thm A combination of boxplots and violin plots: ggplot(gapminder) + geom_violin(aes(x = continent, y = gdpPercap), scale = "width") + geom_boxplot(aes(x = continent, y = gdpPercap), width = .1) + xlab(NULL) + thm There are other variations, e.g. vase plots. Boxplots do not reflect the shape of a distribution. For the eruptions in the faithful data set: ggplot(faithful) + geom_boxplot(aes(y = eruptions, x = "Box")) + geom_violin(aes(y = eruptions, x = "Violin")) + xlab(NULL) + thm Swarm Plots Swarm plots show the full data in a form that also shows the density. There are a number of variations and names, including beeswarm plots, violin scatterplots, violin strip charts, and sina plots Sina plots are available as geom_sina in the ggforce package: library(ggforce) ggplot(gapminder, aes(x = continent, y = gdpPercap)) + geom_sina(size = 0.2) + xlab(NULL) + thm Combined with a width-scaled violin plot: ggplot(gapminder, aes(x = continent, y = gdpPercap)) + geom_violin(scale = "width") + geom_sina(color = "blue", size = 0.4, scale = FALSE) + xlab(NULL) + thm Effectiveness and Scalability • Boxplots are very simple and easy to compare. • Boxplots strongly emphasize the middle half of the data. • Boxplots may not be easy for a lay viewer to understand. • Box plots scale fairly well visually and computationally in the number of observations; over-plotting/storage of outliers becomes an issue for larger data sets • Violin plots scale well both visually and computationally in the number of observations. library(patchwork) p1 <- ggplot(diamonds) + geom_boxplot(aes(x = cut, y = price)) + xlab(NULL) + thm p2 <- ggplot(diamonds) + geom_violin(aes(x = cut, y = price)) + xlab(NULL) + thm p1 + p2 • Scalability in the number of cases for swarm or sina plots is more limited. • The number of groups that can be handled for comparison by these plots is in the range of a few dozen. library(lattice) p1 <- ggplot(barley) + geom_boxplot(aes(x = site, y = yield, fill = year)) + xlab(NULL) + thm p2 <- ggplot(barley) + geom_violin(aes(x = site, y = yield, fill = year)) + xlab(NULL) + thm p1 + p2 Axes can be flipped to avoid overplotting of labels: library(lattice) p3 <- p1 + coord_flip() + guides(fill = "none") p4 <- p2 + coord_flip() p3 + p4 Faceting can also be used to arrange groups of boxplots or violin plots. For life expectancy by continent over the years in the gapminder data: library(dplyr) ggplot(filter(gapminder, year %% 10 == 2, continent != "Oceania")) + geom_boxplot(aes(x = lifeExp, y = factor(year))) + facet_wrap(~ continent, ncol = 1) + theme_minimal() + theme(text = element_text(size = 12)) + theme(strip.text.x = element_text(hjust = 0)) + ylab(NULL) A related visualization motivated by a graph in the Economist is available here. Ridgeline Plots Ridgeline plots, also called ridge plots or joy plots, are another way to show density estimates for a number of groups that has become popular recently. The package ggridges defines geom_density_ridges for creating these plots: library(ggridges) ggplot(barley) + geom_density_ridges(aes(x = yield, y = site, group = site)) + ylab(NULL) + thm Grouping by an interaction with a categorical variable, year, produces separate density estimates for each level. Mapping the fill aesthetic to year allows the separate densities to be identified: ggplot(barley) + geom_density_ridges( aes(x = yield, y = site, group = interaction(year, site), fill = year)) + ylab(NULL) + thm Alpha blending may sometimes help: ggplot(barley) + geom_density_ridges( aes(x = yield, y = site, group = interaction(year, site), fill = year), alpha = 0.8) + ylab(NULL) + thm Adjusting the vertical scale may also help: ggplot(barley) + geom_density_ridges( aes(x = yield, y = site, group = interaction(year, site), fill = year), scale = 0.8) + ylab(NULL) + thm Sometimes reordering the grouping variable, year in this case, can help. The factor levels of year can be reordered to match the order of average yealds within each year by reorder(year, yield) Using -yield produces the reverse order. library(dplyr) ggplot(mutate(barley, year = reorder(year, -yield))) + geom_density_ridges(aes(x = yield, y = site, group = interaction(year, site), fill = year), scale = 0.8) + ylab(NULL) + thm With some tuning ridgeline plots can scale well to many distributions. An example from Claus Wilke’s book: The ggplot2movies package provides data from IMDB on a large number of movies, including their lengths, in a tibble movies: library(ggplot2movies) dim(movies) ## [1] 58788 24 head(movies) ## # A tibble: 6 × 24 ## title year length budget rating votes r1 r2 r3 r4 r5 r6 ## <chr> <int> <int> <int> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1$          1971    121     NA    6.4   348   4.5   4.5   4.5   4.5  14.5  24.5
## 2 $1000 a … 1939 71 NA 6 20 0 14.5 4.5 24.5 14.5 14.5 ## 3$21 a Da…  1941      7     NA    8.2     5   0     0     0     0     0    24.5
## 4 $40,000 1996 70 NA 8.2 6 14.5 0 0 0 0 0 ## 5$50,000 …  1975     71     NA    3.4    17  24.5   4.5   0    14.5  14.5   4.5
## 6 \$pent      2000     91     NA    4.3    45   4.5   4.5   4.5  14.5  14.5  14.5
## # … with 12 more variables: r7 <dbl>, r8 <dbl>, r9 <dbl>, r10 <dbl>,
## #   mpaa <chr>, Action <int>, Animation <int>, Comedy <int>, Drama <int>,
## #   Documentary <int>, Romance <int>, Short <int>

A ridgeline plot of the movie lengths for each year:

library(dplyr)
mv12 <- filter(movies, year > 1912)
ggplot(mv12, aes(x = length, y = year, group = year)) +
geom_density_ridges(scale = 10, size = 0.25, rel_min_height = 0.03) +
scale_x_continuous(limits = c(0, 200)) +
scale_y_reverse(breaks = c(2000, 1980, 1960, 1940, 1920)) +
theme_minimal()
## Warning: Removed 250 rows containing non-finite values (stat_density_ridges).

This shows that since the early 1960’s feature film lengths have stabilized to a distribution centered around 90 minutes:

Another nice example: DW-NOMINATE scores for measuring political position of members of congress over the years:

Original code by Ian McDonald; another version is provided in Claus Wilke’s book.

Interactive Tutorial

An interactive learnr tutorial for these notes is available.

You can run the tutorial with

STAT4580::runTutorial("dists")

You can install the current version of the STAT4580 package with

remotes::install_gitlab("luke-tierney/STAT4580")

You may need to install the remotes package from CRAN first.

Exercises

1. Consider the code

    library(ggplot2)
data(Galton, package = "HistData")
ggplot(Galton, aes(x = parent)) +
geom_histogram(---, fill = "grey", color = "black")

Which of the following replacements for --- produces a histogram with bins that are one inch wide and start at whole integers?

1. binwidth = 1
2. binwidth = 1, center = 66.5
3. binwidth = 2, center = 66
4. center = 66
2. Consider the code

library(ggplot2)
ggplot(faithful, aes(x = eruptions)) + geom_density(---)

Which of the following replacements for --- produces a density plot with the area under the density in blue and no black border?

1. color = "lightblue"
2. fill = "black", color = "lightblue"
3. fill = "lightblue", color = NA
4. fill = NA, color = "black"
3. Consider the code

library(ggplot2)
library(gapminder)
p <- ggplot(gapminder, aes(y = continent, x = lifeExp))

Which of the following produces violin plots without trimming at the smallest and largest observations, and including a line at the median?

1. p + geom_violin(trim = FALSE)
2. p + geom_violin(trim = TRUE, show_median = TRUE)
3. p + geom_violin(trim = FALSE, draw_quantiles = 0.5)
4. p + geom_violin(trim = TRUE, show_quantiles = 0.5)
4. Density ridges can also show quantiles, but the details of how to request this are different. Consider this code:

library(ggplot2)
library(ggridges)
library(gapminder)
ggplot(gapminder, aes(x = lifeExp, y = year, group = year)) +
geom_density_ridges(---)

Which of the following replacements for --- produces density ridges with lines showing the locations of the medians?

1. quantiles = 0.5
2. quantile_lines = TRUE, quantiles = 0.5
3. quantile_lines = TRUE
4. draw_quantiles = 0.5