class: center, middle, title-slide .title[ # Visualizing Distributions ] .author[ ### Luke Tierney ] .institute[ ### University of Iowa ] .date[ ### 2023-05-06 ] --- layout: true <link rel="stylesheet" href="stat4580.css" type="text/css" /> <!-- title based on Wilke's chapter --> ## Introduction --- Once there are more than a handful of numeric data values it is often useful to step back and look at the _distribution_ of the data values: -- * Where is the bulk of the data located? -- * Is there a single area of concentration or are there several? -- * Is the data distribution symmetric or is it skewed, i.e. trails off more slowly in one direction or another? -- * Are there extreme, or outlying, values? -- * Are there any suspicious or impossible values? -- * Are there gaps in the data? -- * Is there rounding, e.g. to integer values, or _heaping_, i.e. a few particular values occur very frequently? --- Plots for visualizing distributions include -- * Strip plots. -- * Histograms. -- * Density plots. -- * Box plots. -- * Violin plots. -- * Swarm plots. -- * Density ridges --- layout: true ## Strip Plots --- .pull-left.small-code[ ### Strip Plot Basics A variant of the dot plot is known as a _strip plot_. A strip plot for the city temperature data is .hide-code[ ```r thm <- theme_minimal() + theme(text = element_text(size = 16)) ggplot(citytemps) + geom_point(aes(x = temp, y = "All")) + thm + theme(axis.title.y = element_blank(), axis.text.y = element_blank()) ``` <img src="dists_files/figure-html/unnamed-chunk-1-1.png" style="display: block; margin: auto;" /> ] ] -- .pull-right.small-code[ The strip plot can reveal gaps and outliers. {{content}} ] -- After looking at the plot we might want to examine the high and low values: {{content}} -- ```r filter(citytemps, temp > 85) ## city temp ## 1 Caracas 88 ## 2 Managua 91 ## 3 Rio de Janeiro 86 ## 4 San Salvador 86 ## 5 Santo Domingo 86 filter(citytemps, temp < 10) ## city temp ## 1 Almaty -6 ## 2 Anadyr 0 ## 3 Tashkent 1 ``` --- For the eruption durations in the `faithful` data a strip plot shows the two modes around 2 and 4 minutes: .small-code[ .hide-code[ ```r ggplot(faithful) + geom_point(aes(x = eruptions, y = "All")) + thm + theme(axis.title.y = element_blank(), axis.text.y = element_blank()) ``` <img src="dists_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ ### Multiple Groups {{content}} ] -- Strip plots are most useful for showing subsets corresponding to a categorical variable. {{content}} -- A strip plot for the yields for different varieties in the barley data is -- .pull-right.small-code[ .hide-code[ ```r ggplot(barley) + geom_point(aes(x = yield, y = variety)) + theme_minimal() + thm ``` <img src="dists_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ ### Scalability {{content}} ] -- Scalability in this form is limited due to over-plotting. {{content}} -- A simple strip plot of `price` within the different `cut` levels in the `diamonds` data is not very helpful: -- .pull-right.small-code[ .hide-code[ ```r ggplot(diamonds) + geom_point(aes(x = price, y = cut)) + thm + theme(axis.title.y = element_blank()) ``` <img src="dists_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ Several approaches are available to reduce the impact of over-plotting: {{content}} ] -- * reduce the point size; {{content}} -- * random displacement of points, called _jittering_; {{content}} -- * making the points translucent, or _alpha blending_. {{content}} -- Combining all three for examining `price` within `cut` for the `diamonds` data produces -- .pull-right.small-code[ .hide-code[ ```r ggplot(diamonds) + geom_point(aes(x = price, y = cut), size = 0.2, position = position_jitter(width = 0), alpha = 0.2) + thm + theme(axis.title.y = element_blank()) ``` <img src="dists_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ Skewness of the price distributions can be seen in this plot, though other approaches will show this more clearly. {{content}} ] -- A peculiar feature reveled by this plot is the gap below 2000. {{content}} -- Examining the subset with `price < 2000` shows the gap is roughly symmetric around 1500: -- .pull-right.small-code[ .hide-code[ ```r ggplot(filter(diamonds, price < 2000)) + geom_point(aes(x = price, y = cut), size = 0.2, position = position_jitter(width = 0), alpha = 0.2) + thm + theme(axis.title.y = element_blank()) ``` <img src="dists_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ A plot along these lines was used on the New York Times [front page for February 21, 2021](../img/NYT-2021-02-21.jpeg). ] -- .pull-right[ <img src="../img/NYT-2021-02-21.jpeg" width="55%" style="display: block; margin: auto;" /> ] --- ### Some Notes -- * With a good combination of point size choice, jittering, and alpha blending the strip plot for groups of data can scale to several hundred thousand observations and ten to twenty of groups. -- * For very large datat sets it can be useful to look at a strip plot of a sample of the data. -- * Strip plots can reveal gaps, outliers, and data outside of the expected range. -- * Skewness and multi-modality can be seen, but other visualizations show these more clearly. -- * Storage needed for vector graphics images grows linearly with the number of observations. -- * Base graphics provides `stripchart` and lattice provides `stripplot`. --- layout: true ## Histograms --- ### Histogram Basics -- .pull-left[ Historams are constructed by binning the data and counting the number of observations in each bin. {{content}} ] -- The objective is usually to visualize the shape of the distribution. {{content}} -- The number of bins needs to be {{content}} -- * small enough to reveal interesting features; {{content}} -- * large enough not to be too noisy. {{content}} -- A very small bin width can be used to look for rounding or heaping. -- .pull-right[ Common choices for the vertical scale are: {{content}} ] -- * bin counts, or frequencies; {{content}} -- * counts per unit, or densities. {{content}} -- The count scale is more intepretable for lay viewers. {{content}} -- The density scale is more suited for comparison to mathematical density models. {{content}} -- Constructing histograms with unequal bin widths is possible but rarely a good idea. --- .pull-left[ ### Histograms in R {{content}} ] -- There are many ways to plot histograms in R: {{content}} -- * the `hist()` function in the base `graphics` package; {{content}} -- * `truehist()` in package `MASS`; {{content}} -- * `histogram()` in package `lattice`; {{content}} -- * `geom_histogram()` in package `ggplot2`. {{content}} -- A histogram of eruption durations for another data set on Old Faithful eruptions, this one from package `MASS`: -- .pull-right.small-code[ .hide-code[ ```r data(geyser, package = "MASS") ggplot(geyser) + geom_histogram(aes(x = duration)) + thm ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. ``` <img src="dists_files/figure-html/unnamed-chunk-9-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ The default settings using `geom_histogram` are less than ideal. {{content}} ] -- Using a binwidth of 0.5 and customized `fill` and `color` settings produces a better result: -- .pull-right.small-code[ .hide-code[ ```r ggplot(geyser) + geom_histogram(aes(x = duration), binwidth = 0.5, fill = "grey", color = "black") + thm ``` <img src="dists_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left.small-code[ Reducing the bin width shows an interesting feature: .hide-code[ ```r ggplot(geyser) + geom_histogram(aes(x = duration), binwidth = 0.05, fill = "grey", color = "black") + thm ``` <img src="dists_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" /> ] ] -- .pull-right[ * Eruptions were sometimes classified as _short_ or _long_; these were coded as 2 and 4 minutes. {{content}} ] -- * For many purposes this kind of heaping or rounding does not matter. {{content}} -- * It would matter if we wanted to estimate means and standard deviations of the durations of the long and short eruptions. {{content}} -- * More data and information about geysers is available at https://geysertimes.org/. <!-- THis seems no longer maintained: http://www.geyserstudy.org/geyser.aspx?pGeyserNo=OLDFAITHFUL. --> {{content}} -- * For exploration there is no one "correct" bin width or number of bins. {{content}} -- * It would be very useful to be able to change this parameter interactively. --- ### Superimposing a Density A histogram can be used to compare the data distribution to a theoretical model, such as a normal distribution. -- This requires using a _density scale_ for the vertical axis. -- The `Galton` data frame in the `HistData` package is one of several data sets used by Galton to study the heights of parents and their children. -- Adding a normal density curve to a `ggplot` histogram involves: -- * computing the parameters of the density; -- * creating the histogram with a density scale using the computed variable `after_stat(density)`; -- * adding the function curve using `geom_function`, `stat_function`, or `geom_line`. --- .pull-left.small-code[ Create the histogram with a density scale using the _computed varlable_ `after_stat(density)`: <!-- http://stackoverflow.com/questions/25075428/ggplot2-stat-function-with-calculated-argument-for-different-data-subset-inside --> {{content}} ] -- ```r data(Galton, package = "HistData") ggplot(Galton) + geom_histogram(aes(x = parent, y = after_stat(density)), binwidth = 1, fill = "grey", color = "black") + thm ``` -- .pull-right[ <img src="dists_files/figure-html/galton-hist-1.png" style="display: block; margin: auto;" /> ] --- .pull-left.small-code[ Then compute the mean and standard deviation and add the normal density curve: ```r data(Galton, package = "HistData") *p_mean <- mean(Galton$parent) *p_sd <- sd(Galton$parent) *p_dens <- function(x) dnorm(x, p_mean, p_sd) ggplot(Galton) + geom_histogram(aes(x = parent, y = after_stat(density)), binwidth = 1, fill = "grey", color = "black") + * geom_function(fun = p_dens, color = "red") + thm ``` ] .pull-right[ <img src="dists_files/figure-html/galton-hist-dens-1.png" style="display: block; margin: auto;" /> ] --- .pull-left.width-40[ ### Multiple Groups {{content}} ] -- Faceting works well for showing comparative histograms for multiple groups. {{content}} -- Histograms of `price` within `cut` for the `diamonds` data: -- .pull-right.width-60.small-code[ .hide-code[ ```r ggplot(diamonds) + geom_histogram(aes(x = price), binwidth = 1000, color = "black", fill = "grey") + facet_wrap(~ cut) + thm ``` <img src="dists_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left.width-40[ These histograms show counts on the vertical axis, and their sizes reflect the total counts for the groups. {{content}} ] -- Together the plots represent a view of the joint distribution of `cut` and `price`. {{content}} -- Switching to a density scale by using `after_stat(density)` for the `y` aesthetic allows the conditional distributions of `price` within groups to be compared: -- .pull-right.width-60.small-code[ .hide-code[ ```r p <- ggplot(diamonds) + geom_histogram(aes(x = price, y = after_stat(density)), binwidth = 1000, color = "black", fill = "grey") + thm p + facet_wrap(~ cut) ``` <img src="dists_files/figure-html/unnamed-chunk-13-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ By mapping the `fill` aesthetic to `cut` it is possible to produce a stacked histogram or a superimposed histogram * `position = "stack"`, the default, for stacked; * `position = "identity"` for superimposed. {{content}} ] -- But neither works very well visually. {{content}} -- For comparing locations of features it can help to facet with a single column. {{content}} -- But this may create aspect ratios that are not ideal. -- .pull-right.small-code[ <img src="dists_files/figure-html/hist-facet-one-col-1.png" style="display: block; margin: auto;" /> ] --- ### Scalability Histograms scale very well. -- * The visual performance does not deteriorate with increasing numbers of observations. -- * The computational effort needed is linear in the number of observations. -- * The amount of storage needed for an image object is linear in the number of bins. --- layout: true ## Density Plots --- ### Density Plot Basics .pull-left[ Density plots can be thought of as plots of smoothed histograms. {{content}} ] -- The smoothness is controlled by a _bandwidth_ parameter that is analogous to the histogram binwidth. {{content}} -- Most density plots use a [_kernel density estimate_](https://en.wikipedia.org/wiki/Kernel_density_estimation), but there are other possible strategies; qualitatively the particular strategy rarely matters. {{content}} -- A density plot of the `geyser` `duration` variable with default bandwidth: -- .pull-right.small-code[ .hide-code[ ```r ggplot(geyser) + geom_density(aes(x = duration)) + thm ``` <img src="dists_files/figure-html/unnamed-chunk-14-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ Using a smaller bandwidth shows the heaping at 2 and 4 minutes: ] .pull-right.small-code[ .hide-code[ ```r ggplot(geyser) + geom_density(aes(x = duration), bw = 0.05) + thm ``` <img src="dists_files/figure-html/unnamed-chunk-15-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ For a moderate number of observations a useful addition is a jittered _rug plot_: ] .pull-right.small-code[ .hide-code[ ```r ggplot(geyser) + geom_density(aes(x = duration)) + geom_rug(aes(x = duration, y = 0), position = position_jitter(height = 0)) + thm ``` <img src="dists_files/figure-html/unnamed-chunk-16-1.png" style="display: block; margin: auto;" /> ] ] --- ### Scalability .pull-left[ Visual scalability is very good. {{content}} ] -- For the `diamonds` data `price` variable: -- .pull-right.small-code[ .hide-code[ ```r ggplot(diamonds) + geom_density(aes(x = price)) + thm ``` <img src="dists_files/figure-html/unnamed-chunk-17-1.png" style="display: block; margin: auto;" /> ] ] --- Density estimates are generally computed at a grid of points and interpolated. -- Defaults in R vary from 50 to 512 points. -- Computational effort for a density estimate at a point is proportional to the number of observations. -- Storage needed for an image is proportional to the number of points where the density is estimated. --- .pull-left[ ### Multiple Groups {{content}} ] -- Density estimates for several groups can be shown in a single plot by mapping a group index to an aesthetic, such as `color`: -- .pull-right.small-code[ .hide-code[ ```r ggplot(barley) + geom_density(aes(x = yield, color = site)) + thm ``` <img src="dists_files/figure-html/unnamed-chunk-18-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ Using `fill` and `alpha` can also be useful: ] -- .pull-right.small-code[ .hide-code[ ```r ggplot(barley) + geom_density(aes(x = yield, fill = site), alpha = 0.2) ``` <img src="dists_files/figure-html/unnamed-chunk-19-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ Multiple densities in a single plot works best with a smaller number of categories, say 2 or 3: ] -- .pull-right.small-code[ .hide-code[ ```r ggplot(barley) + geom_density(aes(x = yield, fill = year), alpha = 0.4) + thm ``` <img src="dists_files/figure-html/unnamed-chunk-20-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ Using small multiples, or faceting, may be a better option: ] -- .pull-right.small-code[ .hide-code[ ```r ggplot(barley) + geom_density(aes(x = yield)) + facet_wrap(~ site) + thm ``` <img src="dists_files/figure-html/unnamed-chunk-21-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ These ideas can be combined: ] -- .pull-right.small-code[ .hide-code[ ```r ggplot(barley) + geom_density(aes(x = yield, color = year)) + facet_wrap(~ site) + thm ``` <img src="dists_files/figure-html/unnamed-chunk-22-1.png" style="display: block; margin: auto;" /> ] {{content}} ] -- These plots again show lower yields for 1932 than for 1931 for all sites except Morris. --- .pull-left[ Density plots default to using the density scale. {{content}} ] -- For the diamonds data a density plot of `price` faceted on `cut` shows the conditional distributions of `price` at the different `cut` levels: -- .pull-right.small-code[ .hide-code[ ```r ggplot(diamonds) + geom_density(aes(x = price)) + facet_wrap(~ cut) + thm ``` <img src="dists_files/figure-html/unnamed-chunk-23-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ Mapping the `y` aesthetic to `after_stat(count)` shows the joint distribution of `price` and `cut`: ] -- .pull-right.small-code[ .hide-code[ ```r ggplot(diamonds) + geom_density(aes(x = price, y = after_stat(count))) + facet_wrap(~ cut) + thm ``` <img src="dists_files/figure-html/unnamed-chunk-24-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ A stacked density plot is sometimes useful but often hard to read: ] -- .pull-right.small-code[ .hide-code[ ```r ggplot(diamonds) + geom_density(aes(x = price, y = after_stat(count), fill = cut), position = "stack") + thm ``` <img src="dists_files/figure-html/unnamed-chunk-25-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ An intermediate option: A faceted plot on the count scale with a muted plot for the full data to allow proportions of the whole to be assessed: ] -- .pull-right.small-code[ .hide-code[ ```r ggplot(diamonds) + geom_density(aes(x = price, y = after_stat(count)), fill = "lightgrey", color = NA, data = mutate(diamonds, cut = NULL)) + geom_density(aes(x = price, y = after_stat(count), fill = cut), position = "stack", color = NA) + facet_wrap(~ cut) + scale_fill_viridis_d(guide = "none") + thm ``` <img src="dists_files/figure-html/unnamed-chunk-26-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ A filled density plot provides a vew of the conditional distribution of `cut` at the different price levels: ] -- .pull-right.small-code[ .hide-code[ ```r ggplot(diamonds) + geom_density(aes(x = price, y = after_stat(count), fill = cut), position = "fill") + ylab(NULL) + thm ``` <img src="dists_files/figure-html/unnamed-chunk-27-1.png" style="display: block; margin: auto;" /> ] {{content}} ] -- This is called a _CD plot_, or a _conditional density plot_. --- ### Some Notes Computations are generally done with the base R function `density`. -- `plot` has a method for the results returned by this function, so a density plot can be created with ```r plot(density(geyser$duration)) ``` -- The `lattice` package provides the function `densityplot`. --- ### Interactive Bandwidth Choice Being able to chose the bandwidth of a density plot, or the binwidth of a histogram, interactively is useful for exploration. -- One way to do this in R (which unfortunately does not work on the RStudio server): ```r data(geyser, package = "MASS") source("https://stat.uiowa.edu/~luke/classes/STAT7400/examples/tkdens.R") tkdens(geyser$duration, tkrplot = TRUE) ``` -- Another option: ```r data(geyser, package = "MASS") source("https://stat.uiowa.edu/~luke/classes/STAT7400/examples/shinydens.R") shinyDens(geyser$duration) ``` <!-- ggplot(barley) + geom_density(aes(x = yield, y = -after_stat(count), fill = year), data = filter(barley, year == 1931)) + geom_density(aes(x = yield, y = after_stat(count), fill = year), data = filter(barley, year == 1932)) + coord_flip() ggplot(barley) + geom_density(aes(x = yield, y = -after_stat(density), fill = year), data = filter(barley, year == 1931)) + geom_density(aes(x = yield, fill = year), data = filter(barley, year == 1932)) + coord_flip() + facet_wrap(~site) ## From Claus Wilke's book: data(Titanic, package = "Stat2Data") library(dplyr) library(tidyr) library(cowplot) library(ggplot2) Titanic %>% select(-SexCode) %>% rename(name = Name, class = PClass, age = Age, sex = Sex, survived = Survived) %>% mutate(name = as.character(name), class = as.character(class), sex = as.character(sex)) -> titanic_all ##titanic <- select(titanic_train, age = Age, sex = Sex) titanic <- titanic_all data.frame( age = (1:28)*3 - 1.5, male = hist(filter(titanic, sex == "male")$age, breaks = (0:28)*3 + .01, plot = FALSE)$counts, female = hist(filter(titanic, sex == "female")$age, breaks = (0:28)*3 + .01, plot = FALSE)$counts ) %>% gather(gender, count, -age) -> gender_counts ggplot(gender_counts, aes(x = age, y = ifelse(gender == "male",-1, 1)*count, fill = gender)) + geom_col() + scale_x_continuous(name = "age (years)", limits = c(0, 75), expand = c(0, 0)) + scale_y_continuous(name = "count", breaks = 20*(-2:1), labels = c("40", "20", "0", "20")) + scale_fill_manual(values = c("#D55E00", "#0072B2"), guide = "none") + draw_text(x = 70, y = -39, "male", hjust = 0) + draw_text(x = 70, y = 21, "female", hjust = 0) + coord_flip() --> --- layout: true ## Boxplots --- .pull-left[ _Boxplots_, or _box-and-whisker_ plots, provide a skeletal representation of a distribution. {{content}} ] -- They are very well suited for showing distributions for multiple groups. {{content}} -- There are many variations of boxplots: {{content}} -- * Most start with a box from the first to the third quartiles and divided by the median. {{content}} -- * The simplest form then adds a whisker from the lower quartile to the minimum and from the upper quartile to the maximum. {{content}} -- * More common is to draw the upper whisker to the largest point below the upper quartile `\(+ 1.5 * IQR\)`, and the lower whisker analogously. {{content}} -- * _Outliers_ falling outside the range of the whiskers are then drawn directly: -- .pull-right.small-code[ .hide-code[ ```r library(gapminder) library(ggplot2) ggplot(gapminder) + geom_boxplot(aes(x = continent, y = gdpPercap)) + xlab(NULL) + thm ``` <img src="dists_files/figure-html/unnamed-chunk-31-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ There are variants that distinguish between _mild outliers_ and _extreme outliers_. {{content}} ] -- A common variant is to show an approximate 95% confidence interval for the population median as a _notch_: -- .pull-right.small-code[ .hide-code[ ```r ggplot(gapminder) + geom_boxplot(aes(x = continent, y = gdpPercap), notch = TRUE) + xlab(NULL) + thm ``` <img src="dists_files/figure-html/unnamed-chunk-32-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ Another variant is to use a width proportional to the square root of the sample size to reflect the strength of evidence in the data: ] .pull-right.small-code[ .hide-code[ ```r ggplot(gapminder) + geom_boxplot(aes(x = continent, y = gdpPercap), notch = TRUE, varwidth = TRUE) + xlab(NULL) + thm ``` <img src="dists_files/figure-html/unnamed-chunk-33-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ With moderate sample sizes it can be useful to super-impose the original data, perhaps with jittering and alpha blending. {{content}} ] -- The outliers in the box plot can be turned off with `outlier.color = NA` so they are not shown twice: -- .pull-right.small-code[ .hide-code[ ```r p <- ggplot(gapminder) + geom_boxplot(aes(x = continent, y = gdpPercap), notch = TRUE, varwidth = TRUE, outlier.color = NA) + xlab(NULL) + thm p + geom_point(aes(x = continent, y = gdpPercap), position = position_jitter(width = 0.1), alpha = 0.1) ``` <img src="dists_files/figure-html/unnamed-chunk-34-1.png" style="display: block; margin: auto;" /> ] ] --- layout: true ## Violin Plots --- .pull-left[ A variant of the boxplot is the _violin plot_: {{content}} ] -- > Hintze, J. L., Nelson, R. D. (1998), "Violin Plots: A Box > Plot-Density Trace Synergism," _The American Statistician_ 52, > 181-184. {{content}} -- The violin plot uses density estimates to show the distributions: -- .pull-right.small-code[ .hide-code[ ```r ggplot(gapminder) + geom_violin(aes(x = continent, y = gdpPercap)) + xlab(NULL) + thm ``` <img src="dists_files/figure-html/unnamed-chunk-35-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ By default the "violins" are scaled to have the same area. {{content}} ] -- They can also be scaled to have the same maximum height or to have areas proportional to sample sizes. {{content}} -- This is done by adding * `scale = "width"` or * `scale = "count"` to the `geom_violin` call. {{content}} -- A comparison of boxplots and violin plots: -- .pull-right.small-code[ .hide-code[ ```r ggplot(gapminder) + geom_boxplot(aes(x = continent, y = gdpPercap)) + geom_violin(aes(x = continent, y = gdpPercap), fill = NA, scale = "width", linetype = 2) + xlab(NULL) + thm ``` <img src="dists_files/figure-html/unnamed-chunk-36-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ A combination of boxplots and violin plots: ] -- .pull-right.small-code[ .hide-code[ ```r ggplot(gapminder) + geom_violin(aes(x = continent, y = gdpPercap), scale = "width") + geom_boxplot(aes(x = continent, y = gdpPercap), width = .1) + xlab(NULL) + thm ``` <img src="dists_files/figure-html/unnamed-chunk-37-1.png" style="display: block; margin: auto;" /> ] {{content}} ] -- There are other variations, e.g. _vase plots_. --- .pull-left[ Boxplots do not reflect the shape of a distribution. {{content}} ] -- For the `eruptions` in the `faithful` data set: -- .pull-right.small-code[ .hide-code[ ```r ggplot(faithful) + geom_boxplot(aes(y = eruptions, x = "Box")) + geom_violin(aes(y = eruptions, x = "Violin")) + xlab(NULL) + thm ``` <img src="dists_files/figure-html/unnamed-chunk-38-1.png" style="display: block; margin: auto;" /> ] ] --- layout: true ## Swarm Plots --- .pull-left[ Swarm plots show the full data in a form that also shows the density. {{content}} ] -- There are a number of variations and names, including _beeswarm plots_, _violin scatterplots_, _violin strip charts_, and _sina plots_ {{content}} -- [Sina plots](https://www.tandfonline.com/doi/full/10.1080/10618600.2017.1366914) are available as `geom_sina` in the `ggforce` package: <!-- http://moderngraphics11.pbworks.com/f/wilkinson_1999.DotPlots.pdf --> -- .pull-right.small-code[ .hide-code[ ```r library(ggforce) ggplot(gapminder, aes(x = continent, y = gdpPercap)) + geom_sina(size = 0.2) + xlab(NULL) + thm ``` <img src="dists_files/figure-html/unnamed-chunk-39-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ Combined with a width-scaled violin plot: ] -- .pull-right.small-code[ .hide-code[ ```r ggplot(gapminder, aes(x = continent, y = gdpPercap)) + geom_violin(scale = "width") + geom_sina(color = "blue", size = 0.4, scale = FALSE) + xlab(NULL) + thm ``` <img src="dists_files/figure-html/unnamed-chunk-40-1.png" style="display: block; margin: auto;" /> ] ] --- layout: true ## Effectiveness and Scalability --- .pull-left.smaller[ * Boxplots are very simple and easy to compare. {{content}} ] -- * Boxplots strongly emphasize the middle half of the data. {{content}} -- * Boxplots may not be easy for a lay viewer to understand. -- .pull-right.smaller[ * Box plots scale fairly well visually and computationally in the number of observations; over-plotting/storage of outliers becomes an issue for larger data sets {{content}} ] -- * Violin plots scale well both visually and computationally in the number of observations. -- .small-code[ .hide-code[ ```r library(patchwork) p1 <- ggplot(diamonds) + geom_boxplot(aes(x = cut, y = price)) + xlab(NULL) + thm p2 <- ggplot(diamonds) + geom_violin(aes(x = cut, y = price)) + xlab(NULL) + thm p1 + p2 ``` <img src="dists_files/figure-html/unnamed-chunk-41-1.png" style="display: block; margin: auto;" /> ] ] --- * Scalability in the number of cases for swarm or sina plots is more limited. -- * The number of groups that can be handled for comparison by these plots is in the range of a few dozen. -- .hide-code[ ```r library(lattice) p1 <- ggplot(barley) + geom_boxplot(aes(x = site, y = yield, fill = year)) + xlab(NULL) + thm p2 <- ggplot(barley) + geom_violin(aes(x = site, y = yield, fill = year)) + xlab(NULL) + thm p1 + p2 ``` <img src="dists_files/figure-html/unnamed-chunk-42-1.png" style="display: block; margin: auto;" /> ] --- Axes can be flipped to avoid overplotting of labels: .hide-code[ ```r library(lattice) p3 <- p1 + coord_flip() + guides(fill = "none") p4 <- p2 + coord_flip() p3 + p4 ``` <img src="dists_files/figure-html/unnamed-chunk-43-1.png" style="display: block; margin: auto;" /> ] --- Faceting can also be used to arrange groups of boxplots or violin plots. -- For life expectancy by continent over the years in the `gapminder` data: -- .hide-code[ ```r library(dplyr) ggplot(filter(gapminder, year %% 10 == 2, continent != "Oceania")) + geom_boxplot(aes(x = lifeExp, y = factor(year))) + facet_wrap(~ continent, ncol = 1) + theme_minimal() + theme(text = element_text(size = 12)) + theme(strip.text.x = element_text(hjust = 0)) + ylab(NULL) ``` <img src="dists_files/figure-html/unnamed-chunk-44-1.png" style="display: block; margin: auto;" /> ] -- A related visualization motivated by a graph in the Economist is available [here](https://cinc.rud.is/web/packages/ggeconodist/). --- layout: true ## Ridgeline Plots --- .pull-left[ [Ridgeline plots](https://blog.revolutionanalytics.com/2017/07/joyplots.html), also called _ridge plots_ or _joy plots_, are another way to show density estimates for a number of groups that has become popular recently. {{content}} ] -- The package `ggridges` defines `geom_density_ridges` for creating these plots: -- .pull-right.small-code[ .hide-code[ ```r library(ggridges) ggplot(barley) + geom_density_ridges(aes(x = yield, y = site, group = site)) + ylab(NULL) + thm ``` <img src="dists_files/figure-html/unnamed-chunk-45-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ Grouping by an interaction with a categorical variable, `year`, produces separate density estimates for each level. {{content}} ] -- Mapping the `fill` aesthetic to `year` allows the separate densities to be identified: -- .pull-right.small-code[ .hide-code[ ```r ggplot(barley) + geom_density_ridges( aes(x = yield, y = site, group = interaction(year, site), fill = year)) + ylab(NULL) + thm ``` <img src="dists_files/figure-html/unnamed-chunk-46-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ Alpha blending may sometimes help: ] .pull-right.small-code[ .hide-code[ ```r ggplot(barley) + geom_density_ridges( aes(x = yield, y = site, group = interaction(year, site), fill = year), alpha = 0.8) + ylab(NULL) + thm ``` <img src="dists_files/figure-html/unnamed-chunk-47-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ Adjusting the vertical scale may also help: ] .pull-right.small-code[ .hide-code[ ```r ggplot(barley) + geom_density_ridges( aes(x = yield, y = site, group = interaction(year, site), fill = year), scale = 0.8) + ylab(NULL) + thm ``` <img src="dists_files/figure-html/unnamed-chunk-48-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ Sometimes reordering the grouping variable, `year` in this case, can help. {{content}} ] -- The factor levels of `year` can be reordered to match the order of average yealds within each year by ```r reorder(year, yield) ``` {{content}} -- Using `-yield` produces the reverse order. -- .pull-right.small-code[ .hide-code[ ```r library(dplyr) ggplot(mutate(barley, year = reorder(year, -yield))) + geom_density_ridges(aes(x = yield, y = site, group = interaction(year, site), fill = year), scale = 0.8) + ylab(NULL) + thm ``` <img src="dists_files/figure-html/unnamed-chunk-50-1.png" style="display: block; margin: auto;" /> ] ] --- With some tuning ridgeline plots can scale well to many distributions. An example from [Claus Wilke's book](https://clauswilke.com/dataviz/): -- The `ggplot2movies` package provides data from [IMDB](https://imdb.com/) on a large number of movies, including their lengths, in a tibble `movies`: .small-code[ ```r library(ggplot2movies) dim(movies) ## [1] 58788 24 head(movies) ## # A tibble: 6 × 24 ## title year length budget rating votes r1 r2 r3 r4 r5 r6 ## <chr> <int> <int> <int> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 $ 1971 121 NA 6.4 348 4.5 4.5 4.5 4.5 14.5 24.5 ## 2 $1000 a … 1939 71 NA 6 20 0 14.5 4.5 24.5 14.5 14.5 ## 3 $21 a Da… 1941 7 NA 8.2 5 0 0 0 0 0 24.5 ## 4 $40,000 1996 70 NA 8.2 6 14.5 0 0 0 0 0 ## 5 $50,000 … 1975 71 NA 3.4 17 24.5 4.5 0 14.5 14.5 4.5 ## 6 $pent 2000 91 NA 4.3 45 4.5 4.5 4.5 14.5 14.5 14.5 ## # ℹ 12 more variables: r7 <dbl>, r8 <dbl>, r9 <dbl>, r10 <dbl>, mpaa <chr>, ## # Action <int>, Animation <int>, Comedy <int>, Drama <int>, ## # Documentary <int>, Romance <int>, Short <int> ``` ] --- .pull-left.small-code[ A ridgeline plot of the movie lengths for each year: .hide-code[ ```r library(dplyr) mv12 <- filter(movies, year > 1912) ggplot(mv12, aes(x = length, y = year, group = year)) + geom_density_ridges(scale = 10, size = 0.25, rel_min_height = 0.03) + scale_x_continuous(limits = c(0, 200)) + scale_y_reverse(breaks = c(2000, 1980, 1960, 1940, 1920)) + theme_minimal() ## Warning: Removed 250 rows containing non-finite values ## (`stat_density_ridges()`). ``` <img src="dists_files/figure-html/unnamed-chunk-52-1.png" style="display: block; margin: auto;" /> ] ] -- .pull-right[ This shows that since the early 1960's feature film lengths have stabilized to a distribution centered around 90 minutes: ] --- Another nice example: [DW-NOMINATE](https://en.wikipedia.org/wiki/NOMINATE_%28scaling_method%29) scores for measuring political position of members of congress over the years: <img src="../img/polarization.jpeg" width="70%" style="display: block; margin: auto;" /> [Original code]( http://rpubs.com/ianrmcdonald/293304) by Ian McDonald; another version is provided in [Claus Wilke's book](https://clauswilke.com/dataviz/). --- layout: false ## Reading Chapter [_Visualizing distributions: Histograms and density plots_](https://clauswilke.com/dataviz/histograms-density-plots.html) in [_Fundamentals of Data Visualization_](https://clauswilke.com/dataviz/). Section [_Histograms and density plots_](https://socviz.co/groupfacettx.html#histograms) in [_Data Visualization_](https://socviz.co/). Chapter [_Visualizing data distributions_](https://rafalab.dfci.harvard.edu/dsbook/distributions.html) in [_Introduction to Data Science Data Analysis and Prediction Algorithms with R_](https://rafalab.dfci.harvard.edu/dsbook/). --- layout: false ## Interactive Tutorial An interactive [`learnr`](https://rstudio.github.io/learnr/) tutorial for these notes is [available](../tutorials/dists.Rmd). You can run the tutorial with ```r STAT4580::runTutorial("dists") ``` You can install the current version of the `STAT4580` package with ```r remotes::install_gitlab("luke-tierney/STAT4580") ``` You may need to install the `remotes` package from CRAN first. --- layout: true ## Exercises --- 1) Consider the code <!-- ## nolint start --> ```r library(ggplot2) data(Galton, package = "HistData") ggplot(Galton, aes(x = parent)) + geom_histogram(---, fill = "grey", color = "black") ``` <!-- ## nolint end --> Which of the following replacements for `---` produces a histogram with bins that are one inch wide and start at whole integers? * a. `binwidth = 1` * b. `binwidth = 1, center = 66.5` * c. `binwidth = 2, center = 66` * d. `center = 66` --- 2) Consider the code ```r library(ggplot2) ggplot(faithful, aes(x = eruptions)) + geom_density(---) ``` Which of the following replacements for `---` produces a density plot with the area under the density in blue and no black border? * a. `color = "lightblue"` * b. `fill = "black", color = "lightblue"` * c. `fill = "lightblue", color = NA` * d. `fill = NA, color = "black"` --- 3) Consider the code ```r library(ggplot2) library(gapminder) p <- ggplot(gapminder, aes(y = continent, x = lifeExp)) ``` Which of the following produces violin plots without trimming at the smallest and largest observations, and including a line at the median? * a. `p + geom_violin(trim = FALSE)` * b. `p + geom_violin(trim = TRUE, show_median = TRUE)` * c. `p + geom_violin(trim = FALSE, draw_quantiles = 0.5)` * d. `p + geom_violin(trim = TRUE, show_quantiles = 0.5)` --- 4) Density ridges can also show quantiles, but the details of how to request this are different. Consider this code: ```r library(ggplot2) library(ggridges) library(gapminder) ggplot(gapminder, aes(x = lifeExp, y = year, group = year)) + geom_density_ridges(---) ``` Which of the following replacements for `---` produces density ridges with lines showing the locations of the medians? * a. `quantiles = 0.5` * b. `quantile_lines = TRUE, quantiles = 0.5` * c. `quantile_lines = TRUE` * d. `draw_quantiles = 0.5` <!-- boxplots violin plots po <- function(prob, quant) { q25 <- quant(0.25) q75 <- quant(0.75) md <- quant(0.5) IQR <- q75 - q25 uop <- prob(q75 + 1.5 * IQR, lower.tail = FALSE) lop <- prob(q25 - 1.5 * IQR) c(lop, uop) } -->
//adapted from Emi Tanaka's gist at //https://gist.github.com/emitanaka/eaa258bb8471c041797ff377704c8505