--- title: "Histograms and Density Plots" output: html_document: toc: yes --- ```{r global_options, include = FALSE} knitr::opts_chunk$set(collapse = TRUE) ``` ## Feature to Look For Some things to keep an eye out for when looking at data on a numeric variable: * skewness, multimodality * gaps, outliers * rounding, e.g. to integer values, or heaping, i.e. a few particular values occur very frequently * impossible or suspicious values ## Histograms ### Histogram Basics * Historams are constructed by binning the data and counting the number of observations in each bin. * The objective is usually to visualize the shape of the distribution. * The number of bins needs to be * large enough to reveal interesting features; * small enough not to be too noisy. * A very small bin width can be used to look for rounding or heaping. * Common choices for the vertical scale are * bin counts, or frequencies * counts per unit, or densities * The count scale is more intepretable for lay viewers. * The density scale is more suited for comparison to mathematical density models. * Constructing histograms with unequal bin widths is possible but rarely a good idea. ### Histograms in R There are many ways to plot histograms in R: * the `hist` function in the base `graphics` package; * `truehist` in package `MASS`; * `histogram` in package `lattice`; * `geom_histogram` in package `ggplot2`. ```{r, echo = FALSE} library(lattice) library(ggplot2) ``` A histogram of eruption durations for another data set on Old Faithful eruptions, this one from package `MASS`: ```{r} library(MASS) histogram(~ duration, data = geyser) ``` The default setting using `geom_histogram` are less than ideal: ```{r} ggplot(geyser) + geom_histogram(aes(x = duration)) ``` Using a binwidth of 0.5 and customized `fill` and `color` settings produces a better result: ```{r} ggplot(geyser) + geom_histogram(aes(x = duration), binwidth = 0.5, fill = "grey", color = "black") ``` Reducing the bin width shows an interesting feature: ```{r} ggplot(geyser) + geom_histogram(aes(x = duration), binwidth = 0.05, fill = "grey", color = "black") ``` * Eruptions were sometimes classified as _short_ or _long_; these were coded as 2 and 4 minutes. * For many purposes this kind of heaping or rounding does not matter. * It would matter if we wanted to estimate means and standard deviation of the durations of the long eruptions. * More data and information about geysers is available at https://geysertimes.org/ and https://gosa.org/. * For exploration there is no one "correct" bin width or number of bins. It would be very useful to be able to change this parameter interactively. ### Superimposing a Density A histogram can be used to compare the data distribution to a theoretical model, such as a normal distribution. This requires using a density scale for the vertical axis. The `Galton` data frame in the `UsingR` package is one of several data sets used by Galton to study the heights of parents and their children. Using the base graphics `hist` function we can compare the data distribution of parent heights to a normal distribution with mean and standard deviation corresponding to the data: ```{r, message = FALSE} library(UsingR) hist(Galton$parent, freq = FALSE) x <- seq(64, 74, length.out = 100) y <- with(Galton, dnorm(x, mean(parent), sd(parent))) lines(x, y, col = "red") ``` Adding a normal density curve to a `ggplot` histogram is similar: * create the histogram with a density scale; * create the curve data in a separate data frame; * add the curve as another layer. Create the histogram with a density scale using the _computed varlable_ `..density..`: ```{r} p <- ggplot(Galton) + geom_histogram(aes(x = parent, y = ..density..), binwidth = 1, fill = "grey", color = "black") p ``` Create the curve data: ```{r} x <- seq(64, 74, length.out = 100) df <- with(Galton, data.frame(x = x, y = dnorm(x, mean(parent), sd(parent)))) ``` Add the curve to the base plot: ```{r} p + geom_line(data = df, aes(x = x, y = y), color = "red") ``` For a lattice histogram, the curve would be added in a _panel function_: ```{r} histogram(~ parent, data = Galton, type = "density", panel = function(x, ...) { panel.histogram(x, ...) xn <- seq(min(x), max(x), length.out = 100) yn <- dnorm(xn, mean(x), sd(x)) panel.lines(xn, yn, col = "red") }) ``` ### Scalability Histograms scale very well. ```{r} hist(diamonds$price) ``` * The visual performance does not deteriorate with increasing numbers of observations. * The computational effort needed is linear in the number of observations. * The amount of storage needed for an image object is linear in the number of bins. ## Density Plots ### Density Plot Basics * Density plots can be thought of as plots of smoothed histograms. * The smoothness is controlled by a _bandwidth_ parameter that is analogous to the histogram binwidth. * Most density plots use a [_kernel density estimate_](https://en.wikipedia.org/wiki/Kernel_density_estimation), but there are other possible strategies; qualitatively the particular strategy rarely matters. Using base graphics, a density plot of the `geyser` `duration` variable with default bandwidth: ```{r} plot(density(geyser$duration)) ``` Using a smaller bandwidth shows the heaping at 2 and 4 minutes: ```{r} plot(density(geyser$duration, bw = 0.05)) ``` For a moderate number of observations a useful addition is a jittered _rug plot_: ```{r} plot(density(geyser$duration, bw = 0.05)) rug(jitter(geyser$duration)) ``` The `lattice` `densityplot` function by default adds a jittered strip plot of the data to the bottom: ```{r} densityplot(~ duration, data = geyser) ``` To produce a density plot with a jittered rug in `ggplot`: ```{r} ggplot(geyser) + geom_density(aes(x = duration)) + geom_rug(aes(x = duration, y = 0), position = position_jitter(height = 0)) ``` ### Scalability Visual scalability is very good ```{r} plot(density(diamonds$price)) ``` * Density estimates are generally computed at a grid of points and interpolated. Defaults in R vary from 50 to 512 points. * Computational effort for a density estimate at a point is proportional to the number of observations. * Storage needed for an image is proportional to the number of point where the density is estimated. ### Grouping and Faceting Both `ggplot` and `lattice` make it easy to show multiple densities for different subgroups in a single plot. `lattice` uses the `group` argument: ```{r} densityplot(~ yield, group = site, data = barley, auto.key = TRUE) ``` In `ggplot` you can map the `site` variable to an aesthetic, such as `color`: ```{r} ggplot(barley) + geom_density(aes(x = yield, color = site)) ``` Using `fill` and `alpha` can also be useful: ```{r} ggplot(barley) + geom_density(aes(x = yield, fill = site), alpha = 0.2) ``` Multiple densities in a single plot works best with a smaller number of categories, say 2 or 3. Often a more effective approach is to use the idea of [small multiples](https://en.wikipedia.org/wiki/Small_multiple), collections of charts designed to facilitate comparisons. Lattice uses the term _lattice plots_ or _trellis plots_. These plots are specified using the `|` operator in a formula: ```{r} densityplot(~ yield | site, data = barley) ``` Comparison is facilitated by using common axes. These ideas can be combined: ```{r} densityplot(~ yield | site, group = year, data = barley, auto.key = TRUE) ``` `ggplot` uses the notion of _faceting_: ```{r} ggplot(barley) + geom_density(aes(x = yield)) + facet_wrap(~ site) ``` Again this can be combined with the `color` aesthetic: ```{r} ggplot(barley) + geom_density(aes(x = yield, color = year)) + facet_wrap(~ site) ``` Both the `lattice` and `ggplot` versions show lower yields for 1932 than for 1931 for all sites except Morris. * Cleveland suggest this may indicate a data entry error for Morris. * A [recent paper](https://blog.revolutionanalytics.com/2014/07/theres-no-mistake-in-the-barley-data.html) suggests there may be no error. ### Interactive Bandwidth Choice Being able to chose the bandwidth of a density plot, or the binwidth of a histogram interactively is useful for exploration. One way to do this in R: ```{r, eval = FALSE} library(UsingR) source("https://stat.uiowa.edu/~luke/classes/STAT7400/examples/tkdens.R") tkdens(geyser$duration, tkrplot = TRUE) tkdens(Galton$parent, tkrplot = TRUE) ```