---
title: "Histograms and Density Plots"
output:
html_document:
toc: yes
---
```{r global_options, include=FALSE}
knitr::opts_chunk$set(collapse=TRUE)
```
## Feature to Look For
Some things to keep an eye out for when looking at data on a numeric
variable:
* skewness, multimodality
* gaps, outliers
* rounding, e.g. to integer values, or heaping, i.e. a few particular
values occur very frequently
* impossible or suspicious values
## Histograms
### Histogram Basics
* Historams are constructed by binning the data and counting the
number of observations in each bin.
* The objective is usually to visualize the shape of the distribution.
* The number of bins needs to be
* large enough to reveal interesting features;
* small enough not to be too noisy.
* A very small bin width can be used to look for rounding or heaping.
* Common choices for the vertical scale are
* bin counts, or frequencies
* counts per unit, or densities
* The count scale is more intepretable for lay viewers.
* The density scale is more suited for comparison to mathematical
density models.
* Constructing histograms with unequal bin widths is possible but
rarely a good idea.
### Histograms in R
There are many ways to plot histograms in R:
* the `hist` function in the base `graphics` package;
* `truehist` in package `MASS`;
* `histogram` in package `lattice`;
* `geom_histogram` in package `ggplot2`.
```{r, echo = FALSE}
library(lattice)
library(ggplot2)
```
A histogram of eruption durations for another data set on Old Faithful
eruptions, this one from package `MASS`:
```{r}
library(MASS)
histogram(~ duration, data = geyser)
```
The default setting using `geom_histogram` are less than ideal:
```{r}
ggplot(geyser) + geom_histogram(aes(x = duration))
```
Using a binwidth of 0.5 and customized `fill` and `color` settings
produces a better result:
```{r}
ggplot(geyser) +
geom_histogram(aes(x = duration),
binwidth = 0.5, fill = "grey", color = "black")
```
Reducing the bin width shows an interesting feature:
```{r}
ggplot(geyser) +
geom_histogram(aes(x = duration),
binwidth = 0.05, fill = "grey", color = "black")
```
* Eruptions were sometimes classified as _short_ or _long_; these were
coded as 2 and 4 minutes.
* For many purposes this kind of heaping or rounding does not matter.
* It would matter if we wanted to estimate means and standard
deviation of the durations of the long eruptions.
* More data and information about geysers is available at
http://geysertimes.org/ and
http://www.geyserstudy.org/geyser.aspx?pGeyserNo=OLDFAITHFUL.
* For exploration there is no one "correct" bin width or number of
bins. It would be very useful to be able to change this parameter
interactively.
### Superimposing a Density
A histogram can be used to compare the data distribution to a
theoretical model, such as a normal distribution.
This requires using a density scale for the vertical axis.
The `Galton` data frame in the `UsingR` package is one of several data
sets used by Galton to study the heights of parents and their
children.
Using the base graphics `hist` function we can compare the data
distribution of parent heights to a normal distribution with mean and
standard deviation corresponding to the data:
```{r, message = FALSE}
library(UsingR)
hist(Galton$parent, freq = FALSE)
x <- seq(64, 74, length.out=100)
y <- with(Galton, dnorm(x, mean(parent), sd(parent)))
lines(x, y, col = "red")
```
Adding a normal density curve to a `ggplot` histogram is similar:
* create the histogram with a density scale;
* create the curve data in a separate data frame;
* add the curve as another layer.
Create the histogram with a density scale using the _computed varlable_
`..density..`:
```{r}
p <- ggplot(Galton) +
geom_histogram(aes(x = parent, y = ..density..),
binwidth = 1, fill = "grey", color = "black")
p
```
Create the curve data:
```{r}
x <- seq(64, 74, length.out=100)
df <- with(Galton, data.frame(x = x, y = dnorm(x, mean(parent), sd(parent))))
```
Add the curve to the base plot:
```{r}
p + geom_line(data = df, aes(x = x, y = y), color = "red")
```
For a lattice histogram, the curve would be added in a _panel function_:
```{r}
histogram(~ parent, data = Galton, type = "density", panel = function(x, ...) {
panel.histogram(x, ...)
xn <- seq(min(x), max(x), length.out = 100)
yn <- dnorm(xn, mean(x), sd(x))
panel.lines(xn, yn, col = "red")
})
```
### Scalability
Histograms scale very well.
```{r}
hist(diamonds$price)
```
* The visual performance does not deteriorate with increasing numbers
of observations.
* The computational effort needed is linear in the number of observations.
* The amount of storage needed for an image object is linear in the
number of bins.
## Density Plots
### Density Plot Basics
* Density plots can be thought of as plots of smoothed histograms.
* The smoothness is controlled by a _bandwidth_ parameter that is
analogous to the histogram binwidth.
* Most density plots use a [_kernel density
estimate_](https://en.wikipedia.org/wiki/Kernel_density_estimation),
but there are other possible strategies; qualitatively the
particular strategy rarely matters.
Using base graphics, a density plot of the `geyser` `duration` variable
with default bandwidth:
```{r}
plot(density(geyser$duration))
```
Using a smaller bandwidth shows the heaping at 2 and 4 minutes:
```{r}
plot(density(geyser$duration, bw = 0.05))
```
For a moderate number of observations a useful addition is a jittered
_rug plot_:
```{r}
plot(density(geyser$duration, bw = 0.05))
rug(jitter(geyser$duration))
```
The `lattice` `densityplot` function by default adds a jittered strip
plot of the data to the bottom:
```{r}
densityplot(~ duration, data = geyser)
```
To produce a density plot with a jittered rug in `ggplot`:
```{r}
ggplot(geyser) +
geom_density(aes(x = duration)) +
geom_rug(aes(x = duration, y = 0), position = position_jitter(height = 0))
```
### Scalability
Visual scalability is very good
```{r}
plot(density(diamonds$price))
```
* Density estimates are generally computed at a grid of points and
interpolated. Defaults in R vary from 50 to 512 points.
* Computational effort for a density estimate at a point is
proportional to the number of observations.
* Storage needed for an image is proportional to the number of point
where the density is estimated.
### Grouping and Faceting
Both `ggplot` and `lattice` make it easy to show multiple densities
for different subgroups in a single plot.
`lattice` uses the `group` argument:
```{r}
densityplot(~ yield, group = site, data = barley, auto.key = TRUE)
```
In `ggplot` you can map the `site` variable to an aesthetic, such as
`color`:
```{r}
ggplot(barley) + geom_density(aes(x = yield, color = site))
```
Using `fill` and `alpha` can also be useful:
```{r}
ggplot(barley) + geom_density(aes(x = yield, fill = site), alpha = 0.2)
```
Multiple densities in a single plot works best with a smaller number of
categories, say 2 or 3.
Often a more effective approach is to use the idea of
[small multiples](https://en.wikipedia.org/wiki/Small_multiple),
collections of charts designed to facilitate comparisons.
Lattice uses the term _lattice plots_ or _trellis plots_. These plots
are specified using the `|` operator in a formula:
```{r}
densityplot(~ yield | site, data = barley)
```
Comparison is facilitated by using common axes.
These ideas can be combined:
```{r}
densityplot(~ yield | site, group = year, data = barley, auto.key = TRUE)
```
`ggplot` uses the notion of _faceting_:
```{r}
ggplot(barley) + geom_density(aes(x = yield)) + facet_wrap(~site)
```
Again this can be combined with the `color` aesthetic:
```{r}
ggplot(barley) + geom_density(aes(x = yield, color = year)) + facet_wrap(~site)
```
Both the `lattice` and `ggplot` versions show lower yields for 1932
than for 1931 for all sites except Morris.
* Cleveland suggest this may indicate a data entry error for Morris.
* A [recent
paper](http://blog.revolutionanalytics.com/2014/07/theres-no-mistake-in-the-barley-data.html)
suggests there may be no error.
### Interactive Bandwidth Choice
Being able to chose the bandwidth of a density plot, or the binwidth
of a histogram interactively is useful for exploration.
One way to do this in R:
```{r, eval = FALSE}
library(UsingR)
source("http://homepage.divms.uiowa.edu/~luke/classes/STAT7400/examples/tkdens.R")
tkdens(geyser$duration, tkrplot=TRUE)
tkdens(Galton$parent, tkrplot=TRUE)
```