---
title: "Strip Plots"
output:
html_document:
toc: yes
---
```{r global_options, include = FALSE}
knitr::opts_chunk$set(collapse = TRUE)
```
## Some Features to Look For
Some things to keep an eye out for when looking at data on a numeric
variable:
* skewness, multimodality
* gaps, outliers
* rounding, e.g. to integer values, or _heaping_, i.e. a few particular
values occur very frequently
* impossible or suspicious values
## Strip Plots
### Basics
```{r, include = FALSE}
library(dplyr)
library(ggplot2)
library(lattice)
library(gridExtra)
if (! file.exists("citytemps.dat"))
download.file("http://www.stat.uiowa.edu/~luke/data/citytemps.dat",
"citytemps.dat")
citytemps <- read.table("citytemps.dat", header = TRUE)
```
A variant of the dot plot is known as a _strip plot_. A strip plot for
the city temperature data is
```{r, warning = FALSE}
p1 <- stripplot(~ temp, data = citytemps)
p2 <- ggplot(citytemps) + geom_point(aes(x = temp, y = "All"))
grid.arrange(p1, p2, nrow = 1)
```
One way to reduce the vertical space is to use the _chunk option_
`fig.height = 2`, which produces
```{r,fig.height = 2, echo = FALSE, warning = FALSE}
ggplot(citytemps) + geom_point(aes(x = temp, y = "All"))
```
The strip plot can reveal gaps and outliers.
After looking at the plot we might want to examine the high and
low values:
```{r}
filter(citytemps, temp > 85)
filter(citytemps, temp < 10)
```
### Multiple Samples
The strip plot is most useful for showing subsets corresponding to a
categorical variable.
A strip plot for the yields for different varieties in the barley data
is
```{r}
ggplot(barley) + geom_point(aes(x = yield, y = variety))
```
### Scalability
Scalability in this form is limited due to over-plotting.
A simple strip plot of `price` within the different `cut` levels in
the `diamonds` data is not very helpful:
```{r}
ggplot(diamonds) + geom_point(aes(x = price, y = cut))
```
Several approaches are available to reduce the impact of over-plotting:
* reduce the point size;
* random displacement of points, called _jittering_;
* making the points translucent, or _alpha blending_.
Combining all three produces
```{r}
ggplot(diamonds) +
geom_point(aes(x = price, y = cut),
size = 0.2, position = "jitter", alpha = 0.2)
```
Skewness of the price distributions can be seen in this plot, though
other approaches will show this more clearly.
A peculiar feature reveled by this plot is the gap below
2000. Examining the subset with `price < 2000` shows the gap is
roughly symmetric around 1500:
```{r}
ggplot(filter(diamonds, price < 2000)) +
geom_point(aes(x = price, y = cut),
size = 0.2, position = "jitter", alpha = 0.2)
```
### Some Notes
* With a good combination of point size choice, jittering, and alpha
blending the strip plot for groups of data can scale to several
hundred thousand observations and ten to twenty of groups.
* Strip plots can reveal gaps, outliers, and data outside of the
expected range.
* Skewness and multi-modality can be seen, but other visualizations
show these more clearly.
* Storage needed for vector graphics images grows linearly with the
number of observations.
Base graphics provides `stripchart`:
```{r}
stripchart(yield ~ variety, data = barley)
```
Lattice provides `stripplot`:
```{r}
stripplot(variety ~ yield, data = barley)
```