---
title: "Strip Plots"
output:
  html_document:
    toc: yes
---

```{r global_options, include = FALSE}
knitr::opts_chunk$set(collapse = TRUE)
```

## Some Features to Look For

Some things to keep an eye out for when looking at data on a numeric
variable:

* skewness, multimodality

* gaps, outliers

* rounding, e.g. to integer values, or _heaping_, i.e. a few particular
  values occur very frequently

* impossible or suspicious values


## Strip Plots

### Basics

```{r, include = FALSE}
library(dplyr)
library(ggplot2)
library(lattice)
library(gridExtra)

if (! file.exists("citytemps.dat"))
    download.file("http://www.stat.uiowa.edu/~luke/data/citytemps.dat",
                  "citytemps.dat")
citytemps <- read.table("citytemps.dat", header = TRUE)
```

A variant of the dot plot is known as a _strip plot_. A strip plot for
the city temperature data is
```{r, warning = FALSE}
p1 <- stripplot(~ temp, data = citytemps)
p2 <- ggplot(citytemps) + geom_point(aes(x = temp, y = "All"))
grid.arrange(p1, p2, nrow = 1)
```

One way to reduce the vertical space is to use the _chunk option_
`fig.height = 2`, which produces

```{r,fig.height = 2, echo = FALSE, warning = FALSE}
ggplot(citytemps) + geom_point(aes(x = temp, y = "All"))
```

The strip plot can reveal gaps and outliers.

After looking at the plot we might want to examine the high and
low values:

```{r}
filter(citytemps, temp > 85)
filter(citytemps, temp < 10)
```

### Multiple Samples

The strip plot is most useful for showing subsets corresponding to a
categorical variable.

A strip plot for the yields for different varieties in the barley data
is

```{r}
ggplot(barley) + geom_point(aes(x = yield, y = variety))
```

### Scalability

Scalability in this form is limited due to over-plotting.

A simple strip plot of `price` within the different `cut` levels in
the `diamonds` data is not very helpful:

```{r}
ggplot(diamonds) + geom_point(aes(x = price, y = cut))
```

Several approaches are available to reduce the impact of over-plotting:

* reduce the point size;

* random displacement of points, called _jittering_;

* making the points translucent, or _alpha blending_.

Combining all three produces

```{r}
ggplot(diamonds) +
    geom_point(aes(x = price, y = cut),
               size = 0.2, position = "jitter", alpha = 0.2)
```

Skewness of the price distributions can be seen in this plot, though
other approaches will show this more clearly.

A peculiar feature reveled by this plot is the gap below
2000. Examining the subset with `price < 2000` shows the gap is
roughly symmetric around 1500:
```{r}
ggplot(filter(diamonds, price < 2000)) +
    geom_point(aes(x = price, y = cut),
               size = 0.2, position = "jitter", alpha = 0.2)
```

### Some Notes

* With a good combination of point size choice, jittering, and alpha
  blending the strip plot for groups of data can scale to several
  hundred thousand observations and ten to twenty of groups.

* Strip plots can reveal gaps, outliers, and data outside of the
  expected range.

* Skewness and multi-modality can be seen, but other visualizations
  show these more clearly.

* Storage needed for vector graphics images grows linearly with the
  number of observations.

Base graphics provides `stripchart`:

```{r}
stripchart(yield ~ variety, data = barley)
```

Lattice provides `stripplot`:

```{r}
stripplot(variety ~ yield, data = barley)
```