--- title: "Strip Plots" output: html_document: toc: yes --- ```{r global_options, include = FALSE} knitr::opts_chunk$set(collapse = TRUE) ``` ## Some Features to Look For Some things to keep an eye out for when looking at data on a numeric variable: * skewness, multimodality * gaps, outliers * rounding, e.g. to integer values, or _heaping_, i.e. a few particular values occur very frequently * impossible or suspicious values ## Strip Plots ### Basics ```{r, include = FALSE} library(dplyr) library(ggplot2) library(lattice) library(gridExtra) if (! file.exists("citytemps.dat")) download.file("http://www.stat.uiowa.edu/~luke/data/citytemps.dat", "citytemps.dat") citytemps <- read.table("citytemps.dat", header = TRUE) ``` A variant of the dot plot is known as a _strip plot_. A strip plot for the city temperature data is ```{r, warning = FALSE} p1 <- stripplot(~ temp, data = citytemps) p2 <- ggplot(citytemps) + geom_point(aes(x = temp, y = "All")) grid.arrange(p1, p2, nrow = 1) ``` One way to reduce the vertical space is to use the _chunk option_ `fig.height = 2`, which produces ```{r,fig.height = 2, echo = FALSE, warning = FALSE} ggplot(citytemps) + geom_point(aes(x = temp, y = "All")) ``` The strip plot can reveal gaps and outliers. After looking at the plot we might want to examine the high and low values: ```{r} filter(citytemps, temp > 85) filter(citytemps, temp < 10) ``` ### Multiple Samples The strip plot is most useful for showing subsets corresponding to a categorical variable. A strip plot for the yields for different varieties in the barley data is ```{r} ggplot(barley) + geom_point(aes(x = yield, y = variety)) ``` ### Scalability Scalability in this form is limited due to over-plotting. A simple strip plot of `price` within the different `cut` levels in the `diamonds` data is not very helpful: ```{r} ggplot(diamonds) + geom_point(aes(x = price, y = cut)) ``` Several approaches are available to reduce the impact of over-plotting: * reduce the point size; * random displacement of points, called _jittering_; * making the points translucent, or _alpha blending_. Combining all three produces ```{r} ggplot(diamonds) + geom_point(aes(x = price, y = cut), size = 0.2, position = "jitter", alpha = 0.2) ``` Skewness of the price distributions can be seen in this plot, though other approaches will show this more clearly. A peculiar feature reveled by this plot is the gap below 2000. Examining the subset with `price < 2000` shows the gap is roughly symmetric around 1500: ```{r} ggplot(filter(diamonds, price < 2000)) + geom_point(aes(x = price, y = cut), size = 0.2, position = "jitter", alpha = 0.2) ``` ### Some Notes * With a good combination of point size choice, jittering, and alpha blending the strip plot for groups of data can scale to several hundred thousand observations and ten to twenty of groups. * Strip plots can reveal gaps, outliers, and data outside of the expected range. * Skewness and multi-modality can be seen, but other visualizations show these more clearly. * Storage needed for vector graphics images grows linearly with the number of observations. Base graphics provides `stripchart`: ```{r} stripchart(yield ~ variety, data = barley) ``` Lattice provides `stripplot`: ```{r} stripplot(variety ~ yield, data = barley) ```