--- title: "Dot Plots and Bar Charts" output: html_document: toc: yes code_download: true --- ```{r global_options, include = FALSE} knitr::opts_chunk$set(collapse = TRUE) ``` ```{r, include = FALSE} source("datasets.R") ``` ## Considerations to Keep in Mind As we look at visualizations, a reminder of some considerations: * Task levels for visualization, from highest to lowest level: * **Analyze**: Identify patterns, distributions, presence of outliers or clusters, other interesting features. * **Search**: Look up aspects of a feature known in advance or revealed by the visualiation. * **Query**: Identify, compare features of individual items. Each higher level builds on the levels below. * Scalability: * How well do these visualizations work for larger data sets? * Are there variations that can help for larger data sets? ## Dot Plots ### Basics One of the simplest visualizations of a single numerical variable with a modest number of observations and lables for the observations is a _dot plot_, or _Cleveland dot plot_: ```{r} library(ggplot2) ggplot(Playfair) + geom_point(aes(x = population, y = city)) ``` This visualization * shows the overall distribution of the data, and * makes it easy to locate the population of a particular city. A useful variation is to order the vertical position by rank: ```{r} ggplot(Playfair) + geom_point(aes(x = population, y = reorder(city, population))) ``` * locating a particular city is a little more difficult; but * the shape of the distribution is more apparent * approximate median and quartiles can be read off This visualization is often very useful for group summaries. ### Larger Data Sets Using labels can become impractical for larger numbers of observations: ```{r} ggplot(citytemps) + geom_point(aes(x = temp, y = reorder(city, temp))) ``` Instead we can use ranks: ```{r} ggplot(citytemps) + geom_point(aes(x = temp, y = rank(temp))) ``` Using ranks or percentiles this visualization in principle scales to much larger data sets: ```{r} ggplot(diamonds) + geom_point(aes(x = price, y = 100 * rank(price) / length(price))) ``` With a data set of this size there is little visual difference between a dot plot and a plot that interpolates the points: ```{r} ggplot(diamonds) + geom_line(aes(x = price, y = 100 * rank(price) / length(price))) ``` Both the dot plot and the interpolated verson use a lot of resources for computing and storing the plot with diminishing visual returns. A more effective approach for larger data sets is to fix a set of percentages and plot against the corresponding percentiles: ```{r} p <- seq(0, 1, length.out = 100) dm <- data.frame(pct = 100 * p, price = quantile(diamonds$price, p)) ggplot(dm) + geom_line(aes(x = price, y = pct)) ``` A similar result can be obtained with `stat_ecdf`. ### Some Variations The size of the dots can be used to encode an additional numeric variable. We can compute the approximate area for the cities in the `Playfair` data frame as `pi * (diameter / 2) ^ 2`: ```{r, message = FALSE} library(dplyr) PlayfairA <- mutate(Playfair, area = pi * (diameter / 2) ^ 2) ggplot(PlayfairA) + geom_point(aes(x = population, y = reorder(city, population), size = area)) + scale_size_area() ``` For the Barley data we have two measures per year. It can be useful to * place both measures for a site/variety combination on one line, and * identify the year using color: ```{r, message = FALSE} barley <- mutate(barley, sitevar = paste(site, variety, sep = ", ")) ggplot(barley) + geom_point(aes(x = yield, y = sitevar, color = year)) ``` Reordering the lines based on the yield in 1931 may be useful. First step: Isolate 1931 yields: ```{r} b31 <- filter(barley, year == 1931) b31 <- select(b31, yield31 = yield, sitevar) head(b31) ``` Second step: Use `left_join` to merge the 1931 yields with the barley data frame: ```{r} barley31 <- left_join(barley, b31) head(barley31) ``` Now plot the data: ```{r} ggplot(barley31) + geom_point(aes(x = yield, y = reorder(sitevar, yield31), color = year)) ``` ### Base and Lattice Graphics Base graphics provides the `dotchart` function: ```{r} dotchart(Playfair$population, labels = Playfair$city) ``` The `lattice` package provides `dotplot`: ```{r} library(lattice) dotplot(reorder(city, population) ~ population, data = Playfair) ``` Most lattice plots support a `group` argument that is usually mapped to color: ```{r} dotplot(reorder(sitevar, yield31) ~ yield, group = year, data = barley31, auto.key = TRUE) ``` ## Bar Charts ### Basics Bar charts are most commonly used to show frequencies for categorical data. They are also usually drawn verically: ```{r} ggplot(diamonds) + geom_bar(aes(x = color)) ``` It is possible to persuade `geom_bar` to encode the value of a numeric variable as bar height by * assigning the variable to the `y` aesthetic, and * specifying `stat = "identity"`. The default `stat` is to bin the data, as for a histogram, and to use the bin counts as the `y` aesthetic. ```{r} ggplot(Playfair) + geom_bar(aes(y = population, x = reorder(city, population)), stat = "identity") ``` Slightly simpler is to use `geom_col`: ```{r} ggplot(Playfair) + geom_col(aes(y = population, x = reorder(city, population))) ``` To make labels readable we can flip the plot: ```{r} ggplot(Playfair) + geom_col(aes(y = population, x = reorder(city, population))) + coord_flip() ``` [ Switching `x` and `y` doesn't do what you want.] Reducing the bar width may be helpful: ```{r} ggplot(Playfair) + geom_col(aes(y = population, x = reorder(city, population)), width = 0.3) + coord_flip() ``` Creating this basic bar chart in `lattice` is easier: ```{r} barchart(reorder(city, population) ~ population, data = Playfair) ``` Base graphics also provide a bar chart with the `barplot` function. ```{r} opar <- par(mar = c(5, 5, 4, 2) + 0.1) with(Playfair, barplot(population, horiz = TRUE, names.arg = city, las = 1, cex.names = 0.7, cex.axis = 0.7)) par(opar) ``` ### Comparisons and the Zero Baseline Issue Bar charts seem to be used much more than dot plots in the popular media. But they are less widely applicable, and have one dangerous feature, sometimes called the _zero baseline issue_. Because of the way our perception works, when we look at a bar chart we focus on the length of the bar, the distance from the base line, even when the baseline is not meaningful. Look at these views of tempreatures in the subset of cities where temperatures are between 60 and 70 degrees. ```{r, fig.width = 10, message = FALSE} library(gridExtra) c6070 <- filter(citytemps, temp >= 60 & temp <= 70) p1 <- barchart(city ~ temp, c6070) p2 <- dotplot(city ~ temp, c6070) grid.arrange(p1, p2, nrow = 1) ``` ```{r, include = FALSE} ## sanity check local({ d <- as.list(citytemps$temp) names(d) <- citytemps$city stopifnot(d$Cairo == 61, d$"Los Angeles" == 67, d$"Mexico City" == 66, d$"San Fransisco" == 60) }) ``` Take the Cairo-San Francisco pair and the Mexico City-Los Angeles pair. * Their temperature differences are about the same. * The bar chart makes the Cairo-San Francisco contrast look much more extreme than the Mexico City-Los Angeles contrast. With a subset like this the focus is on the difference, not the ratio: in this context these temperatures are _interval data_. As another example, take the cities in the `Playfair` data with population between 100 and 500 thousand: ```{r, fig.width = 10} P15 <- filter(Playfair, population >= 100 & population <= 500) p1 <- dotplot(city ~ population, data = P15) p2 <- barchart(city ~ population, data = P15) grid.arrange(p1, p2, nrow = 1) ``` Someone choosing to look at a restricted range like this is most likely focusing on the differences in population, not the ratio. The ratio comparisons emphasized by the bar chart are not meaningful. Setting the origin to zero rescues the bar chart: ```{r, fig.width = 10} P15 <- filter(Playfair, population >= 100 & population <= 500) p1 <- dotplot(city ~ population, data = P15, origin = 0) p2 <- barchart(city ~ population, data = P15, origin = 0) grid.arrange(p1, p2, nrow = 1) ``` Either chart allows a ratio comparison or a difference comparison. * The ratio comparison is a little easier with a bar chart. * The difference comparison is harder because of the distraction created by the bars. * For quantity measurements ratios are usually at least meaningful even if they might not be the primary focus Some notes: * When using bar charts for positive numbers where ratios are meaningful the baseline should _always_ be zero. * The only exception is when there is a natural non-zero baseline value, such as 32 degrees Fahrenheit (i.e. 0 degrees Celcius) for temperatures. * You may need to intervene with your software's defaults to make this happen. * Bar charts always push the viewer to ratio comparisons, whether they are meaningful or not. * Using a non-zero baseline can therefore [mislead the viewer](https://www.storytellingwithdata.com/blog/2012/09/bar-charts-must-have-zero-baseline). * Some news organizations seem particularly prone to taking advantage of/falling prey to this issue. * I used `lattice` in these examples because it is hard to get `geom_bar` or `geom_col~ to use a non-zero base line (which is a good thing!). * You _can_ create a bar chart with a non-zero baseline using `geom_segment`. ### Data With Both Positive and Negative Values Bar charts can be used for data containing both positive and negative values: ```{r, fig.width = 10, eval = FALSE, include = FALSE} p1 <- dotplot(city ~ temp - 32, data = citytemps, origin = 0, panel = function(x, y) { panel.xyplot(x, y) panel.abline(v = 0, lty = 2) }) p2 <- barchart(city ~ temp - 32, data = citytemps, origin = 0) ``` ```{r, fig.width = 10, warning = FALSE} p1 <- ggplot(citytemps) + geom_point(aes(y = city, x = temp - 32)) + geom_vline(xintercept = 0, lty = 2) p2 <- ggplot(citytemps) + geom_col(aes(x = city, y = temp - 32)) + coord_flip() grid.arrange(p1, p2, nrow = 1) ``` * This puts a strong emphasis on the base line * This can be useful if the baseline is meaningful, such as * [an average level](https://www.ncei.noaa.gov/access/monitoring/climate-at-a-glance/time-series/global) * zero degrees Celcius * Zero degrees Fahrenheit is not meaningful: ```{r, fig.width = 10, eval = FALSE, include = FALSE} p1 <- dotplot(city ~ temp, data = citytemps, origin = 0, panel = function(x, y) { panel.xyplot(x, y) panel.abline(v = 0, lty = 2) }) p2 <- barchart(city ~ temp, data = citytemps, origin = 0) ``` ```{r, fig.width = 10, warning = FALSE} p1 <- ggplot(citytemps) + geom_point(aes(y = city, x = temp)) + geom_vline(xintercept = 0, lty = 2) p2 <- ggplot(citytemps) + geom_col(aes(x = city, y = temp)) + coord_flip() grid.arrange(p1, p2, nrow = 1) ``` ## `ggplot` Documentation * The `ggplot2` package includes help pages, but the variants available at are a bit more accessible. * The definitive guide is Hadley Wickham's book. * Some useful recipes are available iat . * A more extensive collection is available in Winston Chang's (2018) _R Graphics Cookbook_. * [_R for Data Science_](https://r4ds.had.co.nz/) also contains material on `ggplot2`. ## Group Summaries Dot plots, and sometimes bar charts, can be very useful for showing group summaries. Two approaches for computing summaries: * Use the `tapply`, `by`, and `aggregate` functions from base R. * Use tools in the `tidyverse`, in particular from the `dplyr` package. I will use the `dplyr` approach. This uses `group_by` to create a _grouped table_, followed by `summarize`. Here is how to compute the agerage `yield` values for each `variety` in the `barley` data: ```{r} barley_by_variety <- group_by(barley, variety) barley_variety_means <- summarize(barley_by_variety, avg_yield = mean(yield)) head(barley_variety_means) ggplot(barley_variety_means) + geom_point(aes(x = avg_yield, y = as.character(variety))) ``` The ordering of the `variety` factor created by the two approaches is a little different. An alternate way of specifying the `dplyr` computation uses the _pipe operator_ `%>%`: ```{r, eval = FALSE} barley %>% group_by(variety) %>% summarize(avg_yield = mean(yield)) ``` In this approach the result on the left of `%>%` is passed implicitly as the first argument to the function called on the right. Some like this approach a lot; others do not. I do not care for it. ## Variations in Appearence Base and lattice dot plots use only hirizontal grid lines. This corresponds to the version introduced by W. S. Cleveland. Lattice and ggplot allow features such as this to be customized using _themes_. `ggplot2` provides a number of alternate themses; the `ggthemes` package provides more. The _Wall Street Journal_ theme `ggthmes::theme_wsj` produces ```{r} ggplot(barley_variety_means) + geom_point(aes(x = avg_yield, y = as.character(variety))) + ggthemes::theme_wsj() ``` A theme to closely match the style used in Cleveland's 1993 book _Visualizing Data_ and used by the base and lattice functions can be defined as ```{r} theme_dotplotx <- function() { theme( ## remove the vertical grid lines panel.grid.major.x = element_blank(), panel.grid.minor.x = element_blank(), ## explicitly set the horizontal lines (or they will disappear too) panel.grid.major.y = element_line(color = "black", linetype = 3), axis.text.y = element_text(size = rel(1.2)), ## use a white backgrounsd panel.background = element_rect(fill = "white", colour = NA), panel.border = element_rect(fill = NA, colour = "grey20")) } ``` This produces ```{r} ggplot(barley_variety_means) + geom_point(aes(x = avg_yield, y = as.character(variety))) + theme_dotplotx() ``` ```{r, echo = FALSE, eval = FALSE} # A dotplot: pretty close approximation to the style in Cleveland's book. # The theme arguments refer to the FINAL x and y axes, # not the pre-coord_flip axes. ggplot(hdata) + geom_point(aes(x = state), stat = "bin") + coord_flip() + theme( # remove the vertical grid lines panel.grid.major.x = element_blank(), ## explicitly set the horizontal lines (or they will disappear too) panel.grid.major.y = element_line(linetype = 3, color = "darkgray"), axis.text.y = element_text(size = rel(0.8))) ggplot(Playfair) + geom_point(aes(x = city, y = population)) + coord_flip() + theme( # remove the vertical grid lines panel.grid.major.x = element_blank(), ## explicitly set the horizontal lines (or they will disappear too) panel.grid.major.y = element_line(linetype = 3, color = "darkgray"), axis.text.y = element_text(size = rel(0.8))) ggplot(Playfair) + geom_point(aes(y = city, x = population)) + theme( # remove the vertical grid lines panel.grid.major.x = element_blank(), ## explicitly set the horizontal lines (or they will disappear too) panel.grid.major.y = element_line(linetype = 3, color = "darkgray"), axis.text.y = element_text(size = rel(0.8))) ggplot(Playfair) + geom_point(aes(y = city, x = population)) + theme( # remove the vertical grid lines panel.grid.major.x = element_blank(), ## explicitly set the horizontal lines (or they will disappear too) panel.grid.major.y = element_line(color = "lightgray"), axis.text.y = element_text(size = rel(0.8))) theme_dotplotx <- function() { theme( # remove the vertical grid lines panel.grid.major.x = element_blank(), panel.grid.minor.x = element_blank(), ## explicitly set the horizontal lines (or they will disappear too) panel.grid.major.y = element_line(color = "lightgray"), axis.text.y = element_text(size = rel(0.8))) } ```