---
title: "Dot Plots and Bar Charts"
output:
html_document:
toc: yes
---
```{r global_options, include=FALSE}
knitr::opts_chunk$set(collapse=TRUE)
```
```{r, include = FALSE}
source("datasets.R")
```
## Considerations to Keep in Mind
As we look at visualizations, a reminder of some considerations:
* Task levels for visualization, from highest to lowest level:
* **Analyze**: Identify patterns, distributions, presence of outliers
or clusters, other interesting features.
* **Search**: Look up aspects of a feature known in advance or
revealed by the visualiation.
* **Query**: Identify, compare features of individual items.
Each higher level builds on the levels below.
* Scalability:
* How well do these visualizations work for larger data sets?
* Are there variations that can help for larger data sets?
## Dot Plots
### Basics
One of the simplest visualizations of a single numerical variable with
a modest number of observations and lables for the observations is a
_dot plot_, or _Cleveland dot plot_:
```{r}
library(ggplot2)
ggplot(Playfair) +
geom_point(aes(x = population, y = city))
```
This visualization
* shows the overall distribution of the data, and
* makes it easy to locate the population of a particular city.
A useful variation is to order the vertical position by rank:
```{r}
ggplot(Playfair) +
geom_point(aes(x = population, y = reorder(city, population)))
```
* locating a particular city is a little more difficult; but
* the shape of the distribution is more apparent
* approximate median and quartiles can be read off
This visualization is often very useful for group summaries.
### Larger Data Sets
Using labels can become impractical for larger numbers of observations:
```{r}
ggplot(citytemps) + geom_point(aes(x = temp, y = reorder(city, temp)))
```
Instead we can use ranks:
```{r}
ggplot(citytemps) + geom_point(aes(x = temp, y = rank(temp)))
```
Using ranks or percentiles this visualization in principle scales to
much larger data sets:
```{r}
ggplot(diamonds) +
geom_point(aes(x = price, y = 100 * rank(price) / length(price)))
```
With a data set of this size there is little visual difference between
a dot plot and a plot that interpolates the points:
```{r}
ggplot(diamonds) +
geom_line(aes(x = price, y = 100 * rank(price) / length(price)))
```
Both the dot plot and the interpolated verson use a lot of resources
for computing and storing the plot with diminishing visual returns.
A more effective approach for larger data sets is to fix a set of
percentages and plot against the corresponding percentiles:
```{r}
p <- seq(0, 1, length.out = 100)
dm <- data.frame(pct = 100 * p, price = quantile(diamonds$price, p))
ggplot(dm) + geom_line(aes(x = price, y = pct))
```
A similar result can be obtained with `stat_ecdf`.
### Some Variations
The size of the dots can be used to encode an additional numeric
variable.
We can compute the approximate area for the cities in the `Playfair`
data frame as `pi * (diameter / 2) ^ 2`:
```{r, message = FALSE}
library(dplyr)
PlayfairA <- mutate(Playfair, area = pi * (diameter / 2) ^ 2)
ggplot(PlayfairA) +
geom_point(aes(x = population, y = reorder(city, population),
size = area)) +
scale_size_area()
```
For the Barley data we have two measures per year. It can be useful to
* place both measures for a site/variety combination on one line, and
* identify the year using color:
```{r, message = FALSE}
barley <- mutate(barley, sitevar = paste(site, variety, sep = ", "))
ggplot(barley) +
geom_point(aes(x = yield, y = sitevar, color = year))
```
Reordering the lines based on the yield in 1931 may be useful.
First step: Isolate 1931 yields:
```{r}
b31 <- filter(barley, year == 1931)
b31 <- select(b31, yield31 = yield, sitevar)
head(b31)
```
Second step: Use `left_join` to merge the 1931 yields with the barley
data frame:
```{r}
barley31 <- left_join(barley, b31)
head(barley31)
```
Now plot the data:
```{r}
ggplot(barley31) +
geom_point(aes(x = yield, y = reorder(sitevar, yield31), color = year))
```
### Base and Lattice Graphics
Base graphics provides the `dotchart` function:
```{r}
dotchart(Playfair$population, labels = Playfair$city)
```
The `lattice` package provides `dotplot`:
```{r}
library(lattice)
dotplot(reorder(city, population) ~ population, data = Playfair)
```
Most lattice plots support a `group` argument that is usually mapped
to color:
```{r}
dotplot(reorder(sitevar, yield31) ~ yield,
group = year, data = barley31, auto.key = TRUE)
```
## Bar Charts
### Basics
Bar charts are most commonly used to show frequencies for categorical
data. They are also usually drawn verically:
```{r}
ggplot(diamonds) + geom_bar(aes(x = color))
````
It is possible to persuade `geom_bar` to encode the value of a numeric
variable as bar height by
* assigning the variable to the `y` aesthetic, and
* specifying `stat = "identity"`.
The default `stat` is to bin the data, as for a histogram, and to use
the bin counts as the `y` aesthetic.
```{r}
ggplot(Playfair) +
geom_bar(aes(y = population, x = reorder(city, population)),
stat = "identity")
```
Slightly simpler is to use `geom_col`:
```{r}
ggplot(Playfair) +
geom_col(aes(y = population, x = reorder(city, population)))
```
To make labels readable we can flip the plot:
```{r}
ggplot(Playfair) +
geom_col(aes(y = population, x = reorder(city, population))) +
coord_flip()
```
[ Switching `x` and `y` doesn't do what you want.]
Reducing the bar width may be helpful:
```{r}
ggplot(Playfair) +
geom_col(aes(y = population, x = reorder(city, population)),
width = 0.3) +
coord_flip()
```
Creating this basic bar chart in `lattice` is easier:
```{r}
barchart(reorder(city, population) ~ population, data = Playfair)
```
Base graphics also provide a bar chart with the `barplot` function.
```{r}
opar <- par(mar = c(5, 5, 4, 2) + 0.1)
with(Playfair,
barplot(population, horiz = TRUE, names.arg = city,
las = 1, cex.names = 0.7, cex.axis = 0.7))
par(opar)
```
### Comparisons and the Zero Baseline Issue
Bar charts seem to be used much more than dot plots in the popular
media.
But they are less widely applicable, and have one dangerous
feature, sometimes called the _zero baseline issue_.
Because of the way our perception works, when we look at a bar chart
we focus on the length of the bar, the distance from the base line,
even when the baseline is not meaningful.
Look at these views of tempreatures in the subset of cities where
temperatures are between 60 and 70 degrees.
```{r, fig.width = 10, message = FALSE}
library(gridExtra)
c6070 <- filter(citytemps, temp >= 60 & temp <= 70)
p1 <- barchart(city ~ temp, c6070)
p2 <- dotplot(city ~ temp, c6070)
grid.arrange(p1, p2, nrow = 1)
```
Take the Melbourn-Los Angeles pair and the Algiers-Addis Ababa pair.
* Their temperature differences are about the same.
* The bar chart makes Algiers-Addis Ababa contrast look much more
extreme than the Malbourn-Los Angeles contrast.
With a subset like this the focus is on the difference, not the
ratio: in this context these temperatures are _interval data_.
As another example, take the cities in the `Playfair` data with
population between 100 and 500 thousand:
```{r, fig.width = 10}
P15 <- filter(Playfair, population >= 100 & population <= 500)
p1 <- dotplot(city ~ population, data = P15)
p2 <- barchart(city ~ population, data = P15)
grid.arrange(p1, p2, nrow = 1)
```
Someone choosing to look at a restricted range like this is most
likely focusing on the differences in population, not the ratio.
The ratio comparisons emphasized by the bar chart are not meaningful.
Setting the origin to zero rescues the bar chart:
```{r, fig.width = 10}
P15 <- filter(Playfair, population >= 100 & population <= 500)
p1 <- dotplot(city ~ population, data = P15, origin = 0)
p2 <- barchart(city ~ population, data = P15, origin = 0)
grid.arrange(p1, p2, nrow = 1)
```
Either chart allows a ratio comparison or a difference comparison.
* The ratio comparison is a little easier with a bar chart.
* The difference comparison is harder because of the distraction
created by the bars.
* For quantity measurements ratios are usually at least meaningful
even if they might not be the primary focus
Some notes:
* When using bar charts for positive numbers where ratios are
meaningful the baseline should _always_ be zero.
* The only exception is when there is a natural non-zero baseline value,
such as 32 degrees Fahrenheit (i.e. 0 degrees Celcius) for temperatures.
* You may need to intervene with your software's defaults to make this
happen.
* Bar charts always push the viewer to ratio comparisons, whether
they are meaningful or not.
* Using a non-zero baseline can therefore
[mislead the viewer](http://www.storytellingwithdata.com/blog/2012/09/bar-charts-must-have-zero-baseline).
* Some news organizations seem particularly prone to taking advantage
of/falling prey to this issue.
* I used `lattice` in these examples because it is hard to get
`geom_bar` or `geom_col~ to use a non-zero base line (which is a good thing!).
* You _can_ create a bar chart with a non-zero baseline using
`geom_segment`.
### Data With Both Positive and Negative Values
Bar charts can be used for data containing both positive and negative
values:
```{r, fig.width = 10, eval = FALSE, include = FALSE}
p1 <- dotplot(city ~ temp - 32, data = citytemps, origin = 0,
panel = function(x, y) {
panel.xyplot(x, y)
panel.abline(v = 0, lty = 2)
})
p2 <- barchart(city ~ temp - 32, data = citytemps, origin = 0)
```
```{r, fig.width = 10, warning = FALSE}
p1 <- ggplot(citytemps) +
geom_point(aes(y = city, x = temp - 32)) +
geom_vline(xintercept = 0, lty = 2)
p2 <- ggplot(citytemps) +
geom_col(aes(x = city, y = temp - 32)) +
coord_flip()
grid.arrange(p1, p2, nrow = 1)
```
* This puts a strong emphasis on the base line
* This can be useful if the baseline is meaningful, such as
* [an average level](https://www.ncdc.noaa.gov/cag/time-series/global)
* zero degrees Celcius
* Zero degrees Fahrenheit is not meaningful:
```{r, fig.width = 10, eval = FALSE, include = FALSE}
p1 <- dotplot(city ~ temp, data = citytemps, origin = 0,
panel = function(x, y) {
panel.xyplot(x, y)
panel.abline(v = 0, lty = 2)
})
p2 <- barchart(city ~ temp, data = citytemps, origin = 0)
```
```{r, fig.width = 10, warning = FALSE}
p1 <- ggplot(citytemps) +
geom_point(aes(y = city, x = temp)) +
geom_vline(xintercept = 0, lty = 2)
p2 <- ggplot(citytemps) +
geom_col(aes(x = city, y = temp)) +
coord_flip()
grid.arrange(p1, p2, nrow = 1)
```
## `ggplot` Documentation
* The `ggplot2` package includes help pages, but the variants
available at are a bit more accessible.
* The definitive guide is Hadley Wickham's book.
* Some useful recipes are available iat
.
* A more extensive collection is available in Winston Chang's (2013)
_R Graphics Cookbook_.
* [_R for Data Science_](http://r4ds.had.co.nz/) also contains
material on `ggplot2`.
## Group Summaries
Dot plots, and sometimes bar charts, can be very useful for showing
group summaries. Two approaches for computing summaries:
* Use the `tapply`, `by`, and `aggregate` functions from base R.
* Use tools in the `tidyverse`, in particular from the `dplyr`
package.
I will use the `dplyr` approach. This uses `group_by` to create a
_grouped table_, followed by `summarize`.
Here is how to compute the agerage `yield` values for each `variety`
in the `barley` data:
```{r}
barley_by_variety <- group_by(barley, variety)
barley_variety_means <- summarize(barley_by_variety, avg_yield = mean(yield))
head(barley_variety_means)
ggplot(barley_variety_means) +
geom_point(aes(x = avg_yield, y = as.character(variety)))
```
The ordering of the `variety` factor created by the two approaches is
a little different.
An alternate way of specifying the `dplyr` computation uses the _pipe
operator_ `%>%`:
```{r, eval = FALSE}
barley %>%
group_by(variety) %>%
summarize(avg_yield = mean(yield))
```
In this approach the result on the left of `%>%` is passed implicitly
as the first argument to the function called on the right.
Some like this approach a lot; others do not. I do not care for it.
## Variations in Appearence
Base and lattice dot plots use only hirizontal grid lines. This
corresponds to the version introduced by W. S. Cleveland.
Lattice and ggplot allow features such as this to be customized using
_themes_.
`ggplot2` provides a number of alternate themses; the `ggthemes`
package provides more.
The _Wall Street Journal_ theme `ggthmes::theme_wsj` produces
```{r}
ggplot(barley_variety_means) +
geom_point(aes(x = avg_yield, y = as.character(variety))) +
ggthemes::theme_wsj()
```
A theme to closely match the style used in Cleveland's 1993 book
_Visualizing Data_ and used by the base and lattice functions can be
defined as
```{r}
theme_dotplotx <- function() {
theme( ## remove the vertical grid lines
panel.grid.major.x = element_blank() ,
panel.grid.minor.x = element_blank() ,
## explicitly set the horizontal lines (or they will disappear too)
panel.grid.major.y = element_line(color="black", linetype = 3),
axis.text.y = element_text(size=rel(1.2)),
## use a white backgrounsd
panel.background = element_rect(fill = "white", colour = NA),
panel.border = element_rect(fill = NA, colour = "grey20"))
}
```
This produces
```{r}
ggplot(barley_variety_means) +
geom_point(aes(x = avg_yield, y = as.character(variety))) +
theme_dotplotx()
```
```{r, echo = FALSE, eval = FALSE}
#http://www.win-vector.com/blog/2013/02/revisiting-clevelands-the-elements-of-graphing-data-in-ggplot2/
# A dotplot: pretty close approximation to the style in Cleveland's book.
# The theme arguments refer to the FINAL x and y axes,
# not the pre-coord_flip axes.
ggplot(hdata)+ geom_point(aes(x=state), stat="bin") +
coord_flip() +
theme( # remove the vertical grid lines
panel.grid.major.x = element_blank() ,
# explicitly set the horizontal lines (or they will disappear too)
panel.grid.major.y = element_line(linetype=3, color="darkgray"),
axis.text.y=element_text(size=rel(0.8)) )
ggplot(Playfair) + geom_point(aes(x=city, y = population)) +
coord_flip() +
theme( # remove the vertical grid lines
panel.grid.major.x = element_blank() ,
# explicitly set the horizontal lines (or they will disappear too)
panel.grid.major.y = element_line(linetype=3, color="darkgray"),
axis.text.y=element_text(size=rel(0.8)) )
ggplot(Playfair) + geom_point(aes(y=city, x = population)) +
theme( # remove the vertical grid lines
panel.grid.major.x = element_blank() ,
# explicitly set the horizontal lines (or they will disappear too)
panel.grid.major.y = element_line(linetype=3, color="darkgray"),
axis.text.y=element_text(size=rel(0.8)) )
ggplot(Playfair) + geom_point(aes(y=city, x = population)) +
theme( # remove the vertical grid lines
panel.grid.major.x = element_blank() ,
# explicitly set the horizontal lines (or they will disappear too)
panel.grid.major.y = element_line(color="lightgray"),
axis.text.y=element_text(size=rel(0.8)) )
theme_dotplotx <- function() {
theme( # remove the vertical grid lines
panel.grid.major.x = element_blank() ,
panel.grid.minor.x = element_blank() ,
# explicitly set the horizontal lines (or they will disappear too)
panel.grid.major.y = element_line(color="lightgray"),
axis.text.y=element_text(size=rel(0.8)) )
}
```