Considerations to Keep in Mind

As we look at visualizations, a reminder of some considerations:

Dot Plots

Basics

One of the simplest visualizations of a single numerical variable with a modest number of observations and lables for the observations is a dot plot, or Cleveland dot plot:

library(ggplot2)
ggplot(Playfair) +
    geom_point(aes(x = population, y = city))

This visualization

  • shows the overall distribution of the data, and
  • makes it easy to locate the population of a particular city.

A useful variation is to order the vertical position by rank:

ggplot(Playfair) +
    geom_point(aes(x = population, y = reorder(city, population)))

  • locating a particular city is a little more difficult; but
  • the shape of the distribution is more apparent
  • approximate median and quartiles can be read off

This visualization is often very useful for group summaries.

Larger Data Sets

Using labels can become impractical for larger numbers of observations:

ggplot(citytemps) + geom_point(aes(x = temp, y = reorder(city, temp)))

Instead we can use ranks:

ggplot(citytemps) + geom_point(aes(x = temp, y = rank(temp)))

Using ranks or percentiles this visualization in principle scales to much larger data sets:

ggplot(diamonds) +
    geom_point(aes(x = price, y = 100 * rank(price) / length(price)))

With a data set of this size there is little visual difference between a dot plot and a plot that interpolates the points:

ggplot(diamonds) +
    geom_line(aes(x = price, y = 100 * rank(price) / length(price)))

Both the dot plot and the interpolated verson use a lot of resources for computing and storing the plot with diminishing visual returns.

A more effective approach for larger data sets is to fix a set of percentages and plot against the corresponding percentiles:

p <- seq(0, 1, length.out = 100)
dm <- data.frame(pct = 100 * p, price = quantile(diamonds$price, p))
ggplot(dm) + geom_line(aes(x = price, y = pct))

A similar result can be obtained with stat_ecdf.

Some Variations

The size of the dots can be used to encode an additional numeric variable.

We can compute the approximate area for the cities in the Playfair data frame as pi * (diameter / 2) ^ 2:

library(dplyr)
PlayfairA <- mutate(Playfair, area = pi * (diameter / 2) ^ 2)
ggplot(PlayfairA) +
    geom_point(aes(x = population, y = reorder(city, population),
                   size = area)) +
    scale_size_area()

For the Barley data we have two measures per year. It can be useful to

  • place both measures for a site/variety combination on one line, and
  • identify the year using color:
barley <- mutate(barley, sitevar = paste(site, variety, sep = ", "))
ggplot(barley) +
    geom_point(aes(x = yield, y = sitevar, color = year))

Reordering the lines based on the yield in 1931 may be useful.

First step: Isolate 1931 yields:

b31 <- filter(barley, year == 1931)
b31 <- select(b31, yield31 = yield, sitevar)
head(b31)
##    yield31                    sitevar
## 1 27.00000 University Farm, Manchuria
## 2 48.86667          Waseca, Manchuria
## 3 27.43334          Morris, Manchuria
## 4 39.93333       Crookston, Manchuria
## 5 32.96667    Grand Rapids, Manchuria
## 6 28.96667          Duluth, Manchuria

Second step: Use left_join to merge the 1931 yields with the barley data frame:

barley31 <- left_join(barley, b31)
head(barley31)
##      yield   variety year            site                    sitevar  yield31
## 1 27.00000 Manchuria 1931 University Farm University Farm, Manchuria 27.00000
## 2 48.86667 Manchuria 1931          Waseca          Waseca, Manchuria 48.86667
## 3 27.43334 Manchuria 1931          Morris          Morris, Manchuria 27.43334
## 4 39.93333 Manchuria 1931       Crookston       Crookston, Manchuria 39.93333
## 5 32.96667 Manchuria 1931    Grand Rapids    Grand Rapids, Manchuria 32.96667
## 6 28.96667 Manchuria 1931          Duluth          Duluth, Manchuria 28.96667

Now plot the data:

ggplot(barley31) +
    geom_point(aes(x = yield, y = reorder(sitevar, yield31), color = year))

Base and Lattice Graphics

Base graphics provides the dotchart function:

dotchart(Playfair$population, labels = Playfair$city)

The lattice package provides dotplot:

library(lattice)
dotplot(reorder(city, population) ~ population, data = Playfair)

Most lattice plots support a group argument that is usually mapped to color:

dotplot(reorder(sitevar, yield31) ~ yield,
        group = year, data = barley31, auto.key = TRUE)

Bar Charts

Basics

Bar charts are most commonly used to show frequencies for categorical data. They are also usually drawn verically:

ggplot(diamonds) + geom_bar(aes(x = color))

It is possible to persuade geom_bar to encode the value of a numeric variable as bar height by

  • assigning the variable to the y aesthetic, and
  • specifying stat = "identity".

The default stat is to bin the data, as for a histogram, and to use the bin counts as the y aesthetic.

ggplot(Playfair) +
    geom_bar(aes(y = population, x = reorder(city, population)),
             stat = "identity")

Slightly simpler is to use geom_col:

ggplot(Playfair) +
    geom_col(aes(y = population, x = reorder(city, population)))

To make labels readable we can flip the plot:

ggplot(Playfair) +
    geom_col(aes(y = population, x = reorder(city, population))) +
    coord_flip()

[ Switching x and y doesn’t do what you want.]

Reducing the bar width may be helpful:

ggplot(Playfair) +
    geom_col(aes(y = population, x = reorder(city, population)),
             width = 0.3) +
    coord_flip()

Creating this basic bar chart in lattice is easier:

barchart(reorder(city, population) ~ population, data = Playfair)

Base graphics also provide a bar chart with the barplot function.

opar <- par(mar = c(5, 5, 4, 2) + 0.1)
with(Playfair,
     barplot(population, horiz = TRUE, names.arg = city,
             las = 1, cex.names = 0.7, cex.axis = 0.7))

par(opar)

Comparisons and the Zero Baseline Issue

Bar charts seem to be used much more than dot plots in the popular media.

But they are less widely applicable, and have one dangerous feature, sometimes called the zero baseline issue.

Because of the way our perception works, when we look at a bar chart we focus on the length of the bar, the distance from the base line, even when the baseline is not meaningful.

Look at these views of tempreatures in the subset of cities where temperatures are between 60 and 70 degrees.

library(gridExtra)
c6070 <- filter(citytemps, temp >= 60 & temp <= 70)
p1 <- barchart(city ~ temp, c6070)
p2 <- dotplot(city ~ temp, c6070)
grid.arrange(p1, p2, nrow = 1)

Take the Cairo-San Francisco pair and the Mexico City-Los Angeles pair.

  • Their temperature differences are about the same.

  • The bar chart makes the Cairo-San Francisco contrast look much more extreme than the Mexico City-Los Angeles contrast.

With a subset like this the focus is on the difference, not the ratio: in this context these temperatures are interval data.

As another example, take the cities in the Playfair data with population between 100 and 500 thousand:

P15 <- filter(Playfair, population >= 100 & population <= 500)
p1 <- dotplot(city ~ population, data = P15)
p2 <- barchart(city ~ population, data = P15)
grid.arrange(p1, p2, nrow = 1)

Someone choosing to look at a restricted range like this is most likely focusing on the differences in population, not the ratio.

The ratio comparisons emphasized by the bar chart are not meaningful.

Setting the origin to zero rescues the bar chart:

P15 <- filter(Playfair, population >= 100 & population <= 500)
p1 <- dotplot(city ~ population, data = P15, origin = 0)
p2 <- barchart(city ~ population, data = P15, origin = 0)
grid.arrange(p1, p2, nrow = 1)

Either chart allows a ratio comparison or a difference comparison.

  • The ratio comparison is a little easier with a bar chart.
  • The difference comparison is harder because of the distraction created by the bars.
  • For quantity measurements ratios are usually at least meaningful even if they might not be the primary focus

Some notes:

  • When using bar charts for positive numbers where ratios are meaningful the baseline should always be zero.
  • The only exception is when there is a natural non-zero baseline value, such as 32 degrees Fahrenheit (i.e. 0 degrees Celcius) for temperatures.
  • You may need to intervene with your software’s defaults to make this happen.
  • Bar charts always push the viewer to ratio comparisons, whether they are meaningful or not.
  • Using a non-zero baseline can therefore mislead the viewer.
  • Some news organizations seem particularly prone to taking advantage of/falling prey to this issue.
  • I used lattice in these examples because it is hard to get geom_bar or `geom_col~ to use a non-zero base line (which is a good thing!).
  • You can create a bar chart with a non-zero baseline using geom_segment.

Data With Both Positive and Negative Values

Bar charts can be used for data containing both positive and negative values:

p1 <- ggplot(citytemps) +
    geom_point(aes(y = city, x = temp - 32)) +
    geom_vline(xintercept = 0, lty = 2)
p2 <- ggplot(citytemps) +
    geom_col(aes(x = city, y = temp - 32)) +
    coord_flip()
grid.arrange(p1, p2, nrow = 1)

  • This puts a strong emphasis on the base line
  • This can be useful if the baseline is meaningful, such as
  • Zero degrees Fahrenheit is not meaningful:
p1 <- ggplot(citytemps) +
    geom_point(aes(y = city, x = temp)) +
    geom_vline(xintercept = 0, lty = 2)
p2 <- ggplot(citytemps) +
    geom_col(aes(x = city, y = temp)) +
    coord_flip()
grid.arrange(p1, p2, nrow = 1)

ggplot Documentation

Group Summaries

Dot plots, and sometimes bar charts, can be very useful for showing group summaries. Two approaches for computing summaries:

I will use the dplyr approach. This uses group_by to create a grouped table, followed by summarize.

Here is how to compute the agerage yield values for each variety in the barley data:

barley_by_variety <- group_by(barley, variety)
barley_variety_means <- summarize(barley_by_variety, avg_yield = mean(yield))
head(barley_variety_means)
## # A tibble: 6 × 2
##   variety   avg_yield
##   <fct>         <dbl>
## 1 Svansota       30.4
## 2 No. 462        35.4
## 3 Manchuria      31.5
## 4 No. 475        31.8
## 5 Velvet         33.1
## 6 Peatland       34.2
ggplot(barley_variety_means) +
    geom_point(aes(x = avg_yield, y = as.character(variety)))

The ordering of the variety factor created by the two approaches is a little different.

An alternate way of specifying the dplyr computation uses the pipe operator %>%:

barley %>%
    group_by(variety) %>%
    summarize(avg_yield = mean(yield))

In this approach the result on the left of %>% is passed implicitly as the first argument to the function called on the right.

Some like this approach a lot; others do not. I do not care for it.

Variations in Appearence

Base and lattice dot plots use only hirizontal grid lines. This corresponds to the version introduced by W. S. Cleveland.

Lattice and ggplot allow features such as this to be customized using themes.

ggplot2 provides a number of alternate themses; the ggthemes package provides more.

The Wall Street Journal theme ggthmes::theme_wsj produces

ggplot(barley_variety_means) +
    geom_point(aes(x = avg_yield, y = as.character(variety))) +
    ggthemes::theme_wsj()

A theme to closely match the style used in Cleveland’s 1993 book Visualizing Data and used by the base and lattice functions can be defined as

theme_dotplotx <- function() {
    theme( ## remove the vertical grid lines
        panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank(),
        ## explicitly set the horizontal lines (or they will disappear too)
        panel.grid.major.y = element_line(color = "black", linetype = 3),
        axis.text.y = element_text(size = rel(1.2)),
        ## use a white backgrounsd
        panel.background = element_rect(fill = "white", colour = NA),
        panel.border = element_rect(fill = NA, colour = "grey20"))
}

This produces

ggplot(barley_variety_means) +
    geom_point(aes(x = avg_yield, y = as.character(variety))) +
    theme_dotplotx()

