## Considerations to Keep in Mind

As we look at visualizations, a reminder of some considerations:

• Task levels for visualization, from highest to lowest level:

• Analyze: Identify patterns, distributions, presence of outliers or clusters, other interesting features.

• Search: Look up aspects of a feature known in advance or revealed by the visualiation.

• Query: Identify, compare features of individual items.

Each higher level builds on the levels below.

• Scalability:

• How well do these visualizations work for larger data sets?

• Are there variations that can help for larger data sets?

## Dot Plots

### Basics

One of the simplest visualizations of a single numerical variable with a modest number of observations and lables for the observations is a dot plot, or Cleveland dot plot:

library(ggplot2)
ggplot(Playfair) +
geom_point(aes(x = population, y = city))

This visualization

• shows the overall distribution of the data, and
• makes it easy to locate the population of a particular city.

A useful variation is to order the vertical position by rank:

ggplot(Playfair) +
geom_point(aes(x = population, y = reorder(city, population)))

• locating a particular city is a little more difficult; but
• the shape of the distribution is more apparent
• approximate median and quartiles can be read off

This visualization is often very useful for group summaries.

### Larger Data Sets

Using labels can become impractical for larger numbers of observations:

ggplot(citytemps) + geom_point(aes(x = temp, y = reorder(city, temp)))

ggplot(citytemps) + geom_point(aes(x = temp, y = rank(temp)))

Using ranks or percentiles this visualization in principle scales to much larger data sets:

ggplot(diamonds) +
geom_point(aes(x = price, y = 100 * rank(price) / length(price)))

With a data set of this size there is little visual difference between a dot plot and a plot that interpolates the points:

ggplot(diamonds) +
geom_line(aes(x = price, y = 100 * rank(price) / length(price)))

Both the dot plot and the interpolated verson use a lot of resources for computing and storing the plot with diminishing visual returns.

A more effective approach for larger data sets is to fix a set of percentages and plot against the corresponding percentiles:

p <- seq(0, 1, length.out = 100)
dm <- data.frame(pct = 100 * p, price = quantile(diamonds$price, p)) ggplot(dm) + geom_line(aes(x = price, y = pct)) A similar result can be obtained with stat_ecdf. ### Some Variations The size of the dots can be used to encode an additional numeric variable. We can compute the approximate area for the cities in the Playfair data frame as pi * (diameter / 2) ^ 2: library(dplyr) PlayfairA <- mutate(Playfair, area = pi * (diameter / 2) ^ 2) ggplot(PlayfairA) + geom_point(aes(x = population, y = reorder(city, population), size = area)) + scale_size_area() For the Barley data we have two measures per year. It can be useful to • place both measures for a site/variety combination on one line, and • identify the year using color: barley <- mutate(barley, sitevar = paste(site, variety, sep = ", ")) ggplot(barley) + geom_point(aes(x = yield, y = sitevar, color = year)) Reordering the lines based on the yield in 1931 may be useful. First step: Isolate 1931 yields: b31 <- filter(barley, year == 1931) b31 <- select(b31, yield31 = yield, sitevar) head(b31) ## yield31 sitevar ## 1 27.00000 University Farm, Manchuria ## 2 48.86667 Waseca, Manchuria ## 3 27.43334 Morris, Manchuria ## 4 39.93333 Crookston, Manchuria ## 5 32.96667 Grand Rapids, Manchuria ## 6 28.96667 Duluth, Manchuria Second step: Use left_join to merge the 1931 yields with the barley data frame: barley31 <- left_join(barley, b31) ## Joining, by = "sitevar" head(barley31) ## yield variety year site sitevar ## 1 27.00000 Manchuria 1931 University Farm University Farm, Manchuria ## 2 48.86667 Manchuria 1931 Waseca Waseca, Manchuria ## 3 27.43334 Manchuria 1931 Morris Morris, Manchuria ## 4 39.93333 Manchuria 1931 Crookston Crookston, Manchuria ## 5 32.96667 Manchuria 1931 Grand Rapids Grand Rapids, Manchuria ## 6 28.96667 Manchuria 1931 Duluth Duluth, Manchuria ## yield31 ## 1 27.00000 ## 2 48.86667 ## 3 27.43334 ## 4 39.93333 ## 5 32.96667 ## 6 28.96667 Now plot the data: ggplot(barley31) + geom_point(aes(x = yield, y = reorder(sitevar, yield31), color = year)) ### Base and Lattice Graphics Base graphics provides the dotchart function: dotchart(Playfair$population, labels = Playfair\$city)

The lattice package provides dotplot:

library(lattice)
dotplot(reorder(city, population) ~ population, data = Playfair)

Most lattice plots support a group argument that is usually mapped to color:

dotplot(reorder(sitevar, yield31) ~ yield,
group = year, data = barley31, auto.key = TRUE)

## Bar Charts

### Basics

Bar charts are most commonly used to show frequencies for categorical data. They are also usually drawn verically:

ggplot(diamonds) + geom_bar(aes(x = color))

It is possible to persuade geom_bar to encode the value of a numeric variable as bar height by

• assigning the variable to the y aesthetic, and
• specifying stat = "identity".

The default stat is to bin the data, as for a histogram, and to use the bin counts as the y aesthetic.

ggplot(Playfair) +
geom_bar(aes(y = population, x = reorder(city, population)),
stat = "identity")

Slightly simpler is to use geom_col:

ggplot(Playfair) +
geom_col(aes(y = population, x = reorder(city, population)))

To make labels readable we can flip the plot:

ggplot(Playfair) +
geom_col(aes(y = population, x = reorder(city, population))) +
coord_flip()

[ Switching x and y doesn’t do what you want.]

Reducing the bar width may be helpful:

ggplot(Playfair) +
geom_col(aes(y = population, x = reorder(city, population)),
width = 0.3) +
coord_flip()

Creating this basic bar chart in lattice is easier:

barchart(reorder(city, population) ~ population, data = Playfair)

Base graphics also provide a bar chart with the barplot function.

opar <- par(mar = c(5, 5, 4, 2) + 0.1)
with(Playfair,
barplot(population, horiz = TRUE, names.arg = city,
las = 1, cex.names = 0.7, cex.axis = 0.7))

par(opar)

### Comparisons and the Zero Baseline Issue

Bar charts seem to be used much more than dot plots in the popular media.

But they are less widely applicable, and have one dangerous feature, sometimes called the zero baseline issue.

Because of the way our perception works, when we look at a bar chart we focus on the length of the bar, the distance from the base line, even when the baseline is not meaningful.

Look at these views of tempreatures in the subset of cities where temperatures are between 60 and 70 degrees.

library(gridExtra)
c6070 <- filter(citytemps, temp >= 60 & temp <= 70)
p1 <- barchart(city ~ temp, c6070)
p2 <- dotplot(city ~ temp, c6070)
grid.arrange(p1, p2, nrow = 1)

Take the Melbourn-Los Angeles pair and the Algiers-Addis Ababa pair.

• Their temperature differences are about the same.

• The bar chart makes Algiers-Addis Ababa contrast look much more extreme than the Malbourn-Los Angeles contrast.

With a subset like this the focus is on the difference, not the ratio: in this context these temperatures are interval data.

As another example, take the cities in the Playfair data with population between 100 and 500 thousand:

P15 <- filter(Playfair, population >= 100 & population <= 500)
p1 <- dotplot(city ~ population, data = P15)
p2 <- barchart(city ~ population, data = P15)
grid.arrange(p1, p2, nrow = 1)

Someone choosing to look at a restricted range like this is most likely focusing on the differences in population, not the ratio.

The ratio comparisons emphasized by the bar chart are not meaningful.

Setting the origin to zero rescues the bar chart:

P15 <- filter(Playfair, population >= 100 & population <= 500)
p1 <- dotplot(city ~ population, data = P15, origin = 0)
p2 <- barchart(city ~ population, data = P15, origin = 0)
grid.arrange(p1, p2, nrow = 1)

Either chart allows a ratio comparison or a difference comparison.

• The ratio comparison is a little easier with a bar chart.
• The difference comparison is harder because of the distraction created by the bars.
• For quantity measurements ratios are usually at least meaningful even if they might not be the primary focus

Some notes:

• When using bar charts for positive numbers where ratios are meaningful the baseline should always be zero.
• The only exception is when there is a natural non-zero baseline value, such as 32 degrees Fahrenheit (i.e. 0 degrees Celcius) for temperatures.
• You may need to intervene with your software’s defaults to make this happen.
• Bar charts always push the viewer to ratio comparisons, whether they are meaningful or not.
• Using a non-zero baseline can therefore mislead the viewer.
• Some news organizations seem particularly prone to taking advantage of/falling prey to this issue.
• I used lattice in these examples because it is hard to get geom_bar or geom_col~ to use a non-zero base line (which is a good thing!).
• You can create a bar chart with a non-zero baseline using geom_segment.

### Data With Both Positive and Negative Values

Bar charts can be used for data containing both positive and negative values:

p1 <- ggplot(citytemps) +
geom_point(aes(y = city, x = temp - 32)) +
geom_vline(xintercept = 0, lty = 2)
p2 <- ggplot(citytemps) +
geom_col(aes(x = city, y = temp - 32)) +
coord_flip()
grid.arrange(p1, p2, nrow = 1)

• This puts a strong emphasis on the base line
• This can be useful if the baseline is meaningful, such as
• Zero degrees Fahrenheit is not meaningful:
p1 <- ggplot(citytemps) +
geom_point(aes(y = city, x = temp)) +
geom_vline(xintercept = 0, lty = 2)
p2 <- ggplot(citytemps) +
geom_col(aes(x = city, y = temp)) +
coord_flip()
grid.arrange(p1, p2, nrow = 1)

## ggplot Documentation

• The ggplot2 package includes help pages, but the variants available at http://docs.ggplot2.org/ are a bit more accessible.
• The definitive guide is Hadley Wickham’s book.
• Some useful recipes are available iat http://www.cookbook-r.com/Graphs/.
• A more extensive collection is available in Winston Chang’s (2013) R Graphics Cookbook.
• R for Data Science also contains material on ggplot2.

## Group Summaries

Dot plots, and sometimes bar charts, can be very useful for showing group summaries. Two approaches for computing summaries:

• Use the tapply, by, and aggregate functions from base R.

• Use tools in the tidyverse, in particular from the dplyr package.

I will use the dplyr approach. This uses group_by to create a grouped table, followed by summarize.

Here is how to compute the agerage yield values for each variety in the barley data:

barley_by_variety <- group_by(barley, variety)
barley_variety_means <- summarize(barley_by_variety, avg_yield = mean(yield))
## # A tibble: 6 x 2
##   variety   avg_yield
##   <fct>         <dbl>
## 1 Svansota       30.4
## 2 No. 462        35.4
## 3 Manchuria      31.5
## 4 No. 475        31.8
## 5 Velvet         33.1
## 6 Peatland       34.2
ggplot(barley_variety_means) +
geom_point(aes(x = avg_yield, y = as.character(variety)))

The ordering of the variety factor created by the two approaches is a little different.

An alternate way of specifying the dplyr computation uses the pipe operator %>%:

barley %>%
group_by(variety) %>%
summarize(avg_yield = mean(yield))

In this approach the result on the left of %>% is passed implicitly as the first argument to the function called on the right.

Some like this approach a lot; others do not. I do not care for it.

## Variations in Appearence

Base and lattice dot plots use only hirizontal grid lines. This corresponds to the version introduced by W. S. Cleveland.

Lattice and ggplot allow features such as this to be customized using themes.

ggplot2 provides a number of alternate themses; the ggthemes package provides more.

The Wall Street Journal theme ggthmes::theme_wsj produces

ggplot(barley_variety_means) +
geom_point(aes(x = avg_yield, y = as.character(variety))) +
ggthemes::theme_wsj()

A theme to closely match the style used in Cleveland’s 1993 book Visualizing Data and used by the base and lattice functions can be defined as

theme_dotplotx <- function() {
theme( ## remove the vertical grid lines
panel.grid.major.x = element_blank() ,
panel.grid.minor.x = element_blank() ,
## explicitly set the horizontal lines (or they will disappear too)
panel.grid.major.y = element_line(color="black", linetype = 3),
axis.text.y = element_text(size=rel(1.2)),
## use a white backgrounsd
panel.background = element_rect(fill = "white", colour = NA),
panel.border = element_rect(fill = NA, colour = "grey20"))
}

This produces

ggplot(barley_variety_means) +
geom_point(aes(x = avg_yield, y = as.character(variety))) +
theme_dotplotx()`