Some topics we did not have time to look at:
Working with models (Chapter 6 in Healy, 2018; Chapter 25 in R4DS).
Visualizing uncertainty (Chapter 16 of Wilke, 2019 and below)
Plot Annotation, plot ensembles, and dashboards. (Part II of Wilke, 2019; Chapter 5 of Healy, 2018; below).
All estimates from data are associated with some degree of uncertainty.
Effectively communicating that uncertainty in visualizations is challenging and an active area of research.
From Cairo (2019); images from a blog post by the author.
The cone of uncertainty:
The NHC forecast cone is designed that two-thirds of historical official forecast errors over a 5-year sample fall within the code for a particular time point..
When published in the media these visualizations are routinely misinterpreted something like this:
A more effective representation might be something like this, showing an ensemble of possible tracks:
An animated version may be more effective, if the presentation medium permits.
Developing better visualizations for hurricane forecasting, especially targeting the public, is an active area of research.
Expert ratings, on a scale from 0 to 5, for chocolate bars manufactured in several countries:
data(cacao, package = "dviz.supp")
countries <- c("U.S.A.", "Austria", "Belgium", "Canada", "Peru", "Switzerland")
col80 <- desaturate(darken("#0072B2", .2), .3)
col95 <- desaturate(lighten("#0072B2", .2), .3)
col99 <- desaturate(lighten("#0072B2", .4), .3)
colP <- col95
colM <- "#D55E00"
c1 <- filter(cacao, location %in% countries)
c1sums <- group_by(c1, location) %>%
summarize(m = mean(rating),
s = sd(rating),
n = n()) %>%
ungroup()
c1CI <- mutate(data.frame(level = c(0.8, 0.95, 0.99)),
df = lapply(level,
function(lev)
with(c1sums, {
h <- s * qt(1 - (1 - lev) / 2, n - 1) /
sqrt(n)
cbind(c1sums, data.frame(xmin = m - h,
xmax = m + h))
}))) %>%
unnest("df")
ggplot(c1, aes(rating, reorder(location, rating))) +
geom_point(position = position_jitter(height = 0.3, width = 0.05),
size = 0.5, color = colP) +
##geom_point(aes(m, location), data = c1sums, size = 2.5, color = colM) +
geom_segment(aes(x = m, xend = m,
y = as.integer(reorder(location, m)) - 0.3,
yend = as.integer(reorder(location, m)) + 0.3),
size = 2, color = colM, data = c1sums) +
thm +
ylab("") +
ggtitle("Ratings for Chocolate Bars", "Bars are sample means.")
The standard deviations of the data distributions are comparable, but the lengths of confidence intervals for the mean vary because of the different sample sizes:
p <- ggplot(filter(c1CI, level == 0.95),
aes(m, reorder(location, m), xmin = xmin, xmax = xmax)) +
geom_errorbarh(height = 0) +
geom_point(size = 2.5, color = colM) +
thm +
ylab("") +
ggtitle("Confidence Intervals for the Mean", "Confidence level 95%")
p + scale_x_continuous(limits = c(1, 4), name = "mean rating")
The same plot with a reduced horizontal range:
p + scale_x_continuous(limits = c(2.5, 3.8), name = "mean rating")
A more elaborate display with confidence intervals at several levels:
## based on code for Wilke's Fig. 16.7
arrange(c1CI, desc(level)) %>%
mutate(level = paste0(100 * level, "%"),
location = reorder(location, m)) %>%
ggplot(aes(m, location, xmin = xmin, xmax = xmax)) +
geom_errorbarh(aes(size = level, color = level), height = 0) +
geom_errorbarh(aes(color = level), height = 0.1) +
geom_point(size = 2.5, color = colM) +
scale_x_continuous(limits = c(2.5, 3.8), name = "mean rating") +
scale_size_manual(name = "confidence level",
values = c(`80%` = 2.25, `95%` = 1.5, `99%` = 0.75),
guide = guide_legend(direction = "horizontal",
title.position = "top",
label.position = "bottom")) +
scale_color_manual(name = "confidence level",
values = c(`80%` = col80, `95%` = col95, `99%` = col99),
guide = guide_legend(direction = "horizontal",
title.position = "top",
label.position = "bottom")) +
thm +
theme(legend.position = c(1, 0.01), legend.justification = c(1, 0)) +
ylab("") +
ggtitle("Confidence Intervals for the Mean")
Confidence densities, or confidence distributions, as proposed in
Adrian W. Bowman. Graphics for Uncertainty. J. R. Statist. Soc. A 182:1-16, 2018. Link
## based on code for Wilke's Fig. 16.9 (e)
ggplot(filter(c1CI, level == 0.95),
aes(x = m, y = reorder(location, m))) +
stat_confidence_density(aes(moe = xmax - m, fill = stat(ndensity)),
height = 0.7, confidence = 0.95, alpha = NA) +
geom_segment(aes(x = m, xend = m,
y = as.integer(reorder(location, m)) - 0.35,
yend = as.integer(reorder(location, m)) + 0.35),
size = 2, color = colM) +
scale_fill_gradient(low = "#81A7D600", high = "#345A7FD0") +
scale_x_continuous(limits = c(2.5, 3.8), name = "mean rating") +
thm +
ylab("")
One drawback of all of these methods: The least precise measurement draws the most attention.
Using the very old mtcars
data set to illustrate estimating a smooth relationship.
A default geom_smooth
shows an estimate along with a point-wise confidence band.
p <- ggplot(mtcars, aes(disp, mpg)) +
geom_point() +
thm
p + geom_smooth()
This may not give the best sense of the joint uncertainty: if the curve is higher on some places it may need to be lower in others.
Showing an ensemble of curves that all are plausible can be a better choice.
Here the ensemble is generated using a case-based bootstrap.
These plots are called ensemble plots (also spaghetti plots, for obvious reasons).
mts <- lapply(seq_len(10),
function(i) mutate(sample_frac(mtcars, 1, replace = TRUE),
sample = i)) %>%
bind_rows
p2 <- p +
geom_smooth(color = NA) +
geom_smooth(aes(group = sample),
se = FALSE, size = 0.3, color = "#3366FF", data = mts)
p2
If animation is available, an alternative is to show the curves one at a time in an animation.
This is an example of a hypothetical outcomes plot, or HOP, as introduced in
Hullman, Jessica, Paul Resnick, and Eytan Adar. “Hypothetical outcome plots outperform error bars and violin plots for inferences about reliability of variable ordering.” PloS one 10, no. 11 (2015).
Again a bootstrap is used to produce the estimates.
library(gganimate)
animate(p2 + transition_states(sample, transition_length = 2, state_length = 1))
A visualization can accurately reflect data but still be misleading if the data are faulty.
A recent NY Times article shows a choropleth map of the estimated share of adults who would “definitely” or “probably” get the COVID-19 vaccine.
Cutoffs: 49 60 65 70 75 80 91 %
The map may accurately reflect the estimates, but the estimates have obvious problems.
Plot annotations can create popout and help focus the viewer’s attention.
Here are a few examples for the mpg
data:
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
geom_mark_hull(aes(filter = class == "2seater"),
fill = "blue",
description = "2-Seaters have high displacement values, but also high fuel efficiency for their displacement.") +
geom_mark_rect(aes(filter = hwy > 40),
fill = "green",
description = "These are Volkswagens") +
geom_mark_circle(aes(filter = hwy == 12),
fill = "red",
description = "Three pickups and an suv.")
## Warning: The concaveman package is required for geom_mark_hull
It is often useful to use several graphics to present an analysis.
Collections of related graphs are sometimes called ensemble graphics.
On line presentations of analyses involving multiple visualizations and, typically, some interactive features are also called dashboards.
To aid the viewer it is usually best to design these visualizations together, with common axis choices and color mappings.
Unwin’s Fig 12.1 provides a simple example:
library(ggplot2)
library(GGally)
library(gridExtra)
data(coffee, package = "pgmm")
coffee <- within(coffee, Type <- ifelse(Variety == 1,
"Arabica", "Robusta"))
names(coffee) <- abbreviate(names(coffee), 8)
a <- ggplot(coffee, aes(x = Type)) + geom_bar(aes(fill = Type)) +
scale_fill_manual(values = c("grey70", "red")) +
guides(fill = FALSE) + ylab("")
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
b <- ggplot(coffee, aes(x = Fat, y = Caffine, colour = Type)) +
geom_point(size = 3) +
scale_colour_manual(values = c("grey70", "red"))
c <- ggparcoord(coffee[order(coffee$Type), ], columns = 3 : 14,
groupColumn = "Type", scale = "uniminmax") +
xlab("") + ylab("") +
theme(legend.position = "none") +
scale_colour_manual(values = c("grey", "red")) +
theme(axis.ticks.y = element_blank(),
axis.text.y = element_blank())
grid.arrange(arrangeGrob(a, b, ncol = 2, widths = c(1, 2)),
c, nrow = 2)
Data on the chemical composition of coffee samples collected from around the world, comprising 43 samples from 29 countries. Each sample is either of the Arabica or Robusta variety. Twelve of the thirteen chemical constituents reported in the study are given. The omitted variable is total chlorogenic acid; it is generally the sum of the chlorogenic, neochlorogenic and isochlorogenic acid values.
Streuli, H. (1973). Der heutige stand der kaffeechemie. In Association Scientifique International du Cafe, 6th International Colloquium on Coffee Chemisrty, Bogata, Columbia, pp. 61-72.
Some of the areas we covered:
Many different types of graphs.
Perception
A little on interaction, animation.
Emphasis on techniques useful for exploration, scientific reporting.
rmarkdown
for integrating code and reporting.git
, GitHub
.Class notes will remain available, in some form, at the class web site.
Some books to look at:
Alberto Cairo (2019) How Charts Lie: Getting Smarter about Visual Information, W. W. Norton & Company.
Claus O. Wilke (2019) Fundamentals of Data Visualization, O’Reilly, Inc. (Book source on GitHub; supporting materials on GitHub)
Kieran Healy (2018) Data Visualization: A practical introduction, Princeton
Winston Chang (in preparation), R Graphics Cookbook, 2nd edition, O’Reilly. (Book source on GitHub)
Some blogs to check out:
Keep a critical eye out for good (and not so good) uses of data visualization in the media.