---
title: "Perception and Visualization"
output:
html_document:
toc: yes
code_folding: show
code_download: true
---
```{r setup, include = FALSE}
source(here::here("setup.R"))
knitr::opts_chunk$set(collapse = TRUE, message = FALSE,
fig.height = 5, fig.width = 6, fig.align = "center")
library(lattice)
library(tidyverse)
library(gridExtra)
library(scales)
set.seed(12345)
```
## Some River Flow Data
```{r, echo = TRUE}
river <- scan("https://www.stat.uiowa.edu/~luke/data/river.dat")
plot(river)
```
## A Simple Model of Visual Perception
The eyes acquire an image, which is processed through three stages of
memory:
* Iconic memory
* Working memory, or short-term memory
* Long-term memory
### Iconic Memory
The first processing stage of an image happens in iconic memory.
* Images remain in iconic memory for less than a second.
* Processing in iconic memory is massively parallel and automatic.
* This is called _preattentive processing_.
Preattentive processing is a fast recognition process.
### Working Memory
Meaningful visual chunks are moved from iconic memory to short term memory.
These chunks are used by conscious, or attentive, processing.
Attentive processing often involves conscious comparisons or search.
Short term memory is limited;
* information is retained for only a few seconds;
* only three or fours chunks can be held at a time.
Chunks can be of varying size; a coherent pattern can form a single
chunk even if it is quite large.
If more chunks are needed or chunks are needed longer they need to
be reacquired or retrieved from long term memory.
### Long Term Memory
Long term visual memory is built up over a lifetime, though
infrequently used visual chunks may become lost.
Chunks processed repeatedly in working memory may be transferred to
long term memory.
Common patterns and contextual information can be retrieved from
long term memory for attentive processing in working memory.
### Visual Design Implications
Try to make as much use of preattentive features as possible.
Recognize when preattentive features might mislead.
For features that require attentive processing keep in mind that
working memory is limited.
## Some Examples of Challenges
### Context Matters
Which of the inner circles is larger, or are they the same size?
```{r, echo = FALSE}
circle <- local({
phi <- seq(0, 2 * pi, length.out = 100)
x <- cos(phi)
y <- sin(phi)
function(cx = 0, cy = 0, r = 1) list(x = r * x + cx, y = r * y + cy)
})
setup <- function(w = 5)
plot(0, type = "n", asp = 1,
xlim = c(-w, w), ylim = c(-w, w),
axes = FALSE, xlab = "", ylab = "",
mar = c(0, 0, 0, 0))
setup()
polygon(circle(-2, 0, 3))
polygon(circle(-2, 0), col = "black")
polygon(circle(3, 0, 1.3))
polygon(circle(3, 0), col = "black")
```
```{r, echo = FALSE}
setup(10)
x1 <- -4
x2 <- 6
polygon(circle(x1, 0))
for (a in (2 * pi * (1 : 6) / 6))
polygon(circle(4 * cos(a) + x1, 4 * sin(a), 2))
polygon(circle(x2, 0))
for (a in (2 * pi * (1 : 10) / 10))
polygon(circle(1.5 * cos(a) + x2, 1.5 * sin(a), 0.5))
```
Which of the lines is longer, or are they the same length?
```{r, echo = FALSE}
setup()
polygon(circle(-4, -2))
polygon(circle(4, -2))
polygon(circle(-2, 2))
polygon(circle(2, 2))
segments(-3, -2, 3, -2, lwd = 3)
segments(-3, +2, 3, +2, lwd = 3)
width <- 3
ah <- function(x, y, size = 1, dir = 1) {
ax <- c(-size * dir, 0, -size * dir)
ay <- c(-size, 0, size)
lines(x + ax, y + ay, lwd = width)
}
```
```{r, echo = FALSE}
setup()
segments(-3, -2, 3, -2, lwd = width)
ah(3, -2)
ah(-3, -2, dir = -1)
segments(-3, +2, 3, +2, lwd = width)
ah(3, 2, dir = -1)
ah(-3, 2)
```
The sine Illusion: which of the bars are longer, or are they the same
length?
```{r, class.source = "fold-hide"}
x <- seq(0, 5 * pi, length.out = 100)
w <- 0.5
plot(x, sin(x), ylim = c(-1, 1 + w), type = "n")
segments(x0 = x, y0 = sin(x), y1 = sin(x) + w, lwd = 3)
```
Which of the squares A and B is darker, or are they the same shade?
```{r echo = FALSE}
knitr::include_graphics(IMG("chess1.png"))
```
```{r echo = FALSE}
knitr::include_graphics(IMG("chess2.png"))
```
### Some Optical Illusions
[R implementations](http://rpubs.com/kohske/R-de-illusion) of some
optical illusions by Kohske Takahashi:
Are these lines parallel?
```{r, echo = FALSE}
library(grid)
rs <- expand.grid(x = seq(0, 1, 1 / 10), y = seq(0, 1, 1 / 10))
grid.rect(rs$x, rs$y, 1 / 10 / 2, 1 / 10 / 2,
gp = gpar(fill = "black", col = NA))
grid.rect(rs$x + 1 / 10 / 4, rs$y + 1 / 10 / 2, 1 / 10 / 2, 1 / 10 / 2,
gp = gpar(fill = "black", col = NA))
ls <- expand.grid(x = 0 : 1, y = seq(0, 1, 1 / 20) - 1 / 20 / 2)
grid.polyline(ls$x, ls$y, id = gl(nrow(ls) / 2, 2),
gp = gpar(col = "grey50", lwd = 1))
```
Again, are these lines parallel?
```{r, echo = FALSE}
grid.newpage()
n <- 10; ny <- 8; L <- 0.01
c <- seq(0, 1, length = n); d <- 1.2 * diff(c)[1] / 2
col <- c("black", "white")
x <- c(c - d, c, c + d, c)
y <- rep(c(0, -d, 0, d), each = n)
w <- c(c - d, c - d + L, c + d, c + d - L)
z <- c(0, L, 0, -L)
ys <- seq(0, 1, length = ny)
grid.rect(gp = gpar(fill = gray(0.5), col = NA))
plyr::l_ply(1 : ny, function(i) {
if (i %% 2 == 0) {
co <- rev(col)
z <- -z
} else
co <- col
grid.polygon(x, y + ys[i], id = rep(1 : n, 4),
gp = gpar(fill = co, col = NA))
grid.polygon(w, rep(z, each = n) + ys[i], id = rep(1 : n, 4),
gp = gpar(fill = rev(co), col = NA))
})
```
[Scintillating grid
illusion](https://psychology.wikia.org/wiki/Grid_illusion#Scintillating_grid_illusion):
Black dots at the intersections appear and disappear; are they real?
```{r, echo = FALSE}
nx <- 6; ny <- 6; lwd <- 10; cr <- 1.2 / 100
grid.newpage()
## nolint start
## xlim <- ylim <- c(0.05, 0.95)
## pushViewport(viewport( # like plot.window()
## x=0.5, y=0.5, # a centered viewport
## width=unit(min(1,diff(xlim)/diff(ylim)), "snpc"), # aspect ratio preserved
## height=unit(min(1,diff(ylim)/diff(xlim)), "snpc"),
## xscale=xlim, # cf. xlim
## yscale=ylim # cf. ylim
##))
## nolint end
grid.rect(0.5, 0.5, 1, 1, gp = gpar(fill = "black"))
ls <- expand.grid(x = 0:1, y = seq(0, 1, 1 / nx / 2) - 1 / nx / 2 / 2)
grid.polyline(ls$x, ls$y, id = gl(nrow(ls) / 2, 2),
gp = gpar(col = "grey", lwd = lwd))
ls <- expand.grid(y = 0 : 1, x = seq(0, 1, 1 / ny / 2) - 1 / ny / 2 / 2)
grid.polyline(ls$x, ls$y, id = gl(nrow(ls) / 2, 2),
gp = gpar(col = "grey", lwd = lwd))
ls <- expand.grid(x = seq(0, 1, 1 / nx / 2) - 1 / nx / 2 / 2,
y = seq(0, 1, 1 / ny / 2) - 1 / ny / 2 / 2)
grid.circle(ls$x, ls$y, r = cr, gp = gpar(col = NA, fill = "white"))
```
[Hermann grid
illusion](https://psychology.wikia.org/wiki/Grid_illusion#Hermann_grid_illusion):
"Ghost-like" grey blobs at the intersections of white lines.
```{r, echo = FALSE}
grid.newpage()
grid.rect(0.5, 0.5, 1, 1, gp = gpar(fill = "black"))
gp <- gpar(col = "white", lwd = 20)
ls <- expand.grid(x = 0 : 1, y = seq(0, 1, len = nx))
grid.polyline(ls$x, ls$y, id = gl(nrow(ls) / 2, 2), gp = gp)
grid.polyline(ls$y, ls$x, id = gl(nrow(ls) / 2, 2), gp = gp)
```
Grid illusion in a cartogram:
```{r, echo = FALSE, message = FALSE, warning = FALSE, out.width = "60%"}
library(ggtext)
library(statebins)
library(hrbrthemes)
library(tidyverse)
## jsonlite::fromJSON(
## "https://static01.nyt.com/newsgraphics/2020/06/02/testing-dashboard/36a7f9f71fb6271078df301b8012185016c851cf/states.json"
## ) -> cvdf
cvdf <- jsonlite::fromJSON(WLNK("nyt-states-2020-06-02.json"))
cvdf %>%
unnest(data) %>%
mutate(date = as.Date(date)) %>%
filter(date == max(date)) %>%
mutate(test_pct = tests_new / est_tests_needed) %>%
select(postal, test_pct) %>%
mutate(
status = case_when(
test_pct > 1.2 ~ "above",
test_pct < 0.8 ~ "below",
TRUE ~ "near"
)
) %>%
ggplot() +
geom_statebins(
aes(state = postal, fill = status)
) +
coord_fixed() +
scale_fill_manual(
name = NULL,
values = c(
"near" = "#febf82",
"below" = "#c7615b",
"above" = "#bacfa2"
)
) +
labs(
title = "Is Your State Doing Enough Coronavirus Testing?",
subtitle = "**12** states meet the testing target; **5** are near the target; **34** are below the target;",
caption = "Data source: NYTimes "
) +
theme_ipsum_es(grid="", plot_title_size = 15) +
theme(
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_markdown(family = font_es, face = "plain", hjust = 0.5, size = 10)
) +
theme(axis.text.x = element_blank(), axis.text.y = element_blank()) +
theme(legend.position = "none", plot.margin = unit(c(1, 1, 1, 1.5), "cm"))
```
A case where there are more dots than we can see at once:
```{r echo = FALSE, out.width = "40%"}
knitr::include_graphics(IMG("disappearing-dots.jpeg"))
```
Some illusions in pictures:
```{r echo = FALSE, out.width = "50%", fig.cap = "A man running into the snow"}
knitr::include_graphics(IMG("man_in_snow.jpeg"))
```
```{r echo = FALSE, out.width = "50%", fig.cap = "Once you see Cookie Monster, you can’t unsee it"}
knitr::include_graphics(IMG("cookie-monster.jpeg"))
```
A large collection of optical illusions is available at
.
Other links to optical illusions can be found [here](https://onlinemasters.ohio.edu/masters-health-administration/health-brain-games-optical-illusions/).
### Motion
```{r chunk-label, fig.width = 10, fig.show='animate', ffmpeg.format='gif', dev='jpeg', interval = 0.1, class.source = "fold-hide"}
n <- 50
x <- 2 * (1 : n)
y <- rep(2, n)
lim <- c(min(x) + 0.1 * (max(x) - min(x)), max(x) - 0.1 * (max(x) - min(x)))
v <- TRUE
for (i in 1 : 2) {
plot(x + v, y, xlim = lim)
v <- ! v
Sys.sleep(0.1)
}
```
```{r, include = FALSE}
gif_file <- "3dscatter_percep.gif"
if (! file.exists(IMG(gif_file))) {
library(rgl)
d <- data.frame(x = rnorm(1000), y = rnorm(1000), z = rnorm(1000))
points3d(d)
par3d(FOV = 1) ## removes perspective distortion
movie3d(
movie = sub("\\..*", "", gif_file),
spin3d(axis = c(0, 0, 1), rpm = 20),
duration = 3,
type = "gif",
dir = IMG(""),
clean = TRUE
)
## re-save so it loops indefinitely
img <- magick::image_read(IMG(gif_file))
magick::image_write(img, path = IMG(gif_file))
}
```
```{r, echo = FALSE, out.width = "75%"}
knitr::include_graphics(IMG("3dscatter_percep.gif"))
```
Images contain no information about the direction of the rotation.
Your mind picks a direction.
```{r, include = FALSE}
knitr::knit_hooks$set(webgl = rgl::hook_webgl)
options(rgl.useNULL = TRUE)
```
```{r, webgl = TRUE, class.source = "fold-hide"}
library(rgl)
d <- data.frame(x = rnorm(1000), y = rnorm(1000), z = rnorm(1000))
points3d(d)
par3d(FOV = 1) ## removes perspective distortion
if (interactive())
play3d(spin3d(axis = c(0, 0, 1), rpm = 30), duration = 20)
```
Interactive rotation can help.
Depth cueing and perspective can also help.
### Popout and Distractors
Where is the red dot:
```{r, echo = FALSE}
thm <- theme_classic()
thm <- list(theme_void(),
theme(panel.border = element_rect(fill = NA, color = "grey20")),
guides(color = "none", shape = "none"),
scale_shape_manual(values = c(A = 16, B = 15)),
coord_equal(), xlim(c(0, 1)), ylim(c(0, 1)))
pts <- function(n, version, sz = 3) {
idx <- seq_len(n)
if (version == 1) {
v1 <- ifelse(idx == 1, "A", "B")
v2 <- "A"
}
else if (version == 2) {
v1 <- "A"
v2 <- ifelse(idx == 1, "A", "B")
}
else if (version == 3) {
v1 <- ifelse(idx == 1 | idx > (n / 2), "A", "B")
v2 <- ifelse(idx <= (n / 2), "A", "B")
}
d <- data.frame(x = runif(n), y = runif(n), v1 = v1, v2 = v2)
ggplot(d, aes(x, y, color = v1, shape = v2)) + geom_point(size = sz) + thm
}
p1 <- pts(10, 1)
p2 <- pts(100, 1)
p3 <- pts(10, 2)
p4 <- pts(100, 2)
p5 <- pts(10, 3)
p6 <- pts(100, 3)
```
```{r, echo = FALSE}
grid.arrange(p1, p2, nrow = 1)
```
```{r, echo = FALSE}
grid.arrange(p3, p4, nrow = 1)
```
```{r, echo = FALSE}
grid.arrange(p5, p6, nrow = 1)
```
## Items, Attributes, Marks, and Channels
To evaluate or design a visualization it is useful to have some terms
for the components.
Several schools have developed different but similar sets of terms.
Some References:
* Bertin, J. (1967), _The Semiology of Graphics_.
* Cleveland, W. S. (1988), _The Elements of Graphing Data_.
* Cleveland, W. S. (1993), _Visualizing Data_.
* Few, S. (2012), _Show Me the Numbers: Designing Tables and Graphs to
Enlighten_, 2nd ed.
* Munzner, T. (2014), _Visualization Analysis and Design_
* Ware, C. (2012), _Information Visualization: Perception for Design_, 3rd ed.
* Wilkinson, L. (2005), _The Grammar of Graphics_, 2nd ed.
Munzner uses the terminology of items, attributes, links, marks, and
channels:
* _Items_ are the basic units on which data is collected: the entities
represented by the rows in a _tidy_ data frame.
* _Attributes_ are the numerical or categorical features of the data
items we want to represent; the variables in a tidy data frame.
* _Links_ are relations among items: e.g. months within a year, or
countries within a continent.
* _Marks_ are the geometric entities used to represent items: points,
lines, areas. These correspond to the simple `geom` objects in
`ggplot`.
* _Visual channels_ are features of marks that can be used to reflect
values of attributes.
Channels correspond approximately to aesthetics in `ggplot` but are
more focused on the visual aspect:
* An `x` aesthetic that is transformed to polar coordinates is a
different channel than an `x` aesthetic representing a position on a
standard linear scale.
Channels are used to _encode_ attributes (_aesthetic mappings_).
A single attribute can be encoded in several channels.
* This makes the attribute easier to perceive.
* But is also uses up available channels.
Some channels are well suited to encode quantitative or ordered
values; they are _quantitatively perceived_.
Others are only suited for nominal values.
A useful classification, adapted from Few(2012):
| Type | Channel | Quantitatively Perceived? |
|:---------|:------------|:--------------------------- |
| Form | Length | Yes |
| | Width | Yes, but limited |
| | Orientation | No |
| | Size | Yes, but limited |
| | Shape | No |
| Color | Hue | No |
| | Intensity | Yes, but limited |
| Position | 2D position | Yes |
Munzner uses the terms _magnitude channels_ and _identity channels_.
Ideally,
* numeric and ordered attributes should be represented by quantitatively
perceived channels;
* unordered categorical attributes should be represented by
non-quantitative channels
A useful principle: The most important attributes should be mapped to
the most effective channels.
## Channel Effectiveness
Some questions about channels:
* What kind and how much information can a channel encode?
* Are some channels better than others?
* Can channels be used independently, or do they interfere?
Some criteria for evaluating channels:
* Accuracy: How well can a viewer decode the information in the channel?
* Discriminability: How easily can differences between attribute
levels be perceived?
* Separability: Can channels be used independently or is there
interference?
* Popout: can a channel provide popout where a difference is perceived
preattentively?
* Grouping: can a channel show perceptual grouping of items?
### Channel Accuracy
Stevens (1957) argues that accuracy of magnitude channels can be
described by a power law:
$$ \text{perceived sensation} = (\text{physical intensity})^\gamma $$
Experiments by Stevens suggest these values for some visual channels:
```{r, echo = FALSE, fig.height = 4}
n <- 100
x <- seq(0, 5, len = n)
gamma <- c(0.5, 0.67, 0.7, 1, 1.7)
chan <- c("Brightness", "Depth", "Area", "Length", "Staturation")
label <- sprintf("%s (%.2f)", chan, gamma)
nc <- length(gamma)
d <- data.frame(Intensity = rep(x, nc),
gamma = rep(gamma, each = n),
label = rep(label, each = n))
p <- ggplot(d, aes(x = Intensity, y = Intensity ^ gamma)) +
geom_line(aes(group = gamma, color = label), size = 1.5) + ylim(c(0, 5)) +
labs(x = "Phisical Intensity", y = "Perceived Sensation") +
theme_minimal() +
theme(panel.border = element_rect(fill = NA, color = "grey20"))
suppressWarnings(print(p))
```
[Others have raised concerns](https://en.wikipedia.org/wiki/Stevens%27s_power_law)
about the validity of these findings.
```{r, echo = FALSE, eval = FALSE}
## Color wheel for hue:
n <- 72
pie(rep(1, n), col = hsv((1 : n) / n, 1, 1), border = NA, labels = NA)
```
```{r, echo = FALSE, eval = FALSE}
## Value and saturation
plot(0, type = "n", asp = 1,
xlim = c(0, 1), ylim = c(0, 1),
axes = FALSE, xlab = "value/lightness", ylab = "saturation",
mar = c(0, 0, 0, 0))
n <- 32
x <- seq(0, 1, length.out = n)[-n]
hue <- rgb2hsv(col2rgb("blue"))[["h", 1]]
delx <- mean(diff(x))
dely <- 1 / n
for (i in 1 : n)
rect(x, dely * (i - 1), x + delx, dely * i,
col = hsv(hue, dely * i, x + delx), border = NA)
```
```{r, echo = FALSE, eval = FALSE}
## Another approach
opar <- par(mfrow = c(2, 1))
n <- 64
x <- seq(0, 1, length.out = n)[-n]
hue <- rgb2hsv(col2rgb("blue"))[["h", 1]]
delx <- mean(diff(x))
plot(0, type = "n", xlim = c(0, 1), ylim = c(0, 1),
axes = FALSE, xlab = "", ylab = "", mar = c(0, 0, 0, 0))
rect(x, 0.2, x + delx, 0.7, col = hsv(hue, x, 1), border = NA)
title("Saturation")
plot(0, type = "n", xlim = c(0, 1), ylim = c(0, 1),
axes = FALSE, xlab = "", ylab = "", mar = c(0, 0, 0, 0))
rect(x, 0.2, x + delx, 0.7, col = hsv(hue, 1, x), border = NA)
title("Value")
par(opar)
```
Another approach has used controlled experiments to assess accuracy of
various channels used in visualizations:
* William S. Cleveland and Robert McGill (1984), "Graphical
Perception: Theory, Experimentation, and Application to the
Development of Graphical Methods," _Journal of the American
Statistical Association_ 79, 531–554.
* William S. Cleveland and Robert McGill (1987), "Graphical
Perception: The Visual Decoding of Quantitative Information on
Graphical Displays of Data" _Journal of the Royal Statistical
Society. Series A_, 192-229.
* Jeffrey Heer and Michael Bostock (2010) "Crowdsourcing Graphical
Perception: Using Mechanical Turk to Assess Visualization Design,"
_Proceedings of the SIGCHI_, 203-212.
Munzner's ordering by accuracy:
```{r, echo = FALSE}
knitr::include_graphics(IMG("tmmchannels.jpeg"))
```
This ordering is sometimes referred to as a _perceptual ladder_.
Line width is another channel; not sure there is agreement on its
accuracy, but it is not high.
### Discriminability
Many channels, in particular identity channels, can only support a
limited number of discriminable levels.
* Line width is one of the most limited with perhaps 3 levels.
* Using more than 5 or 6 color hues is not recommended.
* Similarly, using more than 5 or 6 symbol shapes can create difficulties.
If the number of levels that can be represented by a channel is
smaller than the number of attribute levels then some form of
meaningful aggregation is needed.
### Separability
Some encodings can be used independently of each other; others
interfere with each other to some degree.
```{r, echo = FALSE, out.width = "55%"}
knitr::include_graphics(IMG("TMfig5.10.jpg"))
```
* Vertical an horizontal position can be used independently.
* Color (hue) and position can be used independently
* Size and hue interfere somewhat; hue is harder to perceive on
smaller objects.
* Width and height do not function well independently; the result is
perceived primarily as shape.
* Encoding two different values in the red and green channels as a hue
does not work at all.
### Popout
Many channels support visual popout: having one item or a few items
immediately stand out from the others.
* Color (hue and intensity) do this well.
* Shape and size can also be used effectively to create popout.
Annotation can also be used to create popout.
### Grouping
Perceptual grouping can be achieved in several ways:
* Using an identity channel to to represent items as a group.
* Using _link marks_.
* By _enclosure_.
* By spatial proximity.
```{r, echo = FALSE}
d1 <- data.frame(x = c(1, 2, 3, 4, 1, 2, 3, 4),
y = c(1, 2, 1, 1.5, 6, 5.5, 6, 5),
v = rep(c("A", "B"), each = 4))
p1 <- ggplot(d1, aes(x, y, color = v)) + geom_point(size = 4) +
theme_void() + guides(color = "none") +
theme(panel.border = element_rect(fill = NA, color = "grey20"))
```
```{r, echo = FALSE}
d1 <- data.frame(x = c(1, 2, 3, 4, 1, 2, 3, 4),
y = c(1, 2, 1, 1.5, 6, 5.5, 6, 5),
v = rep(c("A", "B"), each = 4))
p2 <- ggplot(d1, aes(x, y, color = v)) +
geom_line(aes(group = v), color = "black") +
geom_point(size = 4) + theme_void() + guides(color = "none") +
theme(panel.border = element_rect(fill = NA, color = "grey20"))
```
```{r, echo = FALSE}
d3 <- data.frame(x = c(1, 1, 3, 3),
y = c(1, -1, 1, -1),
s = c(5, 10, 10, 5))
d3$i <- factor(seq_along(d3$x) %% 2)
p3 <- ggplot(d3) +
geom_rect(aes(xmin = 0.6, ymin = -1.5, xmax = 1.5, ymax = 1.5),
fill = "gray") +
geom_point(aes(x = x, y = y, size = s, color = i)) +
scale_size_area(max_size = 10) +
guides(size = "none", color = "none") + theme_void() +
xlim(c(0, 4)) + ylim(c(-2, 2)) +
theme(panel.border = element_rect(fill = NA, color = "grey20"))
```
```{r, echo = FALSE}
d4 <- data.frame(x = c(1, 2, 3, 10, 10, 12),
y = c(1, -1, 1, 6, 8, 6),
s = c(5, 10, 10, 5, 5, 10))
d4$i <- factor(seq_along(d4$x) %% 2)
p4 <- ggplot(d4) + geom_point(aes(x = x, y = y, size = s, color = i)) +
scale_size_area(max_size = 15) +
guides(size = "none", color = "none") + theme_void() +
ylim(c(-2, 9)) + xlim(c(0, 13)) +
theme(panel.border = element_rect(fill = NA, color = "grey20"))
```
```{r, echo = FALSE}
grid.arrange(p1, p2, p3, p4, nrow = 2)
```
## Experimental Evidence
### Cleveland-McGill
The 1984 paper is available from
[JSTOR](https://www.jstor.org/stable/2288400).
The paper formulates a theory for ranking _Elementary Perceptual
Tasks_; these correspond to channel mappings.
Some orderings were addressed by informal experiments (obvious to the
authors at least).
Others were assessed by formal experiments with about 50 subjects.
Experiments focused on accuracy of decoding, though this is not viewed
as the primary purpose of a graph:
> One must be careful not to fall into a conceptual trap of adopting
> accuracy as a criterion. ... The power of a graph is its ability to
> enable one to take in the quantitative information, organize it, and
> see patterns and structure not readily revealed by other means of
> studying the data.
Their premise:
> A graphical form that involves elementary perceptual tasks that lead
> to more accurate judgments than another graphical form (with the
> same quantitative information) will result in better organization
> and increase the chances of a correct perception of patterns and
> behavior."
The tasks: For each setting
* Identify which of two marked items is smaller.
* Estimate the percentage the smaller is of the larger.
`)
`)
Results:
Percent large errors:
`)
Absolute error:
`)
### Heer and Bostock
[Heer and Bostock
(2010)](http://idl.cs.washington.edu/papers/crowdsourcing-graphical-perception/)
set out to replicate the Cleveland McGill experiment using crowd
sourcing via [Amazon Mechanical Turk](https://www.mturk.com/)
They used the five position stimuli from Cleveland and McGill and
some new ones:
`)
50 subjects were recruited for each task.
Results were consistent with Cleveland-McGill results:
`)
Use of Mechanical Turk was deemed a success.
### Pie Chart Experiments
[Pie charts](https://eagereyes.org/pie-charts) are popular but
somewhat controversial.
* Pie charts are inferior for comparisons to bar charts.
* Pie charts are quite good at representing part-whole relationships.
* Cleveland and McGill suggested pie charts are read by angle.
* [Kosara and Skau](https://eagereyes.org/papers/a-pair-of-pie-chart-papers)
report experiments that suggest this is not the case.
* If it were,
[_donut charts_](http://www.datavizcatalogue.com/methods/donut_chart.html)
would be even less effective, but they seem to be very comparable.
* [Kosara's blog](https://eagereyes.org) provides a
[review](https://eagereyes.org/blog/2016/an-illustrated-tour-of-the-pie-chart-study-results)
of other pie chart studies.
## Improving Some Common Charts
Cleveland and McGill set out to suggest improvements to some common charts.
This is a selection of their examples.
### Dot Charts
Cleveland and McGill use their perceptual ladder to argue strongly for
using dot charts in place of bar charts and pie charts.
### Playfair's Balance of Trade plots
Playfair presented a number of plots showing imports and exports
between England and other nations.
A primary goal was to show the balance of trade, the difference
between exports and imports:
`)
Assessing the differences from a plot showing exports and imports as
separate curves requires length judgments, which are less accurate
than comparisons to a common stale.
```{r, include = FALSE}
# nolint start
if (! file.exists("playfair-balance-of-trade-data.tsv"))
download.file("http://www.stat.uiowa.edu/~luke/data/playfair-balance-of-trade-data.tsv",
"playfair-balance-of-trade-data.tsv")
# nolint end
```
```{r, echo = FALSE}
pimex <- read.delim("playfair-balance-of-trade-data.tsv")
mutate(pimex, difference = NULL) %>%
pivot_longer(-year, names_to = "which", values_to = "trade") %>%
ggplot(aes(x = year, y = trade, group = which, color = which)) +
geom_line() +
theme_minimal() +
theme(panel.border = element_rect(fill = NA, color = "grey20"))
```
Plotting the difference makes the balance of trade much easier to assess:
```{r, echo = FALSE}
ggplot(pimex) + geom_line(aes(x = year, y = imports - exports)) +
theme_minimal() +
theme(panel.border = element_rect(fill = NA, color = "grey20"))
```
### Framed Unaligned Bars
It is difficult to compare lengths of unaligned rectangles when the
lengths are close.
```{r, echo = FALSE}
h1 <- 7.4
h2 <- 6.7
hb <- 10
off <- 3
p1 <- ggplot() +
geom_rect(aes(xmin = 0, ymin = 0, xmax = 1, ymax = h1),
fill = muted("blue")) +
geom_rect(aes(xmin = 2, ymin = off, xmax = 3, ymax = off + h2),
fill = muted("blue")) +
theme_void() +
ylim(c(-1, off + hb + 1)) + xlim(c(-2, 5))
p1
```
Adding a frame moves the task up the perceptual ladder to an unaligned
comparison against a common scale.
```{r, echo = FALSE}
p1 + geom_rect(aes(xmin = 0, ymin = 0, xmax = 1, ymax = hb),
fill = NA, color = "black") +
geom_rect(aes(xmin = 2, ymin = off, xmax = 3, ymax = off + hb),
color = "black", fill = NA)
```
Comparing to a common scale is still the most effective approach:
```{r, echo = FALSE}
ggplot() +
geom_rect(aes(xmin = 0, ymin = 0, xmax = 1, ymax = h1),
fill = muted("blue")) +
geom_rect(aes(xmin = 2, ymin = 0, xmax = 3, ymax = h2),
fill = muted("blue")) + theme_void() +
ylim(c(-1, off + hb + 1)) + xlim(c(-2, 5))
```
But this does suggest that using unaligned framed rectangles to encode
a _third_ variable, with position encoding the two primary variables
may be effective.
### Framed Rectangle Maps
A _choropleth map_ is a common way to depict a quantitative variable in
a geographic context.
Shading is quite low on the perceptual ladder.
Cleveland and McGill suggest the use of framed rectangles positioned
on the map as an alternative.
This does not seem to have caught on so far, though you do sometimes
see the use of other glyphs, such as pie charts.
```{r, echo = FALSE, out.width = "90%"}
knitr::include_graphics(IMG("CMmap.jpg"))
```
## Analyzing a Design
Graph layout involves several levels:
* Primary data representation
* items, attributes
* marks, channels
* Supporting features
* trend lines, reference lines, annotations
* axes, legends
A useful structure for describing the primary features:
* What are the data items?
* What are the attributes?
* What marks are used?
* What channels are used?
* Which attribute is matched to channel 1
* Which attribute is matched to channel 2
* ...
Useful questions:
* Are the most important attributes mapped to the strongest channels?
* Do the mappings do a good job of conveying the primary message?
* If not, can the graph be improved by adjusting the mappings?
### A Gapminder Plot
One of the frames of a plot from the [GapMinder
site](https://www.gapminder.org):
`)
* Items are countries (in a particular year)
* Attributes in the plot:
* life expectancy
* income per person
* population
* continent
* country name
* year
* Marks: points (or bubbles).
* Channels and mappings:
* horizontal position, mapped to log income
* vertical position, mapped to life expectancy
* area, mapped to population
* color (hue), mapped to continent
* interactive: mouse-over, mapped to country name
* interactive: time (or frame), mapped to year
The basic plot can be created with `ggplot` and aesthetic mappings:
```{r gapminder-basic, eval = FALSE}
library(gapminder)
ggplot(filter(gapminder, year == 2007)) +
geom_point(aes(x = gdpPercap,
y = lifeExp,
size = pop,
color = continent)) +
scale_x_log10() +
ylim(c(20, 85))
```
```{r gapminder-basic, echo = FALSE}
```
The data in the `gapminder` package differ somewhat from the data used
by the Gapminder site, but overall the plot designs are very close.
With some adjustments the basic plot can be brought close to the
Gapminder version in appearance:
```{r gapminder-full, eval = FALSE}
gm2007 <- filter(gapminder, year == 2007) %>%
arrange(desc(pop)) ## sort to avoid over-plotting
ggplot(gm2007) +
geom_point(aes(x = gdpPercap,
y = lifeExp,
size = pop,
fill = continent),
shape = 21) + ## to allow the using `fill` aesthetic
scale_x_log10() +
ylim(c(20, 85)) +
scale_size_area(max_size = 20,
labels = comma,
breaks = c(0.25 * 10 ^ 9, 0.5 * 10 ^ 9, 10 ^ 9)) +
scale_fill_manual(values = c(Africa = "deepskyblue",
Asia = "red",
Americas = "green",
Europe = "gold",
Oceania = "brown")) +
labs(x = "Income", y = "Life expectancy") +
theme(text = element_text(size = 16)) +
guides(fill = guide_legend(title = "Continent",
override.aes = list(size = 5),
order = 1),
size = guide_legend(title = "Population",
label.hjust = 1,
order = 2)) +
theme_minimal() +
theme(panel.border = element_rect(fill = NA, color = "grey20"))
```
```{r gapminder-full, echo = FALSE, fig.height = 6, fig.width = 8}
```
Some notes:
* Using larger bubbles makes the plot more engaging.
* Using larger bubbles makes differences in population easier to
assess, but makes the strength of the relationship between life
expectancy and income harder to assess.
* Using larger bubbles also increases the risk of
over-plotting. Sorting the rows so larger bubble are drawn first
helps reduce the risk somewhat.
These plots show a fairly strong marginal association between life
expectancy and income.
There does not seem to be a strong association of population with the
other two variables, but these plots are not ideal for that assessment.
To judge the marginal association between life expectancy and
population size we can change the channel mapping:
* map the horizontal axis to population;
* map area to income.
```{r gapminder-lifepop, eval = FALSE}
ggplot(filter(gapminder, year == 2007)) +
geom_point(aes(x = pop,
y = lifeExp,
size = gdpPercap)) +
scale_x_log10() +
ylim(c(20, 85)) +
theme_minimal() +
theme(panel.border =
element_rect(fill = NA,
color = "grey20"))
```
```{r gapminder-lifepop, echo = FALSE}
```
This confirms that there is very little association between life
expectancy and population.
The association between life expectancy and income is still visible,
but is easier to assess when these two variables are mapped to 2D
position.
### Michelin Stars
This image, from a
[blog post](http://createhtml5map.com/interactive-map-blog/interactive-bubble-chart-countries-receive-the-most-michelin-stars-each-year/),
shows the total number of stars for different countries:
`)
* Items: Countries (in a particular year)
* Attributes:
* country name
* number of stars
* Marks: bubbles, text
* Channels and mappings:
* bubble color, mapped to country
* bubble area, mapped to number of stars
* text, mapped to country name (where possible)
* text, mapped to number of stars (where possible)
Observations:
* None of the channels are very strong.
* The strongest channels, 2D position, are not used.
* The number of colors used is too high.
A simple _dot plot_ would convey the distribution better.
A dot plot using [2017 data](michelin.dat) from an
[article](https://www.telegraph.co.uk/travel/maps-and-graphics/map-michelin-star-restaurants-countries-with-the-most/)
in The Telegraph:
```{r}
michelin <- read.table(WLNK("michelin.dat"),
head = TRUE)
michelin <- mutate(michelin,
stars = one + 2 * two + 3 * three,
country = reorder(country, stars))
ggplot(michelin, aes(x = stars, y = country)) +
geom_point() +
labs(x = "Stars", y = NULL) +
theme_minimal() +
theme(text = element_text(size = 16)) +
theme(panel.border =
element_rect(fill = NA,
color = "grey20"))
```
Even if a bubble plot is desired for aesthetic reasons, position
could be used to
* group countries by continent;
* show countries on a map.
A plot like the original can be constructed by
* computing locations for a set of packed spheres with specified radii;
* using `geom_point` and `geom_text`.
```{r, fig.width = 8, class.source = "fold-hide"}
## create randomly located packed spheres with specified areas
library(packcircles)
set.seed(54321)
packing <- circleProgressiveLayout(michelin$stars)
## merge circle locations with starts data
mcirc <- bind_cols(packing, michelin) %>%
mutate(country = factor(country)) ## clean out stray attributes
## compute some colors to use
nr <- nrow(mcirc)
pal <- colorRampPalette(RColorBrewer::brewer.pal(9, "Set1"))
cols <- sample(pal(nr), nr)
## these need tuning for the screen or R markdown
csize <- 45
tsize <- 2.5
## the basic plot
## uses shape = 19 and `color` aesthetic
p <- ggplot(mcirc, aes(x = x, y = y)) +
geom_point(aes(size = radius ^ 2, color = country), shape = 19) +
scale_size_area(max_size = csize) +
geom_text(aes(label = paste(country, stars, sep = "\n")),
data = filter(mcirc, stars >= 120),
size = tsize) +
coord_fixed()
## adjustments
p + guides(size = "none") +
scale_color_manual(values = cols) +
xlim(with(mcirc, range(x) + c(-1, 1) * max(radius))) +
ylim(with(mcirc, range(y) + c(-1, 1) * max(radius))) +
theme_void()
```
A somewhat more robust approach
* creates vertex data for polygon approximations to the circles;
* merges the stars data with the polygon data;
* uses `geom_polygon` and `geom_text`.
This is very similar to the way simple _choropleth maps_ are created.
```{r, fig.width = 8, class.source = "fold-hide"}
## compute polygon approximations to spheres
mcircpoly <- circleLayoutVertices(mcirc, idcol = "country", npoints = 100) %>%
rename(country = id)
## merge the stars data into the polygon data
sttab <- select(michelin, country, stars)
mcircpoly <- left_join(mcircpoly, sttab, "country")
## create the plot
ggplot(mcircpoly, aes(x = x, y = y)) +
geom_polygon(aes(fill = country)) +
geom_text(aes(label = paste(country, stars, sep = "\n")),
data = filter(mcirc, stars >= 120),
size = tsize) +
coord_fixed() +
scale_fill_manual(values = cols) +
theme_void()
```
## Aspect Ratio and Perception
The river flow data shows how important aspect ratio can be to our
ability to detect patterns:
```{r, echo = FALSE}
library(gridExtra)
river <- scan("https://www.stat.uiowa.edu/~luke/data/river.dat")
r <- data.frame(flow = river, month = seq_along(river))
p0 <- ggplot(r, aes(x = month, y = flow)) +
theme_minimal() +
theme(panel.border = element_rect(fill = NA, color = "grey20"))
p1 <- p0 + geom_point()
grid.arrange(p1 + coord_fixed(ratio = 35),
p1 + coord_fixed(ratio = 4), heights = c(3, 1))
```
Using a line plot the basic periodicity becomes apparent even in the
first aspect ratio.
```{r, class.source = "fold-hide"}
p2 <- p0 + geom_line()
p2 + coord_fixed(ratio = 35)
```
But the steeper increase/shallower decrease of most periods is easier
to see in the second aspect ratio:
```{r, fig.height = 2, class.source = "fold-hide"}
p2 + coord_fixed(ratio = 4)
```
The aspect ratio also influences interpretation of results.
Some alternative views of (suspect) data on the number of people on
government assistance over a time period:
```{r, message = FALSE, fig.width = 8, class.source = "fold-hide"}
library(readr)
w <- read_csv(WLNK("hw2-welfare.csv"))
w <- mutate(w, onAssistance = onAssistance / 10 ^ 6)
w <- mutate(w,
date = seq(as.Date("2009-01-01"), by = "quarter", length.out = 10))
p0 <- ggplot(w, aes(x = date, y = onAssistance)) +
theme_minimal() +
theme(panel.border = element_rect(fill = NA, color = "grey20"))
p1 <- p0 + geom_line(aes(group = 1))
grid.arrange(p1, p1 + coord_fixed(ratio = 8),
p1 + ylim(0, max(w$onAssistance)),
p0 + geom_col(width = 40), ncol = 2)
```
Some notes:
* Automated choices of axis scaling can affect the aspect ratio of
the content of a plot.
* A zero base line supports ratio comparisons.
* The preattentive response to bar charts is always to compare
ratios, so using a zero base line is important.
* Using a non-zero base line for line plots and scatter plots
encourages interval, or difference, comparisons.
* Research on the effect of aspact ratio on perception has focused
on accuracy of slope comparisons.
* The general message is that keeping away from slopes that are too
steep or too shallow is best.
* _Banking to 45 degrees_, or choosing an aspect ratio so the slope
magnitudes are distributed around 45 degrees is often recommended.
* This also tends to be a useful "neutral ground" when political
implications are involved.
Some references:
* A [blog post](https://eagereyes.org/basics/banking-45-degrees) by
Robert Kosara.
* William S. Cleveland, Marylyn E. McGill and Robert McGill (1988),
"The Shape Parameter of a Two-Variable Graph", _Journal of the
American Statistical Association_
([JSTOR](http://www.jstor.org/stable/2288843?seq=1#page_scan_tab_contents))
* Justin Talbot, John Gerth, Pat Hanrahan (2012), "An Empirical Model
of Slope Ratio Comparisons", _IEEE Trans. Visualization &
Comp. Graphics (Proc. InfoVis)_
([PDF](http://vis.stanford.edu/files/2012-SlopeComparison-InfoVis.pdf))
## Ensemble Plots and Faceting
Using multiple channels allows a single plot to show a lot of information.
But over-plotting and interference can become problems.
One alternative is to use several related views in a useful arrangement.
Such arrangements are sometimes called _ensemble plots_.
There are a number of variations; a few are introduced below.
### Similar Plots with Different Variables and Shared Encodings
One way to show three continuous variables is with two plots that
share an axis:
```{r, fig.height = 3, fig.width = 8, class.source = "fold-hide"}
library(patchwork)
pe_thm <- theme_minimal() +
theme(panel.background = element_rect(color = "black",
size = 0.5,
fill = NA))
p1 <- ggplot(gm2007, aes(x = gdpPercap, y = lifeExp, color = continent)) +
geom_point() +
scale_x_log10() +
guides(color = "none") +
pe_thm
p2 <- ggplot(gm2007, aes(x = pop / 10 ^ 6, y = lifeExp, color = continent)) +
geom_point() +
scale_x_log10() +
pe_thm +
theme(axis.title.y = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank())
p1 + p2
```
Life expectancy is mapped to the vertical position in both plots.
Continent is mapped to color in both plots.
Guides can be shared when encodings are shared.
### Different Plots with Shared Encodings
Multiple views of the data are often helpful.
Sharing encodings makes the relations between views easier to perceive.
```{r, fig.height = 3, fig.width = 8, class.source = "fold-hide"}
p3 <- ggplot(gm2007,
aes(x = pop / 10 ^ 6,
y = reorder(continent, pop, FUN = sum),
fill = continent)) +
geom_col() +
ylab(NULL) +
pe_thm
p1 + p3
```
### Small Multiples
_Small multiples_ refers to a collection of plots with identical
structure showing different subsets of the data and organized in a
useful way.
These plot collections are also called _trellis plots_, _lattice
plots_, or _faceted plots_.
A plot of life expectancy against income per capita in 2007 faceted by
continent:
```{r, fig.width = 8, class.source = "fold-hide"}
gd <- filter(gapminder, year %in% c(1977, 1987, 1997, 2007))
gd2007 <- filter(gapminder, year == 2007)
fct_thm <- theme_minimal() +
theme(panel.background = element_rect(color = "black",
size = 0.5,
fill = NA))
ggplot(gd2007, aes(x = gdpPercap, y = lifeExp, color = continent)) +
geom_point(size = 2.5) +
scale_x_log10() +
facet_wrap(~ continent) +
fct_thm
```
A useful variation is to show a muted view of the full data in the
background:
```{r, fig.width = 8, class.source = "fold-hide"}
ggplot(gd2007, aes(x = gdpPercap, y = lifeExp, color = continent)) +
geom_point(data = mutate(gd2007, continent = NULL), color = "grey80") +
geom_point(size = 2.5) +
scale_x_log10() +
facet_wrap(~ continent) +
fct_thm
```
Data can be facetet on two variables.
This plot shows the full data
faceted by both continent and a set of years:
```{r, fig.width = 7, fig.height = 7, class.source = "fold-hide"}
ggplot(gd, aes(x = gdpPercap, y = lifeExp, color = continent)) +
geom_point(size = 2.5) +
scale_x_log10() +
facet_grid(continent ~ year) +
fct_thm
```
Adding muted data for each year helps regonizing where each
continent group fits within a year
```{r, fig.width = 7, fig.height = 7, class.source = "fold-hide"}
ggplot(gd, aes(x = gdpPercap, y = lifeExp, color = continent)) +
geom_point(data = mutate(gd, continent = NULL), color = "grey80") +
geom_point(size = 2.5) +
scale_x_log10() +
facet_grid(continent ~ year) +
fct_thm
```
## Reading
Section [_Perception and Data
Visualization_](https://socviz.co/lookatdata.html#perception-and-data-visualization)
in [_Data Visualization_](https://socviz.co/).
Chapter [_Data visualization
principles_](https://rafalab.github.io/dsbook/data-visualization-principles.html)
in [_Introduction to Data Science Data Analysis and Prediction
Algorithms with R_](https://rafalab.github.io/dsbook).
## Exercises
1. Which of the following channels are magnitude channels and which
are identity channels?
a. Position on a common scale
b. Length
c. Color hue (red, green, etc.)
d. Symbol shape (dot, cross, etc.)
2. Consider the following visualizations of the 2017 Michelin star
data:
```{r, include = FALSE}
plot.new()
michelin <- read.table(WLNK("michelin.dat"),
head = TRUE)
michelin <- mutate(michelin,
stars = one + 2 * two + 3 * three,
## Get strwidth from base graphics.
## Not right but may be OK for wuick use.
strwidth = 500 * strwidth(country, "inches"),
country = reorder(country, stars))
```
```{r, echo = FALSE}
library(dplyr, warn.conflicts = FALSE)
library(ggplot2)
p1 <- ggplot(michelin, aes(x = 0, y = country, label = country)) +
geom_text(hjust = 0, size = 4) +
geom_tile(aes(x = strwidth + stars / 2,
width = stars),
height = 0.5,
fill = "deepskyblue") +
xlim(c(0, 1000)) +
theme_minimal() +
theme(text = element_text(size = 14),
axis.text.y = element_blank(),
axis.ticks.y = element_blank()) +
ylab(NULL) +
xlab(NULL) +
ggtitle("Plot A")
p2 <- ggplot(michelin, aes(x = stars, y = country, label = country)) +
geom_tile(aes(x = stars / 2,
width = stars),
height = 0.5,
fill = "deepskyblue") +
geom_text(hjust = 0, size = 4, nudge_x = 30) +
xlim(c(0, 1000)) +
theme_minimal() +
theme(text = element_text(size = 14),
axis.text.y = element_blank(),
axis.ticks.y = element_blank()) +
ylab(NULL) +
xlab(NULL) +
ggtitle("Plot B")
p1 + p2
```
Which plot makes it easier to compare the numbers of stars for
different countries? Explain your conclusion by identifying the
channels used and their relative strengths.
3. Identify the items, attributes, marks, channels, and mappings use
in the following plot:
```{r, echo = FALSE}
library(dplyr, warn.conflicts = FALSE)
library(ggplot2)
mutate(mpg, cyl = factor(cyl)) %>%
group_by(cyl, year) %>%
summarize(hwy = mean(hwy), cty = mean(cty), Count = n()) %>%
ungroup() %>%
mutate(cyl = factor(cyl), year = factor(year)) %>%
ggplot(aes(x = year, y = hwy, size = Count, color = cyl)) +
geom_point() +
scale_size_area(max_size = 15) +
theme_minimal() + scale_color_viridis_d()
```