--- title: "Perception and Visualization" output: html_document: toc: yes code_folding: show code_download: true --- ```{r setup, include = FALSE} source(here::here("setup.R")) knitr::opts_chunk$set(collapse = TRUE, message = FALSE, fig.height = 5, fig.width = 6, fig.align = "center") library(lattice) library(tidyverse) library(gridExtra) library(scales) set.seed(12345) ``` ## Some River Flow Data ```{r, echo = TRUE} river <- scan("https://www.stat.uiowa.edu/~luke/data/river.dat") plot(river) ``` ## A Simple Model of Visual Perception The eyes acquire an image, which is processed through three stages of memory: * Iconic memory; * Working memory, or short-term memory; * Long-term memory. ### Iconic Memory The first processing stage of an image happens in iconic memory. * Images remain in iconic memory for less than a second. * Processing in iconic memory is massively parallel and automatic. * This is called _preattentive processing_. Preattentive processing is a fast recognition process. ### Working Memory Meaningful visual chunks are moved from iconic memory to short term memory. These chunks are used by conscious, or attentive, processing. Attentive processing often involves conscious comparisons or search. Short term memory is limited: * information is retained for only a few seconds; * only three or fours chunks can be held at a time. Chunks can be of varying size; a coherent pattern can form a single chunk even if it is quite large. If more chunks are needed or chunks are needed longer they need to be reacquired or retrieved from long term memory. ### Long Term Memory Long term visual memory is built up over a lifetime, though infrequently used visual chunks may become lost. Chunks processed repeatedly in working memory may be transferred to long term memory. Common patterns and contextual information can be retrieved from long term memory for attentive processing in working memory. ### Visual Design Implications Try to make as much use of preattentive features as possible. Recognize when preattentive features might mislead. For features that require attentive processing keep in mind that working memory is limited. ## Some Examples of Challenges ### Context Matters Which of the inner circles is larger, or are they the same size? ```{r, echo = FALSE} circle <- local({ phi <- seq(0, 2 * pi, length.out = 100) x <- cos(phi) y <- sin(phi) function(cx = 0, cy = 0, r = 1) list(x = r * x + cx, y = r * y + cy) }) setup <- function(w = 5) plot(0, type = "n", asp = 1, xlim = c(-w, w), ylim = c(-w, w), axes = FALSE, xlab = "", ylab = "", mar = c(0, 0, 0, 0)) setup() polygon(circle(-2, 0, 3)) polygon(circle(-2, 0), col = "black") polygon(circle(3, 0, 1.3)) polygon(circle(3, 0), col = "black") ``` ```{r, echo = FALSE} setup(10) x1 <- -4 x2 <- 6 polygon(circle(x1, 0)) for (a in (2 * pi * (1 : 6) / 6)) polygon(circle(4 * cos(a) + x1, 4 * sin(a), 2)) polygon(circle(x2, 0)) for (a in (2 * pi * (1 : 10) / 10)) polygon(circle(1.5 * cos(a) + x2, 1.5 * sin(a), 0.5)) ``` Which of the lines is longer, or are they the same length? ```{r, echo = FALSE} setup() polygon(circle(-4, -2)) polygon(circle(4, -2)) polygon(circle(-2, 2)) polygon(circle(2, 2)) segments(-3, -2, 3, -2, lwd = 3) segments(-3, +2, 3, +2, lwd = 3) width <- 3 ah <- function(x, y, size = 1, dir = 1) { ax <- c(-size * dir, 0, -size * dir) ay <- c(-size, 0, size) lines(x + ax, y + ay, lwd = width) } ``` ```{r, echo = FALSE} setup() segments(-3, -2, 3, -2, lwd = width) ah(3, -2) ah(-3, -2, dir = -1) segments(-3, +2, 3, +2, lwd = width) ah(3, 2, dir = -1) ah(-3, 2) ``` The sine Illusion: which of the bars are longer, or are they the same length? ```{r, class.source = "fold-hide"} x <- seq(0, 5 * pi, length.out = 100) w <- 0.5 plot(x, sin(x), ylim = c(-1, 1 + w), type = "n") segments(x0 = x, y0 = sin(x), y1 = sin(x) + w, lwd = 3) ``` Which of the squares A and B is darker, or are they the same shade? ```{r echo = FALSE} knitr::include_graphics(IMG("chess1.png")) ``` ```{r echo = FALSE} knitr::include_graphics(IMG("chess2.png")) ``` ### Some Optical Illusions [R implementations](http://rpubs.com/kohske/R-de-illusion) of some optical illusions by Kohske Takahashi: Are these lines parallel? ```{r, echo = FALSE} library(grid) rs <- expand.grid(x = seq(0, 1, 1 / 10), y = seq(0, 1, 1 / 10)) grid.rect(rs$x, rs$y, 1 / 10 / 2, 1 / 10 / 2, gp = gpar(fill = "black", col = NA)) grid.rect(rs$x + 1 / 10 / 4, rs$y + 1 / 10 / 2, 1 / 10 / 2, 1 / 10 / 2, gp = gpar(fill = "black", col = NA)) ls <- expand.grid(x = 0 : 1, y = seq(0, 1, 1 / 20) - 1 / 20 / 2) grid.polyline(ls$x, ls$y, id = gl(nrow(ls) / 2, 2), gp = gpar(col = "grey50", lwd = 1)) ``` Again, are these lines parallel? ```{r, echo = FALSE} grid.newpage() n <- 10; ny <- 8; L <- 0.01 c <- seq(0, 1, length = n); d <- 1.2 * diff(c)[1] / 2 col <- c("black", "white") x <- c(c - d, c, c + d, c) y <- rep(c(0, -d, 0, d), each = n) w <- c(c - d, c - d + L, c + d, c + d - L) z <- c(0, L, 0, -L) ys <- seq(0, 1, length = ny) grid.rect(gp = gpar(fill = gray(0.5), col = NA)) plyr::l_ply(1 : ny, function(i) { if (i %% 2 == 0) { co <- rev(col) z <- -z } else co <- col grid.polygon(x, y + ys[i], id = rep(1 : n, 4), gp = gpar(fill = co, col = NA)) grid.polygon(w, rep(z, each = n) + ys[i], id = rep(1 : n, 4), gp = gpar(fill = rev(co), col = NA)) }) ``` [Scintillating grid illusion](https://psychology.fandom.com/wiki/Grid_illusion#Scintillating_grid_illusion): Black dots at the intersections appear and disappear; are they real? ```{r, echo = FALSE} nx <- 6; ny <- 6; lwd <- 10; cr <- 1.2 / 100 grid.newpage() ## nolint start ## xlim <- ylim <- c(0.05, 0.95) ## pushViewport(viewport( # like plot.window() ## x=0.5, y=0.5, # a centered viewport ## width=unit(min(1,diff(xlim)/diff(ylim)), "snpc"), # aspect ratio preserved ## height=unit(min(1,diff(ylim)/diff(xlim)), "snpc"), ## xscale=xlim, # cf. xlim ## yscale=ylim # cf. ylim ##)) ## nolint end grid.rect(0.5, 0.5, 1, 1, gp = gpar(fill = "black")) ls <- expand.grid(x = 0:1, y = seq(0, 1, 1 / nx / 2) - 1 / nx / 2 / 2) grid.polyline(ls$x, ls$y, id = gl(nrow(ls) / 2, 2), gp = gpar(col = "grey", lwd = lwd)) ls <- expand.grid(y = 0 : 1, x = seq(0, 1, 1 / ny / 2) - 1 / ny / 2 / 2) grid.polyline(ls$x, ls$y, id = gl(nrow(ls) / 2, 2), gp = gpar(col = "grey", lwd = lwd)) ls <- expand.grid(x = seq(0, 1, 1 / nx / 2) - 1 / nx / 2 / 2, y = seq(0, 1, 1 / ny / 2) - 1 / ny / 2 / 2) grid.circle(ls$x, ls$y, r = cr, gp = gpar(col = NA, fill = "white")) ``` [Hermann grid illusion](https://psychology.fandom.com/wiki/Grid_illusion#Hermann_grid_illusion): "Ghost-like" grey blobs at the intersections of white lines. ```{r, echo = FALSE} grid.newpage() grid.rect(0.5, 0.5, 1, 1, gp = gpar(fill = "black")) gp <- gpar(col = "white", lwd = 20) ls <- expand.grid(x = 0 : 1, y = seq(0, 1, len = nx)) grid.polyline(ls$x, ls$y, id = gl(nrow(ls) / 2, 2), gp = gp) grid.polyline(ls$y, ls$x, id = gl(nrow(ls) / 2, 2), gp = gp) ``` Grid illusion in a cartogram: ```{r, echo = FALSE, message = FALSE, warning = FALSE, out.width = "60%"} library(ggtext) library(statebins) library(hrbrthemes) library(tidyverse) ## jsonlite::fromJSON( ## "https://static01.nyt.com/newsgraphics/2020/06/02/testing-dashboard/36a7f9f71fb6271078df301b8012185016c851cf/states.json" ## ) -> cvdf cvdf <- jsonlite::fromJSON(WLNK("nyt-states-2020-06-02.json")) cvdf |> unnest(data) |> mutate(date = as.Date(date)) |> filter(date == max(date)) |> mutate(test_pct = tests_new / est_tests_needed) |> select(postal, test_pct) |> mutate( status = case_when( test_pct > 1.2 ~ "above", test_pct < 0.8 ~ "below", TRUE ~ "near" ) ) |> ggplot() + geom_statebins( aes(state = postal, fill = status) ) + coord_fixed() + scale_fill_manual( name = NULL, values = c( "near" = "#febf82", "below" = "#c7615b", "above" = "#bacfa2" ) ) + labs( title = "Is Your State Doing Enough Coronavirus Testing?", subtitle = "**12** states meet the testing target;
**5** are near the target; **34** are below the target;", caption = "Data source: NYTimes " ) + theme_ipsum_es(grid = "", plot_title_size = 15, caption_size = 7) + theme( plot.title = element_text(hjust = 0.5), plot.subtitle = element_markdown(family = font_es, face = "plain", hjust = 0.5, size = 10) ) + theme(axis.text.x = element_blank(), axis.text.y = element_blank()) + theme(legend.position = "none", plot.margin = unit(c(1, 1, 1, 1.5), "cm")) ``` A case where there are more dots than we can see at once: ```{r echo = FALSE, out.width = "40%"} knitr::include_graphics(IMG("disappearing-dots.jpeg")) ``` Red from black and white: ```{r echo = FALSE, out.width = "50%"} knitr::include_graphics(IMG("red_from_black_and_white.jpeg")) ``` Some illusions in pictures: ```{r echo = FALSE, out.width = "50%", fig.cap = "A man running into the snow"} knitr::include_graphics(IMG("man_in_snow.jpeg")) ``` ```{r echo = FALSE, out.width = "50%", fig.cap = "Once you see Cookie Monster, you can’t unsee it"} knitr::include_graphics(IMG("cookie-monster.jpeg")) ``` A large collection of optical illusions is available at . ### Motion ```{r chunk-label, fig.width = 10, fig.show='animate', ffmpeg.format='gif', dev='jpeg', interval = 0.1, class.source = "fold-hide"} n <- 50 x <- 2 * (1 : n) y <- rep(2, n) lim <- c(min(x) + 0.1 * (max(x) - min(x)), max(x) - 0.1 * (max(x) - min(x))) v <- TRUE for (i in 1 : 2) { plot(x + v, y, xlim = lim) v <- ! v Sys.sleep(0.1) } ``` ```{r, include = FALSE} gif_file <- "3dscatter_percep.gif" if (! file.exists(IMG(gif_file))) { library(rgl) d <- data.frame(x = rnorm(1000), y = rnorm(1000), z = rnorm(1000)) points3d(d) par3d(FOV = 1) ## removes perspective distortion movie3d( movie = sub("\\..*", "", gif_file), spin3d(axis = c(0, 0, 1), rpm = 20), duration = 3, type = "gif", dir = IMG(""), clean = TRUE ) ## re-save so it loops indefinitely img <- magick::image_read(IMG(gif_file)) magick::image_write(img, path = IMG(gif_file)) } ``` ```{r, echo = FALSE, out.width = "75%"} knitr::include_graphics(IMG("3dscatter_percep.gif")) ``` Images contain no information about the direction of the rotation. Your mind picks a direction. ```{r, include = FALSE} knitr::knit_hooks$set(webgl = rgl::hook_webgl) options(rgl.useNULL = TRUE) ``` ```{r, webgl = TRUE, class.source = "fold-hide"} library(rgl) d <- data.frame(x = rnorm(1000), y = rnorm(1000), z = rnorm(1000)) points3d(d) par3d(FOV = 1) ## removes perspective distortion if (interactive()) play3d(spin3d(axis = c(0, 0, 1), rpm = 30), duration = 20) ``` Interactive rotation can help. Depth cueing and perspective can also help. ### Popout and Distractors Where is the red dot: ```{r, echo = FALSE} thm <- theme_classic() thm <- list(theme_void(), theme(panel.border = element_rect(fill = NA, color = "grey20")), guides(color = "none", shape = "none"), scale_shape_manual(values = c(A = 16, B = 15)), coord_equal(), xlim(c(0, 1)), ylim(c(0, 1))) pts <- function(n, version, sz = 3) { idx <- seq_len(n) if (version == 1) { v1 <- ifelse(idx == 1, "A", "B") v2 <- "A" } else if (version == 2) { v1 <- "A" v2 <- ifelse(idx == 1, "A", "B") } else if (version == 3) { v1 <- ifelse(idx == 1 | idx > (n / 2), "A", "B") v2 <- ifelse(idx <= (n / 2), "A", "B") } d <- data.frame(x = runif(n), y = runif(n), v1 = v1, v2 = v2) ggplot(d, aes(x, y, color = v1, shape = v2)) + geom_point(size = sz) + thm } p1 <- pts(10, 1) p2 <- pts(100, 1) p3 <- pts(10, 2) p4 <- pts(100, 2) p5 <- pts(10, 3) p6 <- pts(100, 3) ``` ```{r, echo = FALSE, dpi = 300, out.width = "60%"} grid.arrange(p1, p2, nrow = 1) ``` Shape only: ```{r, echo = FALSE, dpi = 300, out.width = "60%"} grid.arrange(p3, p4, nrow = 1) ``` Shapes and colors: ```{r, echo = FALSE, dpi = 300, out.width = "60%"} grid.arrange(p5, p6, nrow = 1) ``` ## Items, Attributes, Marks, and Channels To evaluate or design a visualization it is useful to have some terms for the components. Several schools have developed different but similar sets of terms. Some References: * Bertin, J. (1967), _The Semiology of Graphics_. * Cleveland, W. S. (1988), _The Elements of Graphing Data_. * Cleveland, W. S. (1993), _Visualizing Data_. * Few, S. (2012), _Show Me the Numbers: Designing Tables and Graphs to Enlighten_, 2nd ed. * Munzner, T. (2014), _Visualization Analysis and Design_ * Ware, C. (2012), _Information Visualization: Perception for Design_, 3rd ed. * Wilkinson, L. (2005), _The Grammar of Graphics_, 2nd ed. Munzner uses the terminology of _items_, _attributes_, _links_, _marks_, and _channels_: * _Items_ are the basic units on which data is collected: the entities represented by the rows in a _tidy_ data frame. * _Attributes_ are the numerical or categorical features of the data items we want to represent; the variables in a tidy data frame. * _Links_ are relations among items: e.g. months within a year, or countries within a continent. * _Marks_ are the geometric entities used to represent items: points, lines, areas. These correspond to the simple `geom` objects in `ggplot`. * _Visual channels_ are features of marks that can be used to reflect values of attributes. Channels correspond approximately to aesthetics in `ggplot` but are more focused on the visual aspect: * An `x` aesthetic that is transformed to polar coordinates is a different channel than an `x` aesthetic representing a position on a standard linear scale. Channels are used to _encode_ attributes (_aesthetic mappings_). A single attribute can be encoded in several channels. * This makes the attribute easier to perceive. * But it also uses up available channels. Some channels are well suited to encode quantitative or ordered values; they are _quantitatively perceived_. Others are only suited for nominal values. A useful classification, adapted from Few(2012): | Type | Channel | Quantitatively Perceived? | |:---------|:------------|:--------------------------- | | Form | Length | Yes | | | Width | Yes, but limited | | | Orientation | No | | | Size | Yes, but limited | | | Shape | No | | Color | Hue | No | | | Intensity | Yes, but limited | | Position | 2D position | Yes | Munzner uses the terms _magnitude channels_ and _identity channels_. Ideally, * numeric and ordered attributes should be represented by quantitatively perceived channels; * unordered categorical attributes should be represented by non-quantitative channels. A useful principle: The most important attributes should be mapped to the most effective channels. ## Channel Effectiveness Some questions about channels: * What kind and how much information can a channel encode? * Are some channels better than others? * Can channels be used independently, or do they interfere? Some criteria for evaluating channels: * Accuracy: How well can a viewer decode the information in the channel? * Discriminability: How easily can differences between attribute levels be perceived? * Separability: Can channels be used independently or is there interference? * Popout: can a channel provide _popout_ where a difference is perceived preattentively? * Grouping: can a channel show perceptual grouping of items? ### Channel Accuracy Stevens (1957) argues that accuracy of magnitude channels can be described by a power law: $$ \text{perceived sensation} = (\text{physical intensity})^\gamma $$ Experiments by Stevens suggest these values for some visual channels: ```{r, echo = FALSE, fig.height = 4} n <- 100 x <- seq(0, 5, len = n) gamma <- c(0.5, 0.67, 0.7, 1, 1.7) chan <- c("Brightness", "Depth", "Area", "Length", "Staturation") label <- sprintf("%s (%.2f)", chan, gamma) nc <- length(gamma) d <- data.frame(Intensity = rep(x, nc), gamma = rep(gamma, each = n), label = rep(label, each = n)) p <- ggplot(d, aes(x = Intensity, y = Intensity ^ gamma)) + geom_line(aes(group = gamma, color = label), linewidth = 1.5) + ylim(c(0, 5)) + labs(x = "Phisical Intensity", y = "Perceived Sensation") + theme_minimal() + theme(panel.border = element_rect(fill = NA, color = "grey20")) suppressWarnings(print(p)) ``` [Others have raised concerns](https://en.wikipedia.org/wiki/Stevens%27s_power_law) about the validity of these findings. ```{r, echo = FALSE, eval = FALSE} ## Color wheel for hue: n <- 72 pie(rep(1, n), col = hsv((1 : n) / n, 1, 1), border = NA, labels = NA) ``` ```{r, echo = FALSE, eval = FALSE} ## Value and saturation plot(0, type = "n", asp = 1, xlim = c(0, 1), ylim = c(0, 1), axes = FALSE, xlab = "value/lightness", ylab = "saturation", mar = c(0, 0, 0, 0)) n <- 32 x <- seq(0, 1, length.out = n)[-n] hue <- rgb2hsv(col2rgb("blue"))[["h", 1]] delx <- mean(diff(x)) dely <- 1 / n for (i in 1 : n) rect(x, dely * (i - 1), x + delx, dely * i, col = hsv(hue, dely * i, x + delx), border = NA) ``` ```{r, echo = FALSE, eval = FALSE} ## Another approach opar <- par(mfrow = c(2, 1)) n <- 64 x <- seq(0, 1, length.out = n)[-n] hue <- rgb2hsv(col2rgb("blue"))[["h", 1]] delx <- mean(diff(x)) plot(0, type = "n", xlim = c(0, 1), ylim = c(0, 1), axes = FALSE, xlab = "", ylab = "", mar = c(0, 0, 0, 0)) rect(x, 0.2, x + delx, 0.7, col = hsv(hue, x, 1), border = NA) title("Saturation") plot(0, type = "n", xlim = c(0, 1), ylim = c(0, 1), axes = FALSE, xlab = "", ylab = "", mar = c(0, 0, 0, 0)) rect(x, 0.2, x + delx, 0.7, col = hsv(hue, 1, x), border = NA) title("Value") par(opar) ``` Another approach has used controlled experiments to assess accuracy of various channels used in visualizations: * William S. Cleveland and Robert McGill (1984), "Graphical Perception: Theory, Experimentation, and Application to the Development of Graphical Methods," _Journal of the American Statistical Association_ 79, 531–554. * William S. Cleveland and Robert McGill (1987), "Graphical Perception: The Visual Decoding of Quantitative Information on Graphical Displays of Data" _Journal of the Royal Statistical Society. Series A_, 192-229. * Jeffrey Heer and Michael Bostock (2010) "Crowdsourcing Graphical Perception: Using Mechanical Turk to Assess Visualization Design," _Proceedings of the SIGCHI_, 203-212. Munzner's ordering by accuracy: ```{r, echo = FALSE} knitr::include_graphics(IMG("tmmchannels.jpeg")) ``` This ordering is sometimes referred to as a _perceptual ladder_. Line width is another channel; not sure there is agreement on its accuracy, but it is not high. ### Discriminability Many channels, in particular identity channels, can only support a limited number of discriminable levels. * Line width is one of the most limited with perhaps 3 levels. * Using more than 5 or 6 color hues is not recommended. * Similarly, using more than 5 or 6 symbol shapes can create difficulties. If the number of levels that can be represented by a channel is smaller than the number of attribute levels then some form of meaningful aggregation is needed. ### Separability Some encodings can be used independently of each other; others interfere with each other to some degree. ```{r, echo = FALSE, out.width = "55%"} knitr::include_graphics(IMG("TMfig5.10.jpg")) ``` * Vertical an horizontal position can be used independently. * Color (hue) and position can be used independently * Size and hue interfere somewhat; hue is harder to perceive on smaller objects. * Width and height do not function well independently; the result is perceived primarily as shape. * Encoding two different values in the red and green channels as a hue does not work at all. ### Popout Many channels support _visual popout_: having one item or a few items immediately stand out from the others. * Color (hue and intensity) do this well. * Shape and size can also be used effectively to create popout. Annotation can also be used to create popout. ### Grouping Perceptual grouping can be achieved in several ways: * Using an identity channel to to represent items as a group. * Using _link marks_. * By _enclosure_. * By spatial proximity. ```{r, echo = FALSE} d1 <- data.frame(x = c(1, 2, 3, 4, 1, 2, 3, 4), y = c(1, 2, 1, 1.5, 6, 5.5, 6, 5), v = rep(c("A", "B"), each = 4)) p1 <- ggplot(d1, aes(x, y, color = v)) + geom_point(size = 4) + theme_void() + guides(color = "none") + theme(panel.border = element_rect(fill = NA, color = "grey20")) ``` ```{r, echo = FALSE} d1 <- data.frame(x = c(1, 2, 3, 4, 1, 2, 3, 4), y = c(1, 2, 1, 1.5, 6, 5.5, 6, 5), v = rep(c("A", "B"), each = 4)) p2 <- ggplot(d1, aes(x, y, color = v)) + geom_line(aes(group = v), color = "black") + geom_point(size = 4) + theme_void() + guides(color = "none") + theme(panel.border = element_rect(fill = NA, color = "grey20")) ``` ```{r, echo = FALSE} d3 <- data.frame(x = c(1, 1, 3, 3), y = c(1, -1, 1, -1), s = c(5, 10, 10, 5)) d3$i <- factor(seq_along(d3$x) %% 2) p3 <- ggplot(d3) + geom_rect(aes(xmin = 0.6, ymin = -1.5, xmax = 1.5, ymax = 1.5), fill = "gray") + geom_point(aes(x = x, y = y, size = s, color = i)) + scale_size_area(max_size = 10) + guides(size = "none", color = "none") + theme_void() + xlim(c(0, 4)) + ylim(c(-2, 2)) + theme(panel.border = element_rect(fill = NA, color = "grey20")) ``` ```{r, echo = FALSE} d4 <- data.frame(x = c(1, 2, 3, 10, 10, 12), y = c(1, -1, 1, 6, 8, 6), s = c(5, 10, 10, 5, 5, 10)) d4$i <- factor(seq_along(d4$x) %% 2) p4 <- ggplot(d4) + geom_point(aes(x = x, y = y, size = s, color = i)) + scale_size_area(max_size = 15) + guides(size = "none", color = "none") + theme_void() + ylim(c(-2, 9)) + xlim(c(0, 13)) + theme(panel.border = element_rect(fill = NA, color = "grey20")) ``` ```{r, echo = FALSE} grid.arrange(p1, p2, p3, p4, nrow = 2) ``` ## Experimental Evidence ### Cleveland-McGill The 1984 paper is available from [JSTOR](https://www.jstor.org/stable/2288400). The paper formulates a theory for ranking _Elementary Perceptual Tasks_; these correspond to channel mappings. Some orderings were addressed by informal experiments (obvious to the authors at least). Others were assessed by formal experiments with about 50 subjects. Experiments focused on accuracy of decoding, though this is not viewed as the primary purpose of a graph: > One must be careful not to fall into a conceptual trap of adopting > accuracy as a criterion. ... The power of a graph is its ability to > enable one to take in the quantitative information, organize it, and > see patterns and structure not readily revealed by other means of > studying the data. Their premise: > A graphical form that involves elementary perceptual tasks that lead > to more accurate judgments than another graphical form (with the > same quantitative information) will result in better organization > and increase the chances of a correct perception of patterns and > behavior." The tasks: For each setting * Identify which of two marked items is smaller. * Estimate the percentage the smaller is of the larger. ![](`r IMG("CM1.jpg")`) ![](`r IMG("CM2.png")`) Results: Percent large errors: ![](`r IMG("CMLE.jpg")`) Absolute error: ![](`r IMG("cleveland-results.png")`) ### Heer and Bostock [Heer and Bostock (2010)](http://idl.cs.washington.edu/papers/crowdsourcing-graphical-perception/) set out to replicate the Cleveland McGill experiment using crowd sourcing via [Amazon Mechanical Turk](https://www.mturk.com/) They used the five position stimuli from Cleveland and McGill and some new ones: ![](`r IMG("heer-bostock-tasks.png")`) 50 subjects were recruited for each task. Results were consistent with Cleveland-McGill results: ![](`r IMG("HB1.jpeg")`) Use of Mechanical Turk was deemed a success. ### Pie Chart Experiments [Pie charts](https://eagereyes.org/pie-charts) are popular but somewhat controversial. * Pie charts are inferior for comparisons to bar charts. * Pie charts are quite good at representing part-whole relationships. * Cleveland and McGill suggested pie charts are read by angle. * [Kosara and Skau](https://eagereyes.org/papers/a-pair-of-pie-chart-papers) report experiments that suggest this is not the case. * If it were, [_donut charts_](https://datavizcatalogue.com/methods/donut_chart.html) would be even less effective, but they seem to be very comparable. * [Kosara's blog](https://eagereyes.org) provides a [review](https://eagereyes.org/blog/2016/an-illustrated-tour-of-the-pie-chart-study-results) of other pie chart studies. ## Improving Some Common Charts Cleveland and McGill set out to suggest improvements to some common charts. This is a selection of their examples. ### Dot Charts Cleveland and McGill use their perceptual ladder to argue strongly for using dot charts in place of bar charts and pie charts. ### Playfair's Balance of Trade plots Playfair presented a number of plots showing imports and exports between England and other nations. A primary goal was to show the balance of trade, the difference between exports and imports: ![](`r IMG("Playfair_Exports_Imports.jpg")`) Assessing the differences from a plot showing exports and imports as separate curves requires length judgments, which are less accurate than comparisons to a common stale. ```{r, include = FALSE} # nolint start if (! file.exists("playfair-balance-of-trade-data.tsv")) download.file("http://www.stat.uiowa.edu/~luke/data/playfair-balance-of-trade-data.tsv", "playfair-balance-of-trade-data.tsv") # nolint end ``` ```{r, echo = FALSE} pimex <- read.delim("playfair-balance-of-trade-data.tsv") pp1 <- mutate(pimex, difference = NULL) |> pivot_longer(-year, names_to = "which", values_to = "trade") |> ggplot(aes(x = year, y = trade, group = which, color = which)) + geom_line() + theme_minimal() + theme(panel.border = element_rect(fill = NA, color = "grey20")) pp1 ``` Plotting the difference makes the balance of trade much easier to assess: ```{r, echo = FALSE} pp2 <- ggplot(pimex) + geom_line(aes(x = year, y = imports - exports)) + theme_minimal() + theme(panel.border = element_rect(fill = NA, color = "grey20")) pp2 ``` An _ensemble plot_ showing both views may also help. ```{r, echo = FALSE, fig.height = 6} library(patchwork) pp1 / pp2 ``` ### Framed Unaligned Bars It is difficult to compare lengths of unaligned rectangles when the lengths are close. ```{r, echo = FALSE} h1 <- 7.4 h2 <- 6.7 hb <- 10 off <- 3 p1 <- ggplot() + geom_rect(aes(xmin = 0, ymin = 0, xmax = 1, ymax = h1), fill = muted("blue")) + geom_rect(aes(xmin = 2, ymin = off, xmax = 3, ymax = off + h2), fill = muted("blue")) + theme_void() + ylim(c(-1, off + hb + 1)) + xlim(c(-2, 5)) p1 ``` Adding a frame moves the task up the perceptual ladder to an unaligned comparison against a common scale. ```{r, echo = FALSE} p1 + geom_rect(aes(xmin = 0, ymin = 0, xmax = 1, ymax = hb), fill = NA, color = "black") + geom_rect(aes(xmin = 2, ymin = off, xmax = 3, ymax = off + hb), color = "black", fill = NA) ``` Comparing to a common scale is still the most effective approach: ```{r, echo = FALSE} ggplot() + geom_rect(aes(xmin = 0, ymin = 0, xmax = 1, ymax = h1), fill = muted("blue")) + geom_rect(aes(xmin = 2, ymin = 0, xmax = 3, ymax = h2), fill = muted("blue")) + theme_void() + ylim(c(-1, off + hb + 1)) + xlim(c(-2, 5)) ``` But this does suggest that using unaligned framed rectangles to encode a _third_ variable, with position encoding the two primary variables may be effective. ### Framed Rectangle Maps A _choropleth map_ is a common way to depict a quantitative variable in a geographic context. Shading is quite low on the perceptual ladder. Cleveland and McGill suggest the use of framed rectangles positioned on the map as an alternative. This does not seem to have caught on so far, though you do sometimes see the use of other glyphs, such as pie charts. ```{r, echo = FALSE, out.width = "90%"} knitr::include_graphics(IMG("CMmap.jpg")) ``` ## Analyzing a Design Graph layout involves several levels: * Primary data representation * items, attributes * marks, channels * Supporting features * trend lines, reference lines, annotations * axes, legends A useful structure for describing the primary features: * What are the data items? * What are the attributes? * What marks are used? * What channels are used? * Which attribute is matched to channel 1 * Which attribute is matched to channel 2 * ... Useful questions: * Are the most important attributes mapped to the strongest channels? * Do the mappings do a good job of conveying the primary message? * If not, can the graph be improved by adjusting the mappings? ### A Gapminder Plot One of the frames of a plot from the [GapMinder site](https://www.gapminder.org): ![](`r IMG("gapminder2.png")`) * Items are countries (in a particular year) * Attributes in the plot: * life expectancy * income per person * population * continent * country name * year * Marks: points (or bubbles). * Channels and mappings: * horizontal position, mapped to log income * vertical position, mapped to life expectancy * area, mapped to population * color (hue), mapped to continent * interactive: mouse-over, mapped to country name * interactive: time (or frame), mapped to year The basic plot can be created with `ggplot` and aesthetic mappings: ```{r gapminder-basic, eval = FALSE} library(gapminder) gm2007 <- filter(gapminder, year == 2007) ggplot(gm2007) + geom_point(aes(x = gdpPercap, y = lifeExp, size = pop, color = continent)) + scale_x_log10() + ylim(c(20, 85)) ``` ```{r gapminder-basic, echo = FALSE} ``` The data in the `gapminder` package differ somewhat from the data used by the Gapminder site, but overall the plot designs are very close. With some adjustments the basic plot can be brought close to the Gapminder version in appearance: ```{r gapminder-full, eval = FALSE} gm2007 <- filter(gapminder, year == 2007) |> arrange(desc(pop)) ## sort to avoid over-plotting ggplot(gm2007) + geom_point(aes(x = gdpPercap, y = lifeExp, size = pop, fill = continent), shape = 21) + ## to allow the using `fill` aesthetic scale_x_log10(labels = comma) + ylim(c(20, 85)) + scale_size_area(max_size = 20, labels = comma, breaks = c(0.25 * 10 ^ 9, 0.5 * 10 ^ 9, 10 ^ 9)) + scale_fill_manual(values = c(Africa = "deepskyblue", Asia = "red", Americas = "green", Europe = "gold", Oceania = "brown")) + labs(x = "Income", y = "Life expectancy") + theme(text = element_text(size = 16)) + guides(fill = guide_legend(title = "Continent", override.aes = list(size = 5), order = 1), size = guide_legend(title = "Population", label.hjust = 1, order = 2)) + theme_minimal() + theme(panel.border = element_rect(fill = NA, color = "grey20")) ``` ```{r gapminder-full, echo = FALSE, fig.height = 6, fig.width = 8} ``` Some notes: * Using larger bubbles makes the plot more engaging. * Using larger bubbles makes differences in population easier to assess, but makes the strength of the relationship between life expectancy and income harder to assess. * Using larger bubbles also increases the risk of over-plotting. Sorting the rows so larger bubble are drawn first helps reduce the risk somewhat. These plots show a fairly strong marginal association between life expectancy and income. There does not seem to be a strong association of population with the other two variables, but these plots are not ideal for that assessment. To judge the marginal association between life expectancy and population size we can change the channel mapping: * map the horizontal axis to population; * map area to income. ```{r gapminder-lifepop, eval = FALSE} ggplot(filter(gapminder, year == 2007)) + geom_point(aes(x = pop, y = lifeExp, size = gdpPercap)) + scale_x_log10() + ylim(c(20, 85)) + theme_minimal() + theme(panel.border = element_rect(fill = NA, color = "grey20")) ``` ```{r gapminder-lifepop, echo = FALSE} ``` This confirms that there is very little association between life expectancy and population. The association between life expectancy and income is still visible, but is easier to assess when these two variables are mapped to 2D position. ### Michelin Stars This image, from a [blog post](http://createhtml5map.com/interactive-map-blog/interactive-bubble-chart-countries-receive-the-most-michelin-stars-each-year/), shows the total number of stars for different countries: ![](`r IMG("michelin-stars.png")`) * Items: Countries (in a particular year) * Attributes: * country name * number of stars * Marks: bubbles, text * Channels and mappings: * bubble color, mapped to country * bubble area, mapped to number of stars * text, mapped to country name (where possible) * text, mapped to number of stars (where possible) Observations: * None of the channels are very strong. * The strongest channels, 2D position, are not used. * The number of colors used is too high. A simple _dot plot_ would convey the distribution better. A dot plot using [2017 data](michelin.dat) from an [article](https://www.telegraph.co.uk/travel/maps-and-graphics/map-michelin-star-restaurants-countries-with-the-most/) in The Telegraph: ```{r} michelin <- read.table(WLNK("michelin.dat"), head = TRUE) michelin <- mutate(michelin, stars = one + 2 * two + 3 * three, country = reorder(country, stars)) ggplot(michelin, aes(x = stars, y = country)) + geom_point() + labs(x = "Stars", y = NULL) + theme_minimal() + theme(text = element_text(size = 16)) + theme(panel.border = element_rect(fill = NA, color = "grey20")) ``` Even if a bubble plot is desired for aesthetic reasons, position could be used to * group countries by continent; * show countries on a map. A plot like the original can be constructed by * computing locations for a set of packed spheres with specified radii; * using `geom_point` and `geom_text`. ```{r, fig.width = 8, class.source = "fold-hide"} ## create randomly located packed spheres with specified areas library(packcircles) set.seed(54321) packing <- circleProgressiveLayout(michelin$stars) ## merge circle locations with starts data mcirc <- bind_cols(packing, michelin) |> mutate(country = factor(country)) ## clean out stray attributes ## compute some colors to use nr <- nrow(mcirc) pal <- colorRampPalette(RColorBrewer::brewer.pal(9, "Set1")) cols <- sample(pal(nr), nr) ## these need tuning for the screen or R markdown csize <- 45 tsize <- 2.5 ## the basic plot ## uses shape = 19 and `color` aesthetic p <- ggplot(mcirc, aes(x = x, y = y)) + geom_point(aes(size = radius ^ 2, color = country), shape = 19) + scale_size_area(max_size = csize) + geom_text(aes(label = paste(country, stars, sep = "\n")), data = filter(mcirc, stars >= 120), size = tsize) + coord_fixed() ## adjustments p + guides(size = "none") + scale_color_manual(values = cols) + xlim(with(mcirc, range(x) + c(-1, 1) * max(radius))) + ylim(with(mcirc, range(y) + c(-1, 1) * max(radius))) + theme_void() ``` A somewhat more robust approach * creates vertex data for polygon approximations to the circles; * merges the stars data with the polygon data; * uses `geom_polygon` and `geom_text`. This is very similar to the way simple _choropleth maps_ are created. ```{r, fig.width = 8, class.source = "fold-hide"} ## compute polygon approximations to spheres mcircpoly <- circleLayoutVertices(mcirc, idcol = "country", npoints = 100) |> rename(country = id) ## merge the stars data into the polygon data sttab <- select(michelin, country, stars) mcircpoly <- left_join(mcircpoly, sttab, "country") ## create the plot ggplot(mcircpoly, aes(x = x, y = y)) + geom_polygon(aes(fill = country)) + geom_text(aes(label = paste(country, stars, sep = "\n")), data = filter(mcirc, stars >= 120), size = tsize) + coord_fixed() + scale_fill_manual(values = cols) + theme_void() ``` ## Aspect Ratio and Perception The river flow data shows how important aspect ratio can be to our ability to detect patterns: ```{r, echo = FALSE} library(gridExtra) river <- scan("https://www.stat.uiowa.edu/~luke/data/river.dat") r <- data.frame(flow = river, month = seq_along(river)) p0 <- ggplot(r, aes(x = month, y = flow)) + theme_minimal() + theme(panel.border = element_rect(fill = NA, color = "grey20")) p1 <- p0 + geom_point() grid.arrange(p1 + coord_fixed(ratio = 35), p1 + coord_fixed(ratio = 4), heights = c(3, 1)) ``` Using a line plot the basic periodicity becomes apparent even in the first aspect ratio. ```{r, class.source = "fold-hide"} p2 <- p0 + geom_line() p2 + coord_fixed(ratio = 35) ``` But the steeper increase/shallower decrease of most periods is easier to see in the second aspect ratio: ```{r, fig.height = 2, class.source = "fold-hide"} p2 + coord_fixed(ratio = 4) ``` The aspect ratio also influences interpretation of results. Some alternative views of (suspect) data on the number of people on government assistance over a time period: ```{r, message = FALSE, fig.width = 8, class.source = "fold-hide"} library(readr) w <- read_csv(WLNK("hw2-welfare.csv")) w <- mutate(w, onAssistance = onAssistance / 10 ^ 6) w <- mutate(w, date = seq(as.Date("2009-01-01"), by = "quarter", length.out = 10)) p0 <- ggplot(w, aes(x = date, y = onAssistance)) + theme_minimal() + theme(panel.border = element_rect(fill = NA, color = "grey20")) p1 <- p0 + geom_line(aes(group = 1)) grid.arrange(p1, p1 + coord_fixed(ratio = 8), p1 + ylim(0, max(w$onAssistance)), p0 + geom_col(width = 40), ncol = 2) ``` Some notes: * Automated choices of axis scaling can affect the aspect ratio of the content of a plot. * A zero base line supports ratio comparisons. * The preattentive response to bar charts is always to compare ratios, so using a zero base line is important. * Using a non-zero base line for line plots and scatter plots encourages interval, or difference, comparisons. * Research on the effect of aspact ratio on perception has focused on accuracy of slope comparisons. * The general message is that keeping away from slopes that are too steep or too shallow is best. * _Banking to 45 degrees_, or choosing an aspect ratio so the slope magnitudes are distributed around 45 degrees is often recommended. * This also tends to be a useful "neutral ground" when political implications are involved. Some references: * A [blog post](https://eagereyes.org/basics/banking-45-degrees) by Robert Kosara. * William S. Cleveland, Marylyn E. McGill and Robert McGill (1988), "The Shape Parameter of a Two-Variable Graph", _Journal of the American Statistical Association_ ([JSTOR](http://www.jstor.org/stable/2288843?seq=1#page_scan_tab_contents)) * Justin Talbot, John Gerth, Pat Hanrahan (2012), "An Empirical Model of Slope Ratio Comparisons", _IEEE Trans. Visualization & Comp. Graphics (Proc. InfoVis)_ ([PDF](http://vis.stanford.edu/files/2012-SlopeComparison-InfoVis.pdf)) ## Ensemble Plots and Faceting Using multiple channels allows a single plot to show a lot of information. But over-plotting and interference can become problems. One alternative is to use several related views in a useful arrangement. Such arrangements are sometimes called _ensemble plots_. There are a number of variations; a few are introduced below. ### Similar Plots with Different Variables and Shared Encodings One way to show three continuous variables is with two plots that share an axis: ```{r, fig.height = 3, fig.width = 8, class.source = "fold-hide"} library(patchwork) pe_thm <- theme_minimal() + theme(panel.background = element_rect(color = "black", linewidth = 0.5, fill = NA)) p1 <- ggplot(gm2007, aes(x = gdpPercap, y = lifeExp, color = continent)) + geom_point() + scale_x_log10() + guides(color = "none") + pe_thm p2 <- ggplot(gm2007, aes(x = pop / 10 ^ 6, y = lifeExp, color = continent)) + geom_point() + scale_x_log10() + pe_thm + theme(axis.title.y = element_blank(), axis.text.y = element_blank(), axis.ticks.y = element_blank()) p1 + p2 ``` Life expectancy is mapped to the vertical position in both plots. Continent is mapped to color in both plots. Guides can be shared when encodings are shared. ### Different Plots with Shared Encodings Multiple views of the data are often helpful. Sharing encodings makes the relations between views easier to perceive. ```{r, fig.height = 3, fig.width = 8, class.source = "fold-hide"} p3 <- ggplot(gm2007, aes(x = pop / 10 ^ 6, y = reorder(continent, pop, FUN = sum), fill = continent)) + geom_col() + ylab(NULL) + pe_thm p1 + p3 ``` ### Small Multiples _Small multiples_ refers to a collection of plots with identical structure showing different subsets of the data and organized in a useful way. These plot collections are also called _trellis plots_, _lattice plots_, or _faceted plots_. A plot of life expectancy against income per capita in 2007 faceted by continent: ```{r, fig.width = 8, class.source = "fold-hide"} gd <- filter(gapminder, year %in% c(1977, 1987, 1997, 2007)) gd2007 <- filter(gapminder, year == 2007) fct_thm <- theme_minimal() + theme(panel.background = element_rect(color = "black", linewidth = 0.5, fill = NA)) ggplot(gd2007, aes(x = gdpPercap, y = lifeExp, color = continent)) + geom_point(size = 2.5) + scale_x_log10() + facet_wrap(~ continent) + fct_thm ``` A useful variation is to show a muted view of the full data in the background: ```{r, fig.width = 8, class.source = "fold-hide"} ggplot(gd2007, aes(x = gdpPercap, y = lifeExp, color = continent)) + geom_point(data = mutate(gd2007, continent = NULL), color = "grey80") + geom_point(size = 2.5) + scale_x_log10() + facet_wrap(~ continent) + fct_thm ``` Data can be facetet on two variables. This plot shows the full data faceted by both continent and a set of years: ```{r, fig.width = 7, fig.height = 7, class.source = "fold-hide"} ggplot(gd, aes(x = gdpPercap, y = lifeExp, color = continent)) + geom_point(size = 2.5) + scale_x_log10() + facet_grid(continent ~ year) + fct_thm ``` Adding muted data for each year helps regonizing where each continent group fits within a year ```{r, fig.width = 7, fig.height = 7, class.source = "fold-hide"} ggplot(gd, aes(x = gdpPercap, y = lifeExp, color = continent)) + geom_point(data = mutate(gd, continent = NULL), color = "grey80") + geom_point(size = 2.5) + scale_x_log10() + facet_grid(continent ~ year) + fct_thm ``` A [recent example]( https://www.ft.com/content/29fd9b5c-2f35-41bf-9d4c-994db4e12998) from the Financial Times: ```{r echo = FALSE, out.width = "70%"} knitr::include_graphics(IMG("FT-gender-politics.jpeg")) ``` ## Reading Section [_Perception and Data Visualization_](https://socviz.co/lookatdata.html#perception-and-data-visualization) in [_Data Visualization_](https://socviz.co/). Chapter [_Data visualization principles_](https://rafalab.dfci.harvard.edu/dsbook-part-1/dataviz/dataviz-principles.html) in [_Introduction to Data Science: Data Analysis and Prediction Algorithms with R_](https://rafalab.dfci.harvard.edu/dsbook-part-1/). ## Exercises 1. Which of the following channels are magnitude channels and which are identity channels? a. Position on a common scale b. Length c. Color hue (red, green, etc.) d. Symbol shape (dot, cross, etc.) 2. Consider the following visualizations of the 2017 Michelin star data: ```{r, include = FALSE} plot.new() michelin <- read.table(WLNK("michelin.dat"), head = TRUE) michelin <- mutate(michelin, stars = one + 2 * two + 3 * three, ## Get strwidth from base graphics. ## Not right but may be OK for wuick use. strwidth = 500 * strwidth(country, "inches"), country = reorder(country, stars)) ``` ```{r, echo = FALSE} library(dplyr, warn.conflicts = FALSE) library(ggplot2) p1 <- ggplot(michelin, aes(x = 0, y = country, label = country)) + geom_text(hjust = 0, size = 4) + geom_tile(aes(x = strwidth + stars / 2, width = stars), height = 0.5, fill = "deepskyblue") + xlim(c(0, 1000)) + theme_minimal() + theme(text = element_text(size = 14), axis.text.y = element_blank(), axis.ticks.y = element_blank()) + ylab(NULL) + xlab(NULL) + ggtitle("Plot A") p2 <- ggplot(michelin, aes(x = stars, y = country, label = country)) + geom_tile(aes(x = stars / 2, width = stars), height = 0.5, fill = "deepskyblue") + geom_text(hjust = 0, size = 4, nudge_x = 30) + xlim(c(0, 1000)) + theme_minimal() + theme(text = element_text(size = 14), axis.text.y = element_blank(), axis.ticks.y = element_blank()) + ylab(NULL) + xlab(NULL) + ggtitle("Plot B") p1 + p2 ``` Which plot makes it easier to compare the numbers of stars for different countries? Explain your conclusion by identifying the channels used and their relative strengths. 3. Identify the items, attributes, marks, channels, and mappings use in the following plot: ```{r, echo = FALSE} library(dplyr, warn.conflicts = FALSE) library(ggplot2) mutate(mpg, cyl = factor(cyl)) |> group_by(cyl, year) |> summarize(hwy = mean(hwy), cty = mean(cty), Count = n()) |> ungroup() |> mutate(cyl = factor(cyl), year = factor(year)) |> ggplot(aes(x = year, y = hwy, size = Count, color = cyl)) + geom_point() + scale_size_area(max_size = 15) + theme_minimal() + scale_color_viridis_d() ```