--- title: "The Grammar of Graphics" output: html_document: toc: yes code_folding: show code_download: true --- ```{r setup, include = FALSE} source(here::here("setup.R")) knitr::opts_chunk$set(collapse = TRUE, message = FALSE, fig.height = 5, fig.width = 6, fig.align = "center") library(dplyr) library(ggplot2) library(lattice) library(gridExtra) set.seed(12345) ``` ## Background The _Grammar of Graphics_ is a language proposed by Leland Wilkinson for describing statistical graphs. > Wilkinson, L. (2005), _The Grammar of Graphics_, 2nd ed., Springer. The grammar of graphics has served as the foundation for the graphics frameworks in [SPSS](https://www.ibm.com/products/spss-statistics), [Vega-Lite](https://vega.github.io/vega-lite/) and several other systems. `ggplot2` represents an implementation and extension of the grammar of graphics for R. > Wickham, H. (2016), _ggplot2: Elegant Graphics for Data Analysis_, > 2nd ed., Springer. [3rd ed. in progress](https://ggplot2-book.org/). > On line documentation: . > Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund (2023), > [_R for Data Science (2nd Edition)_](https://r4ds.hadley.nz/), > O'Reilly. > [Data visualization cheatsheet](https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-visualization.pdf) > Winston Chang (2018), [_R Graphics Cookbook_, 2nd > edition](https://r-graphics.org/), O’Reilly. ([Book source on > GitHub](https://github.com/wch/rgcookbook)) The idea is that any basic plot can be built out of a combination of * a data set; * one or more geometrical representation (_geoms_); * mappings of values to _aesthetic_ features of the geom; * a _stat_ to produce values to be mapped; * position adjustments; * a coordinate system; * a scale specification; * a faceting scheme. `ggplot2` provides tools for specifying these components and adjusting their features. Many components and features are provided by default and do not need to be specified explicitly unless the defaults are to be changed. ## A Basic Template The simplest graph needs a data set, a geom, and a mapping: ```r ggplot(data = ) + (mapping = aes()) ``` The appearance of geom objects is controlled by _aesthetic_ features. Each geom has some required and some optional aesthetics. For `geom_point` the required aesthetics are * `x` position * `y` position. Optional aesthetics include * `color` * `fill` * `shape` * `size` `geom_point` is used to produce a _scatter plot_. ## Scatter Plots Using `geom_point` The `mpg` data set included in the `ggpllot2` package includes EPA fuel economy data from 1999 to 2008 for 38 popular models of cars. ```{r} mpg ``` ```{r, include = FALSE} fig_align <- if (using_xaringan) "left" else "center" ``` A simple scatter plot: ```{r mpg-plain, eval = FALSE} ggplot(mpg) + geom_point(aes(x = displ, y = hwy)) ``` ```{r mpg-plain, echo = FALSE, fig.width = 5.75, fig.align = fig_align} ``` Map color to vehicle class: ```{r mpg-color, eval = FALSE} ggplot(mpg) + geom_point(aes(x = displ, y = hwy, color = class)) ``` ```{r mpg-color, echo = FALSE, fig.width = 7, fig.align = fig_align} ``` And map shape to number of cylinders: ```{r mpg-color-shape, eval = FALSE} ggplot(mpg) + geom_point(aes(x = displ, y = hwy, color = class, shape = factor(cyl))) ``` ```{r mpg-color-shape, echo = FALSE, fig.width = 7, fig.align = fig_align} ``` Perception: * Too many colors; * shapes are too small; * interference between shapes and colors. Aesthetics can be mapped to a variable or set to a fixed common value. This can be used to override default settings: ```{r mpg-fixed, eval = FALSE} ggplot(mpg) + geom_point(aes(x = displ, y = hwy), color = "blue", shape = 1) ``` ```{r mpg-fixed, echo = FALSE, fig.width = 7, fig.align = fig_align} ``` Changing the `size` aesthetics makes shapes easier to recognize: ```{r mpg-color-shape-large, eval = FALSE} ggplot(mpg) + geom_point(aes(x = displ, y = hwy, color = class, shape = factor(cyl)), size = 3) ``` ```{r mpg-color-shape-large, echo = FALSE, fig.width = 7, fig.align = fig_align} ``` Perception: Still too many colors; still have interference. Available point shapes are specified by number: ```{r, echo = FALSE, eval = FALSE} generateRPointShapes <- function() { oldPar <- par() par(font = 2, mar = c(0.5, 0, 0, 0)) y <- rev(c(rep(1, 6), rep(2, 5), rep(3, 5), rep(4, 5), rep(5, 5))) x <- c(rep(1 : 5, 5), 6) plot(x, y, pch = 0 : 25, cex = 1.5, ylim = c(1, 5.5), xlim = c(1, 6.5), axes = FALSE, xlab = "", ylab = "", bg = "blue") text(x, y, labels = 0 : 25, pos = 3) par(mar = oldPar$mar, font = oldPar$font) } generateRPointShapes() ``` ```{r, echo = FALSE} ggplot(NULL, aes(x = rep(1 : 5, 5), y = rev(rep(1 : 5, each = 5)))) + geom_point(shape = 1 : 25, size = 5, fill = "blue") + geom_text(aes(label = 1 : 25), nudge_y = 0.25, size = 6) + theme_void() ``` Shapes 1-20 have their color set by the `color` aesthetic and ignore the `fill` aesthetic. For shapes 21-25 the `color` aesthetic specifies the border color and `fill` specifies the interior color. Using `shape` 21 with `cyl` mapped to the `fill` aesthetic: ```{r mpg-fill-21, eval = FALSE} ggplot(mutate(mpg, cyl = factor(cyl))) + geom_point(aes(x = displ, y = hwy, fill = cyl), shape = 21, size = 4) ``` ```{r mpg-fill-21, echo = FALSE} ``` Perception: Borders, larger symbols, fewer colors help. Specifying a new default is very different from specifying a constant value as an aesthetic. Constant aesthetic: Rarely what you want: ```{r mpg-bad-color, eval = FALSE} ggplot(mpg) + geom_point(aes(x = displ, y = hwy, color = "blue")) ``` ```{r mpg-bad-color, echo = FALSE, fig.height = 4.2} ``` Default: Probably what you want: ```{r mpg-good-color, eval = FALSE} ggplot(mpg) + geom_point(aes(x = displ, y = hwy), color = "blue") ``` ```{r mpg-good-color, echo = FALSE, fig.height = 4.2} ``` ## Geometric Objects `ggplot2` provides a number of geoms: ```{r, echo = FALSE, results = "asis"} showList <- function(v, ncol = 4, pad = 2) { w <- max(nchar(v)) + pad nrow <- ceiling(length(v) / ncol) v <- c(v, character(ncol * nrow - length(v))) cat("```r\n") for (i in seq_len(nrow)) { line <- v[ncol * (i - 1) + (1 : ncol)] for (j in 1 : ncol) if (j < ncol) cat(sprintf("%-*s", w, line[j])) else cat(sprintf("%s\n", line[j])) ## cat(sprintf("%-*s%-*s%-*s%s\n", ## w, line[1], w, line[2], w, line[3], line[4])) } cat("```\n") } showList(ls("package:ggplot2", pat = "^geom_")) ``` Additional geoms are available in packages like `ggforce`, `ggridges`, and others described on the [`ggplot2` extensions site](https://exts.ggplot2.tidyverse.org/). Geoms can be added as _layers_ to a plot. Mappings common to all, or most, geoms can be specified in the `ggplot` call: ```{r mpg-smooth, eval = FALSE} ggplot(mpg, aes(x = displ, y = hwy)) + geom_smooth() + geom_point() ``` ```{r mpg-smooth, echo = FALSE, message = FALSE} ``` Geoms can also use different data sets. One way to highlight Europe in a plot of life expectancy against log income for 2007 is to start with a plot of the full data: ```{r gm_2007, eval = FALSE} library(dplyr) library(gapminder) gm_2007 <- filter(gapminder, year == 2007) (p <- ggplot(gm_2007, aes(x = gdpPercap, y = lifeExp)) + geom_point() + scale_x_log10()) ``` ```{r gm_2007, echo = FALSE} ``` Then add a layer showing only Europe: ```{r gm_2007_eu, eval = FALSE} gm_2007_eu <- filter(gm_2007, continent == "Europe") p + geom_point(data = gm_2007_eu, color = "red", size = 3) ``` ```{r gm_2007_eu, echo = FALSE} ``` ## Statistical Transformations All geoms use a statistical transformation (_stat_) to convert raw data to the values to be mapped to the object's features. The available stats are ```{r, echo = FALSE, results = "asis"} showList(ls("package:ggplot2", pat = "^stat_"), ncol = 3) ``` Each geom has a default stat, and each stat has a default geom. * For `geom_point` the default stat is `stat_identity`. * For `geom_bar` the default stat is `stat_count`. * For `geom_histogram` the default stat is `stat_bin`. Stats can provide _computed variables_ that can be mapped to aesthetic features. For `stat_bin` some of the computed variables are * `count`: number of points in bin * `density`: density of points in bin, scaled to integrate to 1 The `density` variable can be accessed as `after_stat(dentity)`. Older approaches that also work but are now discouraged: * `stat(dentity)` * `..density..` By default, `geom_histogram` uses `y = after_stat(count)`. ```{r geyser-count, eval = FALSE} ggplot(faithful) + geom_histogram(aes(x = eruptions), binwidth = 0.25, fill = "grey", color = "black") ``` ```{r geyser-count, echo = FALSE} ``` Explicitly specifying `y = after_stat(count)` produces the same plot: ```{r geyser-count-exp, eval = FALSE} ggplot(faithful) + geom_histogram(aes(x = eruptions, y = after_stat(count)), binwidth = 0.25, fill = "grey", color = "black") ``` ```{r geyser-count-exp, echo = FALSE} ``` Using `y = after_stat(density)` produces a density scaled axis. ```{r geyser-dentity, eval = FALSE} (p <- ggplot(faithful) + geom_histogram(aes(x = eruptions, y = after_stat(density)), binwidth = 0.25, fill = "grey", color = "black")) ``` ```{r geyser-dentity, echo = FALSE} ``` `stat_function` can be used to add a density curve specified as a mixture of two normal densities: ```{r} (ms <- mutate(faithful, type = ifelse(eruptions < 3, "short", "long")) |> group_by(type) |> summarize(mean = mean(eruptions), sd = sd(eruptions), n = n()) |> mutate(p = n / sum(n))) ``` ```{r geyser-hist-dens, eval = FALSE} f <- function(x) ms$p[1] * dnorm(x, ms$mean[1], ms$sd[1]) + ms$p[2] * dnorm(x, ms$mean[2], ms$sd[2]) p + stat_function(fun = f, color = "red") ``` ```{r geyser-hist-dens, echo = FALSE} ``` ## Position Adjustments The available position adjustments: ```{r, echo = FALSE, results = "asis"} showList(ls("package:ggplot2", pat = "^position_"), ncol = 3) ``` A bar chart showing the counts for the different `cut` categories in the `diamonds` data: ```{r diamonds-cut, eval = FALSE} ggplot(diamonds, aes(x = cut)) + geom_bar() ``` ```{r diamonds-cut, echo = FALSE} ``` Mapping `clarity` to `fill` shows the breakdown by both `cut` and `clarity` in a _stacked bar chart_: ```{r diamonds-stack1, eval = FALSE} ggplot(diamonds, aes(x = cut, fill = clarity)) + geom_bar() ``` ```{r diamonds-stack1, echo = FALSE} ``` The default `position` for bar charts is `position_stack`: ```{r diamonds-stack2, eval = FALSE} ggplot(diamonds, aes(x = cut, fill = clarity)) + geom_bar(position = "stack") ``` ```{r diamonds-stack2, echo = FALSE} ``` `position_dodge` produces _side-by-side bar charts_: ```{r diamonds-dodge, eval = FALSE} ggplot(diamonds, aes(x = cut, fill = clarity)) + geom_bar(position = "dodge") ``` ```{r diamonds-dodge, echo = FALSE} ``` `position_fill` rescales all bars to be equal height to help compare proportions within bars. ```{r diamonds-fill, eval = FALSE} ggplot(diamonds, aes(x = cut, fill = clarity)) + geom_bar(position = "fill") ``` ```{r diamonds-fill, echo = FALSE} ``` Using the counts to scale the widths would produce a _spine plot_, a variant of a _mosaic plot_. This is easiest to do with the `ggmosaic` package. `position_jitter` can be used with `geom_point` to avoid overplotting or break up rounding artifacts. Another version of the Old Faithful data available as `geyser` in package `MASS` has some rounding in the `duration` variable: ```{r geyser2, eval = FALSE} data(geyser, package = "MASS") ## Adjust for different meaning of `waiting` variable geyser2 <- na.omit(mutate(geyser, duration = lag(duration))) p <- ggplot(geyser2, aes(x = duration, y = waiting)) p + geom_point() ``` ```{r geyser2, echo = FALSE} ``` _Jittering_ can help break up the distracting _heaping_ of values on durations of 2 and 4 minutes. The default amount of jittering isn't quite enough in this case: ```{r geyser2-jit, eval = FALSE} p + geom_point(position = "jitter") ``` ```{r geyser2-jit, echo = FALSE} ``` To jitter only horizontally and by a larger amount you can use ```{r geyser2-jit2, eval = FALSE} p + geom_point(position = position_jitter(height = 0, width = 0.1)) ``` ```{r geyser2-jit2, echo = FALSE} ``` ## Coordinate Systems Coordinate system functions include ```{r, echo = FALSE, results = "asis"} showList(ls("package:ggplot2", pat = "^coord_")) ``` The default coordinate system is `coord_cartesian`. ### Cartesian Coordinates `coord_cartesian` can be used to _zoom in_ on a particular regiion: ```{r geyser2-zoom, eval = FALSE} p + geom_point() + coord_cartesian(xlim = c(3, 4)) ``` ```{r geyser2-zoom, echo = FALSE} ``` `coord_fixed` and `coord_equal` fix the _aspect ratio_ for a cartesian coordinate system. The aspect ratio is the ratio of the number physical display units per `y` unit to the number of physical display units per `x` unit. The aspect ratio can be important for recognizing features and patterns. ```{r} river <- scan("https://www.stat.uiowa.edu/~luke/data/river.dat") r <- data.frame(flow = river, month = seq_along(river)) ``` ```{r river-flat, eval = FALSE} ggplot(r, aes(x = month, y = flow)) + geom_point() + coord_fixed(ratio = 4) ``` ```{r river-flat, echo = FALSE, fig.height = 2, fig.width = 8} ``` ### Polar Coordinates A filled bar chart ```{r diamonds-fill-1, eval = FALSE} (p <- ggplot(diamonds) + geom_bar(aes(x = 1, fill = cut), position = "fill")) ``` ```{r diamonds-fill-1, echo = FALSE} ``` is turned into a pie chart by changing to polar coordinates: ```{r diamonds-pie, eval = FALSE} p + coord_polar(theta = "y") ``` ```{r diamonds-pie, echo = FALSE} ``` ### Coordinate Systems for Maps Coordinate systems are particularly important for maps. Polygons for many political and geographic boundaries are available through the `map_data` function. Boundaries for the lower 48 US states can be obtained as ```{r} usa <- map_data("state") ``` Polygon vertices are encoded by longitude and latitude. Plotting these in the default cartesian coordinate system usually does not work well: ```{r usa-cart, eval = FALSE} usa <- map_data("state") m <- ggplot(usa, aes(x = long, y = lat, group = group)) + geom_polygon(fill = "white", color = "black") m ``` ```{r usa-cart, echo = FALSE} ``` Using a fixed aspect ratio is better, but an aspect ratio of 1 does not work well: ```{r} m + coord_equal() ``` The problem is that away from the equator a one degree change in latitude corresponds to a larger distance than a one degree change in longitude. The ratio of one degree longitude separation to one degree latitude separation for the latitude at the middle of Iowa of 41 degrees is ```{r} longlat <- cos(41 / 90 * pi / 2) longlat ``` A better map is obtained using the aspect ratio `1 / longlat`: ```{r usa-fixed, eval = FALSE} m + coord_fixed(1 / longlat) ``` ```{r usa-fixed, echo = FALSE} ``` The best approach is to use a coordinate system designed specifically for maps. There are many _projections_ used in map making. The default projection used by `coord_map` is the [Mercator](https://en.wikipedia.org/wiki/Mercator_projection) projection. ```{r usa-mercator, eval = FALSE} m + coord_map() ``` ```{r usa-mercator, echo = FALSE} ``` Proper map projections are non-linear; this is easier to see with an Albers projection: ```{r usa-albers, eval = FALSE} m + coord_map("albers", 20, 50) ``` ```{r usa-albers, echo = FALSE} ``` ## Scales Scales are used for controlling the mapping of values to physical representations such as colors, shapes, and positions. Scale functions are also responsible for producing _guides_ for translating physical representations back to values, such as * axis labels and marks; * color or shape legends. There are currently `r length(ls("package:ggplot2", pat = "scale_"))` scale functions; some examples are ```r scale_color_gradient scale_shape_manual scale_x_log10 scale_color_manual scale_size_area scale_y_log10 scale_fill_gradient scale_x_sqrt scale_fill_manual scale_y_sqrt ``` An [experimental tool](https://ggplot2tor.com/scales/) to help choosing scales has recently been introduced. Start with a basic scatter plot: ```{r mpg-basic, eval = FALSE} (p <- ggplot(mpg, aes(x = displ, y = hwy)) + geom_point()) ``` ```{r mpg-basic, echo = FALSE} ``` Remove the `x` tick marks and labels (this can also be done with theme settings): ```{r mpg-no-ticks-labs, eval = FALSE} p + scale_x_continuous(labels = NULL, breaks = NULL) ``` ```{r mpg-no-ticks-labs, echo = FALSE} ``` Change the tick locations and labels: ```{r mpg-new-ticks-labs, eval = FALSE} p + scale_x_continuous(labels = paste(c(2, 4, 6), "ltr"), breaks = c(2, 4, 6)) ``` ```{r mpg-new-ticks-labs, echo = FALSE} ``` Use a logarithmic axis: ```{r mpg-log-x, eval = FALSE} p + scale_x_log10(labels = paste(c(2, 4, 6), "ltr"), breaks = c(2, 4, 6), minor_breaks = c(3, 5, 7)) ``` ```{r mpg-log-x, echo = FALSE} ``` The [Scales](https://r4ds.hadley.nz/communication.html#scales) section in [R for Data Science](https://r4ds.hadley.nz/) provides some more details. Color assignment can also be controlled by scale functions. For example, for some presidential approval ratings data ```{r, include = FALSE} pr_appr <- data.frame(pres = c("Obama", "Carter", "Clinton", "G.W. Bush", "Reagan", "G.H.W Bush", "Trump"), appr = c(79, 78, 68, 65, 58, 56, 40), party = c("D", "D", "D", "R", "R", "R", "R"), year = c(2009, 1977, 1993, 2001, 1981, 1989, 2017)) pr_appr <- mutate(pr_appr, pres = reorder(pres, appr)) ``` ```{r} pr_appr ``` the default color scale is not ideal: ```{r pr-appr0, eval = FALSE} ggplot(pr_appr, aes(x = appr, y = pres, fill = party)) + geom_col() ``` ```{r pr-appr0, echo = FALSE} ``` The common assignment of red for Republican and blue for Democrat can be obtained by ```{r pr-appr, eval = FALSE} ggplot(pr_appr, aes(x = appr, y = pres, fill = party)) + geom_col() + scale_fill_manual(values = c(R = "red", D = "blue")) ``` ```{r pr-appr, echo = FALSE} ``` A better choice is to use a well-designed [color palette](https://hclwizard.org/#color-palettes): ```{r pr-appr-2, eval = FALSE} ggplot(pr_appr, aes(x = appr, y = pres, fill = party)) + geom_col() + colorspace::scale_fill_discrete_diverging( palette = "Blue-Red 2") ``` ```{r pr-appr-2, echo = FALSE} ``` ## Facets Faceting uses the _small multiples_ approach to introduce additional variables. For a single variable `facet_wrap` is usually used: ```{r mpg-facet-wrap, eval = FALSE} p <- ggplot(mpg) + geom_point(aes(x = displ, y = hwy)) p + facet_wrap(~ class) ``` ```{r mpg-facet-wrap, echo = FALSE} ``` For two variables, each with a modest number of categories, `facet_grid` can be effective: ```{r mpg-facet-grid, eval = FALSE} p + facet_grid(factor(cyl) ~ drv) ``` ```{r mpg-facet-grid, echo = FALSE} ``` To show common data in all facets make sure the data does not contain the faceting variable. This was used to show muted views of the full data in faceted plots. A faceted plot of the `gapminder` data: ```{r gapminder-not-muted, eval = FALSE} library(gapminder) years_to_keep <- c(1977, 1987, 1997, 2007) gd <- filter(gapminder, year %in% years_to_keep) ggplot(gd, aes(x = gdpPercap, y = lifeExp, color = continent)) + geom_point(size = 2.5) + scale_x_log10() + facet_wrap(~ year) ``` ```{r gapminder-not-muted, echo = FALSE, fig.width = 8} ``` Add a muted version of the full data in the background of each panel: ```{r gapminder-muted, eval = FALSE} library(gapminder) years_to_keep <- c(1977, 1987, 1997, 2007) gd <- filter(gapminder, year %in% years_to_keep) gd_no_year <- mutate(gd, year = NULL) ggplot(gd, aes(x = gdpPercap, y = lifeExp, color = continent)) + geom_point(data = gd_no_year, color = "grey80") + geom_point(size = 2.5) + scale_x_log10() + facet_wrap(~ year) ``` ```{r gapminder-muted, echo = FALSE, fig.width = 8} ``` Usually facets use common axis scales, but one or both can be allowed to vary. A useful approach for showing time series data with a good aspect ratio can be to split the data into facets for non-overlapping portions of the time axis. ```{r river-facet, eval = FALSE} pd <- rep(paste(seq(1, by = 32, length.out = 4), seq(32, by = 32, length.out = 4), sep = " - "), each = 32) rd <- data.frame(month = seq_along(river), flow = river, panel = pd) ggplot(rd, aes(x = month, y = flow)) + geom_point() + facet_wrap(~ panel, scale = "free_x", ncol = 1) ``` ```{r river-facet, echo = FALSE} ``` Facet arrangement can also be used to convey other information, such as geographic location. The [`geofacet` package](https://hafen.github.io/geofacet/) allows facets to be placed in approximate locations of different geographic regions. An example for data from US states: ```{r geofacet, eval = FALSE} library(geofacet) ggplot(state_unemp, aes(year, rate)) + geom_line() + facet_geo(~ state, grid = "us_state_grid2", label = "code") + scale_x_continuous(labels = function(x) paste0("'", substr(x, 3, 4))) + labs(title = "Seasonally Adjusted US Unemployment Rate 2000-2016", caption = "Data Source: bls.gov", x = "Year", y = "Unemployment Rate (%)") + theme(strip.text.x = element_text(size = 6), axis.text = element_text(size = 5)) ``` ```{r geofacet, echo = FALSE, message = FALSE} ``` Arrangement according to a calendar can also be useful. ## Themes `ggplot2` supports the notion of _themes_ for adjusting non-data appearance aspects of a plot, such as * plot titles * axis and legend placement and titles * background colors * guide line placement Theme elements can be customized in several ways: * `theme()` can be used to adjust individual elements in a plot. * `theme_set()` adjusts default settings for a session; * pre-defined theme functions allow consistent style changes. The [full documentation](https://ggplot2.tidyverse.org/reference/theme.html) of the `theme` function lists many customizable elements. One simple example: ```{r theme-simple, eval = FALSE} ggplot(mutate(mpg, cyl = factor(cyl))) + geom_point(aes(x = displ, y = hwy, fill = cyl), shape = 21, size = 3) + theme(legend.position = "top", axis.text = element_text(size = 12), axis.title = element_text(size = 14, face = "bold")) ``` ```{r theme-simple, echo = FALSE} ``` Another example: ```{r theme-simple-2, eval = FALSE} gthm <- theme(plot.background = element_rect(fill = "lightblue", color = NA), panel.background = element_rect(fill = "pink")) p + gthm ``` ```{r theme-simple-2, echo = FALSE} ``` Some alternate complete themes provided by `ggplot2` are ```r theme_bw theme_gray theme_minimal theme_void theme_classic theme_grey theme_dark theme_light ``` Some examples: ```{r alt-themes, eval = FALSE} p_bw <- p + theme_bw() + ggtitle("BW") p_classic <- p + theme_classic() + ggtitle("Classic") p_min <- p + theme_minimal() + ggtitle("Minimal") p_void <- p + theme_void() + ggtitle("Void") library(patchwork) (p_bw + p_classic) / (p_min + p_void) ``` ```{r alt-themes, echo = FALSE} ``` The [`ggthemes`](http://www.rpubs.com/Mentors_Ubiqum/ggthemes_1) package provides some additional themes. Some examples: ```{r ggthemes-examples, eval = FALSE} library(ggthemes) p_econ <- p + theme_economist() + ggtitle("Economist") p_wsj <- p + theme_wsj() + ggtitle("WSJ") p_tufte <- p + theme_tufte() + ggtitle("Tufte") p_few <- p + theme_few() + ggtitle("Few") (p_econ + p_wsj) / (p_tufte + p_few) ``` ```{r ggthemes-examples, echo = FALSE} ``` `ggthemes` also provides `theme_map` that removes unnecessary elements from maps: ```{r} m + coord_map() + theme_map() ``` The [Themes](https://r4ds.hadley.nz/communication.html#sec-themes) section in [R for Data Science](https://r4ds.hadley.nz/) provides some more details. ## A More Complete Template ```r ggplot(data = ) + (mapping = aes(), stat = , position = ) + < ... MORE GEOMS ... > + + + + ``` ## Labels and Annotations A basic plot: ```{r mpg-ann, eval = FALSE} p <- ggplot(mpg, aes(x = displ, y = hwy)) p1 <- p + geom_point(aes(color = factor(cyl)), size = 2.5) p1 ``` ```{r mpg-ann, echo = FALSE} ``` Axis labels are based on the expressions given to `aes`. This is convenient for exploration but usually not ideal for a report. The `labs()` function can be used to change axis and legend labels: ```{r mpg-ann-labs, eval = FALSE} p1 + labs(x = "Displacement (Liters)", y = "Highway Miles Per Gallon", color = "Cylinders") ``` ```{r mpg-ann-labs, echo = FALSE} ``` The `labs()` function can also add a title, subtitle, and caption: ```{r mpg-ann-labs-2, eval = FALSE} p2 <- p1 + labs(x = "Displacement (Liters)", y = "Highway Miles Per Gallon", color = "Cylinders", title = "Gas Mileage and Displacement", subtitle = paste("For models which had a new release every year", "between 1999 and 2008"), caption = "Data Source: https://fueleconomy.gov/") p2 ``` ```{r mpg-ann-labs-2, echo = FALSE} ``` Annotations can be used to provide popout that draws a viewer's attention to particular features. The `annotate()` function is one option: ```{r mpg-ann-popout, eval = FALSE} p2 + annotate("label", x = 2.8, y = 43, label = "Volkswagens") + annotate("rect", xmin = 1.7, xmax = 2.1, ymin = 40, ymax = 45, fill = NA, color = "black") ``` ```{r mpg-ann-popout, echo = FALSE} ``` Often more convenient are some `geom_mark` objects provided by the `ggforce` package: ```{r mpg-ann-popout-2, eval = FALSE} library(ggforce) p2 + geom_mark_hull(aes(filter = class == "2seater"), description = paste("2-Seaters have high displacement", "values, but also high fuel efficiency", "for their displacement.")) + geom_mark_rect(aes(filter = hwy > 40), description = "These are Volkswagens") + geom_mark_circle(aes(filter = hwy == 12), description = "Three pickups and an SUV.") ``` ```{r mpg-ann-popout-2, echo = FALSE, fig.width = 7, fig.height = 5.5} #| warning: false ``` These annotations can be customized in a number of ways. ## Arranging Plots There are several tools available for assembling ensemble plots. The [`patchwork`](https://patchwork.data-imaginist.com/) package is a good choice. A simple example: ```{r mpg-patchwork, eval = FALSE} p1 <- ggplot(mpg, aes(x = displ, y = hwy)) + geom_point() p2 <- ggplot(mpg, aes(x = cyl, y = hwy, group = cyl)) + geom_boxplot() p3 <- ggplot(mpg, aes(x = cyl)) + geom_bar() library(patchwork) (p1 + p2) / p3 ``` ```{r mpg-patchwork, echo = FALSE} ``` ## Animation The [`gganimate`](https://github.com/thomasp85/gganimate) package can be used to add animation to a `ggplot` graph. Start with a plot `p` for all years in the `gapminder` data, with `year` in the background: ```{r} p <- gapminder |> arrange(desc(pop)) |> ggplot(aes(x = gdpPercap, y = lifeExp)) + geom_text(aes(x = 5000, y = 55, label = as.character(year)), size = 50, color = "grey", hjust = "center", vjust = "center") + geom_point(aes(size = pop, fill = continent), shape = 21) + scale_x_log10(labels = scales::comma) + ylim(c(20, 85)) + scale_size_area(max_size = 20, labels = scales::comma, breaks = c(0.25 * 10 ^ 9, 0.5 * 10 ^ 9, 10 ^ 9)) + scale_fill_manual(values = c(Africa = "deepskyblue", Asia = "red", Americas = "green", Europe = "gold", Oceania = "brown")) + labs(x = "Income", y = "Life expectancy") + theme(text = element_text(size = 16)) + guides(fill = guide_legend(title = "Continent", override.aes = list(size = 5), order = 1), size = guide_legend(title = "Population", label.hjust = 1, order = 2)) + theme_minimal() + theme(panel.border = element_rect(fill = NA, color = "grey20")) ``` ```{r gapminder-full, echo = FALSE, fig.height = 6, fig.width = 8} p ``` A [GIF](https://simple.wikipedia.org/wiki/Graphics_Interchange_Format) animation: ```{r gapminder-anim, eval = FALSE} library(gganimate) animate(p + transition_states( year, transition_length = 2, state_length = 0)) ``` ```{r gapminder-anim, echo = FALSE, fig.height = 6, fig.width = 8} ``` A movie: ```{r gapminder-anim-movie, eval = FALSE} animate(p + transition_states( year, transition_length = 2, state_length = 0, wrap = FALSE), renderer = ffmpeg_renderer()) ``` ```{r gapminder-anim-movie, echo = FALSE, fig.height = 6, fig.width = 8, out.width = "100%"} ``` ## Interaction ### Plotly The `ggplotly` function in the [`plotly` package](https://plotly.com/r/) can be used to add some interactive features to a plot created with `ggplot2`. * In an R session a call to `ggplotly()` opens may open a browser window with the interactive plot. * In an RStudio session the plot appears in the graphics panel. * In an Rmarkdown document the interactive plot is embedded in the `html` file. Another interactive plotting approach that can be used from R is described in an [Infoworld article](https://www.infoworld.com/article/3607068/plot-in-r-with-echarts4r.html). A simple example using `ggplotly()`: ```{r mpg-plotly, eval = FALSE} library(ggplot2) library(plotly) p <- ggplot(mutate(mpg, cyl = factor(cyl))) + geom_point(aes(x = displ, y = hwy, fill = cyl), shape = 21, size = 3) ggplotly(p) ``` ```{r mpg-plotly, echo = FALSE, message = FALSE} ``` Adding a `text` aesthetic allows the tooltip display to be customized: ```{r mpg-plotly-2, eval = FALSE} p <- ggplot(mutate(mpg, cyl = factor(cyl))) + geom_point(aes(x = displ, y = hwy, fill = cyl, text = paste(year, manufacturer, model)), shape = 21, size = 3) ggplotly(p, tooltip = "text") |> style(hoverlabel = list(bgcolor = "white")) ``` ```{r mpg-plotly-2, echo = FALSE, warning = FALSE, message = FALSE} ``` ### Ggiraph The [`ggiraph` package](https://davidgohel.github.io/ggiraph/) provides another approach. ```{r mpg-ggiraph, eval = FALSE} library(ggplot2) library(ggiraph) p <- ggplot(mutate(mpg, cyl = factor(cyl))) + geom_point_interactive( aes(x = displ, y = hwy, fill = cyl, tooltip = paste(year, manufacturer, model)), shape = 21, size = 3) girafe(ggobj = p) ``` ```{r mpg-ggiraph, echo = FALSE} ``` ### Grammar of Interactive Graphics There have been several efforts to develop a grammar of interactive graphics, including [`ggvis`](https://ggvis.rstudio.com/) and [`animint`](https://tdhock.github.io/animint/); neither seems to be under active development at this time. A promising approach is [Vega-Lite](https://vega.github.io/vega-lite/), with a Python interface [Altair](https://altair-viz.github.io/) and an R interface [altair](https://vegawidget.github.io/altair/) to the Python interface. An example using the `altair` package: ```{r rubber-altair, eval = FALSE} rub <- read.csv(here::here("rubber.csv")) library(altair) chartTH <- alt$Chart(rub)$ mark_point()$ encode(x = alt$X("H:Q", scale = alt$Scale(domain = range(rub$H))), y = alt$Y("T:Q", scale = alt$Scale(domain = range(rub$T)))) brush <- alt$selection_interval() chartTH_brush <- chartTH$add_selection(brush) chartTH_selection <- chartTH_brush$encode(color = alt$condition(brush, "Origin:N", alt$value("lightgray"))) chartAT <- chartTH_selection$ encode(x = alt$X("T:Q", scale = alt$Scale(domain = range(rub$T))), y = alt$Y("A:Q", scale = alt$Scale(domain = range(rub$A)))) chartAT | chartTH_selection ``` The resulting linked plots: ```{r rubber-altair, echo = FALSE, error = TRUE, warning = FALSE} ``` ## Notes * A number of other [`ggplot` extensions](https://exts.ggplot2.tidyverse.org/) are available. * A [blog post](https://medium.com/bbc-visual-and-data-journalism/how-the-bbc-visual-and-data-journalism-team-works-with-graphics-in-r-ed0b35693535) explains how the [BBC Visual and Data Journalism](https://medium.com/bbc-visual-and-data-journalism) team creates their graphics. More details are provided in an [_R cook book_](https://bbc.github.io/rcookbook/). * A [blog post](https://blog.revolutionanalytics.com/2016/07/data-journalism-with-r-at-538.html) describes the use of R and `ggplot` by [FiveThirtyEight](https://fivethirtyeight.com/). The `ggthemes` packages includes `theme_fivethirtyeight` to emulate their style. ## Reading Chapters [_Data visualization_](https://r4ds.hadley.nz/data-visualize.html) and [_Graphics for communication_](https://r4ds.hadley.nz/communication.html) in [_R for Data Science_](https://r4ds.hadley.nz/), O'Reilly. Chapter [_Make a plot_](https://socviz.co/makeplot.html) in [_Data Visualization_](https://socviz.co/). Chapter [_ggplot2_](https://rafalab.dfci.harvard.edu/dsbook-part-1/dataviz/ggplot2.html) in [_Introduction to Data Science: Data Analysis and Prediction Algorithms with R_](https://rafalab.dfci.harvard.edu/dsbook-part-1/). ## Interactive Tutorial An interactive [`learnr`](https://rstudio.github.io/learnr/) tutorial for these notes is [available](`r WLNK("tutorials/ggplot.Rmd")`). You can run the tutorial with ```{r, eval = FALSE} STAT4580::runTutorial("ggplot") ``` You can install the current version of the `STAT4580` package with ```{r, eval = FALSE} remotes::install_gitlab("luke-tierney/STAT4580") ``` You may need to install the `remotes` package from CRAN first. ## Exercises 1. In the following expression, which value of the `shape` aesthetic produces a plot with points represented as triangles outlined in black colored according to the number of cylinders? ```r library(ggplot2) ggplot(mpg, aes(x = displ, y = hwy, fill = factor(cyl))) + geom_point(size = 4, shape = ---) ``` a. 15 b. 17 c. 21 d. 24 2. It can sometimes be useful to plot text labels in a scatterplot instead of points. Consider the plot set up as ```r library(ggplot2) library(dplyr) data(gapminder, package = "gapminder") p <- filter(gapminder, year == 2007) |> group_by(continent) |> summarize(gdpPercap = mean(gdpPercap), lifeExp = mean(lifeExp)) |> ggplot(aes(x = gdpPercap, y = lifeExp)) ``` Which of the following produces a plot with continent names on white rectangles? a. `p + geom_text(aes(label = continent))` b. `p + geom_label(aes(label = continent))` c. `p + geom_label(label = continent)` d. `p + geom_text(text = continent)` 3. The following code plots a _kernel density estimate_ for the `eruptions` variable in the `faithful` data set: ```r library(ggplot2) ggplot(faithful, aes(x = eruptions)) + geom_density(bw = 0.1) ``` Look at the help page for `geom_density`. Which of the following best describes what specifying a value for `bw` does: a. Changes the _kernel_ used to construct the estimate. b. Changes the _smoothing bandwidth_ to make the result more or less smooth. c. Changes the `stat` used to `stat_bw`. d. Has no effect on the retult. 4. This code creates a map of Iowa counties. ```r library(ggplot2) p <- ggplot(map_data("county", "iowa"), aes(x = long, y = lat, group = group)) + geom_polygon(, fill = "White", color = "black") ``` Which of these produces a plot with an aspect ratio that best matches the map on [this page](https://en.wikipedia.org/w/index.php?title=List_of_counties_in_Iowa&oldid=1001171082)? a. `p + coord_fixed(0.5)` b. `p + coord_fixed(0.75)` c. `p + coord_fixed(1.35)` d. `p + coord_fixed(1.95)` 5. Consider the two plots created by this code (print the values of `p1` and `p2` to see the plots): ```r library(ggplot2) data(gapminder, package = "gapminder") p1 <- ggplot(gapminder, aes(x = log(gdpPercap), y = lifeExp)) + geom_point() + scale_x_continuous(name = "") p2 <- ggplot(gapminder, aes(x = gdpPercap, y = lifeExp)) + geom_point() + scale_x_log10(labels = scales::comma, name = "") ``` Which of these statements is true? a. The `x` axis labels are identical in both plots. b. The `x` axis labels in `p2` are in dollars; the labels in `p1` are in log dollars. c. The `x` axis labels in `p1` are in dollars; the labels in `p2` are in log dollars. d. There are no labels on the `x` axis in `p2`. 6. Consider the plot created by ```r library(ggplot2) data(gapminder, package = "gapminder") p <- ggplot(gapminder, aes(x = gdpPercap, y = lifeExp)) + geom_point() + scale_x_log10(labels = scales::comma) ``` Which of these expressions produces a plot with a white background? a. `p` b. `p + theme_grey()` c. `p + theme_classic()` d. `p + ggthemes::theme_economist()` 7. There are many different ways to change the `x` axis label in `ggplot`. Consider the plot created by ```r library(ggplot2) p <- ggplot(mpg, aes(x = displ, y = hwy)) + geom_point() ``` Which of the following does **not** change the `x` axis label to _Displacement_? a. `p + labs(x = "Displacement")` b. `p + scale_x_continuous("Displacement")` c. `p + xlab("Displacement")` d. `p + theme(axis.title.x = "Displacement")`