---
title: "The Grammar of Graphics"
output:
html_document:
toc: yes
---
```{r global_options, include=FALSE}
knitr::opts_chunk$set(collapse=TRUE)
```
```{r, include = FALSE}
library(dplyr)
library(ggplot2)
library(lattice)
library(gridExtra)
set.seed(12345)
```
## Background
The _Grammar of Graphics_ is a language proposed by Leland Wilkinson
for describing statistical graphs.
* Wilkinson, L. (2005), _The Grammar of Graphics_, 2nd ed., Springer.
The grammar of graphics has served as the foundation for the graphics
system in SPSS and several other systems.
`ggplot2` represents an implementation and extension of the grammar
for R.
* Wickham, H. (2016), _ggplot2: Elegant Graphics for Data Analysis_,
2nd ed., Springer.
* On line documentation: .
* Wickham. H., and Grolemund, G. (2016),
[_R for Data Science_](http://r4ds.had.co.nz/), O'Reilly.
* [Data visualization cheatsheet](https://github.com/rstudio/cheatsheets/raw/master/data-visualization-2.1.pdf)
The basic idea is that any basic plot can be built out of a combination of
* a data set
* one or more geometrical representation (_geoms_)
* mappings of values to _aesthetic_ features of the geom
* a _stat_ to produce values to be mapped
* position adjustments
* a coordinate system
* a scale specification
* a faceting scheme
`ggplot2` provides tools for specifying these components and adjusting
their features.
Many are provided by default and do not need to be specified
explicitly unless the defaults are to be changed.
## A Basic Template
The simplest graph needs a data set, a geom, and a mapping:
```r
ggplot(data = ) + (mapping = aes())
```
The appearance of geom objects is controlled by _aesthetic_ features.
Each geom has some required and some optional aesthetics.
For `geom_point` the required aesthetics are
* `x` position
* `y` position.
Optional aesthetics include
* `color`
* `fill`
* `shape`
* `size`
```{r}
ggplot(mpg) + geom_point(aes(x = displ, y = hwy, color = class))
ggplot(mpg) + geom_point(aes(x = displ, y = hwy, color = class, shape = factor(cyl)))
```
Many optional aesthetics can also be used to override common defaults:
```{r}
ggplot(mpg) + geom_point(aes(x = displ, y = hwy), color = "blue", shape = 1)
```
Available point shapes are specified by number:
```{r, echo = FALSE}
generateRPointShapes<-function(){
oldPar<-par()
par(font=2, mar=c(0.5,0,0,0))
y=rev(c(rep(1,6),rep(2,5), rep(3,5), rep(4,5), rep(5,5)))
x=c(rep(1:5,5),6)
plot(x, y, pch = 0:25, cex=1.5, ylim=c(1,5.5), xlim=c(1,6.5),
axes=FALSE, xlab="", ylab="", bg="blue")
text(x, y, labels=0:25, pos=3)
par(mar=oldPar$mar,font=oldPar$font )
}
generateRPointShapes()
```
Some of these only work properly in certain combinations. For example,
`fill` only works with shapes 21--15:
```{r}
ggplot(mutate(mpg, cyl = factor(cyl))) +
geom_point(aes(x = displ, y = hwy, fill = cyl),
shape = 21, size = 4)
```
Specifying a new default is very different from specifying a constant
value as an aesthetic, which is rarely what you want:
```{r}
ggplot(mpg) + geom_point(aes(x = displ, y = hwy, color = "blue"))
```
## Geometric Objects
`ggplot2` provides a number of geoms:
```r
geom_abline geom_density_2d geom_linerange geom_rug
geom_area geom_density2d geom_map geom_segment
geom_bar geom_dotplot geom_path geom_sf
geom_bin2d geom_errorbar geom_point geom_sf_label
geom_blank geom_errorbarh geom_pointrange geom_sf_text
geom_boxplot geom_freqpoly geom_polygon geom_smooth
geom_col geom_hex geom_qq geom_spoke
geom_contour geom_histogram geom_qq_line geom_step
geom_count geom_hline geom_quantile geom_text
geom_crossbar geom_jitter geom_raster geom_tile
geom_curve geom_label geom_rect geom_violin
geom_density geom_line geom_ribbon geom_vline
```
Additional geoms are available in packages like `ggbeewsarm` and `ggridges`.
Geoms can be added as layers to a plot.
Mappings common to all, or most, geoms can be specified in the `ggplot` call:
```{r}
ggplot(mpg, aes(x = displ, y = hwy)) + geom_smooth() + geom_point()
```
Geoms can also use different data sets. This was used to show simulated QQ plots
in the background behind the QQ plot for the Galton heights data:
```{R, include = FALSE}
## reproduced from qqpp.Rmd
nsim <- function(n, m = 0, s = 1) {
z <- rnorm(n)
m + s * ((z - mean(z)) / sd(z))
}
nboot <- function(x, R) {
n <- length(x)
m <- mean(x)
s <- sd(x)
do.call(rbind,
lapply(1 : R,
function(i) {
xx <- sort(nsim(n, m, s))
p <- seq_along(x) / n - 0.5 / n
data.frame(x = xx, p = p, sim = i)
}))
}
```
```{r}
father.son <- UsingR::father.son
gb <- nboot(father.son$fheight, 50)
ggplot(father.son) +
geom_line(aes(x = qnorm(p), y = x, group = sim),
color = "gray", data = gb) +
geom_qq(aes(sample = fheight))
```
## Statistical Transformations
All geoms use a statistical transformation (_stat_) to convert raw
data to the values to be mapped to the objects features.
The available stats are
```r
stat_bin stat_density2d stat_smooth
stat_bin_2d stat_ecdf stat_spoke
stat_bin_hex stat_ellipse stat_sum
stat_bin2d stat_function stat_summary
stat_binhex stat_identity stat_summary_2d
stat_boxplot stat_qq stat_summary_bin
stat_contour stat_qq_line stat_summary_hex
stat_count stat_quantile stat_summary2d
stat_density stat_sf stat_unique
stat_density_2d stat_sf_coordinates stat_ydensity
```
Each geom has a default stat, and each stat has a default geom.
* For `geom_point` the default stat is `stat_identity`.
* For `geom_bar` the default stat is `stat_count`.
* For `geom_histogram` the default is `stat_bin`.
Stats can provide _computed variables_ that can be referenced as
`....`.
For `stat_bin` some of the computed variables are
* `count`: number of points in bin
* `density`: density of points in bin, scaled to integrate to 1
By default, `geom_histogram` uses `y = ..count..`.
```{r}
library(HistData)
ggplot(Galton) +
geom_histogram(aes(x = parent),
binwidth = 1, fill = "grey", color = "black")
```
Using `y = ..density..` produces a density scaled axis.
```{r}
library(HistData)
p <- ggplot(Galton) +
geom_histogram(aes(x = parent, y = ..density..),
binwidth = 1, fill = "grey", color = "black")
p
```
`stat_function` can be used to add a normal density curve.
```{r}
m <- mean(Galton$parent)
s <- sd(Galton$parent)
p + stat_function(fun = function(x) dnorm(x, m, s), color = "red")
```
## Position Adjustments
Some available position adjustments:
```r
position_dodge position_identity position_nudge
position_dodge2 position_jitter position_stack
position_fill position_jitterdodge
```
For bar charts these allow choosing between stacked and side-by-side
charts.
The default is `position_stack`:
```{r}
ggplot(diamonds, aes(x = cut, fill = clarity)) + geom_bar(position = "stack")
```
`position_dodge` produces side-by-side bar charts:
```{r}
ggplot(diamonds, aes(x = cut, fill = clarity)) + geom_bar(position = "dodge")
```
`position_fill` rescales all bars to be equal height to help compare
proportions within bars.
Specifying `y = ..prop..` produces a better `y` axis label.
```{r}
ggplot(diamonds, aes(x = cut, fill = clarity)) + geom_bar(position = "fill")
```
Using the counts to scale the widths produces as _spine plot_, a
variant of a _mosaic plot_. This is easiest to do with the `ggmosaic`
package.
`position_jitter` can be used with `geom_point` to avoid overplotting
or break up rounding artifacts.
```{r}
p <- ggplot(mpg, aes(x = displ, y = hwy))
p + geom_point(position = "jitter")
```
To jitter only horizontally you can use
```{r}
p + geom_point(position = position_jitter(height=0))
```
## Coordinate Systems
Coordinate system functions include
```r
coord_cartesian coord_flip coord_polar coord_trans
coord_equal coord_map coord_quickmap
coord_fixed coord_munch coord_sf
```
We already used `coord_flip` and `coord_polar`.
The default coordinate system is `coord_cartesian`.
`coord_cartesian` can be used to _zoom in_ on a particular regiion:
```{r}
p + geom_point() + coord_cartesian(xlim=c(3,4))
```
`coord_fixed` and `coord_equal` fix the _aspect ratio_ for a cartesian
coordinate system.
The aspect ratio is the ratio of the number physical display units per
`y` unit to the number of physical display units par `x` unit.
The aspect ratio can be important for recognizing features and patterns.
In a PP plot the 45 degree line plays an important role, so using an
aspect ratio of 1 is helpful:
```{r}
library(gridExtra)
m <- mean(father.son$fheight)
s <- sd(father.son$fheight)
n <- nrow(father.son)
prop <- (1 : n) / n - 0.5 / n
pp1 <- ggplot(father.son) +
geom_point(aes(x = prop, y = sort(pnorm(fheight, m, s))))
pp2 <- pp1 + coord_fixed(ratio = 1)
grid.arrange(pp1, pp2, nrow = 1)
```
Coordinate systems are particularly important for maps.
Polygons for many polotical and geographic boundaries are available
through the `map_data` function.
```{r}
usa <- map_data("state")
```
Polygon vertices are encoded by longitude and latitude. Plotting these
in the default cartesian coordinate system usually does not work well:
```{r}
m <- ggplot(usa, aes(x = long, y = lat, group = group)) +
geom_polygon(fill = "white", color = "black")
m
```
Using a fixed aspect ratio is better, but an aspect ratio of 1 does
not work well:
```{r}
m + coord_equal()
```
The problem is that away from the equator a one degree change in
latitude corresponds to a larger distance than a one degree change in
longitude.
The ratio of one degree longitude separation to one degree latitude
separation for the latitude at the middle of iowa of 41 degrees is
```{r}
longlat <- cos(41/90 * pi /2)
longlat
```
A better map is obtained using the aspect ration `1 / longlat`:
```{r}
m + coord_fixed(1 / longlat)
```
The best approach to use a coordinate system designed specifically for
maps.
```{r}
m + coord_map()
```
There are many projections used in map making; the default projection
used by `coord_map` is the
[Mercator](https://en.wikipedia.org/wiki/Mercator_projection)
projection.
Proper map projections are non-linear; this is easier to see with the
Lagrange projection:
```{r}
m + coord_map("lagrange")
```
## Scales
Scales are used for controlling the mapping of values to physical
representations such as colors, shapes, and positions.
Scale functions are also responsible for producing _guides_ for
translating physical representations back to values, such as
* axis labels and marks;
* color or shape legends.
There are `r length(ls("package:ggplot2", pat = "scale_"))` scale
functions; some examples are
```r
scale_color_gradient scale_shape_manual scale_x_log10
scale_color_identity scale_size_area scale_y_log10
scale_fill_gradient scale_x_sqrt
scale_fill_manual scale_y_sqrt
```
Start with a basic plot:
```{r}
p <- ggplot(mpg, aes(x = displ, y = hwy)) + geom_point()
p
```
Remove the tick marks and labels (this can also be done with theme settings):
```{r}
p + scale_x_continuous(labels = NULL, breaks = NULL)
```
Change the tick locations and labels:
```{r}
p + scale_x_continuous(labels = paste(c(2, 4, 6), "ltr"), breaks = c(2, 4, 6))
```
Use a logarithmic axis:
```{r}
p + scale_x_log10(labels = paste(c(2, 4, 6), "ltr"),
breaks = c(2, 4, 6),
minor_breaks = c(3, 5, 7))
```
The
[Scales](http://r4ds.had.co.nz/graphics-for-communication.html#scales)
section in [R for Data Science](http://r4ds.had.co.nz/) provides some
more details.
Color assignment can also be controlled by scale functions. For example,
for the presidential approval ratings data
```{r}
pr_appr <- data.frame(pres = c("Obama", "Carter", "Clinton",
"G.W. Bush", "Reagan", "G.H.W Bush", "Trump"),
appr = c(79, 78, 68, 65, 58, 56, 40),
party = c("D", "D", "D", "R", "R", "R", "R"),
year = c(2009, 1977, 1993, 2001, 1981, 1989, 2017))
pr_appr <- mutate(pr_appr, pres = reorder(pres, appr))
```
the common assignment of red for republican and blue for democrat can
be obtained by
```{r}
ggplot(pr_appr, aes(x = pres, y = appr, fill = party)) +
geom_col() + coord_flip() +
scale_fill_manual(values = c(R = "red", D = "blue"))
```
## Themes
`ggplot2` supports the notion of _themes_ for adjusting non-data
appearance aspects of a plot, such as
* plot titles
* axis and legend placement and titles
* background colors
* guide line placement
Theme elements can be customized in several ways:
* `theme` can be used to adjust individual elements in a plot.
* `theme_set` adjusts default settings for a session;
* pre-defined theme functions allow consistent style changes.
The
[full documentation](http://ggplot2.tidyverse.org/reference/theme.html)
of the `theme` function lists many customizable elements.
One simple example:
```{r}
ggplot(mutate(mpg, cyl = factor(cyl))) +
geom_point(aes(x = displ, y = hwy, fill = cyl),
shape = 21, size = 3) +
theme(legend.position = "top",
axis.text = element_text(size = 12),
axis.title = element_text(size = 14, face = "bold"))
```
Another example:
```{r}
gthm <- theme(plot.background = element_rect(fill = "lightblue", color = NA),
panel.background = element_rect(fill = "lightblue2"))
p + gthm
```
Some alternate complete themes provided by `ggplot2` are
```
theme_bw theme_gray theme_minimal theme_void
theme_classic theme_grey theme_dark theme_light
```
```{r}
p_bw <- p + theme_bw() + ggtitle("BW")
p_classic <- p + theme_classic() + ggtitle("Classic")
p_min <- p + theme_minimal() + ggtitle("Minimal")
p_void <- p + theme_void() + ggtitle("Void")
grid.arrange(p_bw, p_classic, p_min, p_void, nrow = 2)
```
The
[`ggthemes`](http://www.rpubs.com/Mentors_Ubiqum/ggthemes_1)
package provides some additional themes. Some examples:
```{r}
library(ggthemes)
p_econ <- p + theme_economist() + ggtitle("Economist")
p_wsj <- p + theme_wsj() + ggtitle("WSJ")
p_tufte <- p + theme_tufte() + ggtitle("Tufte")
p_few <- p + theme_few() + ggtitle("Few")
grid.arrange(p_econ, p_wsj, p_tufte, p_few, nrow = 2)
```
`ggthemes` also provides `theme_map` that removes unnecessary elements
from maps:
```{r}
m + coord_map() + theme_map()
```
The
[Themes](http://r4ds.had.co.nz/graphics-for-communication.html#themes)
section in [R for Data Science](http://r4ds.had.co.nz/) provides some
more details.
## Facets
Faceting uses the _small multiples_ approach to introduce additional
variables.
For a single variable `facet_wrap` is usually used:
```{r}
p <- ggplot(mpg) + geom_point(aes(x = displ, y = hwy))
p + facet_wrap(~ class)
```
For two variables, each with a modest number of categories,
`facet_grid` can be effective:
```{r}
p + facet_grid(factor(cyl) ~ drv)
```
Facet arrangement can also be used to convey other information, such
as geographic location.
The [`geofacet` package](https://hafen.github.io/geofacet/) allows
facets to be placed in approximate locations of different geographic
regions.
An example for data from US states:
```{r}
library(geofacet)
ggplot(state_unemp, aes(year, rate)) +
geom_line() +
facet_geo(~ state, grid = "us_state_grid2", label = "name") +
scale_x_continuous(labels = function(x) paste0("'", substr(x, 3, 4))) +
labs(title = "Seasonally Adjusted US Unemployment Rate 2000-2016",
caption = "Data Source: bls.gov",
x = "Year",
y = "Unemployment Rate (%)") +
theme(strip.text.x = element_text(size = 6))
```
Arrangement according to a calendar is also useful.
## A More Complete Template
```r
ggplot(data = ) +
(mapping = aes(),
stat = ,
position = ) +
< ... MORE GEOMS ... > +
+
+
+
```
## Interaction
The `ggplotly` function in the [`plotly` package](https://plot.ly/r/)
can be used to add some interactive features to a plot created with
`ggplot2`.
* In an R session a call to `ggplotly` opens a browser window with the
interactive plot.
* In an Rmarkdown document the interactive plot is embedded in the
`html` file.
```{r, message = FALSE}
library(plotly)
p <- ggplot(mutate(mpg, cyl = factor(cyl))) +
geom_point(aes(x = displ, y = hwy, fill = cyl),
shape = 21, size = 3)
ggplotly(p)
```
Adding a `text` aesthetic allows the tooltip display to be customized:
```{r, message = FALSE}
p <- ggplot(mutate(mpg, cyl = factor(cyl))) +
geom_point(aes(x = displ, y = hwy, fill = cyl,
text = paste(year, manufacturer, model)),
shape = 21, size = 3)
ggplotly(p, tooltip = "text")
```
## Notes
* There have been several efforts to develop a grammar of interactive
graphics, including [`ggvis`](http://ggvis.rstudio.com/) and
[`animint`](https://tdhock.github.io/animint/); neither seems to be
under active development at this time.
* A recent project
[`gganimate`](https://github.com/thomasp85/gganimate) to add
animation to `ggplot` looks very promising.
* A number of other [`ggplot`
extensions](https://www.ggplot2-exts.org/) are available.
* A [recent blog
post](https://medium.com/bbc-visual-and-data-journalism/how-the-bbc-visual-and-data-journalism-team-works-with-graphics-in-r-ed0b35693535)
explains how the [BBC Visual and Data
Journalism](https://medium.com/bbc-visual-and-data-journalism) team
creates their graphics. More details are provided in an [_R cook
book_](https://bbc.github.io/rcookbook/).
* A [blog
post](https://blog.revolutionanalytics.com/2016/07/data-journalism-with-r-at-538.html)
on use of R and `ggplot` by
[FiveThirtyEight](https://fivethirtyeight.com/). The `ggthemes`
packages includes `theme_fivethirtyeight` to emulate their style.