Background

The Grammar of Graphics is a language proposed by Leland Wilkinson for describing statistical graphs.

The grammar of graphics has served as the foundation for the graphics system in SPSS and several other systems.

ggplot2 represents an implementation and extension of the grammar for R.

The basic idea is that any basic plot can be built out of a combination of

ggplot2 provides tools for specifying these components and adjusting their features.

Many are provided by default and do not need to be specified explicitly unless the defaults are to be changed.

A Basic Template

The simplest graph needs a data set, a geom, and a mapping:

ggplot(data = <DATA>) + <GEOM>(mapping = aes(<MAPPINGS>))

The appearance of geom objects is controlled by aesthetic features.

Each geom has some required and some optional aesthetics.

For geom_point the required aesthetics are

Optional aesthetics include

ggplot(mpg) + geom_point(aes(x = displ, y = hwy, color = class))

ggplot(mpg) + geom_point(aes(x = displ, y = hwy, color = class, shape = factor(cyl)))

Many optional aesthetics can also be used to override common defaults:

ggplot(mpg) + geom_point(aes(x = displ, y = hwy), color = "blue", shape = 1)

Available point shapes are specified by number:

Some of these only work properly in certain combinations. For example, fill only works with shapes 21–15:

ggplot(mutate(mpg, cyl = factor(cyl))) +
    geom_point(aes(x = displ, y = hwy, fill = cyl),
               shape = 21, size = 4)

Specifying a new default is very different from specifying a constant value as an aesthetic, which is rarely what you want:

ggplot(mpg) + geom_point(aes(x = displ, y = hwy, color = "blue"))

Geometric Objects

ggplot2 provides a number of geoms:

geom_abline      geom_density_2d  geom_linerange   geom_rug
geom_area        geom_density2d   geom_map         geom_segment
geom_bar         geom_dotplot     geom_path        geom_sf
geom_bin2d       geom_errorbar    geom_point       geom_sf_label
geom_blank       geom_errorbarh   geom_pointrange  geom_sf_text
geom_boxplot     geom_freqpoly    geom_polygon     geom_smooth
geom_col         geom_hex         geom_qq          geom_spoke
geom_contour     geom_histogram   geom_qq_line     geom_step
geom_count       geom_hline       geom_quantile    geom_text
geom_crossbar    geom_jitter      geom_raster      geom_tile
geom_curve       geom_label       geom_rect        geom_violin
geom_density     geom_line        geom_ribbon      geom_vline

Additional geoms are available in packages like ggbeewsarm and ggridges.

Geoms can be added as layers to a plot.

Mappings common to all, or most, geoms can be specified in the ggplot call:

ggplot(mpg, aes(x = displ, y = hwy)) +  geom_smooth() + geom_point()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Geoms can also use different data sets. This was used to show simulated QQ plots in the background behind the QQ plot for the Galton heights data:

father.son <- UsingR::father.son
gb <- nboot(father.son$fheight, 50)
ggplot(father.son) +
    geom_line(aes(x = qnorm(p), y = x, group = sim),
              color = "gray", data = gb) +
    geom_qq(aes(sample = fheight))

Statistical Transformations

All geoms use a statistical transformation (stat) to convert raw data to the values to be mapped to the objects features.

The available stats are

stat_bin             stat_density2d       stat_smooth
stat_bin_2d          stat_ecdf            stat_spoke
stat_bin_hex         stat_ellipse         stat_sum
stat_bin2d           stat_function        stat_summary
stat_binhex          stat_identity        stat_summary_2d
stat_boxplot         stat_qq              stat_summary_bin
stat_contour         stat_qq_line         stat_summary_hex
stat_count           stat_quantile        stat_summary2d
stat_density         stat_sf              stat_unique
stat_density_2d      stat_sf_coordinates  stat_ydensity

Each geom has a default stat, and each stat has a default geom.

Stats can provide computed variables that can be referenced as ..<variable>...

For stat_bin some of the computed variables are

By default, geom_histogram uses y = ..count...

library(HistData)
ggplot(Galton) +
    geom_histogram(aes(x = parent),
                   binwidth = 1, fill = "grey", color = "black")

Using y = ..density.. produces a density scaled axis.

library(HistData)
p <- ggplot(Galton) +
    geom_histogram(aes(x = parent, y = ..density..),
                   binwidth = 1, fill = "grey", color = "black")
p

stat_function can be used to add a normal density curve.

m <- mean(Galton$parent)
s <- sd(Galton$parent)
p + stat_function(fun = function(x) dnorm(x, m, s), color = "red")

Position Adjustments

Some available position adjustments:

position_dodge        position_identity     position_nudge
position_dodge2       position_jitter       position_stack
position_fill         position_jitterdodge  

For bar charts these allow choosing between stacked and side-by-side charts.

The default is position_stack:

ggplot(diamonds, aes(x = cut, fill = clarity)) + geom_bar(position = "stack")

position_dodge produces side-by-side bar charts:

ggplot(diamonds, aes(x = cut, fill = clarity)) + geom_bar(position = "dodge")

position_fill rescales all bars to be equal height to help compare proportions within bars.

Specifying y = ..prop.. produces a better y axis label.

ggplot(diamonds, aes(x = cut, fill = clarity)) + geom_bar(position = "fill")

Using the counts to scale the widths produces as spine plot, a variant of a mosaic plot. This is easiest to do with the ggmosaic package.

position_jitter can be used with geom_point to avoid overplotting or break up rounding artifacts.

p <- ggplot(mpg, aes(x = displ, y = hwy))
p + geom_point(position = "jitter")

To jitter only horizontally you can use

p + geom_point(position = position_jitter(height=0))

Coordinate Systems

Coordinate system functions include

coord_cartesian  coord_flip       coord_polar      coord_trans
coord_equal      coord_map        coord_quickmap   
coord_fixed      coord_munch      coord_sf         

We already used coord_flip and coord_polar.

The default coordinate system is coord_cartesian.

coord_cartesian can be used to zoom in on a particular regiion:

p + geom_point() + coord_cartesian(xlim=c(3,4))

coord_fixed and coord_equal fix the aspect ratio for a cartesian coordinate system.

The aspect ratio is the ratio of the number physical display units per y unit to the number of physical display units par x unit.

The aspect ratio can be important for recognizing features and patterns.

In a PP plot the 45 degree line plays an important role, so using an aspect ratio of 1 is helpful:

library(gridExtra)
m <- mean(father.son$fheight)
s <- sd(father.son$fheight)
n <- nrow(father.son)
prop <- (1 : n) / n - 0.5 / n
pp1 <- ggplot(father.son) +
    geom_point(aes(x = prop, y = sort(pnorm(fheight, m, s))))
pp2 <- pp1 + coord_fixed(ratio = 1)
grid.arrange(pp1, pp2, nrow = 1)

Coordinate systems are particularly important for maps.

Polygons for many polotical and geographic boundaries are available through the map_data function.

usa <- map_data("state")

Polygon vertices are encoded by longitude and latitude. Plotting these in the default cartesian coordinate system usually does not work well:

m <- ggplot(usa, aes(x = long, y = lat, group = group)) +
    geom_polygon(fill = "white", color = "black")
m

Using a fixed aspect ratio is better, but an aspect ratio of 1 does not work well:

m + coord_equal()

The problem is that away from the equator a one degree change in latitude corresponds to a larger distance than a one degree change in longitude.

The ratio of one degree longitude separation to one degree latitude separation for the latitude at the middle of iowa of 41 degrees is

longlat <- cos(41/90 * pi /2)
longlat
## [1] 0.7547096

A better map is obtained using the aspect ration 1 / longlat:

m + coord_fixed(1 / longlat)

The best approach to use a coordinate system designed specifically for maps.

m + coord_map()

There are many projections used in map making; the default projection used by coord_map is the Mercator projection.

Proper map projections are non-linear; this is easier to see with the Lagrange projection:

m + coord_map("lagrange")

Scales

Scales are used for controlling the mapping of values to physical representations such as colors, shapes, and positions.

Scale functions are also responsible for producing guides for translating physical representations back to values, such as

There are 94 scale functions; some examples are

scale_color_gradient      scale_shape_manual     scale_x_log10
scale_color_identity      scale_size_area        scale_y_log10
scale_fill_gradient                              scale_x_sqrt
scale_fill_manual                                scale_y_sqrt

Start with a basic plot:

p <- ggplot(mpg, aes(x = displ, y = hwy)) + geom_point()
p

Remove the tick marks and labels (this can also be done with theme settings):

p + scale_x_continuous(labels = NULL, breaks = NULL)

Change the tick locations and labels:

p + scale_x_continuous(labels = paste(c(2, 4, 6), "ltr"), breaks = c(2, 4, 6))

Use a logarithmic axis:

p + scale_x_log10(labels = paste(c(2, 4, 6), "ltr"),
                  breaks = c(2, 4, 6),
                  minor_breaks = c(3, 5, 7))

The Scales section in R for Data Science provides some more details.

Color assignment can also be controlled by scale functions. For example, for the presidential approval ratings data

pr_appr <- data.frame(pres = c("Obama", "Carter", "Clinton",
                               "G.W. Bush", "Reagan", "G.H.W Bush", "Trump"),
                      appr = c(79, 78, 68, 65, 58, 56, 40),
                      party = c("D", "D", "D", "R", "R", "R", "R"),
                      year = c(2009, 1977, 1993, 2001, 1981, 1989, 2017))
pr_appr <- mutate(pr_appr, pres = reorder(pres, appr))

the common assignment of red for republican and blue for democrat can be obtained by

ggplot(pr_appr, aes(x = pres, y = appr, fill = party)) +
    geom_col() + coord_flip() +
    scale_fill_manual(values = c(R = "red", D = "blue")) 

Themes

ggplot2 supports the notion of themes for adjusting non-data appearance aspects of a plot, such as

Theme elements can be customized in several ways:

The full documentation of the theme function lists many customizable elements.

One simple example:

ggplot(mutate(mpg, cyl = factor(cyl))) +
    geom_point(aes(x = displ, y = hwy, fill = cyl),
               shape = 21, size = 3) +
    theme(legend.position = "top",
          axis.text = element_text(size = 12),
          axis.title = element_text(size = 14, face = "bold"))

Another example:

gthm <- theme(plot.background = element_rect(fill = "lightblue", color = NA),
              panel.background = element_rect(fill = "lightblue2"))
p + gthm

Some alternate complete themes provided by ggplot2 are

theme_bw        theme_gray      theme_minimal   theme_void
theme_classic   theme_grey      theme_dark      theme_light
p_bw <- p + theme_bw() + ggtitle("BW")
p_classic <- p + theme_classic() + ggtitle("Classic")
p_min <- p + theme_minimal() + ggtitle("Minimal")
p_void <- p + theme_void() + ggtitle("Void")
grid.arrange(p_bw, p_classic, p_min, p_void, nrow = 2)

The ggthemes package provides some additional themes. Some examples:

library(ggthemes)
p_econ <- p + theme_economist() + ggtitle("Economist")
p_wsj <- p + theme_wsj() + ggtitle("WSJ")
p_tufte <- p + theme_tufte() + ggtitle("Tufte")
p_few <- p + theme_few() + ggtitle("Few")
grid.arrange(p_econ, p_wsj, p_tufte, p_few, nrow = 2)

ggthemes also provides theme_map that removes unnecessary elements from maps:

m + coord_map() + theme_map()

The Themes section in R for Data Science provides some more details.

Facets

Faceting uses the small multiples approach to introduce additional variables.

For a single variable facet_wrap is usually used:

p <- ggplot(mpg) + geom_point(aes(x = displ, y = hwy))
p + facet_wrap(~ class)

For two variables, each with a modest number of categories, facet_grid can be effective:

p + facet_grid(factor(cyl) ~ drv)

Facet arrangement can also be used to convey other information, such as geographic location.

The geofacet package allows facets to be placed in approximate locations of different geographic regions.

An example for data from US states:

library(geofacet)
ggplot(state_unemp, aes(year, rate)) +
    geom_line() +
    facet_geo(~ state, grid = "us_state_grid2", label = "name") +
    scale_x_continuous(labels = function(x) paste0("'", substr(x, 3, 4))) +
    labs(title = "Seasonally Adjusted US Unemployment Rate 2000-2016",
       caption = "Data Source: bls.gov",
    x = "Year",
    y = "Unemployment Rate (%)") +
  theme(strip.text.x = element_text(size = 6))

Arrangement according to a calendar is also useful.

A More Complete Template

ggplot(data = <DATA>) +
    <GEOM>(mapping = aes(<MAPPINGS>),
           stat = <STAT>,
           position = <POSITION>) +
    < ... MORE GEOMS ... > +
    <COORDINATE_ADJUSTMENT> +
    <SCALE_ADJUSTMENT> +
    <FACETING> +
    <THEME_ADJUSTMENT>

Interaction

The ggplotly function in the plotly package can be used to add some interactive features to a plot created with ggplot2.

library(plotly)
p <- ggplot(mutate(mpg, cyl = factor(cyl))) +
    geom_point(aes(x = displ, y = hwy, fill = cyl),
               shape = 21, size = 3)
ggplotly(p)

Adding a text aesthetic allows the tooltip display to be customized:

p <- ggplot(mutate(mpg, cyl = factor(cyl))) +
    geom_point(aes(x = displ, y = hwy, fill = cyl,
                   text = paste(year, manufacturer, model)),
               shape = 21, size = 3)
## Warning: Ignoring unknown aesthetics: text
ggplotly(p, tooltip = "text")

Notes