Scalability

n <- 50000
n50K <- data.frame(x = rnorm(n), y = rnorm(n))
n10K <- n50K[1 : 10000, ]
n1K <- n50K[1 : 1000, ]
p1 <- ggplot(n1K, aes(x, y)) + geom_point() + coord_equal()
p2 <- ggplot(n10K, aes(x, y)) + geom_point() + coord_equal()
grid.arrange(p1, p2, nrow = 1)

Some Simple Options

Simple options to address overplotting:

Reducing the point size helps when the number of points is in the low tens of thousands:

ggplot(n10K, aes(x, y)) + geom_point(size = 0.1) + coord_equal()

Alpha blending can also be effective, on its own or in combination with point size adjustment:

ggplot(n50K, aes(x, y)) + geom_point(alpha = 0.05, size = 0.5) + coord_equal()

Density Estimation Methods

Some methods based on density estimation or binning:

Density Contours

A 2D density estimate can be displayed in therms of its contours, or level curves.

p <- ggplot(n50K, aes(x, y)) + coord_equal()
pp <- geom_point(alpha = 0.05, size = 0.5)
dd <- geom_density_2d(color = "red")
p + dd

2D density estimate contours can be superimposed on a set of points or placed beneath a set of points:

p1 <- p + list(pp, dd)
p2 <- p + list(dd, pp)
grid.arrange(p1, p2, nrow = 1)

Density Levels Encoded with Point Size

Density levels can also be encoded in point size in a grid of points:

p + stat_density_2d(geom = "point", aes(size = ..density..), n = 30, contour = FALSE)

  • This scales well computationally

  • It introduces some distracting visual artifacts.

  • It does not easily support encoding a grouping with color or shape.

Hexagonal Binning

Hexagonal binning divides the plane into hexagonal bins and displays the number of points in each bin.

The default color scheme seems less than ideal:

p + geom_hex()
## Loading required package: methods

An alternative fill color choice:

p + geom_hex() + scale_fill_gradient(low = "gray", high = "black")

Again it is possible to use a scaled point representation:

p + stat_bin_hex(geom = "point", aes(size = ..count..), fill = NA)

Hexagonal binning produces less visual distraction than a rectangular grid.

Other Density Plots

The hdrcde package computes and plots density contours containing specified proportions of the data.

The hdr.boxplot.2d plots these contours and shows the points not in the outermost contour:

library(hdrcde)
## This is hdrcde 3.3
with(n50K, hdr.boxplot.2d(x, y, prob = c(0.1, 0.5, 0.75, 0.9)))

It should be possible to select the contour levels used in ggplot in a similar way.

Some Enhancements

Encoding Additional Variables

Scatter plots can encode information about other variables using

  • symbol color
  • symbol size
  • symbol shape

An example using the mpg data set:

p <- ggplot(mpg, aes(cty, hwy, color = factor(cyl)))
p + geom_point(aes(size = drv))
## Warning: Using size for a discrete variable is not advised.

Some encodings work better than others:

  • Size is not a good fit for a discrete variable

  • Even though cyl is numeric, it is best encoded as categorical.

Using shape instead of size for drv:

p + geom_point(aes(shape = drv))

Increasing the size makes shapes and the colors easier to distinguish:

p + geom_point(aes(shape = drv), size = 3)

Marginal Plots

The ggMargin function in the ggExtra package attaches marginal histograms to (some) plots produced by ggplot:

library(ggExtra)
p <- ggplot(n50K, aes(x, y)) + geom_point()
ggMarginal(p, type = "histogram", bins = 50)

The default type is "density" for a marginal density plot.

For data sets of more modest size rug plots along the axes can be useful:

ggplot(faithful, aes(eruptions, waiting)) +
    geom_point() +
    geom_rug(alpha = 0.05)

Adding a Smooth Curve

A useful enhancement is to add a smooth curve to a plot. The default method is a form of local averaging, and includes a representation of uncertainty:

ggplot(faithful, aes(eruptions, waiting)) + geom_point() + geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Axis Transformation

Variable transformations are often helpful. Variable transformations can be applied by

  • plotting the transformed data
  • plotting on transformed axes

Axis transformations ggplot supports:

  • log10 with scale_x_log10, scale_y_log10
  • square root with scale_x_sqrt, scale_y_sqrt
  • revered axes with scale_x_revers, scale_y_referse

Base and lattice graphics support some of these as well.

Example: Diamond Prices

scd <- summarize(group_by(diamonds, cut),
                 med_price = median(price),
         avg_price = mean(price),
         n = length(price))

gscd <- gather(scd, which, price, 2 : 3)

ggplot(gscd) + geom_point(aes(x = price, y = cut, color = which), size = 2)

ggplot(diamonds) + geom_bar(aes(x = cut))

ggplot(diamonds) + geom_point(aes(x = carat, y = price))

A function to help trying things out:

dex <- function(data = diamonds,
                alpha = 1, size = 1, logs = FALSE,
                logx = logs, logy = logs)
{
    p0 <- ggplot(data, aes(x = carat, y = price))
    p <- p0 + geom_point(alpha = alpha, size = size)
    if (logx)
        p <- p + scale_x_log10()
    if (logy)
        p <- p + scale_y_log10()
    p
}

Some explorations:

library(ggplot2)
library(dplyr)

dex()
dex(alpha = 0.1)
dex(alpha = 0.1, size = 0.1)
dex(alpha = 0.01, size = 0.1)

p <- ggplot(diamonds) + geom_histogram(aes(carat), binwidth=0.01)
p

dF <- filter(diamonds, cut == "Fair")
p + geom_histogram(aes(carat), binwidth=0.01, data = dF, col = "red")

dI <- filter(diamonds, cut == "Ideal")
p + geom_histogram(aes(carat), binwidth=0.01, data = dI, col = "red")

p <- dex(alpha = 0.01, size = 0.5, logs = TRUE)
p

p + geom_point(data = dF, color = "red")
p + geom_point(data = dF, color = "red", size = 1)

p + geom_point(data = dI, color = "red", size = 1)

p + geom_point(data = dI, color = "red", size = 1, alpha = 0.01)

p + geom_smooth(aes(color = cut))

dex(alpha = 0.01, size = 0.5) + geom_smooth(aes(color = cut))

ggplot(diamonds) +
    geom_density(aes(x = carat, fill = cut), bw = .2, alpha = 0.3)

ggplot(filter(diamonds, cut %in% c("Fair", "Ideal"))) +
    geom_density(aes(x = carat, fill = cut), bw = .2, alpha = 0.3)

Exploring a sample:

d500 <- diamonds[sample(nrow(diamonds), 500),]
ggplot(d500, aes(x = cut)) + geom_bar()
dex(data = d500)
dex(data = d500) + geom_point(aes(color = cut))
dex(data = d500) + geom_point(data = filter(d500, cut == "Fair"), color = "red")
dex(data = d500) + geom_point(data = filter(d500, cut == "Ideal"), color = "red")

Explorations using facets:

dex(alpha = 0.01, size = 0.5) + facet_wrap(~cut)

p0 <- ggplot(diamonds, aes(x = carat, y = price))
dNoCut <- mutate(diamonds, cut = NULL)
p1 <- p0 + geom_point(alpha = 0.01, color = "grey", data = dNoCut)
p1
p1 <- p0 + geom_point(alpha = 0.01, size = 1, data = dNoCut)
p1

p <- p1 + geom_point(color = "red", size = 1, alpha = 0.05)
p + facet_wrap(~cut)
p + facet_wrap(~cut) + scale_x_log10() + scale_y_log10()

Facets with a sample:

p500 <- p1 + geom_point(data = d500, color = "red", size = 1)
p500
p500 + facet_wrap(~cut) + scale_x_log10() + scale_y_log10()
p500 + facet_wrap(~cut)
p1 + geom_smooth(aes(color = cut), se = FALSE)

Example: River Flows

river <- scan("https://www.stat.uiowa.edu/~luke/data/river.dat")
plot(river)