## Scalability

Scatter plots work well for hundreds of observations

Over-plotting becomes an issue once the number of observations gets into tens of thousands.

For some output formats storage also becomes an issue as the number of points plotted increases.

Some simulated data:

n <- 50000
## n50K <- data.frame(x = rnorm(n), y = rnorm(n))
x <- rnorm(n)
y <- x + 0.4 * x ^ 2 + rnorm(n)
y <- y / sd(y)
n50K <- data.frame(x = x, y = y)

n10K <- n50K[1 : 10000, ]
n1K <- n50K[1 : 1000, ]
p1 <- ggplot(n1K, aes(x, y)) +
geom_point() +
coord_equal() +
ggtitle(sprintf("%d Points", nrow(n1K)))
p2 <- ggplot(n10K, aes(x, y)) +
geom_point() +
coord_equal() +
ggtitle(sprintf("%d Points", nrow(n10K)))
library(patchwork)
p1 | p2

## Some Simple Options

• sampling

• reducing the point size

• alpha blending

Reducing the point size helps when the number of points is in the low tens of thousands:

ggplot(n10K, aes(x, y)) +
geom_point(size = 0.1) +
coord_equal() +
ggtitle(sprintf("%d Points", nrow(n10K)))

Alpha blending can also be effective, on its own or in combination with point size adjustment:

ggplot(n50K, aes(x, y)) +
geom_point(alpha = 0.01, size = 0.5) +
coord_equal() +
ggtitle(sprintf("%d Points", nrow(n50K)))

Experimentation is usually needed to identify a good point size and alpha level.

Both alpha blending and point size reduction inhibit the use of color for encoding a grouping variable.

The best choices may vary from one output format to another.

## Density Estimation Methods

Some methods based on density estimation or binning:

• Displaying contours of a 2D density estimate.

• Encoding density estimates in point size.

• Hexagonal binning.

### Density Contours

A 2D density estimate can be displayed in terms of its contours, or level curves.

p <- ggplot(n50K, aes(x, y)) +
coord_equal() +
ggtitle(sprintf("%d Points", nrow(n50K)))
pp <- geom_point(alpha = 0.01, size = 0.5)
dd <- geom_density_2d(color = "red")
p + dd

These are the contours of the estimated density surface:

d <- MASS::kde2d(n50K$x, n50K$y, n = 50)
v <- expand.grid(x = d$x, y = d$y)
v$z <- as.numeric(d$z)
lattice::wireframe(z ~ x + y,
data = v,
screen = list(z = 70,
x = -60))

2D density estimate contours can be superimposed on a set of points or placed beneath a set of points:

p1 <- p + list(pp, dd)
p2 <- p + list(dd, pp)
p1 | p2

### Density Levels Encoded with Point Size

Density levels can also be encoded in point size in a grid of points:

p + stat_density_2d(
aes(size = after_stat(density)),
geom = "point",
n = 30,
contour = FALSE) +
scale_size(range = c(0, 6))

This scales well computationally

It does not easily support encoding a grouping with color or shape.

It introduces some distracting visual artifacts that are related to some optical illusions seen earlier.

This effect can be reduced somewhat by jittering:

jit_amt <- 0.03
p + stat_density_2d(
aes(size = after_stat(density)),
geom = "point",
position = position_jitter(jit_amt,
jit_amt),
n = 30, contour = FALSE) +
scale_size(range = c(0, 6))

### Hexagonal Binning

Hexagonal binning divides the plane into hexagonal bins and displays the number of points in each bin:

p + geom_hex()

This also scales very well to larger data sets.

The default color scheme seems less than ideal.

An alternative fill color choice:

p + geom_hex() +
high = "black")

Again it is possible to use a scaled point representation:

p + stat_bin_hex(
geom = "point",
aes(size = after_stat(density)),
fill = NA)

This representation still produces some visual distractions, but less than a rectangular grid.

Again, jittering can help:

jit_amt <- 0.05
p + stat_bin_hex(
aes(size = after_stat(density)),
geom = "point",
position = position_jitter(jit_amt,
jit_amt),
fill = NA)

### Other Density Plots

The hdrcde package computes and plots density contours containing specified proportions of the data.

The hdr.boxplot.2d function plots these contours and shows the points not in the outermost contour:

library(hdrcde)
with(n50K,
hdr.boxplot.2d(x, y,
prob = c(0.1, 0.5, 0.75, 0.9)))

## Some Enhancements

Scatter plots can encode information about other variables using

• symbol color

• symbol size

• symbol shape

An example using the mpg data set:

p <- ggplot(mpg, aes(cty, hwy,
color = factor(cyl)))
p + geom_point(aes(size = drv))
## Warning: Using size for a discrete variable is not advised.

Some encodings work better than others:

• Size is not a good fit for a discrete variable

• Even though cyl is numeric, it is best encoded as categorical.

• Size and color interfere with each other as the color of smaller objects is harder to perceive.

• Similarly, size and shape interfere with each other.

Using shape instead of size for drv:

p + geom_point(aes(shape = drv))

Increasing the size makes shapes and the colors easier to distinguish:

p + geom_point(aes(shape = drv), size = 3)

### Marginal Plots

The ggMargin function in the ggExtra package attaches marginal histograms to (some) plots produced by ggplot:

library(ggExtra)
p <- ggplot(n50K, aes(x, y)) +
geom_point(alpha = 0.01, size = 0.5)
ggMarginal(p,
type = "histogram",
bins = 50,
fill = "lightgrey")

The default type is "density" for a marginal density plot:

ggMarginal(p, fill = "lightgrey")

For data sets of more modest size rug plots along the axes can be useful:

ggplot(faithful,
aes(eruptions, waiting)) +
geom_point() +
geom_rug(alpha = 0.2)

When the variables on the $$y$$ axis is a response variable a useful enhancement is to add a smooth curve to a plot.

The default method is a form of local averaging, and includes a representation of uncertainty:

p1 <- ggplot(faithful, aes(eruptions, waiting)) +
geom_point() +
geom_smooth() +
ggtitle("Old Faithful Eruptions")
p2 <- ggplot(n50K, aes(x, y)) +
pp +
geom_smooth() +
ggtitle(sprintf("%d Points", nrow(n50K)))
p1 | p2

### Axis Transformation

Variable transformations can be applied by

• plotting the transformed data

• plotting on transformed axes

Axis transformations ggplot supports:

• log10 with scale_x_log10, scale_y_log10

• square root with scale_x_sqrt, scale_y_sqrt

• reversed axes with scale_x_reverse, scale_y_reverse

Changing scales can make plot shapes simpler, but non-linear axes are harder to interpret.

library(gapminder)
gap <- filter(gapminder, year == 2007)
p <- ggplot(gap,
aes(x = gdpPercap,
y = lifeExp,
color = continent,
size = pop)) +
geom_point() +
scale_size_area(max_size = 8)
p1 <- p +
guides(color = "none",
size = "none")
p2 <- p +
scale_x_log10() +
guides(size = "none")
p1 + p2

## Conditioning

When the $$y$$ variable is a response it can also be useful to explore the conditional distribution of $$y$$ given different values of the $$x$$ variable.

One way to do this is to focus on the data for narrow ranges of the conditioning variable.

Two approaches for creating groups to focus on:

• Use the base function cut, or one of the gglot2 functions

• cut_interval
• cut_number
• cut_width
• Use the round function.

Rounding to an integer or using

cut_width(x, width = 1, center = 0)

produces these bins:

p <- ggplot(n50K, aes(x, y)) +
geom_point(alpha = 0.05, size = 0.5)
p + geom_vline(xintercept =
seq(-3.5, 3.5, by = 1),
linetype = 2,
color = "red")

Data within each group can then be shown using box plots (you need to use the group aesthetic to plot on a continuous axis):

n50K_trm <- filter(n50K, x > -2.5, x < 2.5)
mutate(n50K_trm, xrnd = round(x)) %>%
ggplot(aes(x, y)) +
geom_boxplot(aes(group = xrnd))

Using cut_width and violin plots:

mutate(n50K_trm,
xcut = cut_width(x,
width = 1,
center = 0)) %>%
group_by(xcut) %>%
mutate(xmed = median(x)) %>%
ungroup() %>%
ggplot(aes(x, y)) +
geom_violin(aes(group = xmed),
draw_quantiles = 0.5)

Another option for visualizing the conditional distributions is faceted density plots:

mutate(n50K_trm, xrnd = round(x)) %>%
ggplot(aes(x)) +
geom_density(aes(y)) +
facet_wrap(~ xrnd, ncol = 1)

Density ridges work as well:

library(ggridges)
mutate(n50K_trm, xrnd = round(x)) %>%
ggplot(aes(y, xrnd, group = xrnd)) +
geom_density_ridges(quantile_lines = TRUE, quantiles = 2)

Using a larger set of narrower bins centered on multiples of 1/2:

mutate(n50K_trm, xrnd = round(2 * x) / 2) %>%
ggplot(aes(y, xrnd, group = xrnd)) +
geom_density_ridges(quantile_lines = TRUE, quantiles = 2)

Sometimes it is useful to use a small number of narrow bins:

rdata <- data.frame(xmin = (-2 : 2) - 0.1,
xmax = (-2 : 2) + 0.1,
ymin = -Inf,
ymax = Inf)
p1 <- p + geom_rect(aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax),
data = rdata,
inherit.aes = FALSE,
fill = "red",
alpha = 0.2)

p2 <- mutate(n50K_trm, xrnd = round(x)) %>%
filter(abs(x - xrnd) < 0.1) %>%
ggplot(aes(y, xrnd, group = xrnd)) +
geom_density_ridges(quantile_lines = TRUE, quantiles = 2)
p1 | p2

For examining conditional distributions, bins need to be:

• narrow enough to focus on a particular $$x$$ value;

• wide enough to include enough observations for a useful visualization.

For smaller data sets it can also be useful to allow bins to overlap.

• ggplot does not provide support for this.

• lattice does support this with shingles.

The lattice funcitons shingle and equal.count create shingles that can be used in lattice-style faceting.

## Example: Diamond Prices

The diamonds data set contains prices and other attributes for 53,940 diamonds.

The cut variable indicates the quality of a diamond’s cut.

You might expect ‘better’ cuts to cost more, but that is not true on average:

mm <- group_by(diamonds, cut) %>%
summarize(med_price = median(price),
avg_price = mean(price),
n = length(price)) %>%
pivot_longer(2 : 3,
names_to = "which",
values_to = "price")
ggplot(mm, aes(x = price,
y = cut,
color = which)) +
geom_point(, size = 3)

The proportion of diamonds at each cut level is also perhaps surprising:

ggplot(diamonds) +
geom_bar(aes(x = cut),
fill = "deepskyblue3")

And the price distributions within each cut level differ in shape:

ggplot(diamonds, aes(x = price, y = cut)) +
geom_violin() +
geom_point(aes(color = which), data = mm)

Another important factor is the size, measured in carat.

ggplot(diamonds, aes(x = carat, y = cut)) +
geom_violin()

Histograms with a narrow bin width may help understand the unusual shapes.

ggplot(diamonds) +
geom_histogram(aes(carat),
binwidth = 0.01)

Try faceting on cut:

ggplot(diamonds) +
geom_histogram(aes(carat),
binwidth = 0.01) +
facet_wrap(~ cut, ncol = 1)

To focus on the shapes of the conditional distributions of carat given cut use y = after_stat(density):

ggplot(diamonds) +
geom_histogram(
aes(x = carat,
y = after_stat(density)),
binwidth = 0.01) +
facet_wrap(~ cut, ncol = 1)

Alternatively, you can facet with scales = "free_y":

ggplot(diamonds) +
geom_histogram(aes(carat),
binwidth = 0.01) +
facet_wrap(~ cut,
scales = "free_y",
ncol = 1)

Larger diamonds have a higher price:

ggplot(diamonds, aes(x = carat, y = price)) +
geom_point()

There is a lot of over-plotting, so a good opportunity to try some of the techniques we have learned.

Some explorations:

Reduced point size:

p <- ggplot(diamonds,
aes(x = carat, y = price))
p + geom_point(size = 0.1)

Reduced point size and alpha level:

p + geom_point(size = 0.1, alpha = 0.1)

Try log scale for price:

p + geom_point(size = 0.1, alpha = 0.1) +
scale_y_log10()

Log scale for both:

p + geom_point(size = 0.1, alpha = 0.1) +
scale_x_log10() +
scale_y_log10()

p + geom_point(size = 0.1, alpha = 0.1) +
geom_smooth() +
scale_x_log10() +
scale_y_log10()

Separate smooths for each cut:

p + geom_point(size = 0.1, alpha = 0.1) +
geom_smooth(aes(color = cut)) +
scale_x_log10() +
scale_y_log10()

Exploring a sample:

d500 <- sample_n(diamonds, 500)

Distribution of cut in the sample:

ggplot(d500, aes(x = cut)) +
geom_bar()

ggplot(d500, aes(x = carat, y = price)) +
geom_point()

You can also take a stratified sample stratified on cut using group_by, sample_frac, and ungroup.

Explorations using facets:

p0 <- ggplot(diamonds, aes(x = carat, y = price))
dNoCut <- mutate(diamonds, cut = NULL)
p1 <- p0 + geom_point(alpha = 0.01, color = "grey", data = dNoCut)
p <- p1 + geom_point(color = "red", size = 1, alpha = 0.05)
## p + facet_wrap(~ cut)
p + facet_wrap(~ cut) + scale_x_log10() + scale_y_log10()

Facets with a sample:

p500 <- p1 +
geom_point(data = d500,
color = "red",
size = 1)
## p500
## p500 + facet_wrap(~ cut)
p500 +
facet_wrap(~ cut) +
scale_x_log10() +
scale_y_log10()

Muted data with smooths:

p1 +
geom_smooth(aes(color = cut),
se = FALSE) +
scale_x_log10() +
scale_y_log10()

Conditioning, carat values near 1, 1.5, or 2:

drnd <- mutate(diamonds,
crnd = round(2 * carat) / 2)
## Look at a bar chart of count(drnd, crnd)
## Drop higher carat values from density ridges
## map cut to color or fill (need to use group = interaction(crnd, cut))
## try log scale for price
filter(drnd,
crnd <= 2, crnd >= 1,
carat >= crnd, carat <= crnd + 0.1) %>%
ggplot(aes(x = price, y = cut)) +
geom_density_ridges(quantile_lines = TRUE,
quantiles = 2) +
facet_wrap(~crnd, ncol = 1)

## Exercises

1. A plot of arrival delay against departure delay for the NYC flight data shows a lot of over-plotting, even for a 10% sample.

library(dplyr)
library(ggplot2)
library(nycflights13)
fl <- filter(flights, dep_delay < 120) %>%
sample_frac(0.1)
p <- ggplot(fl, aes(x = dep_delay, arr_delay)) +
geom_point()
p
## Warning: Removed 101 rows containing missing values (geom_point).

This masks the fact that three quarters of the flights in this sample have departure delays of less than 10 minutes. Superimposing density contours is one way to address this issue.

Which of the following adds four red density contours to the plot?

1. p + geom_density_2d(color = "red")
2. p + geom_density_2d(bins = 4, color = "red")
3. p + geom_hex(bins = 5)
4. p + geom_density_2d(bins = 5)
2. The diamonds data set is large enough for a scatter plot of price against carat to suffer from a significant amount of over-plotting. With p created as

    library(ggplot2)
p <- ggplot(diamonds, aes(x = carat, y = price))

which of the following is the best choice to address the over-plotting issue?

1. p + geom_point()
2. p + geom_point(size = 0.5)
3. p + geom_point(size = 0.1, alpha = 0.1)
4. p + geom_point(position = "jitter")
3. Consider the code

library(dplyr)
library(ggplot2)
filter(diamonds, carat < 3) %>%
mutate(crnd = round(carat)) %>%
filter(---) %>%
ggplot(aes(x = price, y = crnd, group = crnd)) +
geom_density_ridges()

Which replacement for --- produces a plot showing the conditional density of price given carat for carat values near one and the conditional density of price given carat for carat values near two?

1. abs(carat - crnd) < 0.1
2. carat < 0.1
3. abs(crnd) < 1
4. carat + crnd < 2