Scatter plots work well for hundreds of observations

Overplotting becomes an issue once the number of observations gets into tens of thousands.

Storage also becomes an issue as the number of points plotted increases.

```
n <- 50000
n50K <- data.frame(x = rnorm(n), y = rnorm(n))
n10K <- n50K[1 : 10000, ]
n1K <- n50K[1 : 1000, ]
p1 <- ggplot(n1K, aes(x, y)) + geom_point() + coord_equal()
p2 <- ggplot(n10K, aes(x, y)) + geom_point() + coord_equal()
grid.arrange(p1, p2, nrow = 1)
```

Simple options to address overplotting:

sampling

reducing the point size

alpha blending

Reducing the point size helps when the number of points is in the low tens of thousands:

`ggplot(n10K, aes(x, y)) + geom_point(size = 0.1) + coord_equal()`

Alpha blending can also be effective, on its own or in combination with point size adjustment:

`ggplot(n50K, aes(x, y)) + geom_point(alpha = 0.05, size = 0.5) + coord_equal()`

Experimentation is usually needed to identify a good point size and alpha level.

Both alpha blending and point size reduction inhibit the use of color for encoding a gouping variable.

Some methods based on density estimation or binning:

Displaying contours of a 2D density estimate.

Encoding density estimates in point size.

Hexagonal binning.

A 2D density estimate can be displayed in therms of its *contours*, or *level curves*.

```
p <- ggplot(n50K, aes(x, y)) + coord_equal()
pp <- geom_point(alpha = 0.05, size = 0.5)
dd <- geom_density_2d(color = "red")
p + dd
```

2D density estimate contours can be superimposed on a set of points or placed beneath a set of points:

```
p1 <- p + list(pp, dd)
p2 <- p + list(dd, pp)
grid.arrange(p1, p2, nrow = 1)
```

Density levels can also be encoded in point size in a grid of points:

`p + stat_density_2d(geom = "point", aes(size = ..density..), n = 30, contour = FALSE)`

This scales well computationally

It introduces some distracting visual artifacts.

It does not easily support encoding a grouping with color or shape.

Hexagonal binning divides the plane into hexagonal bins and displays the number of points in each bin.

The default color scheme seems less than ideal:

```
p + geom_hex()
## Loading required package: methods
```

An alternative fill color choice:

`p + geom_hex() + scale_fill_gradient(low = "gray", high = "black")`

Again it is possible to use a scaled point representation:

`p + stat_bin_hex(geom = "point", aes(size = ..count..), fill = NA)`

Hexagonal binning produces less visual distraction than a rectangular grid.

The `hdrcde`

package computes and plots density contours containing specified proportions of the data.

The `hdr.boxplot.2d`

plots these contours and shows the points not in the outermost contour:

```
library(hdrcde)
## This is hdrcde 3.3
with(n50K, hdr.boxplot.2d(x, y, prob = c(0.1, 0.5, 0.75, 0.9)))
```

It should be possible to select the contour levels used in `ggplot`

in a similar way.

Scatter plots can encode information about other variables using

- symbol color
- symbol size
- symbol shape

An example using the `mpg`

data set:

```
p <- ggplot(mpg, aes(cty, hwy, color = factor(cyl)))
p + geom_point(aes(size = drv))
## Warning: Using size for a discrete variable is not advised.
```

Some encodings work better than others:

Size is not a good fit for a discrete variable

Even though

`cyl`

is numeric, it is best encoded as categorical.

Using `shape`

instead of `size`

for `drv`

:

`p + geom_point(aes(shape = drv))`

Increasing the size makes shapes and the colors easier to distinguish:

`p + geom_point(aes(shape = drv), size = 3)`

The `ggMargin`

function in the `ggExtra`

package attaches marginal histograms to (some) plots produced by `ggplot`

:

```
library(ggExtra)
p <- ggplot(n50K, aes(x, y)) + geom_point()
ggMarginal(p, type = "histogram", bins = 50)
```

The default type is `"density"`

for a marginal density plot.

For data sets of more modest size rug plots along the axes can be useful:

```
ggplot(faithful, aes(eruptions, waiting)) +
geom_point() +
geom_rug(alpha = 0.05)
```

A useful enhancement is to add a smooth curve to a plot. The default method is a form of local averaging, and includes a representation of uncertainty:

```
ggplot(faithful, aes(eruptions, waiting)) + geom_point() + geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
```

Variable transformations are often helpful. Variable transformations can be applied by

- plotting the transformed data
- plotting on transformed axes

Axis transformations `ggplot`

supports:

`log10`

with`scale_x_log10`

, scale_y_log10- square root with
`scale_x_sqrt`

,`scale_y_sqrt`

- revered axes with
`scale_x_revers`

,`scale_y_referse`

Base and `lattice`

graphics support some of these as well.

```
scd <- summarize(group_by(diamonds, cut),
med_price = median(price),
avg_price = mean(price),
n = length(price))
gscd <- gather(scd, which, price, 2 : 3)
ggplot(gscd) + geom_point(aes(x = price, y = cut, color = which), size = 2)
```

`ggplot(diamonds) + geom_bar(aes(x = cut))`

`ggplot(diamonds) + geom_point(aes(x = carat, y = price))`

A function to help trying things out:

```
dex <- function(data = diamonds,
alpha = 1, size = 1, logs = FALSE,
logx = logs, logy = logs)
{
p0 <- ggplot(data, aes(x = carat, y = price))
p <- p0 + geom_point(alpha = alpha, size = size)
if (logx)
p <- p + scale_x_log10()
if (logy)
p <- p + scale_y_log10()
p
}
```

Some explorations:

```
library(ggplot2)
library(dplyr)
dex()
dex(alpha = 0.1)
dex(alpha = 0.1, size = 0.1)
dex(alpha = 0.01, size = 0.1)
p <- ggplot(diamonds) + geom_histogram(aes(carat), binwidth=0.01)
p
dF <- filter(diamonds, cut == "Fair")
p + geom_histogram(aes(carat), binwidth=0.01, data = dF, col = "red")
dI <- filter(diamonds, cut == "Ideal")
p + geom_histogram(aes(carat), binwidth=0.01, data = dI, col = "red")
p <- dex(alpha = 0.01, size = 0.5, logs = TRUE)
p
p + geom_point(data = dF, color = "red")
p + geom_point(data = dF, color = "red", size = 1)
p + geom_point(data = dI, color = "red", size = 1)
p + geom_point(data = dI, color = "red", size = 1, alpha = 0.01)
p + geom_smooth(aes(color = cut))
dex(alpha = 0.01, size = 0.5) + geom_smooth(aes(color = cut))
ggplot(diamonds) +
geom_density(aes(x = carat, fill = cut), bw = .2, alpha = 0.3)
ggplot(filter(diamonds, cut %in% c("Fair", "Ideal"))) +
geom_density(aes(x = carat, fill = cut), bw = .2, alpha = 0.3)
```

Exploring a sample:

```
d500 <- diamonds[sample(nrow(diamonds), 500),]
ggplot(d500, aes(x = cut)) + geom_bar()
dex(data = d500)
dex(data = d500) + geom_point(aes(color = cut))
dex(data = d500) + geom_point(data = filter(d500, cut == "Fair"), color = "red")
dex(data = d500) + geom_point(data = filter(d500, cut == "Ideal"), color = "red")
```

Explorations using facets:

```
dex(alpha = 0.01, size = 0.5) + facet_wrap(~cut)
p0 <- ggplot(diamonds, aes(x = carat, y = price))
dNoCut <- mutate(diamonds, cut = NULL)
p1 <- p0 + geom_point(alpha = 0.01, color = "grey", data = dNoCut)
p1
p1 <- p0 + geom_point(alpha = 0.01, size = 1, data = dNoCut)
p1
p <- p1 + geom_point(color = "red", size = 1, alpha = 0.05)
p + facet_wrap(~cut)
p + facet_wrap(~cut) + scale_x_log10() + scale_y_log10()
```

Facets with a sample:

```
p500 <- p1 + geom_point(data = d500, color = "red", size = 1)
p500
p500 + facet_wrap(~cut) + scale_x_log10() + scale_y_log10()
p500 + facet_wrap(~cut)
```

`p1 + geom_smooth(aes(color = cut), se = FALSE)`

```
river <- scan("https://www.stat.uiowa.edu/~luke/data/river.dat")
plot(river)
```