Scatter Plot Scalability and Enhancements

class: center, middle, title-slide

.title[
# Scatter Plot Scalability and Enhancements
]
.author[
### Luke Tierney
]
.institute[
### University of Iowa
]
.date[
### 2024-03-20
]

---

layout: true

## Scalability
---

Scatter plots work well for hundreds of observations

Over-plotting becomes an issue once the number of observations gets
into tens of thousands.

For some output formats storage also becomes an issue as the number of
points plotted increases.

.pull-left.width-40[
Some simulated data:

.hide-code[

```r
n <- 50000
## n50K <- data.frame(x = rnorm(n), y = rnorm(n))
x <- rnorm(n)
y <- x + 0.4 * x ^ 2 + rnorm(n)
y <- y / sd(y)
n50K <- data.frame(x = x, y = y)

n10K <- n50K[1 : 10000, ]
n1K <- n50K[1 : 1000, ]
p1 <- ggplot(n1K, aes(x, y)) +
    geom_point() +
    coord_equal() +
    ggtitle(sprintf("%d Points", nrow(n1K)))
p2 <- ggplot(n10K, aes(x, y)) +
    geom_point() +
    coord_equal() +
    ggtitle(sprintf("%d Points", nrow(n10K)))
library(patchwork)
p1 | p2
```
]
]

.pull-right.width-60[
<img src="twoetc_files/figure-html/simulated-data-plots-1.png" style="display: block; margin: auto;" />
]

---
layout: true
## Some Simple Options
---

.pull-left[
Simple options to address over-plotting:
{{content}}
]

* sampling
{{content}}
--

* reducing the point size
{{content}}
--

* alpha blending
{{content}}
--

With no adjustments:

.pull-right[
.hide-code[

```r
p2 + xlim(c(-5, 5)) + ylim(c(-3, 10))
```

<img src="twoetc_files/figure-html/unnamed-chunk-1-1.png" style="display: block; margin: auto;" />
]
]

---

.pull-left[
Reducing the point size helps when the number of points is in the low
tens of thousands:
]

.pull-right[
.hide-code[

```r
ggplot(n10K, aes(x, y)) +
    geom_point(size = 0.1) +
    coord_equal() +
    xlim(c(-5, 5)) +
    ylim(c(-3, 10)) +
    ggtitle(sprintf("%d Points", nrow(n10K)))
```

<img src="twoetc_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" />
]
]

---

.pull-left[
Alpha blending can also be effective, on its own or in combination
with point size adjustment:
]

.pull-right[
.hide-code[

```r
ggplot(n50K, aes(x, y)) +
    geom_point(alpha = 0.01, size = 0.5) +
    coord_equal() +
    xlim(c(-5, 5)) +
    ylim(c(-3, 10)) +
    ggtitle(sprintf("%d Points", nrow(n50K)))
```

<img src="twoetc_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" />
]
]

---

Experimentation is usually needed to identify a good point size and
alpha level.

Both alpha blending and point size reduction inhibit the use of color
for encoding a grouping variable.

The best choices may vary from one output format to another.

---
layout: true
## Density Estimation Methods
---

Some methods based on density estimation or binning:

* Displaying contours of a 2D density estimate.

* Encoding density estimates in point size.

* Hexagonal binning.

---

### Density Contours

.pull-left[
A 2D density estimate can be displayed in terms of its _contours_, or
_level curves_.

.hide-code[

```r
p <- ggplot(n50K, aes(x, y)) +
    coord_equal() +
    ggtitle(sprintf("%d Points", nrow(n50K)))
pp <- geom_point(alpha = 0.01, size = 0.5)
dd <- geom_density_2d(color = "red")
p + dd
```

<img src="twoetc_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" />
]
]

.pull-right[
These are the contours of the estimated density surface:

.hide-code[

```r
d <- MASS::kde2d(n50K$x, n50K$y, n = 50)
v <- expand.grid(x = d$x, y = d$y)
v$z <- as.numeric(d$z)
lattice::wireframe(z ~ x + y,
                   data = v,
                   screen = list(z = 70,
                                 x = -60))
```

<img src="twoetc_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" />
]
]

---

2D density estimate contours can be superimposed on a set of points or
placed beneath a set of points:

.hide-code[

```r
p1 <- p + list(pp, dd)
p2 <- p + list(dd, pp)
p1 | p2
```

<img src="twoetc_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" />
]

---

.pull-left[
### Density Levels Encoded with Point Size

Density levels can also be encoded in point size in a grid of points:
]

.pull-right[
.hide-code[

```r
p + stat_density_2d(
        aes(size = after_stat(density)),
        geom = "point",
        n = 30,
        contour = FALSE) +
    scale_size(range = c(0, 6))
```

<img src="twoetc_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" />
]
]

---

.pull-left[
This scales well computationally
{{content}}
]
--

It does not easily support encoding a grouping with color or shape.
{{content}}
--

It introduces some distracting visual artifacts that are related to
some optical illusions [seen
earlier](percep.html#some-optical-illusions).
{{content}}
--

This effect can be reduced somewhat by jittering:

.pull-right[
.hide-code[

```r
jit_amt <- 0.03
p + stat_density_2d(
        aes(size = after_stat(density)),
        geom = "point",
        position = position_jitter(jit_amt,
                                   jit_amt),
        n = 30, contour = FALSE) +
    scale_size(range = c(0, 6))
```

<img src="twoetc_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" />
]
]

---

.pull-left[
### Hexagonal Binning
{{content}}
]
--

Hexagonal binning divides the plane into hexagonal bins and displays
the number of points in each bin:

.pull-right[
.hide-code[

```r
p + geom_hex()
```

<img src="twoetc_files/figure-html/unnamed-chunk-9-1.png" style="display: block; margin: auto;" />
]
]

---

.pull-left[
This also scales very well to larger data sets.
{{content}}
]
--

The default color scheme seems less than ideal.
{{content}}
--

An alternative fill color choice:

.pull-right[
.hide-code[

```r
p + geom_hex() +
    scale_fill_gradient(low = "gray",
                        high = "black")
```

<img src="twoetc_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" />
]
]

---

.pull-left[
Again it is possible to use a scaled point representation:
]

.pull-right[
.hide-code[

```r
p + stat_bin_hex(
        geom = "point",
        aes(size = after_stat(density)),
        fill = NA)
```

<img src="twoetc_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" />
]
]

---

.pull-left[
This representation still produces some visual distractions, but less
than a rectangular grid.

Again, jittering can help:
]

.pull-right[
.hide-code[

```r
jit_amt <- 0.05
p + stat_bin_hex(
        aes(size = after_stat(density)),
        geom = "point",
        position = position_jitter(jit_amt,
                                   jit_amt),
        fill = NA)
```

<img src="twoetc_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" />
]
]

---

.pull-left[
### Other Density Plots

The `hdrcde` package computes and plots density contours containing
specified proportions of the data.
{{content}}
]
--

The `hdr.boxplot.2d` function plots these contours and shows the
points not in the outermost contour:

.pull-right[
.hide-code[

```r
library(hdrcde)
with(n50K,
     hdr.boxplot.2d(x, y,
                    prob = c(0.1, 0.5, 0.75, 0.9)))
```

<img src="twoetc_files/figure-html/unnamed-chunk-13-1.png" style="display: block; margin: auto;" />
]

]

---
layout: true
## Some Enhancements
---

.pull-left[
### Encoding Additional Variables
{{content}}
]
--

Scatter plots can encode information about other variables using
{{content}}
--

* symbol color
{{content}}
--

* symbol size
{{content}}
--

* symbol shape
{{content}}
--

An example using the `mpg` data set:

.pull-right[
.hide-code[

```r
p <- ggplot(mpg, aes(cty, hwy,
                     color = factor(cyl)))
p + geom_point(aes(size = drv))
## Warning: Using size for a discrete variable is not advised.
```

<img src="twoetc_files/figure-html/unnamed-chunk-14-1.png" style="display: block; margin: auto;" />
]
]

---

Some encodings work better than others:

* Size is not a good fit for a discrete variable

* Even though `cyl` is numeric, it is best encoded as categorical.

* Size and color interfere with each other as the color of smaller
  objects is harder to perceive.

* Similarly, size and shape interfere with each other.

---

.pull-left[
Using `shape` instead of `size` for `drv`:
]

.pull-right[
.hide-code[

```r
p + geom_point(aes(shape = drv))
```

<img src="twoetc_files/figure-html/unnamed-chunk-15-1.png" style="display: block; margin: auto;" />
]
]

---

.pull-left[
Increasing the size makes shapes and the colors easier to distinguish:
]

.pull-right[
.hide-code[

```r
p + geom_point(aes(shape = drv), size = 3)
```

<img src="twoetc_files/figure-html/unnamed-chunk-16-1.png" style="display: block; margin: auto;" />
]
]

---

.pull-left[
### Marginal Plots

The `ggMargin` function in the `ggExtra` package attaches marginal
histograms to (some) plots produced by `ggplot`:
]

.pull-right[
.hide-code[

```r
library(ggExtra)
p <- ggplot(n50K, aes(x, y)) +
    geom_point(alpha = 0.01, size = 0.5)
ggMarginal(p,
           type = "histogram",
           bins = 50,
           fill = "lightgrey")
```

<img src="twoetc_files/figure-html/unnamed-chunk-17-1.png" style="display: block; margin: auto;" />
]
]

---

.pull-left[
The default type is `"density"` for a marginal density plot:
]

.pull-right[
.hide-code[

```r
ggMarginal(p, fill = "lightgrey")
```

<img src="twoetc_files/figure-html/unnamed-chunk-18-1.png" style="display: block; margin: auto;" />
]
]

---

.pull-left[
For data sets of more modest size rug plots along the axes can be
useful:
]

.pull-right[
.hide-code[

```r
ggplot(faithful,
       aes(eruptions, waiting)) +
    geom_point() +
    geom_rug(alpha = 0.2)
```

<img src="twoetc_files/figure-html/unnamed-chunk-19-1.png" style="display: block; margin: auto;" />
]
]

---

### Adding a Smooth Curve

When the variables on the `$y$` axis is a response variable a useful
enhancement is to add a smooth curve to a plot.

The default method is a form of local averaging, and includes a
representation of uncertainty:

.hide-code[

```r
p1 <- ggplot(faithful, aes(eruptions, waiting)) +
    geom_point() +
    geom_smooth() +
    ggtitle("Old Faithful Eruptions")
p2 <- ggplot(n50K, aes(x, y)) +
    pp +
    geom_smooth() +
    ggtitle(sprintf("%d Points", nrow(n50K)))
p1 | p2
```

<img src="twoetc_files/figure-html/unnamed-chunk-20-1.png" style="display: block; margin: auto;" />
]

---

### Axis Transformation

Variable transformations are often helpful.

Variable transformations can be applied by

* plotting the transformed data

* plotting on transformed axes

Axis transformations `ggplot` supports:

* `log10` with `scale_x_log10`, scale_y_log10

* square root with `scale_x_sqrt`, `scale_y_sqrt`

* reversed axes with `scale_x_reverse`, `scale_y_reverse`

---

Changing scales can make plot shapes simpler, but non-linear axes are
harder to interpret.

.hide-code[

```r
library(gapminder)
gap <- filter(gapminder, year == 2007)
p <- ggplot(gap,
            aes(x = gdpPercap,
                y = lifeExp,
                color = continent,
                size = pop)) +
    geom_point() +
    scale_size_area(max_size = 8)
p1 <- p +
    guides(color = "none",
           size = "none")
p2 <- p +
    scale_x_log10() +
    guides(size = "none")
p1 + p2
```

<img src="twoetc_files/figure-html/unnamed-chunk-21-1.png" style="display: block; margin: auto;" />
]

---
layout: true
## Conditioning
---

When the `$y$` variable is a response it can also be useful to explore
the conditional distribution of `$y$` given different values of the `$x$`
variable.

One way to do this is to focus on the data for narrow ranges of the
conditioning variable.

Two approaches for creating groups to focus on:

* Use the `base` function `cut`, or one of the `gglot2` functions

* `cut_interval`
    * `cut_number`
    * `cut_width`

* Use the `round` function.

---

.pull-left[
Rounding to an integer or using

```r
cut_width(x, width = 1, center = 0)
```

produces these bins:
]

.pull-right[
.hide-code[

```r
p <- ggplot(n50K, aes(x, y)) +
    geom_point(alpha = 0.05, size = 0.5)
p + geom_vline(xintercept =
                   seq(-3.5, 4.5, by = 1),
               linetype = 2,
               color = "red") +
    scale_x_continuous(breaks = seq(-3, 4),
                       labels = as.character(seq(-3, 4)))
```

<img src="twoetc_files/figure-html/unnamed-chunk-23-1.png" style="display: block; margin: auto;" />
]
]

---

.pull-left[
Data within each group can then be shown using box plots (you need to
use the `group` aesthetic to plot on a continuous axis):
]

.pull-right[
.hide-code[

```r
n50K_trm <- filter(n50K, x > -2.5, x < 2.5)
mutate(n50K_trm, xrnd = round(x)) |>
    ggplot(aes(x, y)) +
    geom_boxplot(aes(group = xrnd))
```

<img src="twoetc_files/figure-html/unnamed-chunk-24-1.png" style="display: block; margin: auto;" />
]
]

---

.pull-left[
Using `cut_width` and violin plots:
]

.pull-right[
.hide-code[

```r
mutate(n50K_trm,
       xcut = cut_width(x,
                        width = 1,
                        center = 0)) |>
    group_by(xcut) |>
    mutate(xmed = median(x)) |>
    ungroup() |>
    ggplot(aes(x, y)) +
    geom_violin(aes(group = xmed),
                draw_quantiles = 0.5)
```

<img src="twoetc_files/figure-html/unnamed-chunk-25-1.png" style="display: block; margin: auto;" />
]
]

---

.pull-left[
Another option for visualizing the conditional distributions is
faceted density plots:
]

.pull-right[
.hide-code[

```r
mutate(n50K_trm, xrnd = round(x)) |>
    ggplot(aes(x)) +
    geom_density(aes(y)) +
    facet_wrap(~ xrnd, ncol = 1)
```

<img src="twoetc_files/figure-html/unnamed-chunk-26-1.png" style="display: block; margin: auto;" />
]
]

---

.pull-left[
Density ridges work as well:
]

.pull-right[
.hide-code[

```r
library(ggridges)
mutate(n50K_trm, xrnd = round(x)) |>
    ggplot(aes(y, xrnd, group = xrnd)) +
    geom_density_ridges(quantile_lines = TRUE,
                        quantiles = 2)
```

<img src="twoetc_files/figure-html/unnamed-chunk-27-1.png" style="display: block; margin: auto;" />
]
]

---

.pull-left[
Using a larger set of narrower bins centered on multiples of 1/2:
]

.pull-right[
.hide-code[

```r
mutate(n50K_trm, xrnd = round(2 * x) / 2) |>
    ggplot(aes(y, xrnd, group = xrnd)) +
    geom_density_ridges(quantile_lines = TRUE,
                        quantiles = 2)
```

<img src="twoetc_files/figure-html/unnamed-chunk-28-1.png" style="display: block; margin: auto;" />
]
]

---

Sometimes it is useful to use a small number of narrow bins:

.hide-code[

```r
rdata <- data.frame(xmin = (-2 : 2) - 0.1,
                    xmax = (-2 : 2) + 0.1,
                    ymin = -Inf,
                    ymax = Inf)
p1 <- p + geom_rect(aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax),
                    data = rdata,
                    inherit.aes = FALSE,
                    fill = "red",
                    alpha = 0.2)

p2 <- mutate(n50K_trm, xrnd = round(x)) |>
    filter(abs(x - xrnd) < 0.1) |>
    ggplot(aes(y, xrnd, group = xrnd)) +
    geom_density_ridges(quantile_lines = TRUE, quantiles = 2)
p1 | p2
```

<img src="twoetc_files/figure-html/unnamed-chunk-29-1.png" style="display: block; margin: auto;" />
]

---

For examining conditional distributions, bins need to be:

* narrow enough to focus on a particular `$x$` value;

* wide enough to include enough observations for a useful visualization.

For smaller data sets it can also be useful to allow bins to overlap.

* `ggplot` does not provide support for this.

* `lattice` does support this with _shingles_.

The lattice functions `shingle` and `equal.count` create shingles that
can be used in `lattice`-style faceting.

---
layout: true
## Example: Diamond Prices
---

.pull-left[
The `diamonds` data set contains prices and other attributes for
53,940 diamonds.
{{content}}
]
--

The `cut` variable indicates the _quality_ of a diamond's cut.
{{content}}
--

You might expect 'better' cuts to cost more, but that is not true on
average:

.pull-right[
.hide-code[

```r
mm <- group_by(diamonds, cut) |>
    summarize(med_price = median(price),
              avg_price = mean(price),
              n = length(price)) |>
    pivot_longer(2 : 3,
                 names_to = "which",
                 values_to = "price")
ggplot(mm, aes(x = price,
               y = cut,
               color = which)) +
    geom_point(, size = 3)
```

<img src="twoetc_files/figure-html/unnamed-chunk-30-1.png" style="display: block; margin: auto;" />
]
]

---

.pull-left[
The proportion of diamonds at each `cut` level is also perhaps surprising:
]

.pull-right[
.hide-code[

```r
ggplot(diamonds) +
    geom_bar(aes(x = cut),
             fill = "deepskyblue3")
```

<img src="twoetc_files/figure-html/unnamed-chunk-31-1.png" style="display: block; margin: auto;" />
]
]

---

.pull-left[
And the price distributions within each `cut` level differ in shape:
]

.pull-right[
.hide-code[

```r
ggplot(diamonds, aes(x = price, y = cut)) +
    geom_violin() +
    geom_point(aes(color = which), data = mm)
```

<img src="twoetc_files/figure-html/unnamed-chunk-32-1.png" style="display: block; margin: auto;" />
]
]

---

.pull-left[
Another important factor is the size, measured in `carat`.
]

.pull-right[
.hide-code[

```r
ggplot(diamonds, aes(x = carat, y = cut)) +
    geom_violin()
```

<img src="twoetc_files/figure-html/unnamed-chunk-33-1.png" style="display: block; margin: auto;" />
]
]

---

.pull-left[
Histograms with a narrow bin width may help understand the unusual shapes.
]

.pull-right[
.hide-code[

```r
ggplot(diamonds) +
    geom_histogram(aes(carat),
                   binwidth = 0.01)
```

<img src="twoetc_files/figure-html/unnamed-chunk-34-1.png" style="display: block; margin: auto;" />
]
]

---

.pull-left[
Try faceting on `cut`:
]

.pull-right[
.hide-code[

```r
ggplot(diamonds) +
    geom_histogram(aes(carat),
                   binwidth = 0.01) +
    facet_wrap(~ cut, ncol = 1)
```

<img src="twoetc_files/figure-html/unnamed-chunk-35-1.png" style="display: block; margin: auto;" />
]
]

---

.pull-left[
To focus on the shapes of the conditional distributions of carat given
cut use `y = after_stat(density)`:
]

.pull-right[
.hide-code[

```r
ggplot(diamonds) +
    geom_histogram(
        aes(x = carat,
            y = after_stat(density)),
        binwidth = 0.01) +
    facet_wrap(~ cut, ncol = 1)
```

<img src="twoetc_files/figure-html/unnamed-chunk-36-1.png" style="display: block; margin: auto;" />
]
]

---

.pull-left[
Alternatively, you can facet with `scales = "free_y"`:
]

.pull-right[
.hide-code[

```r
ggplot(diamonds) +
    geom_histogram(aes(carat),
                   binwidth = 0.01) +
    facet_wrap(~ cut,
               scales = "free_y",
               ncol = 1)
```

<img src="twoetc_files/figure-html/unnamed-chunk-37-1.png" style="display: block; margin: auto;" />
]
]

---

.pull-left[
Larger diamonds have a higher price:

.hide-code[

```r
ggplot(diamonds, aes(x = carat, y = price)) +
    geom_point()
```

<img src="twoetc_files/figure-html/unnamed-chunk-38-1.png" style="display: block; margin: auto;" />
]
]

.pull-right[
There is a lot of over-plotting, so a good opportunity to try some of the
techniques we have learned.
]

---

Some explorations:

.pull-left[
Reduced point size:
.hide-code[

```r
p <- ggplot(diamonds,
            aes(x = carat, y = price))
p + geom_point(size = 0.1)
```

<img src="twoetc_files/figure-html/unnamed-chunk-39-1.png" style="display: block; margin: auto;" />
]
]

--
.pull-right[
Reduced point size and alpha level:
.hide-code[

```r
p + geom_point(size = 0.1, alpha = 0.1)
```

<img src="twoetc_files/figure-html/unnamed-chunk-40-1.png" style="display: block; margin: auto;" />
]
]

---

.pull-left[
Try log scale for price:
.hide-code[

```r
p + geom_point(size = 0.1, alpha = 0.1) +
    scale_y_log10()
```

<img src="twoetc_files/figure-html/unnamed-chunk-41-1.png" style="display: block; margin: auto;" />
]
]

.pull-right[
Log scale for both:
.hide-code[

```r
p + geom_point(size = 0.1, alpha = 0.1) +
    scale_x_log10() +
    scale_y_log10()
```

<img src="twoetc_files/figure-html/unnamed-chunk-42-1.png" style="display: block; margin: auto;" />
]
]

---

.pull-left[
Add a smooth:
.hide-code[

```r
p + geom_point(size = 0.1, alpha = 0.1) +
    geom_smooth() +
    scale_x_log10() +
    scale_y_log10()
```

<img src="twoetc_files/figure-html/unnamed-chunk-43-1.png" style="display: block; margin: auto;" />
]
]

.pull-right[
Separate smooths for each cut:
.hide-code[

```r
p + geom_point(size = 0.1, alpha = 0.1) +
    geom_smooth(aes(color = cut)) +
    scale_x_log10() +
    scale_y_log10()
```

<img src="twoetc_files/figure-html/unnamed-chunk-44-1.png" style="display: block; margin: auto;" />
]
]

---

.pull-left[
Exploring a sample:

```r
d500 <- sample_n(diamonds, 500)
```
Distribution of `cut` in the sample:

.hide-code[

```r
ggplot(d500, aes(x = cut)) +
    geom_bar()
```

<img src="twoetc_files/figure-html/unnamed-chunk-46-1.png" style="display: block; margin: auto;" />
]
]

.pull-right[
.hide-code[

```r
ggplot(d500, aes(x = carat, y = price)) +
    geom_point()
```

<img src="twoetc_files/figure-html/unnamed-chunk-47-1.png" style="display: block; margin: auto;" />
]
{{content}}
]
--

You can also take a _stratified sample_ stratified on cut using
`group_by`, `sample_frac`, and `ungroup`.

---

.pull-left.width-30[
Explorations using facets:
]

.pull-right.width-65[
.hide-code[

```r
p0 <- ggplot(diamonds, aes(x = carat, y = price))
dNoCut <- mutate(diamonds, cut = NULL)
p1 <- p0 + geom_point(alpha = 0.01, color = "grey", data = dNoCut)
p <- p1 + geom_point(color = "red", size = 1, alpha = 0.05)
## p + facet_wrap(~ cut)
p + facet_wrap(~ cut) + scale_x_log10() + scale_y_log10()
```

<img src="twoetc_files/figure-html/unnamed-chunk-48-1.png" style="display: block; margin: auto;" />
]
]

---

.pull-left.width-30[
Facets with a sample:
]

.pull-right.width-65[
.hide-code[

```r
p500 <- p1 +
    geom_point(data = d500,
               color = "red",
               size = 1)
## p500
## p500 + facet_wrap(~ cut)
p500 +
    facet_wrap(~ cut) +
    scale_x_log10() +
    scale_y_log10()
```

<img src="twoetc_files/figure-html/unnamed-chunk-49-1.png" style="display: block; margin: auto;" />
]
]

---

.pull-left.width-30[
Muted data with smooths:
]

.pull-right.width-65[
.hide-code[

```r
p1 +
    geom_smooth(aes(color = cut),
                se = FALSE) +
    scale_x_log10() +
    scale_y_log10()
```

<img src="twoetc_files/figure-html/unnamed-chunk-50-1.png" style="display: block; margin: auto;" />
]
]

---

.pull-left[
Conditioning, `carat` values near 1, 1.5, or 2:
]

.pull-right[
.hide-code[

```r
drnd <- mutate(diamonds,
               crnd = round(2 * carat) / 2)
## Look at a bar chart of count(drnd, crnd)
## Drop higher carat values from density ridges
## map cut to color or fill (need to use group = interaction(crnd, cut))
## try log scale for price
filter(drnd,
       crnd <= 2, crnd >= 1,
       carat >= crnd, carat <= crnd + 0.1) |>
    ggplot(aes(x = price, y = cut)) +
    geom_density_ridges(quantile_lines = TRUE,
                        quantiles = 2) +
    facet_wrap(~crnd, ncol = 1)
```

<img src="twoetc_files/figure-html/unnamed-chunk-51-1.png" style="display: block; margin: auto;" />
]
]

<!--
Focus on carat values near one:

```r
filter(diamonds, carat >= 0.95, carat < 1.05) |>
    ggplot(aes(x = cut, y = price)) +
    geom_violin(scale = "count", draw_quantiles = 0.5)
## Try density ridges
```
-->

---
layout: false
## Reading

Chapter [_Visualizing associations among two or more quantitative
  variables_](https://clauswilke.com/dataviz/visualizing-associations.html)
  in [_Fundamentals of Data
  Visualization_](https://clauswilke.com/dataviz/).

---
layout: true
## Exercises
---

1) A plot of arrival delay against departure delay for the NYC flight
   data shows a lot of over-plotting, even for a 10% sample.

.pull-left[
.hide-code[

```r
library(dplyr)
library(ggplot2)
library(nycflights13)
fl <- filter(flights, dep_delay < 120) |>
    sample_frac(0.1)
p <- ggplot(fl, aes(x = dep_delay, arr_delay)) +
    geom_point()
p
## Warning: Removed 101 rows containing missing values or values outside the scale range
## (`geom_point()`).
```

<img src="twoetc_files/figure-html/unnamed-chunk-53-1.png" style="display: block; margin: auto;" />
]
]

.pull-right[
This masks the fact that three quarters of the flights in this
sample have departure delays of less than 10
minutes. Superimposing density contours is one way to address this
issue.

Which of the following adds four red density contours to the plot?

*   a. `p + geom_density_2d(color = "red")`
*   b. `p + geom_density_2d(bins = 4, color = "red")`
*   c. `p + geom_hex(bins = 5)`
*   d. `p + geom_density_2d(bins = 5)`
]

---

2) The `diamonds` data set is large enough for a scatter plot of
`price` against `carat` to suffer from a significant amount of
over-plotting.  With `p` created as

```r
library(ggplot2)
p <- ggplot(diamonds, aes(x = carat, y = price))
```

which of the following is the best choice to address the
over-plotting issue?

*   a. `p + geom_point()`
*   b. `p + geom_point(size = 0.5)`
*   c. `p + geom_point(size = 0.1, alpha = 0.1)`
*   d. `p + geom_point(position = "jitter")`

---

3)  Consider the code

```r
library(dplyr)
library(ggplot2)
filter(diamonds, carat < 3) |>
    mutate(crnd = round(carat)) |>
    filter(---) |>
    ggplot(aes(x = price, y = crnd, group = crnd)) +
    geom_density_ridges()
```

Which replacement for `---` produces a plot showing the
conditional density of `price` given `carat` for `carat` values
near one and the conditional density of `price` given `carat` for
`carat` values near two?

*   a. `abs(carat - crnd) < 0.1`
*   b. `carat < 0.1`
*   c. `abs(crnd) < 1`
*   d. `carat + crnd < 2`

//adapted from Emi Tanaka's gist at //https://gist.github.com/emitanaka/eaa258bb8471c041797ff377704c8505