class: center, middle, title-slide .title[ # Scatter Plot Scalability and Enhancements ] .author[ ### Luke Tierney ] .institute[ ### University of Iowa ] .date[ ### 2023-05-06 ] --- layout: true <link rel="stylesheet" href="stat4580.css" type="text/css" /> <style type="text/css"> .remark-code { font-size: 85%; } </style> <!-- title based on Wilke's chapter --> ## Scalability --- Scatter plots work well for hundreds of observations -- Over-plotting becomes an issue once the number of observations gets into tens of thousands. -- For some output formats storage also becomes an issue as the number of points plotted increases. -- Some simulated data: .hide-code[ ```r n <- 50000 ## n50K <- data.frame(x = rnorm(n), y = rnorm(n)) x <- rnorm(n) y <- x + 0.4 * x ^ 2 + rnorm(n) y <- y / sd(y) n50K <- data.frame(x = x, y = y) n10K <- n50K[1 : 10000, ] n1K <- n50K[1 : 1000, ] p1 <- ggplot(n1K, aes(x, y)) + geom_point() + coord_equal() + ggtitle(sprintf("%d Points", nrow(n1K))) p2 <- ggplot(n10K, aes(x, y)) + geom_point() + coord_equal() + ggtitle(sprintf("%d Points", nrow(n10K))) library(patchwork) p1 | p2 ``` <img src="twoetc_files/figure-html/unnamed-chunk-1-1.png" style="display: block; margin: auto;" /> ] --- layout: true ## Some Simple Options --- .pull-left[ Simple options to address over-plotting: {{content}} ] -- * sampling {{content}} -- * reducing the point size {{content}} -- * alpha blending {{content}} -- Reducing the point size helps when the number of points is in the low tens of thousands: -- .pull-right[ .hide-code[ ```r ggplot(n10K, aes(x, y)) + geom_point(size = 0.1) + coord_equal() + ggtitle(sprintf("%d Points", nrow(n10K))) ``` <img src="twoetc_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ Alpha blending can also be effective, on its own or in combination with point size adjustment: .hide-code[ ```r ggplot(n50K, aes(x, y)) + geom_point(alpha = 0.01, size = 0.5) + coord_equal() + ggtitle(sprintf("%d Points", nrow(n50K))) ``` <img src="twoetc_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> ] ] -- .pull-right[ Experimentation is usually needed to identify a good point size and alpha level. {{content}} ] -- Both alpha blending and point size reduction inhibit the use of color for encoding a grouping variable. {{content}} -- The best choices may vary from one output format to another. --- layout: true ## Density Estimation Methods --- Some methods based on density estimation or binning: -- * Displaying contours of a 2D density estimate. -- * Encoding density estimates in point size. -- * Hexagonal binning. --- ### Density Contours .pull-left[ A 2D density estimate can be displayed in terms of its _contours_, or _level curves_. .hide-code[ ```r p <- ggplot(n50K, aes(x, y)) + coord_equal() + ggtitle(sprintf("%d Points", nrow(n50K))) pp <- geom_point(alpha = 0.01, size = 0.5) dd <- geom_density_2d(color = "red") p + dd ``` <img src="twoetc_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> ] ] -- .pull-right[ These are the contours of the estimated density surface: .hide-code[ ```r d <- MASS::kde2d(n50K$x, n50K$y, n = 50) v <- expand.grid(x = d$x, y = d$y) v$z <- as.numeric(d$z) lattice::wireframe(z ~ x + y, data = v, screen = list(z = 70, x = -60)) ``` <img src="twoetc_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> ] ] --- 2D density estimate contours can be superimposed on a set of points or placed beneath a set of points: .hide-code[ ```r p1 <- p + list(pp, dd) p2 <- p + list(dd, pp) p1 | p2 ``` <img src="twoetc_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> ] --- .pull-left[ ### Density Levels Encoded with Point Size Density levels can also be encoded in point size in a grid of points: ] .pull-right[ .hide-code[ ```r p + stat_density_2d( aes(size = after_stat(density)), geom = "point", n = 30, contour = FALSE) + scale_size(range = c(0, 6)) ``` <img src="twoetc_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ This scales well computationally {{content}} ] -- It does not easily support encoding a grouping with color or shape. {{content}} -- It introduces some distracting visual artifacts that are related to some optical illusions [seen earlier](percep.html#some-optical-illusions). {{content}} -- This effect can be reduced somewhat by jittering: -- .pull-right[ .hide-code[ ```r jit_amt <- 0.03 p + stat_density_2d( aes(size = after_stat(density)), geom = "point", position = position_jitter(jit_amt, jit_amt), n = 30, contour = FALSE) + scale_size(range = c(0, 6)) ``` <img src="twoetc_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ ### Hexagonal Binning {{content}} ] -- Hexagonal binning divides the plane into hexagonal bins and displays the number of points in each bin: -- .pull-right[ .hide-code[ ```r p + geom_hex() ``` <img src="twoetc_files/figure-html/unnamed-chunk-9-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ This also scales very well to larger data sets. {{content}} ] -- The default color scheme seems less than ideal. {{content}} -- An alternative fill color choice: -- .pull-right[ .hide-code[ ```r p + geom_hex() + scale_fill_gradient(low = "gray", high = "black") ``` <img src="twoetc_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ Again it is possible to use a scaled point representation: ] .pull-right[ .hide-code[ ```r p + stat_bin_hex( geom = "point", aes(size = after_stat(density)), fill = NA) ``` <img src="twoetc_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ This representation still produces some visual distractions, but less than a rectangular grid. Again, jittering can help: ] .pull-right[ .hide-code[ ```r jit_amt <- 0.05 p + stat_bin_hex( aes(size = after_stat(density)), geom = "point", position = position_jitter(jit_amt, jit_amt), fill = NA) ``` <img src="twoetc_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ ### Other Density Plots The `hdrcde` package computes and plots density contours containing specified proportions of the data. {{content}} ] -- The `hdr.boxplot.2d` function plots these contours and shows the points not in the outermost contour: -- .pull-right[ .hide-code[ ```r library(hdrcde) with(n50K, hdr.boxplot.2d(x, y, prob = c(0.1, 0.5, 0.75, 0.9))) ``` <img src="twoetc_files/figure-html/unnamed-chunk-13-1.png" style="display: block; margin: auto;" /> ] <!-- It should be possible to select the contour levels used in `ggplot` in a similar way. --> ] --- layout: true ## Some Enhancements --- .pull-left[ ### Encoding Additional Variables {{content}} ] -- Scatter plots can encode information about other variables using {{content}} -- * symbol color {{content}} -- * symbol size {{content}} -- * symbol shape {{content}} -- An example using the `mpg` data set: -- .pull-right[ .hide-code[ ```r p <- ggplot(mpg, aes(cty, hwy, color = factor(cyl))) p + geom_point(aes(size = drv)) ## Warning: Using size for a discrete variable is not advised. ``` <img src="twoetc_files/figure-html/unnamed-chunk-14-1.png" style="display: block; margin: auto;" /> ] ] --- Some encodings work better than others: -- * Size is not a good fit for a discrete variable -- * Even though `cyl` is numeric, it is best encoded as categorical. -- * Size and color interfere with each other as the color of smaller objects is harder to perceive. -- * Similarly, size and shape interfere with each other. --- .pull-left[ Using `shape` instead of `size` for `drv`: ] .pull-right[ .hide-code[ ```r p + geom_point(aes(shape = drv)) ``` <img src="twoetc_files/figure-html/unnamed-chunk-15-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ Increasing the size makes shapes and the colors easier to distinguish: ] .pull-right[ .hide-code[ ```r p + geom_point(aes(shape = drv), size = 3) ``` <img src="twoetc_files/figure-html/unnamed-chunk-16-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ ### Marginal Plots The `ggMargin` function in the `ggExtra` package attaches marginal histograms to (some) plots produced by `ggplot`: ] .pull-right[ .hide-code[ ```r library(ggExtra) p <- ggplot(n50K, aes(x, y)) + geom_point(alpha = 0.01, size = 0.5) ggMarginal(p, type = "histogram", bins = 50, fill = "lightgrey") ``` <img src="twoetc_files/figure-html/unnamed-chunk-17-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ The default type is `"density"` for a marginal density plot: ] .pull-right[ .hide-code[ ```r ggMarginal(p, fill = "lightgrey") ``` <img src="twoetc_files/figure-html/unnamed-chunk-18-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ For data sets of more modest size rug plots along the axes can be useful: ] .pull-right[ .hide-code[ ```r ggplot(faithful, aes(eruptions, waiting)) + geom_point() + geom_rug(alpha = 0.2) ``` <img src="twoetc_files/figure-html/unnamed-chunk-19-1.png" style="display: block; margin: auto;" /> ] ] --- ### Adding a Smooth Curve When the variables on the `\(y\)` axis is a response variable a useful enhancement is to add a smooth curve to a plot. -- The default method is a form of local averaging, and includes a representation of uncertainty: -- .hide-code[ ```r p1 <- ggplot(faithful, aes(eruptions, waiting)) + geom_point() + geom_smooth() + ggtitle("Old Faithful Eruptions") p2 <- ggplot(n50K, aes(x, y)) + pp + geom_smooth() + ggtitle(sprintf("%d Points", nrow(n50K))) p1 | p2 ``` <img src="twoetc_files/figure-html/unnamed-chunk-20-1.png" style="display: block; margin: auto;" /> ] --- ### Axis Transformation Variable transformations are often helpful. -- Variable transformations can be applied by -- * plotting the transformed data -- * plotting on transformed axes -- Axis transformations `ggplot` supports: -- * `log10` with `scale_x_log10`, scale_y_log10 -- * square root with `scale_x_sqrt`, `scale_y_sqrt` -- * reversed axes with `scale_x_reverse`, `scale_y_reverse` --- Changing scales can make plot shapes simpler, but non-linear axes are harder to interpret. .hide-code[ ```r library(gapminder) gap <- filter(gapminder, year == 2007) p <- ggplot(gap, aes(x = gdpPercap, y = lifeExp, color = continent, size = pop)) + geom_point() + scale_size_area(max_size = 8) p1 <- p + guides(color = "none", size = "none") p2 <- p + scale_x_log10() + guides(size = "none") p1 + p2 ``` <img src="twoetc_files/figure-html/unnamed-chunk-21-1.png" style="display: block; margin: auto;" /> ] --- layout: true ## Conditioning --- When the `\(y\)` variable is a response it can also be useful to explore the conditional distribution of `\(y\)` given different values of the `\(x\)` variable. -- One way to do this is to focus on the data for narrow ranges of the conditioning variable. -- Two approaches for creating groups to focus on: -- * Use the `base` function `cut`, or one of the `gglot2` functions * `cut_interval` * `cut_number` * `cut_width` -- * Use the `round` function. --- .pull-left[ Rounding to an integer or using ```r cut_width(x, width = 1, center = 0) ``` produces these bins: ] .pull-right[ .hide-code[ ```r p <- ggplot(n50K, aes(x, y)) + geom_point(alpha = 0.05, size = 0.5) p + geom_vline(xintercept = seq(-3.5, 3.5, by = 1), linetype = 2, color = "red") ``` <img src="twoetc_files/figure-html/unnamed-chunk-23-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ Data within each group can then be shown using box plots (you need to use the `group` aesthetic to plot on a continuous axis): ] .pull-right[ .hide-code[ ```r n50K_trm <- filter(n50K, x > -2.5, x < 2.5) mutate(n50K_trm, xrnd = round(x)) %>% ggplot(aes(x, y)) + geom_boxplot(aes(group = xrnd)) ``` <img src="twoetc_files/figure-html/unnamed-chunk-24-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ Using `cut_width` and violin plots: ] .pull-right[ .hide-code[ ```r mutate(n50K_trm, xcut = cut_width(x, width = 1, center = 0)) %>% group_by(xcut) %>% mutate(xmed = median(x)) %>% ungroup() %>% ggplot(aes(x, y)) + geom_violin(aes(group = xmed), draw_quantiles = 0.5) ``` <img src="twoetc_files/figure-html/unnamed-chunk-25-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ Another option for visualizing the conditional distributions is faceted density plots: ] .pull-right[ .hide-code[ ```r mutate(n50K_trm, xrnd = round(x)) %>% ggplot(aes(x)) + geom_density(aes(y)) + facet_wrap(~ xrnd, ncol = 1) ``` <img src="twoetc_files/figure-html/unnamed-chunk-26-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ Density ridges work as well: ] .pull-right[ .hide-code[ ```r library(ggridges) mutate(n50K_trm, xrnd = round(x)) %>% ggplot(aes(y, xrnd, group = xrnd)) + geom_density_ridges(quantile_lines = TRUE, quantiles = 2) ``` <img src="twoetc_files/figure-html/unnamed-chunk-27-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ Using a larger set of narrower bins centered on multiples of 1/2: ] .pull-right[ .hide-code[ ```r mutate(n50K_trm, xrnd = round(2 * x) / 2) %>% ggplot(aes(y, xrnd, group = xrnd)) + geom_density_ridges(quantile_lines = TRUE, quantiles = 2) ``` <img src="twoetc_files/figure-html/unnamed-chunk-28-1.png" style="display: block; margin: auto;" /> ] ] --- Sometimes it is useful to use a small number of narrow bins: .hide-code[ ```r rdata <- data.frame(xmin = (-2 : 2) - 0.1, xmax = (-2 : 2) + 0.1, ymin = -Inf, ymax = Inf) p1 <- p + geom_rect(aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax), data = rdata, inherit.aes = FALSE, fill = "red", alpha = 0.2) p2 <- mutate(n50K_trm, xrnd = round(x)) %>% filter(abs(x - xrnd) < 0.1) %>% ggplot(aes(y, xrnd, group = xrnd)) + geom_density_ridges(quantile_lines = TRUE, quantiles = 2) p1 | p2 ``` <img src="twoetc_files/figure-html/unnamed-chunk-29-1.png" style="display: block; margin: auto;" /> ] --- For examining conditional distributions, bins need to be: -- * narrow enough to focus on a particular `\(x\)` value; -- * wide enough to include enough observations for a useful visualization. -- For smaller data sets it can also be useful to allow bins to overlap. -- * `ggplot` does not provide support for this. -- * `lattice` does support this with _shingles_. -- The lattice functions `shingle` and `equal.count` create shingles that can be used in `lattice`-style faceting. --- layout: true ## Example: Diamond Prices --- .pull-left[ The `diamonds` data set contains prices and other attributes for 53,940 diamonds. {{content}} ] -- The `cut` variable indicates the _quality_ of a diamond's cut. {{content}} -- You might expect 'better' cuts to cost more, but that is not true on average: -- .pull-right[ .hide-code[ ```r mm <- group_by(diamonds, cut) %>% summarize(med_price = median(price), avg_price = mean(price), n = length(price)) %>% pivot_longer(2 : 3, names_to = "which", values_to = "price") ggplot(mm, aes(x = price, y = cut, color = which)) + geom_point(, size = 3) ``` <img src="twoetc_files/figure-html/unnamed-chunk-30-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ The proportion of diamonds at each `cut` level is also perhaps surprising: ] .pull-right[ .hide-code[ ```r ggplot(diamonds) + geom_bar(aes(x = cut), fill = "deepskyblue3") ``` <img src="twoetc_files/figure-html/unnamed-chunk-31-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ And the price distributions within each `cut` level differ in shape: ] .pull-right[ .hide-code[ ```r ggplot(diamonds, aes(x = price, y = cut)) + geom_violin() + geom_point(aes(color = which), data = mm) ``` <img src="twoetc_files/figure-html/unnamed-chunk-32-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ Another important factor is the size, measured in `carat`. ] .pull-right[ .hide-code[ ```r ggplot(diamonds, aes(x = carat, y = cut)) + geom_violin() ``` <img src="twoetc_files/figure-html/unnamed-chunk-33-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ Histograms with a narrow bin width may help understand the unusual shapes. ] .pull-right[ .hide-code[ ```r ggplot(diamonds) + geom_histogram(aes(carat), binwidth = 0.01) ``` <img src="twoetc_files/figure-html/unnamed-chunk-34-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ Try faceting on `cut`: ] .pull-right[ .hide-code[ ```r ggplot(diamonds) + geom_histogram(aes(carat), binwidth = 0.01) + facet_wrap(~ cut, ncol = 1) ``` <img src="twoetc_files/figure-html/unnamed-chunk-35-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ To focus on the shapes of the conditional distributions of carat given cut use `y = after_stat(density)`: ] .pull-right[ .hide-code[ ```r ggplot(diamonds) + geom_histogram( aes(x = carat, y = after_stat(density)), binwidth = 0.01) + facet_wrap(~ cut, ncol = 1) ``` <img src="twoetc_files/figure-html/unnamed-chunk-36-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ Alternatively, you can facet with `scales = "free_y"`: ] .pull-right[ .hide-code[ ```r ggplot(diamonds) + geom_histogram(aes(carat), binwidth = 0.01) + facet_wrap(~ cut, scales = "free_y", ncol = 1) ``` <img src="twoetc_files/figure-html/unnamed-chunk-37-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ Larger diamonds have a higher price: .hide-code[ ```r ggplot(diamonds, aes(x = carat, y = price)) + geom_point() ``` <img src="twoetc_files/figure-html/unnamed-chunk-38-1.png" style="display: block; margin: auto;" /> ] ] -- .pull-right[ There is a lot of over-plotting, so a good opportunity to try some of the techniques we have learned. ] --- Some explorations: .pull-left[ Reduced point size: .hide-code[ ```r p <- ggplot(diamonds, aes(x = carat, y = price)) p + geom_point(size = 0.1) ``` <img src="twoetc_files/figure-html/unnamed-chunk-39-1.png" style="display: block; margin: auto;" /> ] ] -- .pull-right[ Reduced point size and alpha level: .hide-code[ ```r p + geom_point(size = 0.1, alpha = 0.1) ``` <img src="twoetc_files/figure-html/unnamed-chunk-40-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ Try log scale for price: .hide-code[ ```r p + geom_point(size = 0.1, alpha = 0.1) + scale_y_log10() ``` <img src="twoetc_files/figure-html/unnamed-chunk-41-1.png" style="display: block; margin: auto;" /> ] ] -- .pull-right[ Log scale for both: .hide-code[ ```r p + geom_point(size = 0.1, alpha = 0.1) + scale_x_log10() + scale_y_log10() ``` <img src="twoetc_files/figure-html/unnamed-chunk-42-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ Add a smooth: .hide-code[ ```r p + geom_point(size = 0.1, alpha = 0.1) + geom_smooth() + scale_x_log10() + scale_y_log10() ``` <img src="twoetc_files/figure-html/unnamed-chunk-43-1.png" style="display: block; margin: auto;" /> ] ] -- .pull-right[ Separate smooths for each cut: .hide-code[ ```r p + geom_point(size = 0.1, alpha = 0.1) + geom_smooth(aes(color = cut)) + scale_x_log10() + scale_y_log10() ``` <img src="twoetc_files/figure-html/unnamed-chunk-44-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ Exploring a sample: ```r d500 <- sample_n(diamonds, 500) ``` Distribution of `cut` in the sample: .hide-code[ ```r ggplot(d500, aes(x = cut)) + geom_bar() ``` <img src="twoetc_files/figure-html/unnamed-chunk-46-1.png" style="display: block; margin: auto;" /> ] ] -- .pull-right[ .hide-code[ ```r ggplot(d500, aes(x = carat, y = price)) + geom_point() ``` <img src="twoetc_files/figure-html/unnamed-chunk-47-1.png" style="display: block; margin: auto;" /> ] {{content}} ] -- You can also take a _stratified sample_ stratified on cut using `group_by`, `sample_frac`, and `ungroup`. --- .pull-left.width-30[ Explorations using facets: ] .pull-right.width-65[ .hide-code[ ```r p0 <- ggplot(diamonds, aes(x = carat, y = price)) dNoCut <- mutate(diamonds, cut = NULL) p1 <- p0 + geom_point(alpha = 0.01, color = "grey", data = dNoCut) p <- p1 + geom_point(color = "red", size = 1, alpha = 0.05) ## p + facet_wrap(~ cut) p + facet_wrap(~ cut) + scale_x_log10() + scale_y_log10() ``` <img src="twoetc_files/figure-html/unnamed-chunk-48-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left.width-30[ Facets with a sample: ] .pull-right.width-65[ .hide-code[ ```r p500 <- p1 + geom_point(data = d500, color = "red", size = 1) ## p500 ## p500 + facet_wrap(~ cut) p500 + facet_wrap(~ cut) + scale_x_log10() + scale_y_log10() ``` <img src="twoetc_files/figure-html/unnamed-chunk-49-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left.width-30[ Muted data with smooths: ] .pull-right.width-65[ .hide-code[ ```r p1 + geom_smooth(aes(color = cut), se = FALSE) + scale_x_log10() + scale_y_log10() ``` <img src="twoetc_files/figure-html/unnamed-chunk-50-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ Conditioning, `carat` values near 1, 1.5, or 2: ] .pull-right[ .hide-code[ ```r drnd <- mutate(diamonds, crnd = round(2 * carat) / 2) ## Look at a bar chart of count(drnd, crnd) ## Drop higher carat values from density ridges ## map cut to color or fill (need to use group = interaction(crnd, cut)) ## try log scale for price filter(drnd, crnd <= 2, crnd >= 1, carat >= crnd, carat <= crnd + 0.1) %>% ggplot(aes(x = price, y = cut)) + geom_density_ridges(quantile_lines = TRUE, quantiles = 2) + facet_wrap(~crnd, ncol = 1) ``` <img src="twoetc_files/figure-html/unnamed-chunk-51-1.png" style="display: block; margin: auto;" /> ] ] <!-- Focus on carat values near one: ```r filter(diamonds, carat >= 0.95, carat < 1.05) %>% ggplot(aes(x = cut, y = price)) + geom_violin(scale = "count", draw_quantiles = 0.5) ## Try density ridges ``` --> --- layout: false ## Reading Chapter [_Visualizing associations among two or more quantitative variables_](https://clauswilke.com/dataviz/visualizing-associations.html) in [_Fundamentals of Data Visualization_](https://clauswilke.com/dataviz/). --- layout: true ## Exercises --- 1) A plot of arrival delay against departure delay for the NYC flight data shows a lot of over-plotting, even for a 10% sample. .pull-left[ .hide-code[ ```r library(dplyr) library(ggplot2) library(nycflights13) fl <- filter(flights, dep_delay < 120) %>% sample_frac(0.1) p <- ggplot(fl, aes(x = dep_delay, arr_delay)) + geom_point() p ## Warning: Removed 101 rows containing missing values (`geom_point()`). ``` <img src="twoetc_files/figure-html/unnamed-chunk-53-1.png" style="display: block; margin: auto;" /> ] ] .pull-right[ This masks the fact that three quarters of the flights in this sample have departure delays of less than 10 minutes. Superimposing density contours is one way to address this issue. Which of the following adds four red density contours to the plot? * a. `p + geom_density_2d(color = "red")` * b. `p + geom_density_2d(bins = 4, color = "red")` * c. `p + geom_hex(bins = 5)` * d. `p + geom_density_2d(bins = 5)` ] --- 2) The `diamonds` data set is large enough for a scatter plot of `price` against `carat` to suffer from a significant amount of over-plotting. With `p` created as ```r library(ggplot2) p <- ggplot(diamonds, aes(x = carat, y = price)) ``` which of the following is the best choice to address the over-plotting issue? * a. `p + geom_point()` * b. `p + geom_point(size = 0.5)` * c. `p + geom_point(size = 0.1, alpha = 0.1)` * d. `p + geom_point(position = "jitter")` --- 3) Consider the code <!-- ## nolint start --> ```r library(dplyr) library(ggplot2) filter(diamonds, carat < 3) %>% mutate(crnd = round(carat)) %>% filter(---) %>% ggplot(aes(x = price, y = crnd, group = crnd)) + geom_density_ridges() ``` <!-- ## nolint end --> Which replacement for `---` produces a plot showing the conditional density of `price` given `carat` for `carat` values near one and the conditional density of `price` given `carat` for `carat` values near two? * a. `abs(carat - crnd) < 0.1` * b. `carat < 0.1` * c. `abs(crnd) < 1` * d. `carat + crnd < 2`
//adapted from Emi Tanaka's gist at //https://gist.github.com/emitanaka/eaa258bb8471c041797ff377704c8505