class: center, middle, title-slide .title[ # Visualizing Two Numeric Variables ] .author[ ### Luke Tierney ] .institute[ ### University of Iowa ] .date[ ### 2023-03-22 ] --- layout: true <link rel="stylesheet" href="stat4580.css" type="text/css" /> <style type="text/css"> .remark-code { font-size: 85%; } </style> <!-- title based on Wilke's chapter --> ## Slope Graphs --- The most used graph for visualizing the relationship between two numeric variables is the _scatter plot_. -- But there is one alternative that can be useful and is increasingly popular: the _slope chart_ or _slope graph_. -- ### Tufte's Slope Graph -- Two articles on slope graphs with examples: * https://charliepark.org/slopegraphs/ * https://www.visualisingdata.com/2013/12/in-praise-of-slopegraphs/ --- .pull-left[ <img src="../img/tufteslope.gif" width="80%" style="display: block; margin: auto;" /> <!-- http://charliepark.org/images/slopegraphs/slopegraph.gif --> ] -- .pull-right[ Tufte showed this example in _The Visual Display of Quantitative Information_. {{content}} ] -- Some features of the data that are easy to see: {{content}} -- * order of the countries within each year; {{content}} -- * how each country's values changed; {{content}} -- * how the rates of change compare; {{content}} -- * the country (Britain) that does not fit the general pattern. --- The chart uses _no non-data ink_. (Tufte thinks this is important; others disagree). -- The chart in this form is well suited for small data sets or summaries with modest numbers of categories. -- Scalability in this full form is limited, but better if labels and values are dropped. -- The idea can be extended to multiple periods, though two periods or levels is most common when labeling is used. -- Without labeling this becomes a _parallel coordinates plot_. --- .pull-left[ A multi-period example: ] .pull-right[ <img src="../img/cancer_survival_nash.gif" width="60%" style="display: block; margin: auto;" /> <!-- http://charliepark.org/images/slopegraphs/cancer_survival_nash.gif --> ] --- ### Barley Mean Yields A slope graph for the average yields at each experiment station for the two years 1931 and 1932: .pull-left[ .hide-code[ ```r theme_set(theme_minimal() + theme(text = element_text(size = 16))) library(ggrepel) library(forcats) data(barley, package = "lattice") barley_site_year <- group_by(barley, site, year) %>% summarize(avg_yield = mean(yield)) %>% mutate(year = fct_rev(year)) barley_site_year_1932 <- filter(barley_site_year, year == "1932") ggplot(barley_site_year, aes(x = year, y = avg_yield, group = site)) + geom_line() + geom_text_repel(aes(label = site), data = barley_site_year_1932, hjust = "left", direction = "y") + scale_x_discrete(expand = expansion(mult = c(0.1, .3)), position = "top") + labs(x = NULL, y = "Average Yield") + theme(axis.line.y.left = element_line(), axis.ticks.y.left = element_line()) ``` <img src="twonum_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> ] ] -- .pull-right[ The anomalous result for Morris pops out very clearly. {{content}} ] -- This graph departs from the classic Tufte style: {{content}} -- * it uses an axis instead of showing the numbers; {{content}} -- * only shows labels on one side. {{content}} -- This is similar to the style used [here](https://clauswilke.com/dataviz/visualizing-associations.html#associations-paired-data). --- #### Creating the Graph The first step is to compute the averages: ```r barley_site_year <- group_by(barley, site, year) %>% summarize(avg_yield = mean(yield)) head(barley_site_year, 2) ## # A tibble: 2 × 3 ## # Groups: site [1] ## site year avg_yield ## <fct> <fct> <dbl> ## 1 Grand Rapids 1932 20.8 ## 2 Grand Rapids 1931 29.1 ``` -- The `year` variable is a factor with the levels in the wrong order, so we need to fix that: ```r levels(barley_site_year$year) ## [1] "1932" "1931" barley_site_year <- mutate(barley_site_year, year = fct_rev(year)) levels(barley_site_year$year) ## [1] "1931" "1932" ``` --- .pull-left[ Set the default theme to `theme_minimal` with larger text: ```r theme_set( theme_minimal() + theme(text = element_text(size = 16))) ``` {{content}} ] -- The core of a slope graph is produced by ```r p <- ggplot(barley_site_year, aes(x = year, y = avg_yield, group = site)) + geom_line() p ``` -- .pull-right[ <img src="twonum_files/figure-html/slope-core-1.png" style="display: block; margin: auto;" /> ] --- .pull-left[ Adding the labels on the 1932 side can be done as ```r barley_site_year_1932 <- filter(barley_site_year, year == "1932") p + geom_text(aes(label = site), data = barley_site_year_1932, hjust = "left") ``` ] .pull-right[ <img src="twonum_files/figure-html/slope-core-label-1.png" style="display: block; margin: auto;" /> ] --- .pull-left[ The label positions could use further adjusting. Using `geom_text_repel` from the `ggrepel` package handles this well: ```r library(ggrepel) p <- p + geom_text_repel( aes(label = site), data = barley_site_year_1932, hjust = "left", direction = "y") p ``` ] .pull-right[ <img src="twonum_files/figure-html/slope-core-repel-1.png" style="display: block; margin: auto;" /> ] --- .pull-left[ Adjust the `x` scale: ```r p <- p + scale_x_discrete( expand = expansion(mult = c(0.1, .3)), position = "top") p ``` ] .pull-right[ <img src="twonum_files/figure-html/slope-core-x-1.png" style="display: block; margin: auto;" /> ] --- .pull-left[ Final theme adjustments: ```r p + labs(x = NULL, y = "Average Yield") + theme(axis.line.y.left = element_line(), axis.ticks.y.left = element_line()) ``` ] .pull-right[ <img src="twonum_files/figure-html/scale-core-theme-1.png" style="display: block; margin: auto;" /> ] --- ### Father-Son Heights The `father.son` data set has 1078 observations, which is too large for the labeled slope graph, but the basic representation is useful: -- .pull-left[ .hide-code[ ```r library(tidyr) fs <- mutate(father.son, id = seq_len(nrow(father.son))) %>% pivot_longer(1 : 2, names_to = "which", values_to = "height") ggplot(fs, aes(x = which, y = height)) + geom_line(aes(group = id), alpha = 0.1) + scale_x_discrete( expand = expansion(mult = c(.1, .1)), labels = c("Father", "Son"), position = "top") + labs(x = NULL, y = "Height (Inches)") ``` <img src="twonum_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" /> ] ] -- .pull-right[ This very clearly shows the famous _regression to the mean_ effect: {{content}} ] -- * taller parents tend to be taller than their children; * shorter parents tend to be shorter than their children. {{content}} -- Conversely, * taller children tend to be taller than their parents; * shorter children tend to be shorter than their parents. <!-- fs <- mutate(fs, TorSF = abs(height - mean(height)) > 1.5 * sd(height) & which == "fheight") ggplot(fs) + geom_line(aes(x = which, y = height, group = id, color = TorSF), alpha = 0.1) + scale_x_discrete(expand = c(.1, 0)) + scale_color_manual(values = c("TRUE" = "red", "FALSE" = "black")) --> --- #### Creating the Graph To make creating the graph easier we can convert the data frame into a longer form with variables .pull-left[ * `height`, the height measurement * `which`, `fheight` or `sheight` * `id`, identifying the pair: {{content}} ] -- Add the `id` variable: ```r fs <- mutate(father.son, id = seq_len(nrow(father.son))) head(fs) ## fheight sheight id ## 1 65.04851 59.77827 1 ## 2 63.25094 63.21404 2 ## 3 64.95532 63.34242 3 ## 4 65.75250 62.79238 4 ## 5 61.13723 64.28113 5 ## 6 63.02254 64.24221 6 ``` -- .pull-right[ Pivot to the longer format: ```r fs <- pivot_longer(fs, 1 : 2, names_to = "which", values_to = "height") head(fs) ## # A tibble: 6 × 3 ## id which height ## <int> <chr> <dbl> ## 1 1 fheight 65.0 ## 2 1 sheight 59.8 ## 3 2 fheight 63.3 ## 4 2 sheight 63.2 ## 5 3 fheight 65.0 ## 6 3 sheight 63.3 ``` ] --- .pull-left[ The basic plot is quite simple: ```r ggplot(fs, aes(x = which, y = height, group = id)) + geom_line() ``` ] .pull-right[ <img src="twonum_files/figure-html/fs-basic-1.png" style="display: block; margin: auto;" /> ] --- .pull-left[ With an `alpha` adjustment to reduce over-plotting: ```r ggplot(fs, aes(x = which, y = height, group = id)) + * geom_line(alpha = 0.1) ``` ] .pull-right[ <img src="twonum_files/figure-html/fs-alpha-1.png" style="display: block; margin: auto;" /> ] --- .pull-left[ With scale and label adjustments: ```r ggplot(fs, aes(x = which, y = height, group = id)) + geom_line(alpha = 0.1) + * scale_x_discrete( * expand = expansion(mult = c(.1, .1)), * labels = c("Father", "Son"), * position = "top") + * labs(x = NULL, y = "Height (Inches)") ``` ] .pull-right[ <img src="twonum_files/figure-html/fs-theme-1.png" style="display: block; margin: auto;" /> ] --- layout: true ## Scatter Plots --- The most common way to show the relationship between two variables is a _scatter plot_. -- A scatter plot of two variables maps the values of one variable to the vertical axis and the other to the horizontal axis of a cartesian coordinate system and places a mark for each observation at the resulting point. -- Some conventions: -- * Plot `A` versus/against `B` means `A` is mapped to the vertical, or `\(y\)`, axis, and `B` to the horizontal, or `\(x\)` axis. -- * If we can think of variation in `A` as being partly explained by `B` then we usually plot `A` against `B`. -- * If we can think of `B` as helping to predict `A`, then we usually plot `A` against `B`. --- ### Barley Yields For a scatter plot of mean yield in 1932 against mean yield in 1931 for the different sites it is useful to have a data frame containing variables for each year. -- This requires converting the data frame to a wider format. ```r wide_barley_site_year <- pivot_wider(barley_site_year, names_from = "year", names_prefix = "avg_yield_", values_from = "avg_yield") head(wide_barley_site_year) ## # A tibble: 6 × 3 ## # Groups: site [6] ## site avg_yield_1932 avg_yield_1931 ## <fct> <dbl> <dbl> ## 1 Grand Rapids 20.8 29.1 ## 2 Duluth 25.7 30.3 ## 3 University Farm 29.5 35.8 ## 4 Morris 41.5 29.3 ## 5 Crookston 31.2 43.7 ## 6 Waseca 41.9 54.3 ``` --- .pull-left[ The basic scatter plot of `y = avg_yield_1932` against `x = avg_yield_year1931`: ] .pull-right[ .hide-code[ ```r p <- ggplot(wide_barley_site_year, aes(x = avg_yield_1931, y = avg_yield_1932)) + geom_point() p ``` <img src="twonum_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ Adding labels using `geom_text_repel` identifies the Morris site: ] .pull-right[ .hide-code[ ```r p <- p + geom_text_repel(aes(label = site)) p ``` <img src="twonum_files/figure-html/unnamed-chunk-13-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ To recognize the reversal for Morris we can add the 45 degree line: ] .pull-right[ .hide-code[ ```r p + geom_abline(aes(intercept = 0, slope = 1), linetype = 2) ``` <img src="twonum_files/figure-html/unnamed-chunk-14-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ A 45 degree line also helps when viewing the full data: ] .pull-right[ .hide-code[ ```r bw <- pivot_wider(barley, names_from = "year", names_prefix = "yield_", values_from = "yield") ggplot(bw, aes(x = yield_1931, y = yield_1932)) + geom_point() + geom_abline(intercept = 0, slope = 1, linetype = 2) + geom_point(data = filter(bw, site == "Morris"), color = "red") + ggtitle("Barley Yields", "Values for Morris are shown in red.") + labs(x = "1931", y = "1932") ``` <img src="twonum_files/figure-html/unnamed-chunk-15-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ A recent [blog post](https://eagereyes.org/blog/2020/in-praise-of-the-diagonal-reference-line) discusses the value of reference lines as plot annotations. {{content}} ] -- If the primary goal is to show the change from one year to the next then a _mean-difference plot_ is a good choice: -- .pull-right[ .hide-code[ ```r ggplot(wide_barley_site_year, aes(x = (avg_yield_1932 + avg_yield_1931) / 2, y = avg_yield_1932 - avg_yield_1931)) + geom_point() + geom_text_repel(aes(label = site)) + geom_abline(aes(intercept = 0, slope = 0), linetype = 2) ``` <img src="twonum_files/figure-html/unnamed-chunk-16-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ For the full data: .hide-code[ ```r ggplot(bw, aes(x = (yield_1932 + yield_1931) / 2, y = yield_1932 - yield_1931)) + geom_point() + geom_abline(aes(intercept = 0, slope = 0), linetype = 2) ``` <img src="twonum_files/figure-html/unnamed-chunk-17-1.png" style="display: block; margin: auto;" /> ] ] -- .pull-right[ The comparison of changes is now an aligned axis comparison. {{content}} ] -- Mean-difference plots are also known as * Tukey mean-difference plots; * MA-plots; * [Bland-Altman plots](https://en.wikipedia.org/wiki/Bland%E2%80%93Altman_plot). --- .pull-left[ Plotting the difference against the `x` variable is also often useful: ] .pull-right[ .hide-code[ ```r ggplot(bw, aes(x = yield_1931, y = yield_1932 - yield_1931)) + geom_point() + geom_point( data = filter(bw, site == "Morris"), color = "red") + geom_abline(aes(intercept = 0, slope = 0), linetype = 2) + ggtitle( "Barley Yield Differences", "Values for Morris are shown in red.") + labs(x = "Yield in 1931", y = "Difference in Yield for 1932") ``` <img src="twonum_files/figure-html/unnamed-chunk-18-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ ### Father and Son Heights The basic scatter plot: ] .pull-right[ .hide-code[ ```r p0 <- ggplot(father.son, aes(x = fheight, y = sheight)) p1 <- p0 + geom_point() p1 ``` <img src="twonum_files/figure-html/unnamed-chunk-19-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ Adding a line with slope one helps identify the regression to the mean phenomenon: ] .pull-right[ .hide-code[ ```r p2 <- p1 + geom_abline( aes(intercept = mean(sheight) - mean(fheight), slope = 1), color = "red", linewidth = 1.5) p2 ``` <img src="twonum_files/figure-html/unnamed-chunk-20-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ Adding a regression line helps further: ] .pull-right[ .hide-code[ ```r p2 + geom_smooth(method = "lm") ``` <img src="twonum_files/figure-html/unnamed-chunk-21-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ But for showing the regression effect it is hard to beat the scatter plot of `sheight - fheight` against `fheight`: ] .pull-right[ .hide-code[ ```r ggplot(father.son) + geom_point(aes(x = fheight, y = sheight - fheight)) + geom_hline(aes(yintercept = 0), linetype = 2) ``` <img src="twonum_files/figure-html/unnamed-chunk-22-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ ### Old Faithful Eruptions A scatter plot of the waiting times until the next eruption against the duration of the current eruption for the `faithful` data set shows the two clusters corresponding to the short and long eruptions: ] .pull-right[ .hide-code[ ```r ggplot(faithful) + geom_point(aes(x = eruptions, y = waiting)) ``` <img src="twonum_files/figure-html/unnamed-chunk-23-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ For the `geyser` data set from the `MASS` package a plot of the two variables shows a different pattern: ] .pull-right[ .hide-code[ ```r data(geyser, package = "MASS") ggplot(geyser) + geom_point(aes(x = duration, y = waiting)) ``` <img src="twonum_files/figure-html/unnamed-chunk-24-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ The reason for the difference is that in the `geyser` data set the waiting time reflects the time since the _previous_ eruption, not the time until the _next_ one. {{content}} ] -- For this ordering it is more natural to plot `duration` against `waiting`: -- .pull-right[ .hide-code[ ```r ggplot(geyser) + geom_point(aes(x = waiting, y = duration)) ``` <img src="twonum_files/figure-html/unnamed-chunk-25-1.png" style="display: block; margin: auto;" /> ] {{content}} ] -- How well does the waiting time predict whether the duration will be longer or shorter? --- .pull-left[ The question the park service is more interested in: How well does duration predict waiting time until the next eruption? {{content}} ] -- We can adjust these data to pair durations with waiting times until the next eruption using the `lag` function from `dplyr`. {{content}} -- This produces the same basic pattern as for the `faithful` data set: -- .pull-right[ .hide-code[ ```r ggplot(geyser) + geom_point(aes(x = lag(duration), y = waiting), na.rm = TRUE) ``` <img src="twonum_files/figure-html/unnamed-chunk-26-1.png" style="display: block; margin: auto;" /> ] ] --- layout: false ## Reading Chapter [_Visualizing associations among two or more quantitative variables_](https://clauswilke.com/dataviz/visualizing-associations.html) in [_Fundamentals of Data Visualization_](https://clauswilke.com/dataviz/).
//adapted from Emi Tanaka's gist at //https://gist.github.com/emitanaka/eaa258bb8471c041797ff377704c8505