---
title: "Visualizing Two Numeric Variables"
output:
html_document:
toc: yes
---
```{r global_options, include=FALSE}
knitr::opts_chunk$set(collapse=TRUE)
```
```{r, include = FALSE}
library(UsingR)
library(lattice)
library(tidyverse)
library(gridExtra)
set.seed(12345)
```
## Slope Graphs
The most used graph for visualizing the relationship between two
numeric variables is the _scatter plot_.
But there is one alternative that can be useful and is increasingly
popular: the _slope chart_ or _slope graph_.
### Tufte's Slope Graph
Two articles on slope graphs with examples:
* http://charliepark.org/slopegraphs/
* http://www.visualisingdata.com/2013/12/in-praise-of-slopegraphs/
Tufte showed this example in _The Visual Display of Quantitative
Information_:
![](img/tufteslope.gif)
Some features of the data that are easy to see:
* order of the countries within each year;
* how each country's values changed;
* how the rates of change compare;
* the country (Britain) that does not fit the general pattern.
The chart uses no non-data ink.
The chart in this form is well suited for small data sets or summaries
with modest numbers of categories.
Scalability in this full form is limited, but better if labels and
values are dropped.
The idea can be extended to multiple periods, though two periods or
levels is most common when labeling is used. Without labeling this
becomes a _parallel coordinates plot_.
![](img/cancer_survival_nash.gif)
### Barley Mean Yields
Averaging `yield` over the different varieties for each site and year
produces
```{r}
gbsy <- group_by(barley, site, year)
absy <- summarize(gbsy, avg_yield = mean(yield))
head(absy)
```
The `year` variable in the summary is a factor with the levels in the
wrong order, so we need to fix that:
```{r}
levels(absy$year)
absy <- mutate(absy, year = fct_rev(year))
levels(absy$year)
```
The core of a slope graph for these means is
```{r}
p <- ggplot(absy, aes(x = year, y = avg_yield, group = site)) + geom_line()
p
```
Adding the labels can be done as
```{r}
p + geom_text(aes(label = paste0(site, ", ", round(avg_yield, 1))),
hjust = "outward")
```
The labal positions could use further adjusting; using `geom_text_repel`
from the `ggrepel` package handles this well:
```{r}
library(ggrepel)
p <- p + geom_text_repel(aes(label = paste0(site, ", ", round(avg_yield, 1))),
hjust = "outward", direction = "y")
basic_barley_slopes <- p
p
```
Finally, clear out non-data ink and move axis labels to the top:
```{r}
p + theme(panel.background = element_blank(),
panel.grid=element_blank(),
axis.ticks=element_blank(),
axis.text.y=element_blank(),
axis.title=element_blank(),
panel.border=element_blank()) +
scale_x_discrete(position = "top")
```
### Father-Son Heights
The `father.son` data set has ```r nrow(father.son)``` observations,
which is too large for the labeled slope graph, but the basic
representation is useful.
To make creating the graph easier we can convert the data frame into
one with a `height` variable, a variable indicating whether the height
is for father or son, and a variable identifying the pair:
```{r}
fs <- mutate(father.son, id = seq_len(nrow(father.son)))
head(fs)
fs <- gather(fs, who, height, 1:2)
head(fs)
```
The basic plot is quite simple:
```{r}
ggplot(fs, aes(x = who, y = height)) + geom_line(aes(group = id))
```
With an axis adjustment and using a reduced `alpha` level:
```{r}
ggplot(fs, aes(x = who, y = height)) +
geom_line(aes(group = id), alpha = 0.1) +
scale_x_discrete(expand = c(.1, 0))
```
This very clearly shows the famous _regression to the mean_ effect:
* taller parents tend to be taller than their children;
* shorter parents tend to be shorter than their children.
Conversely,
* taller children tend to be taller than their parents;
* shorter children tend to be shorter than their parents.
## Scatter Plots
A scatter plot of two variables plots maps the values on one variable
to the vertical axis and the other to the horisontal axis of a
cartesian coordinate system and places a marker for each observation
at the resulting point.
Conventions:
* Plot `A` versus/against `B` means `A` is mapped to the vertical, or
$y$, axis, and `B` to the horizontal, or $x$ axis.
* If we can think of variation `A` as being partly explained byt `B`
then we usually plot `A` against `B`.
* If we can thing of `B` as helping to precict `A`, then we usually
plot `A` against `B`.
### Barley Mean Yields
For a scatter plot of mean yield in 1932 against mean yield in 1931
for the different sites it is useful to have a data frame containing
variables for each year.
This requires the inverse of a `gather` operation, called a `spread`:
```{r}
sabsy <- spread(absy, year, avg_yield, sep="")
head(sabsy)
```
The basic scatter plot of `y = year1932` against `x = year1931`:
```{r}
p <- ggplot(sabsy, aes(x = year1931, y = year1932)) + geom_point()
p
```
Adding labels using `geom_text_repel` identifies the Morris site:
```{r}
p <- p + geom_text_repel(aes(label = site), vjust = "top")
p
```
To recognize the reversal we can add the 45 degree line:
```{r}
p + geom_abline(aes(intercept = 0, slope = 1), linetype = 2)
```
### Father and Son Heights
The basic scatter plot:
```{r}
p0 <- ggplot(father.son, aes(x = fheight, y = sheight))
p1 <- p0 + geom_point()
p1
```
Adding a line with slope one helps identify the regression to the mean
phenomenon:
```{r}
p2 <- p1 + geom_abline(aes(intercept = mean(sheight) - mean(fheight),
slope = 1),
linetype = 2, color = "red")
p2
```
Adding a regression line helps further:
```{r}
p2 + geom_smooth(method = "lm")
```
But for showing the regression effect it is hard to beat the scatter
plot of `sheight - fheight` against `fheight`:
```{r}
ggplot(father.son) +
geom_point(aes(x = fheight, y = sheight - fheight)) +
geom_hline(aes(yintercept = 0), linetype = 2, color = "red")
```
### Old Faithful Eruptions
A scatter plot of the waiting times until the next eruption against
the duration of the current eruption for the `faithful` data set shows
the two clusters corresponding to the short and long eruptions:
```{r}
ggplot(faithful) + geom_point(aes(x = eruptions, y = waiting))
```
For the `geyser` data set from the `MASS` package a plot of the two
variables shows a different pattern:
```{r}
ggplot(geyser) + geom_point(aes(x = duration, y = waiting))
```
The reason for the difference is that in the `geyser` data set the
waiting time reflects the time since the _previous_ eruption, not the
time until the _next_ one.
For this time order it would be more natural to plot duration against
time since the last eruption:
```{r}
ggplot(geyser) + geom_point(aes(x = waiting, y = duration))
```
We can adjust these data to pair durations with waiting times until
the next eruption using the `lag` function from `dplyr`. This produces
the same basic pattern as for the `faithful` data set:
```{r}
ggplot(geyser) + geom_point(aes(x = lag(duration), y = waiting))
```