The most used graph for visualizing the relationship between two numeric variables is the *scatter plot*.

But there is one alternative that can be useful and is increasingly popular: the *slope chart* or *slope graph*.

Two articles on slope graphs with examples:

- http://charliepark.org/slopegraphs/
- http://www.visualisingdata.com/2013/12/in-praise-of-slopegraphs/

Tufte showed this example in *The Visual Display of Quantitative Information*:

Some features of the data that are easy to see:

order of the countries within each year;

how each country’s values changed;

how the rates of change compare;

the country (Britain) that does not fit the general pattern.

The chart uses no non-data ink.

The chart in this form is well suited for small data sets or summaries with modest numbers of categories.

Scalability in this full form is limited, but better if labels and values are dropped.

The idea can be extended to multiple periods, though two periods or levels is most common when labeling is used. Without labeling this becomes a *parallel coordinates plot*.

Averaging `yield`

over the different varieties for each site and year produces

```
gbsy <- group_by(barley, site, year)
absy <- summarize(gbsy, avg_yield = mean(yield))
head(absy)
## # A tibble: 6 x 3
## # Groups: site [3]
## site year avg_yield
## <fct> <fct> <dbl>
## 1 Grand Rapids 1932 20.8
## 2 Grand Rapids 1931 29.1
## 3 Duluth 1932 25.7
## 4 Duluth 1931 30.3
## 5 University Farm 1932 29.5
## 6 University Farm 1931 35.8
```

The `year`

variable in the summary is a factor with the levels in the wrong order, so we need to fix that:

```
levels(absy$year)
## [1] "1932" "1931"
absy <- mutate(absy, year = fct_rev(year))
levels(absy$year)
## [1] "1931" "1932"
```

The core of a slope graph for these means is

```
p <- ggplot(absy, aes(x = year, y = avg_yield, group = site)) + geom_line()
p
```

Adding the labels can be done as

```
p + geom_text(aes(label = paste0(site, ", ", round(avg_yield, 1))),
hjust = "outward")
```

The labal positions could use further adjusting; using `geom_text_repel`

from the `ggrepel`

package handles this well:

```
library(ggrepel)
p <- p + geom_text_repel(aes(label = paste0(site, ", ", round(avg_yield, 1))),
hjust = "outward", direction = "y")
basic_barley_slopes <- p
p
```

Finally, clear out non-data ink and move axis labels to the top:

```
p + theme(panel.background = element_blank(),
panel.grid=element_blank(),
axis.ticks=element_blank(),
axis.text.y=element_blank(),
axis.title=element_blank(),
panel.border=element_blank()) +
scale_x_discrete(position = "top")
```

The `father.son`

data set has `1078`

observations, which is too large for the labeled slope graph, but the basic representation is useful.

To make creating the graph easier we can convert the data frame into one with a `height`

variable, a variable indicating whether the height is for father or son, and a variable identifying the pair:

```
fs <- mutate(father.son, id = seq_len(nrow(father.son)))
head(fs)
## fheight sheight id
## 1 65.04851 59.77827 1
## 2 63.25094 63.21404 2
## 3 64.95532 63.34242 3
## 4 65.75250 62.79238 4
## 5 61.13723 64.28113 5
## 6 63.02254 64.24221 6
fs <- gather(fs, who, height, 1:2)
head(fs)
## id who height
## 1 1 fheight 65.04851
## 2 2 fheight 63.25094
## 3 3 fheight 64.95532
## 4 4 fheight 65.75250
## 5 5 fheight 61.13723
## 6 6 fheight 63.02254
```

The basic plot is quite simple:

`ggplot(fs, aes(x = who, y = height)) + geom_line(aes(group = id))`

With an axis adjustment and using a reduced `alpha`

level:

```
ggplot(fs, aes(x = who, y = height)) +
geom_line(aes(group = id), alpha = 0.1) +
scale_x_discrete(expand = c(.1, 0))
```

This very clearly shows the famous *regression to the mean* effect:

- taller parents tend to be taller than their children;
- shorter parents tend to be shorter than their children.

Conversely,

- taller children tend to be taller than their parents;
- shorter children tend to be shorter than their parents.

A scatter plot of two variables plots maps the values on one variable to the vertical axis and the other to the horisontal axis of a cartesian coordinate system and places a marker for each observation at the resulting point.

Conventions:

Plot

`A`

versus/against`B`

means`A`

is mapped to the vertical, or \(y\), axis, and`B`

to the horizontal, or \(x\) axis.If we can think of variation

`A`

as being partly explained byt`B`

then we usually plot`A`

against`B`

.If we can thing of

`B`

as helping to precict`A`

, then we usually plot`A`

against`B`

.

For a scatter plot of mean yield in 1932 against mean yield in 1931 for the different sites it is useful to have a data frame containing variables for each year.

This requires the inverse of a `gather`

operation, called a `spread`

:

```
sabsy <- spread(absy, year, avg_yield, sep="")
head(sabsy)
## # A tibble: 6 x 3
## # Groups: site [6]
## site year1931 year1932
## <fct> <dbl> <dbl>
## 1 Grand Rapids 29.1 20.8
## 2 Duluth 30.3 25.7
## 3 University Farm 35.8 29.5
## 4 Morris 29.3 41.5
## 5 Crookston 43.7 31.2
## 6 Waseca 54.3 41.9
```

The basic scatter plot of `y = year1932`

against `x = year1931`

:

```
p <- ggplot(sabsy, aes(x = year1931, y = year1932)) + geom_point()
p
```

Adding labels using `geom_text_repel`

identifies the Morris site:

```
p <- p + geom_text_repel(aes(label = site), vjust = "top")
p
```

To recognize the reversal we can add the 45 degree line:

`p + geom_abline(aes(intercept = 0, slope = 1), linetype = 2)`

The basic scatter plot:

```
p0 <- ggplot(father.son, aes(x = fheight, y = sheight))
p1 <- p0 + geom_point()
p1
```

Adding a line with slope one helps identify the regression to the mean phenomenon:

```
p2 <- p1 + geom_abline(aes(intercept = mean(sheight) - mean(fheight),
slope = 1),
linetype = 2, color = "red")
p2
```

Adding a regression line helps further:

`p2 + geom_smooth(method = "lm")`

But for showing the regression effect it is hard to beat the scatter plot of `sheight - fheight`

against `fheight`

:

```
ggplot(father.son) +
geom_point(aes(x = fheight, y = sheight - fheight)) +
geom_hline(aes(yintercept = 0), linetype = 2, color = "red")
```

A scatter plot of the waiting times until the next eruption against the duration of the current eruption for the `faithful`

data set shows the two clusters corresponding to the short and long eruptions:

`ggplot(faithful) + geom_point(aes(x = eruptions, y = waiting))`

For the `geyser`

data set from the `MASS`

package a plot of the two variables shows a different pattern:

`ggplot(geyser) + geom_point(aes(x = duration, y = waiting))`

The reason for the difference is that in the `geyser`

data set the waiting time reflects the time since the *previous* eruption, not the time until the *next* one.

For this time order it would be more natural to plot duration against time since the last eruption:

`ggplot(geyser) + geom_point(aes(x = waiting, y = duration))`

We can adjust these data to pair durations with waiting times until the next eruption using the `lag`

function from `dplyr`

. This produces the same basic pattern as for the `faithful`

data set:

```
ggplot(geyser) + geom_point(aes(x = lag(duration), y = waiting))
## Warning: Removed 1 rows containing missing values (geom_point).
```