As we look at visualizations, a reminder of some considerations:

Task levels for visualization, from highest to lowest level:

**Analyze**: Identify patterns, distributions, presence of outliers or clusters, other interesting features.**Search**: Look up aspects of a feature known in advance or revealed by the visualiation.**Query**: Identify, compare features of individual items.

Each higher level builds on the levels below.

Scalability:

How well do these visualizations work for larger data sets?

Are there variations that can help for larger data sets?

One of the simplest visualizations of a single numerical variable with a modest number of observations and lables for the observations is a *dot plot*, or *Cleveland dot plot*:

```
library(ggplot2)
ggplot(Playfair) +
geom_point(aes(x = population, y = city))
```

This visualization

- shows the overall distribution of the data, and
- makes it easy to locate the population of a particular city.

A useful variation is to order the vertical position by rank:

```
ggplot(Playfair) +
geom_point(aes(x = population, y = reorder(city, population)))
```

- locating a particular city is a little more difficult; but
- the shape of the distribution is more apparent
- approximate median and quartiles can be read off

This visualization is often very useful for group summaries.

Using labels can become impractical for larger numbers of observations:

`ggplot(citytemps) + geom_point(aes(x = temp, y = reorder(city, temp)))`

Instead we can use ranks:

`ggplot(citytemps) + geom_point(aes(x = temp, y = rank(temp)))`

Using ranks or percentiles this visualization in principle scales to much larger data sets:

```
ggplot(diamonds) +
geom_point(aes(x = price, y = 100 * rank(price) / length(price)))
```

With a data set of this size there is little visual difference between a dot plot and a plot that interpolates the points:

```
ggplot(diamonds) +
geom_line(aes(x = price, y = 100 * rank(price) / length(price)))
```

Both the dot plot and the interpolated verson use a lot of resources for computing and storing the plot with diminishing visual returns.

A more effective approach for larger data sets is to fix a set of percentages and plot against the corresponding percentiles:

```
p <- seq(0, 1, length.out = 100)
dm <- data.frame(pct = 100 * p, price = quantile(diamonds$price, p))
ggplot(dm) + geom_line(aes(x = price, y = pct))
```

A similar result can be obtained with `stat_ecdf`

.

The size of the dots can be used to encode an additional numeric variable.

We can compute the approximate area for the cities in the `Playfair`

data frame as `pi * (diameter / 2) ^ 2`

:

```
library(dplyr)
PlayfairA <- mutate(Playfair, area = pi * (diameter / 2) ^ 2)
ggplot(PlayfairA) +
geom_point(aes(x = population, y = reorder(city, population),
size = area)) +
scale_size_area()
```

For the Barley data we have two measures per year. It can be useful to

- place both measures for a site/variety combination on one line, and
- identify the year using color:

```
barley <- mutate(barley, sitevar = paste(site, variety, sep = ", "))
ggplot(barley) +
geom_point(aes(x = yield, y = sitevar, color = year))
```

Reordering the lines based on the yield in 1931 may be useful.

First step: Isolate 1931 yields:

```
b31 <- filter(barley, year == 1931)
b31 <- select(b31, yield31 = yield, sitevar)
head(b31)
## yield31 sitevar
## 1 27.00000 University Farm, Manchuria
## 2 48.86667 Waseca, Manchuria
## 3 27.43334 Morris, Manchuria
## 4 39.93333 Crookston, Manchuria
## 5 32.96667 Grand Rapids, Manchuria
## 6 28.96667 Duluth, Manchuria
```

Second step: Use `left_join`

to merge the 1931 yields with the barley data frame:

```
barley31 <- left_join(barley, b31)
## Joining, by = "sitevar"
head(barley31)
## yield variety year site sitevar
## 1 27.00000 Manchuria 1931 University Farm University Farm, Manchuria
## 2 48.86667 Manchuria 1931 Waseca Waseca, Manchuria
## 3 27.43334 Manchuria 1931 Morris Morris, Manchuria
## 4 39.93333 Manchuria 1931 Crookston Crookston, Manchuria
## 5 32.96667 Manchuria 1931 Grand Rapids Grand Rapids, Manchuria
## 6 28.96667 Manchuria 1931 Duluth Duluth, Manchuria
## yield31
## 1 27.00000
## 2 48.86667
## 3 27.43334
## 4 39.93333
## 5 32.96667
## 6 28.96667
```

Now plot the data:

```
ggplot(barley31) +
geom_point(aes(x = yield, y = reorder(sitevar, yield31), color = year))
```

Base graphics provides the `dotchart`

function:

`dotchart(Playfair$population, labels = Playfair$city)`

The `lattice`

package provides `dotplot`

:

```
library(lattice)
dotplot(reorder(city, population) ~ population, data = Playfair)
```

Most lattice plots support a `group`

argument that is usually mapped to color:

```
dotplot(reorder(sitevar, yield31) ~ yield,
group = year, data = barley31, auto.key = TRUE)
```

Bar charts are most commonly used to show frequencies for categorical data. They are also usually drawn verically:

`ggplot(diamonds) + geom_bar(aes(x = color))`

It is possible to persuade `geom_bar`

to encode the value of a numeric variable as bar height by

- assigning the variable to the
`y`

aesthetic, and - specifying
`stat = "identity"`

.

The default `stat`

is to bin the data, as for a histogram, and to use the bin counts as the `y`

aesthetic.

```
ggplot(Playfair) +
geom_bar(aes(y = population, x = reorder(city, population)),
stat = "identity")
```

Slightly simpler is to use `geom_col`

:

```
ggplot(Playfair) +
geom_col(aes(y = population, x = reorder(city, population)))
```

To make labels readable we can flip the plot:

```
ggplot(Playfair) +
geom_col(aes(y = population, x = reorder(city, population))) +
coord_flip()
```

[ Switching `x`

and `y`

doesn’t do what you want.]

Reducing the bar width may be helpful:

```
ggplot(Playfair) +
geom_col(aes(y = population, x = reorder(city, population)),
width = 0.3) +
coord_flip()
```

Creating this basic bar chart in `lattice`

is easier:

`barchart(reorder(city, population) ~ population, data = Playfair)`

Base graphics also provide a bar chart with the `barplot`

function.

```
opar <- par(mar = c(5, 5, 4, 2) + 0.1)
with(Playfair,
barplot(population, horiz = TRUE, names.arg = city,
las = 1, cex.names = 0.7, cex.axis = 0.7))
```

`par(opar)`

Bar charts seem to be used much more than dot plots in the popular media.

But they are less widely applicable, and have one dangerous feature, sometimes called the *zero baseline issue*.

Because of the way our perception works, when we look at a bar chart we focus on the length of the bar, the distance from the base line, even when the baseline is not meaningful.

Look at these views of tempreatures in the subset of cities where temperatures are between 60 and 70 degrees.

```
library(gridExtra)
c6070 <- filter(citytemps, temp >= 60 & temp <= 70)
p1 <- barchart(city ~ temp, c6070)
p2 <- dotplot(city ~ temp, c6070)
grid.arrange(p1, p2, nrow = 1)
```

Take the Melbourn-Los Angeles pair and the Algiers-Addis Ababa pair.

Their temperature differences are about the same.

The bar chart makes Algiers-Addis Ababa contrast look much more extreme than the Malbourn-Los Angeles contrast.

With a subset like this the focus is on the difference, not the ratio: in this context these temperatures are *interval data*.

As another example, take the cities in the `Playfair`

data with population between 100 and 500 thousand:

```
P15 <- filter(Playfair, population >= 100 & population <= 500)
p1 <- dotplot(city ~ population, data = P15)
p2 <- barchart(city ~ population, data = P15)
grid.arrange(p1, p2, nrow = 1)
```

Someone choosing to look at a restricted range like this is most likely focusing on the differences in population, not the ratio.

The ratio comparisons emphasized by the bar chart are not meaningful.

Setting the origin to zero rescues the bar chart:

```
P15 <- filter(Playfair, population >= 100 & population <= 500)
p1 <- dotplot(city ~ population, data = P15, origin = 0)
p2 <- barchart(city ~ population, data = P15, origin = 0)
grid.arrange(p1, p2, nrow = 1)
```

Either chart allows a ratio comparison or a difference comparison.

- The ratio comparison is a little easier with a bar chart.
- The difference comparison is harder because of the distraction created by the bars.
- For quantity measurements ratios are usually at least meaningful even if they might not be the primary focus

Some notes:

- When using bar charts for positive numbers where ratios are meaningful the baseline should
*always*be zero. - The only exception is when there is a natural non-zero baseline value, such as 32 degrees Fahrenheit (i.e. 0 degrees Celcius) for temperatures.
- You may need to intervene with your software’s defaults to make this happen.
- Bar charts always push the viewer to ratio comparisons, whether they are meaningful or not.
- Using a non-zero baseline can therefore mislead the viewer.
- Some news organizations seem particularly prone to taking advantage of/falling prey to this issue.
- I used
`lattice`

in these examples because it is hard to get`geom_bar`

or `geom_col~ to use a non-zero base line (which is a good thing!). - You
*can*create a bar chart with a non-zero baseline using`geom_segment`

.

Bar charts can be used for data containing both positive and negative values:

```
p1 <- ggplot(citytemps) +
geom_point(aes(y = city, x = temp - 32)) +
geom_vline(xintercept = 0, lty = 2)
p2 <- ggplot(citytemps) +
geom_col(aes(x = city, y = temp - 32)) +
coord_flip()
grid.arrange(p1, p2, nrow = 1)
```

- This puts a strong emphasis on the base line
- This can be useful if the baseline is meaningful, such as
- an average level
- zero degrees Celcius

- Zero degrees Fahrenheit is not meaningful:

```
p1 <- ggplot(citytemps) +
geom_point(aes(y = city, x = temp)) +
geom_vline(xintercept = 0, lty = 2)
p2 <- ggplot(citytemps) +
geom_col(aes(x = city, y = temp)) +
coord_flip()
grid.arrange(p1, p2, nrow = 1)
```

`ggplot`

Documentation- The
`ggplot2`

package includes help pages, but the variants available at http://docs.ggplot2.org/ are a bit more accessible. - The definitive guide is Hadley Wickham’s book.
- Some useful recipes are available iat http://www.cookbook-r.com/Graphs/.
- A more extensive collection is available in Winston Chang’s (2013)
*R Graphics Cookbook*. *R for Data Science*also contains material on`ggplot2`

.

Dot plots, and sometimes bar charts, can be very useful for showing group summaries. Two approaches for computing summaries:

Use the

`tapply`

,`by`

, and`aggregate`

functions from base R.Use tools in the

`tidyverse`

, in particular from the`dplyr`

package.

I will use the `dplyr`

approach. This uses `group_by`

to create a *grouped table*, followed by `summarize`

.

Here is how to compute the agerage `yield`

values for each `variety`

in the `barley`

data:

```
barley_by_variety <- group_by(barley, variety)
barley_variety_means <- summarize(barley_by_variety, avg_yield = mean(yield))
head(barley_variety_means)
## # A tibble: 6 x 2
## variety avg_yield
## <fct> <dbl>
## 1 Svansota 30.4
## 2 No. 462 35.4
## 3 Manchuria 31.5
## 4 No. 475 31.8
## 5 Velvet 33.1
## 6 Peatland 34.2
ggplot(barley_variety_means) +
geom_point(aes(x = avg_yield, y = as.character(variety)))
```

The ordering of the `variety`

factor created by the two approaches is a little different.

An alternate way of specifying the `dplyr`

computation uses the *pipe operator* `%>%`

:

```
barley %>%
group_by(variety) %>%
summarize(avg_yield = mean(yield))
```

In this approach the result on the left of `%>%`

is passed implicitly as the first argument to the function called on the right.

Some like this approach a lot; others do not. I do not care for it.

Base and lattice dot plots use only hirizontal grid lines. This corresponds to the version introduced by W. S. Cleveland.

Lattice and ggplot allow features such as this to be customized using *themes*.

`ggplot2`

provides a number of alternate themses; the `ggthemes`

package provides more.

The *Wall Street Journal* theme `ggthmes::theme_wsj`

produces

```
ggplot(barley_variety_means) +
geom_point(aes(x = avg_yield, y = as.character(variety))) +
ggthemes::theme_wsj()
```

A theme to closely match the style used in Cleveland’s 1993 book *Visualizing Data* and used by the base and lattice functions can be defined as

```
theme_dotplotx <- function() {
theme( ## remove the vertical grid lines
panel.grid.major.x = element_blank() ,
panel.grid.minor.x = element_blank() ,
## explicitly set the horizontal lines (or they will disappear too)
panel.grid.major.y = element_line(color="black", linetype = 3),
axis.text.y = element_text(size=rel(1.2)),
## use a white backgrounsd
panel.background = element_rect(fill = "white", colour = NA),
panel.border = element_rect(fill = NA, colour = "grey20"))
}
```

This produces

```
ggplot(barley_variety_means) +
geom_point(aes(x = avg_yield, y = as.character(variety))) +
theme_dotplotx()
```