---
title: "Introduction to Data Visualization"
output:
html_document:
toc: yes
code_download: true
---
```{r setup, include = FALSE}
source(here::here("setup.R"))
knitr::opts_chunk$set(collapse = TRUE,
fig.height = 5, fig.width = 6, fig.align = "center")
```
## Background
The goals of a visualization can vary:
* exploration and understanding
* presentation and explanation
* engagement
Usually
* initial data analysis tries to explore and understand the data;
* a scientific paper will use visualizations to present and explain results;
* magazines, newspapers, and on-line publications will use them to
attract and engage readers.
A project may involve all three.
Four terms are in common use:
* statistical graphics;
* data visualization;
* information visualization;
* infographics.
Some view these as interchangeable; others view them as a continuum.
Visualizations can be
* static, possibly printed on paper;
* dynamic, involving animation or interactive features.
Historically, only static graphics were available.
Static graphics remain very useful for exploration, especially if they
can be created quickly and easily.
Interactive graphics are very effective for engagement and are used
heavily in on-line publications.
Traditional scientific publications are mostly limited to static
visualizations, though on-line supplements are becoming more common.
We will focus primarily on static visualizations but will also look at
a few interactive options.
## Visualization in the Data Analysis Process
A data-driven project typically involves several cycles of
* importing, cleaning, and arranging the data (_data wrangling_)
* exploring and understanding the data
* modeling the data to separate signal from noise
* exploring residuals and other aspects of model/data mismatch
* possibly getting additional data
* communicating the results
A figure that is often used to capture these steps:
```{r, include = FALSE}
library(nomnoml)
```
Visualization can help at each stage and is often crucial for
* understanding the nature of the original data;
* understanding a complex fitted model;
* understanding departures from the model.
Visualizing the data should almost always come before modeling or
summarizing.
A famous example created by Anscombe (1973):
```{r, echo = FALSE, message = FALSE}
library(ggplot2)
anscombe1 <- reshape(anscombe, varying = 1 : 8,
direction = "long", sep = "", timevar = "set")
ggplot(anscombe1) +
geom_point(aes(x = x, y = y)) +
geom_smooth(aes(x = x, y = y), method = "lm", se = FALSE) +
facet_wrap(~ set)
```
The regression lines for all four groups are essentially identical!
Another set of examples in the same spirit is provided by the package
[`datasauRus`](https://cran.r-project.org/package=datasauRus); the
package
[_vignette_](https://cran.r-project.org/package=datasauRus/vignettes/Datasaurus.html)
shows the examples.
## Some Historical Graphics
Easy construction of graphics is highly computational, but a computer
isn't necessary.
Many graphical ideas and elaborate statistical graphs were
created in the 1800s and before.
The following are some classical examples.
### William Playfair
Playfair's _The Commercial and Political Atlas_ and
_Statistical Breviary_ introduced a number of new graphs, including:
[A bar graph:](http://en.wikipedia.org/wiki/File:Playfair_Barchart.gif)
```{r, echo = FALSE, out.width = "80%"}
knitr::include_graphics(IMG("Playfair_Barchart.gif"))
```
[A pie chart:](http://en.wikipedia.org/wiki/File:Playfair-piechart.jpg)
```{r, echo = FALSE, out.width = "60%"}
knitr::include_graphics(IMG("Playfair-piechart.jpg"))
```
### Charles Joseph Minard
Minard developed
[many elaborate graphs](http://www.datavis.ca/gallery/minbib.php),
some available as
[thumbnail images](http://www.datavis.ca/gallery/minbib/), including
an illustration of
[Napoleon's Russia campaign](http://en.wikipedia.org/wiki/File:Minard.png)
```{r, echo = FALSE, out.width = "70%"}
knitr::include_graphics(IMG("Minard.png"))
```
This can be [recreated
approximately](
`r WLNK("examples.html#minards-graph-of-napoleons-russia-campaign")`)
in R.
### Florence Nightingale
Florence Nightingale used a [polar area
diagram](img/Nightingale-mortality.jpg) to illustrate causes of death
among British troops in the Crimean war.
```{r, echo = FALSE, out.width = "60%"}
knitr::include_graphics("https://upload.wikimedia.org/wikipedia/commons/1/17/Nightingale-mortality.jpg")
```
An approximate recreation in R is
[available](`r WLNK("amounts.html#polar-area-charts")`).
### John Snow
```{r, echo = FALSE, out.width = "90%"}
knitr::include_graphics(IMG("snowmap_1854.jpg"))
```
[John
Snow](http://sphweb.bumc.bu.edu/otlt/mph-modules/ep/ep713_history/EP713_History6.html)
used a [map](http://www.ph.ucla.edu/epi/snow/snowmap1.pdf) ([higher
resolution](http://www.ph.ucla.edu/epi/snow/snowmap1_1854_lge.htm))
to identify the source of the 1854 London cholera epidemic.
The data is
[available](http://blog.rtwilson.com/john-snows-cholera-data-in-more-formats/),
and has been used for some [interactive
visualizations](http://freakonometrics.hypotheses.org/19499).
A [short movie](http://www.snowthemovie.com/) has recently been
produced.
### Statistical Atlas of the United States
```{r, echo = FALSE}
knitr::include_graphics(IMG("us-atlas.png"))
```
A Statistical Atlas of the US from the late 1800s shows a number of
nice
[examples](https://web.archive.org/web/20161018162622/http://www.handsomeatlas.com/).
The complete atlases are also
[available](https://www.census.gov/history/www/reference/publications/statistical_atlases_1.html).
[A project to show modern data in a similar
style](http://projects.flowingdata.com/atlas/).
### Some References
* Edward Tufte (1983), _The Visual Display of Quantitative
Information_.
* Michael Friendly (2008), "The Golden Age of Statistical
Graphics," _Statistical Science_ 8(4), 502-–535
* Michael Friendly's [Historical
Milestones](http://www.datavis.ca/gallery/historical.php) at
* [A Wikipedia
entry](http://en.wikipedia.org/wiki/Information_graphics#History)
## Graphics Software
Most statistical systems provide software for producing static graphics.
Statistical static graphics software typically provides
* a variety of standard plots with reasonable default configurations
for things like
* bin widths
* axis scaling
* aspect ratio
* ability to customize plot attributes
* ability to add information to plots, such as
* legends
* additional points, lines
* annotations
* superimposed plots
* ability to produce new kinds of plots
Some software is more flexible than others.
Non-statistical graph or chart software often emphasizes appearance
over content: results may look pretty, but content is hard to extract
(e.g. 3D pie charts).
Chart drawing packages can be used to produce good statistical graphs
but they may not make it easy.
Some newspapers and magazines have very good graphics departments, including
* New York Times
* Economist
* Guardian
* LA Times
Sometimes tools like Adobe Illustrator or
[Inkscape](https://inkscape.org/en/) can be used to edit and improve
graphics produced by statistical software.
[NY Times graphics creators](http://blog.revolutionanalytics.com/2010/12/data-visualization-practices-at-the-new-york-times.html)
often create initial graphs in R and enhance in Adobe Illustrator
## Graphics in R
R has several flexible static graphics system, including
* _base graphics_, or _standard graphics_, provided by the `graphics`
package in the base R distribution;
* _trellis graphics_, or _lattice graphics_, provided by the `lattice`
package (one of the standard _Recommended_ packages typically
bundled with R);
* [`ggplot2`](http://ggplot2.tidyverse.org/) based on Wilkinson's _Grammar of
Graphics_ and available from CRAN.
We will mostly be using `ggplot`.
## Some Task Levels for Visualization
In evaluating visualization methods it can be useful to think about
several levels of tasks that might be accomplished with a visualiation.
A useful list, from highest to lowest level:
* **Analyze**: Identify patterns, distributions, presence of outliers
or clusters, other interesting features.
* **Search**: Look up aspects of a feature known in advance or
revealed by the visualiation.
* **Query**: Identify, compare features of individual items.
Each higher level builds on the levels below.
As we look at different methods it is useful to consider the tasks
they are suited for,
## Scalability
Data sets come in many different sizes and shapes.
Some techniques work well for smaller data sets but deteriorate in
effectiveness as size increases.
Sometimes modifications are available that slow the deterioration.
Other methods scale better, though usually at the expense of giving up
some level of detail.
As we look at different methods it is useful to consider the scale of data
sets they are suited for.
## Some Interactive Viualizations
* [Bloomberg's hottest year on record
visualization](https://www.bloomberg.com/graphics/hottest-year-on-record/)
* [Wealth & Health of Nations](http://www.gapminder.org/tools) at
[Gapminder](http://www.gapminder.org/) and Hans Rosling's [200 years
that changed the
world](https://www.gapminder.org/videos/200-years-that-changed-the-world/)
video.
* [Gun violence at the local
level](https://www.theguardian.com/us-news/ng-interactive/2017/jan/09/special-report-fixing-gun-violence-in-america)
* [Paths to the White House in
2012](http://www.nytimes.com/interactive/2012/11/02/us/politics/paths-to-the-white-house.html?_r=2&)
* [U.S. Population Pyramid From
1980-2050](https://www.visualcapitalist.com/us-population-pyramid-1980-2050/)
* LA Times years in graphics:
[2014](http://graphics.latimes.com/2014-in-graphics/) and
[2015](http://graphics.latimes.com/2015-in-graphics/)