--- title: "Introduction to Data Visualization" output: html_document: toc: yes code_download: true --- ```{r setup, include = FALSE} source(here::here("setup.R")) knitr::opts_chunk$set(collapse = TRUE, fig.height = 5, fig.width = 6, fig.align = "center") ``` ## Background The goals of a visualization can vary: * exploration and understanding * presentation and explanation * engagement Usually * initial data analysis tries to explore and understand the data; * a scientific paper will use visualizations to present and explain results; * magazines, newspapers, and on-line publications will use them to attract and engage readers. A project may involve all three. Four terms are in common use: * statistical graphics; * data visualization; * information visualization; * infographics. Some view these as interchangeable; others view them as a continuum. Visualizations can be * static, possibly printed on paper; * dynamic, involving animation or interactive features. Historically, only static graphics were available. Static graphics remain very useful for exploration, especially if they can be created quickly and easily. Interactive graphics are very effective for engagement and are used heavily in on-line publications. Traditional scientific publications are mostly limited to static visualizations, though on-line supplements are becoming more common. We will focus primarily on static visualizations but will also look at a few interactive options. ## Visualization in the Data Analysis Process A data-driven project typically involves several cycles of * importing, cleaning, and arranging the data (_data wrangling_) * exploring and understanding the data * modeling the data to separate signal from noise * exploring residuals and other aspects of model/data mismatch * possibly getting additional data * communicating the results A figure that is often used to capture these steps: ```{r, include = FALSE} library(nomnoml) ```
```{nomnoml, echo = FALSE} #padding: 25 #fontsize: 18 #fill: #E1DAFF; #D4A9FF #stroke: #8515C7 #linewidth: 2 [Import] -> [Understand] [Understand | [Wrangle] -> [Visualize] [Visualize] -> [Model] [Model] -> [Wrangle] ] [Understand] -> [Communicate] ```
Visualization can help at each stage and is often crucial for * understanding the nature of the original data; * understanding a complex fitted model; * understanding departures from the model. Visualizing the data should almost always come before modeling or summarizing. A famous example created by Anscombe (1973): ```{r, echo = FALSE, message = FALSE} library(ggplot2) anscombe1 <- reshape(anscombe, varying = 1 : 8, direction = "long", sep = "", timevar = "set") ggplot(anscombe1) + geom_point(aes(x = x, y = y)) + geom_smooth(aes(x = x, y = y), method = "lm", se = FALSE) + facet_wrap(~ set) ``` The regression lines for all four groups are essentially identical! Another set of examples in the same spirit is provided by the package [`datasauRus`](https://cran.r-project.org/package=datasauRus); the package [_vignette_](https://cran.r-project.org/package=datasauRus/vignettes/Datasaurus.html) shows the examples. ## Some Historical Graphics Easy construction of graphics is highly computational, but a computer isn't necessary. Many graphical ideas and elaborate statistical graphs were created in the 1800s and before. The following are some classical examples. ### William Playfair [Playfair](https://en.wikipedia.org/wiki/William_Playfair)'s _The Commercial and Political Atlas_ and _Statistical Breviary_ (1801) introduced a number of new graphs, including: [A bar graph:](https://en.wikipedia.org/wiki/File:Playfair_Barchart.gif) ```{r, echo = FALSE, out.width = "80%"} knitr::include_graphics(IMG("Playfair_Barchart.gif")) ``` [A pie chart:](https://en.wikipedia.org/wiki/File:Playfair-piechart.jpg) ```{r, echo = FALSE, out.width = "60%"} knitr::include_graphics(IMG("Playfair-piechart.jpg")) ``` ### Charles Joseph Minard Minard developed [many elaborate graphs](https://www.datavis.ca/gallery/minbib.php), some available as [thumbnail images](https://www.datavis.ca/gallery/minbib/), including an illustration of [Napoleon's Russia campaign](https://en.wikipedia.org/wiki/File:Minard.png) ```{r, echo = FALSE, out.width = "70%"} knitr::include_graphics(IMG("Minard.png")) ``` This can be [recreated approximately]( `r WLNK("examples.html#minards-graph-of-napoleons-russia-campaign")`) in R. ### Florence Nightingale Florence Nightingale used a [polar area diagram](img/Nightingale-mortality.jpg) to illustrate causes of death among British troops in the Crimean war. ```{r, echo = FALSE, out.width = "60%"} knitr::include_graphics(IMG("Nightingale-mortality.jpg")) ``` An approximate recreation in R is [available](`r WLNK("amounts.html#polar-area-charts")`). ### John Snow ```{r, echo = FALSE, out.width = "90%"} knitr::include_graphics(IMG("snowmap_1854.jpg")) ``` [John Snow](https://sphweb.bumc.bu.edu/otlt/mph-modules/ep/ep713_history/EP713_History6.html) used a [map](https://web.archive.org/web/20131209205201/http://www.ph.ucla.edu/epi/snow/snowmap1.pdf) ([higher resolution](https://web.archive.org/web/20220807223704/http://www.ph.ucla.edu/epi/snow/snowmap1_1854_lge.htm)) to identify the source of the 1854 London cholera epidemic. The data is [available](https://blog.rtwilson.com/john-snows-cholera-data-in-more-formats/), and has been used for some [interactive visualizations](https://freakonometrics.hypotheses.org/19499). A [short movie](http://www.snowthemovie.com/) was produced in 2013. ### Statistical Atlas of the United States ```{r, echo = FALSE} knitr::include_graphics(IMG("us-atlas.png")) ``` A Statistical Atlas of the US from the late 1800s shows a number of nice [examples](https://web.archive.org/web/20161018162622/http://www.handsomeatlas.com/). The complete atlases are also [available](https://www.census.gov/history/www/reference/publications/statistical_atlases_1.html). [A project to show modern data in a similar style](http://projects.flowingdata.com/atlas/). ### Some References * Edward Tufte (1983), _The Visual Display of Quantitative Information_. * Michael Friendly (2008), "The Golden Age of Statistical Graphics," _Statistical Science_ 8(4), 502-–535 * Michael Friendly's [Historical Milestones](http://www.datavis.ca/gallery/historical.php) at * [A Wikipedia entry](https://en.wikipedia.org/wiki/Information_graphics#History) ## Graphics Software Most statistical systems provide software for producing static graphics. Statistical static graphics software typically provides * a variety of standard plots with reasonable default configurations for things like * bin widths * axis scaling * aspect ratio * ability to customize plot attributes * ability to add information to plots, such as * legends * additional points, lines * annotations * superimposed plots * ability to produce new kinds of plots Some software is more flexible than others. Non-statistical graph or chart software often emphasizes appearance over content: results may look pretty, but content is hard to extract (e.g. 3D pie charts). Chart drawing packages can be used to produce good statistical graphs but they may not make it easy. Some newspapers and magazines have very good graphics departments, including * New York Times * Economist * Guardian * LA Times Sometimes tools like Adobe Illustrator or [Inkscape](https://inkscape.org/en/) can be used to edit and improve graphics produced by statistical software. [NY Times graphics creators](https://blog.revolutionanalytics.com/2010/12/data-visualization-practices-at-the-new-york-times.html) often create initial graphs in R and enhance in Adobe Illustrator ## Graphics in R R has several flexible static graphics system, including * _base graphics_, or _standard graphics_, provided by the `graphics` package in the base R distribution; * _trellis graphics_, or _lattice graphics_, provided by the `lattice` package (one of the standard _Recommended_ packages typically bundled with R); * [`ggplot2`](https://ggplot2.tidyverse.org/) based on Wilkinson's _Grammar of Graphics_ and available from CRAN. We will mostly be using `ggplot`. ## Some Task Levels for Visualization In evaluating visualization methods it can be useful to think about several levels of tasks that might be accomplished with a visualiation. A useful list, from highest to lowest level: * **Analyze**: Identify patterns, distributions, presence of outliers or clusters, other interesting features. * **Search**: Look up aspects of a feature known in advance or revealed by the visualiation. * **Query**: Identify, compare features of individual items. Each higher level builds on the levels below. As we look at different methods it is useful to consider the tasks they are suited for, ## Scalability Data sets come in many different sizes and shapes. Some techniques work well for smaller data sets but deteriorate in effectiveness as size increases. Sometimes modifications are available that slow the deterioration. Other methods scale better, though usually at the expense of giving up some level of detail. As we look at different methods it is useful to consider the scale of data sets they are suited for. ## Some Interactive Viualizations * [Bloomberg's hottest year on record visualization](https://www.bloomberg.com/graphics/hottest-year-on-record/) * [Wealth & Health of Nations](https://www.gapminder.org/tools/) at [Gapminder](https://www.gapminder.org/) and Hans Rosling's [200 years that changed the world](https://www.gapminder.org/videos/200-years-that-changed-the-world/) video. * [Gun violence at the local level](https://www.theguardian.com/us-news/ng-interactive/2017/jan/09/special-report-fixing-gun-violence-in-america) * [Paths to the White House in 2012](https://archive.nytimes.com/www.nytimes.com/interactive/2012/11/02/us/politics/paths-to-the-white-house.html?_r=2) * [U.S. Population Pyramid From 1980-2050](https://www.visualcapitalist.com/us-population-pyramid-1980-2050/) * LA Times years in graphics: [2014](https://graphics.latimes.com/2014-in-graphics/), [2015](https://graphics.latimes.com/2015-in-graphics/), and [2020](https://www.latimes.com/projects/data-and-graphics-2020/). * NY Times year in graphics: [2022](https://www.nytimes.com/interactive/2022/12/28/us/2022-year-in-graphics.html) * [Some Tableau visualizations from 2022](https://www.tableau.com/blog/tableau-public-viz-wrap-interesting-data-visualizations-2022).