class: center, middle, title-slide .title[ # Introduction to Data Visualization ] .author[ ### Luke Tierney ] .institute[ ### University of Iowa ] .date[ ### 2023-05-06 ] --- layout: true <link rel="stylesheet" href="stat4580.css" type="text/css" /> ## Background --- The goals of a visualization can vary: * exploration and understanding -- * presentation and explanation -- * engagement -- Usually * initial data analysis tries to explore and understand the data; -- * a scientific paper will use visualizations to present and explain results; -- * magazines, newspapers, and on-line publications will use them to attract and engage readers. -- A project may involve all three. --- Four terms are in common use: * statistical graphics; -- * data visualization; -- * information visualization; -- * infographics. -- Some view these as interchangeable; others view them as a continuum. -- Visualizations can be * static, possibly printed on paper; -- * dynamic, involving animation or interactive features. --- Historically, only static graphics were available. -- Static graphics remain very useful for exploration, especially if they can be created quickly and easily. -- Interactive graphics are very effective for engagement and are used heavily in on-line publications. -- Traditional scientific publications are mostly limited to static visualizations, though on-line supplements are becoming more common. -- We will focus primarily on static visualizations but will also look at a few interactive options. --- layout: true ## Visualization in the Data Analysis Process --- A data-driven project typically involves several cycles of -- * importing, cleaning, and arranging the data (_data wrangling_) -- * exploring and understanding the data -- * modeling the data to separate signal from noise -- * exploring residuals and other aspects of model/data mismatch -- * possibly getting additional data -- * communicating the results --- .pull-left[ A figure that is often used to capture these steps: <!-- ## nolint start --> <center>
</center> <!-- ## nolint end --> ] -- .pull-right[ Visualization can help at each stage and is often crucial for {{content}} ] -- * understanding the nature of the original data; {{content}} -- * understanding a complex fitted model; {{content}} -- * understanding departures from the model. {{content}} -- Visualizing the data should almost always come before modeling or summarizing. --- .pull-left[ A famous example created by Anscombe (1973): <img src="data:image/png;base64,#visintro_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> ] -- .pull-right[ The regression lines for all four groups are essentially identical! {{content}} ] -- <!-- ## simple base R version: sp <- split(anscombe1, anscombe1$set) t(sapply(sp, function(d) coef(lm(y ~ x, data = d)))) ## tidyverse version: nest(anscombe1, data = -set) %>% mutate(fit = lapply(data, function(d) tidy(lm(y ~ x, data = d)))) %>% unnest(fit) %>% select(set, term, estimate) %>% pivot_wider(names_from = "term", values_from = "estimate") --> Another set of examples in the same spirit is provided by the package [`datasauRus`](https://cran.r-project.org/package=datasauRus); the package [_vignette_](https://cran.r-project.org/package=datasauRus/vignettes/Datasaurus.html) shows the examples. --- layout: true ## Some Historical Graphics --- Easy construction of graphics is highly computational, but a computer isn't necessary. -- Many graphical ideas and elaborate statistical graphs were created in the 1800s and before. -- The following are some classical examples. --- ### William Playfair [Playfair](https://en.wikipedia.org/wiki/William_Playfair)'s _The Commercial and Political Atlas_ and _Statistical Breviary_ (1801) introduced a number of new graphs, including: .pull-left[ [A bar graph:](https://en.wikipedia.org/wiki/File:Playfair_Barchart.gif) <img src="data:image/png;base64,#../img/Playfair_Barchart.gif" width="80%" style="display: block; margin: auto;" /> ] -- .pull-right[ [A pie chart:](https://en.wikipedia.org/wiki/File:Playfair-piechart.jpg) <img src="data:image/png;base64,#../img/Playfair-piechart.jpg" width="60%" style="display: block; margin: auto;" /> ] --- ### Charles Joseph Minard Minard developed [many elaborate graphs](https://www.datavis.ca/gallery/minbib.php), some available as [thumbnail images](https://www.datavis.ca/gallery/minbib/), including an illustration of [Napoleon's Russia campaign](https://en.wikipedia.org/wiki/File:Minard.png) <img src="data:image/png;base64,#../img/Minard.png" width="70%" style="display: block; margin: auto;" /> -- This can be [recreated approximately]( ../examples.html#minards-graph-of-napoleons-russia-campaign) in R. --- ### Florence Nightingale Florence Nightingale used a [polar area diagram](img/Nightingale-mortality.jpg) to illustrate causes of death among British troops in the Crimean war. <img src="data:image/png;base64,#../img/Nightingale-mortality.jpg" width="60%" style="display: block; margin: auto;" /> -- An approximate recreation in R is [available](../amounts.html#polar-area-charts). --- ### John Snow .pull-left[ <img src="data:image/png;base64,#../img/snowmap_1854.jpg" width="90%" style="display: block; margin: auto;" /> ] .pull-right[ [John Snow](https://sphweb.bumc.bu.edu/otlt/mph-modules/ep/ep713_history/EP713_History6.html) used a [map](https://web.archive.org/web/20131209205201/http://www.ph.ucla.edu/epi/snow/snowmap1.pdf) ([higher resolution](https://web.archive.org/web/20220807223704/http://www.ph.ucla.edu/epi/snow/snowmap1_1854_lge.htm)) to identify the source of the 1854 London cholera epidemic. {{content}} ] -- <!-- An [enhanced version](http://www.datavis.ca/gallery/historical.php#snow) is available at <http://www.datavis.ca>.--> The data is [available](https://blog.rtwilson.com/john-snows-cholera-data-in-more-formats/), and has been used for some [interactive visualizations](https://freakonometrics.hypotheses.org/19499). {{content}} -- A [short movie](http://www.snowthemovie.com/) was produced in 2013. --- ### Statistical Atlas of the United States .pull-left[ <img src="data:image/png;base64,#../img/us-atlas.png" width="1525" style="display: block; margin: auto;" /> ] .pull-right[ A Statistical Atlas of the US from the late 1800s shows a number of nice [examples](https://web.archive.org/web/20161018162622/http://www.handsomeatlas.com/). The complete atlases are also [available](https://www.census.gov/history/www/reference/publications/statistical_atlases_1.html). {{content}} ] -- [A project to show modern data in a similar style](http://projects.flowingdata.com/atlas/). --- ### Some References * Edward Tufte (1983), _The Visual Display of Quantitative Information_. * Michael Friendly (2008), "The Golden Age of Statistical Graphics," _Statistical Science_ 8(4), 502-–535 * Michael Friendly's [Historical Milestones](http://www.datavis.ca/gallery/historical.php) at <http://www.datavis.ca/> * [A Wikipedia entry](https://en.wikipedia.org/wiki/Information_graphics#History) --- layout: true ## Graphics Software --- Most statistical systems provide software for producing static graphics. -- Statistical static graphics software typically provides -- * a variety of standard plots with reasonable default configurations for things like * bin widths * axis scaling * aspect ratio -- * ability to customize plot attributes -- * ability to add information to plots, such as * legends * additional points, lines * annotations * superimposed plots -- * ability to produce new kinds of plots --- Some software is more flexible than others. -- <!-- Dynamic graphical software should provide similar flexibility but often does not. --> Non-statistical graph or chart software often emphasizes appearance over content: results may look pretty, but content is hard to extract (e.g. 3D pie charts). -- Chart drawing packages can be used to produce good statistical graphs but they may not make it easy. -- Some newspapers and magazines have very good graphics departments, including * New York Times * Economist * Guardian * LA Times -- Sometimes tools like Adobe Illustrator or [Inkscape](https://inkscape.org/en/) can be used to edit and improve graphics produced by statistical software. -- [NY Times graphics creators](https://blog.revolutionanalytics.com/2010/12/data-visualization-practices-at-the-new-york-times.html) often create initial graphs in R and enhance in Adobe Illustrator --- layout: false ## Graphics in R R has several flexible static graphics system, including -- * _base graphics_, or _standard graphics_, provided by the `graphics` package in the base R distribution; -- * _trellis graphics_, or _lattice graphics_, provided by the `lattice` package (one of the standard _Recommended_ packages typically bundled with R); -- * [`ggplot2`](https://ggplot2.tidyverse.org/) based on Wilkinson's _Grammar of Graphics_ and available from CRAN. -- We will mostly be using `ggplot`. <!-- Some interactive exploratory graphics packages include * `ggobi` for exploring point clouts in higher dimensions * `rgl` for for 3D rendering and viewing * `iplots`, a Java-based dynamic graphics system for linked plots ### Some Internal Structure Static graphics output is produced by a _device_. These can be * screen devices, like `Windows`, `Quartz`, `X11` * file devices, like `pdf` or `png` Devices are used through a _device-independent layer_ implemented in the `grDevices` package. Base graphics is implemented directly using this layer. `lattice` and `ggplot2` are built on an intermediate framework known as _grid graphics_ and implemented in the `grid` package. Using `grid` features is occasionally useful for arranging multiple `lattice` or `ggplot2` plots on a page. <img src="data:image/png;base64,#../img/Rgraphics.jpg" width="1400" style="display: block; margin: auto;" /> --> --- ## Some Task Levels for Visualization In evaluating visualization methods it can be useful to think about several levels of tasks that might be accomplished with a visualiation. -- A useful list, from highest to lowest level: -- * **Analyze**: Identify patterns, distributions, presence of outliers or clusters, other interesting features. -- * **Search**: Look up aspects of a feature known in advance or revealed by the visualiation. -- * **Query**: Identify, compare features of individual items. -- Each higher level builds on the levels below. -- As we look at different methods it is useful to consider the tasks they are suited for, --- ## Scalability Data sets come in many different sizes and shapes. -- Some techniques work well for smaller data sets but deteriorate in effectiveness as size increases. -- Sometimes modifications are available that slow the deterioration. -- Other methods scale better, though usually at the expense of giving up some level of detail. -- As we look at different methods it is useful to consider the scale of data sets they are suited for. --- ## Some Interactive Viualizations * [Bloomberg's hottest year on record visualization](https://www.bloomberg.com/graphics/hottest-year-on-record/) <!-- https://www.ncdc.noaa.gov/cag/time-series/ http://www.npr.org/sections/thetwo-way/2017/01/18/510405739/2016-was-the-hottest-year-yet-scientists-declare--> -- * [Wealth & Health of Nations](https://www.gapminder.org/tools/) at [Gapminder](https://www.gapminder.org/) and Hans Rosling's [200 years that changed the world](https://www.gapminder.org/videos/200-years-that-changed-the-world/) video. -- * [Gun violence at the local level](https://www.theguardian.com/us-news/ng-interactive/2017/jan/09/special-report-fixing-gun-violence-in-america) -- <!-- Page is still there but data behind visual seems to be gone * [Subway crime on in New York](http://www.nydailynews.com/new-york/nyc-crime/daily-news-analysis-reveals-crime-rankings-city-subway-system-article-1.1836918) --> <!-- Paywalled * [Who was helped by Obamacare](http://www.nytimes.com/interactive/2014/10/29/upshot/obamacare-who-was-helped-most.html?_r=0&abt=0002&abg=0) --> * [Paths to the White House in 2012](https://archive.nytimes.com/www.nytimes.com/interactive/2012/11/02/us/politics/paths-to-the-white-house.html?_r=2) -- * [U.S. Population Pyramid From 1980-2050](https://www.visualcapitalist.com/us-population-pyramid-1980-2050/) -- * LA Times years in graphics: [2014](https://graphics.latimes.com/2014-in-graphics/), [2015](https://graphics.latimes.com/2015-in-graphics/), and [2020](https://www.latimes.com/projects/data-and-graphics-2020/). <!-- NYT and Tableau vis examples: https://tldr.nettime.org/@tb/109609913272606761 --> -- * NY Times year in graphics: [2022](https://www.nytimes.com/interactive/2022/12/28/us/2022-year-in-graphics.html) -- * [Some Tableau visualizations from 2022](https://www.tableau.com/blog/tableau-public-viz-wrap-interesting-data-visualizations-2022).