Final Notes

class: center, middle, title-slide

.title[
# Final Notes
]
.author[
### Luke Tierney
]
.institute[
### University of Iowa
]
.date[
### 2025-05-08
]

---

## Some Other Topics

Some topics we did not have time to look at:

* Working with models ([Chapter 6 in Healy,
  2018](https://socviz.co/modeling.html); [Tidy Modeling with R](https://www.tmwr.org/)).

* [Visualizing missing
  values](http://naniar.njtierney.com/articles/naniar-visualisation.html).

* Visualizing uncertainty ([Chapter 16 of Wilke,
  2019](https://clauswilke.com/dataviz/visualizing-uncertainty.html)
  and [below](#visualizing-uncertainty))

* Plot annotation, plot ensembles, and dashboards. ([Part II of Wilke,
  2019](https://clauswilke.com/dataviz/proportional-ink.html);
  [Chapter 5 of Healy, 2018](https://socviz.co/workgeoms.html);
  [below](#plot-annotation-plot-ensembles-and-dashboards)).

* Data Science Ethics ([below](#data-science-ethics)).

---
layout: true
## Visualizing Uncertainty: Hurricanes
---

name: visualizing-uncertainty

All estimates from data are associated with some degree of uncertainty.

Effectively communicating that uncertainty in visualizations is
challenging and an active area of
[research](http://space.ucmerced.edu/chapter).

The _cone of uncertainty_: (From Cairo (2019); images from a [blog
post](https://web.archive.org/web/20231202192241/http://www.thefunctionalart.com/2020/01/all-graphics-from-how-charts-lie-freely.html)
by the author.)

.pull-left[
<img src="../img/PDF10.Tropicalstorm.png" width="450" />
]

.pull-right[
The [NHC forecast cone](https://www.nhc.noaa.gov/aboutcone.shtml) is
designed so that two-thirds of historical official forecast errors over
a 5-year sample fall within the cone for a particular time point..
]

---

When published in the media these visualizations are routinely
misinterpreted something like this:

.pull-left.width-65[
<img src="../img/PDF11.StormWRONGSize.png" width="573" />
]

---

A more effective representation might be something like this, showing
an _ensemble_ of possible tracks:

.pull-left.width-65[
<img src="../img/PDF13.StormLines.png" width="573" />
]

.pull-right.width-30[
An animated version may be more effective, if the presentation medium
permits.
{{content}}
]
--

Developing better visualizations for hurricane forecasting, especially
targeting the public, is an active area of research.

---
layout: true
## Visualizing Uncertainty: Chocolate Bars
---

.pull-left[
[Expert ratings](http://flavorsofcacao.com), on a scale from 0 to 5,
for chocolate bars manufactured in several countries:
]

.pull-right[
![](wrapup_files/figure-html/unnamed-chunk-4-1.png)
]

---

.pull-left[
The standard deviations of the data distributions are comparable, but
the lengths of confidence intervals for the mean vary because of the
different sample sizes:
]

.pull-right[
![](wrapup_files/figure-html/unnamed-chunk-5-1.png)
]

---

.pull-left[
The same plot with a reduced horizontal range:
]

.pull-right[
![](wrapup_files/figure-html/unnamed-chunk-6-1.png)
]

---

.pull-left[
A more elaborate display with confidence intervals at several levels:
]

.pull-right[
![](wrapup_files/figure-html/unnamed-chunk-7-1.png)
]

---

.pull-left[
Confidence densities, or confidence distributions, as proposed in

> Adrian W. Bowman. Graphics for Uncertainty. J. R. Statist. Soc. A
> 182:1-16, 2018. [Link](https://rss.onlinelibrary.wiley.com/doi/full/10.1111/rssa.12379)
]

.pull-right[
![](wrapup_files/figure-html/unnamed-chunk-8-1.png)
]

---

One drawback of all of these methods:

> The least precise measurement draws the most attention.

These examples from Wilke's book use the [`ungeviz`
package](https://github.com/wilkelab/ungeviz) available on GitHub.

Another package providing some tools for uncertainty visualization is
[`ggdist` package](https://mjskay.github.io/ggdist/).

---
layout: true
## Visualizing Uncertainty: Old Cars
---

Using the very old `mtcars` data set to illustrate estimating a smooth
relationship:

.pull-left[
![](wrapup_files/figure-html/unnamed-chunk-9-1.png)
]

.pull-right[
A default `geom_smooth` shows an estimate along with a point-wise
confidence band.
{{content}}
]
--

This may not give the best sense of the joint uncertainty: if the curve
is higher on some places it may need to be lower in others.

---

Showing an _ensemble_ of curves that all are plausible can be a better
choice.

.pull-left[
![](wrapup_files/figure-html/unnamed-chunk-10-1.png)
]

.pull-right[
This approach was shown earlier for visualizing possible hurricane paths.
{{content}}
]
--

This ensemble is generated using a _case-based bootstrap_.
{{content}}
--

These plots are called _ensemble plots_ (also spaghetti plots, for
obvious reasons).

---

If animation is available, an alternative is to show the curves one at
a time in an animation.

.pull-left[
![](wrapup_files/figure-html/unnamed-chunk-11-1.gif)
]

.pull-right[
Again, a bootstrap is used to produce the estimates.
{{content}}
]
--

This is an example of a _hypothetical outcomes plot_, or _HOP_, as
introduced in

> Hullman, Jessica, Paul Resnick, and Eytan Adar. "Hypothetical
> outcome plots outperform error bars and violin plots for inferences
> about reliability of variable ordering." PLOS ONE 10, no. 11 (2015).

---
layout: true
## Data Quality and Integrity
---

A visualization can accurately reflect data but still be misleading if
the data are faulty.

A [NY Times
article](https://www.nytimes.com/2021/05/03/health/covid-herd-immunity-vaccine.html)
from May 2021 shows a choropleth map of the estimated share of adults
who would "definitely" or "probably" get the COVID-19 vaccine.

.pull-left[
Cutoffs: 49  60   65  70  75  80  91 %

<img src="../img/map-1050.png" width="500" />
]

.pull-right[
The map may accurately reflect the estimates, but the estimates have
obvious problems.
]

---

.pull-left[
The data used for the map are available
[here](https://aspe.hhs.gov/reports/vaccine-hesitancy-covid-19-state-county-local-estimates).
{{content}}
]
--

Discussions on social media suggest that the state level data may be
more reasonable:

.pull-right.width-50[

![](wrapup_files/figure-html/unnamed-chunk-13-1.png)

]

---
layout: false
name: data-science-ethics
## Data Science Ethics

Some issues:

* Data misrepresentation

* Data falsification

* Data privacy

* Data scraping and terms of use

* Algorithmic bias

Some references:

* [Data science
  ethics](https://mdsr-book.github.io/mdsr2e/ch-ethics.html) chapter
  in: Benjamin S. Baumer, Daniel T. Kaplan, and Nicholas J. Horton
  (2021)  
  [_Modern Data Science with R, 2nd edition_](https://mdsr-book.github.io/mdsr2e/).

* [Data science ethics](https://datasciencebox.org/02-ethics.html) section of
  the online book
  [Data Science in a Box](https://datasciencebox.org/index.html)
	  by Mine Çetinkaya-Rundel.

* David Martens (2022) [Data Science Ethics: Concepts, Techniques, and
  Cautionary Tales ](https://amzn.to/4cYsaWq), Oxford University Press

* Alberto Cairo (2019) _How Charts Lie: Getting Smarter about Visual
  Information_, W. W. Norton & Company.

---
name: plot-annotation-plot-ensembles-and-dashboards
	
## Plot Annotation, Plot Ensembles, and Dashboards

Plot annotations can create popout and help focus the viewer's
attention.

They may be increasingly important as images are shared on line
without context.

Here is an examples for the `mpg` data:

![](wrapup_files/figure-html/unnamed-chunk-14-1.png)

---
layout: true
## Plot Ensembles: Coffee
---

It is often useful to use several graphics to present an analysis.

Collections of related graphs are sometimes called _ensemble graphics_.

On line presentations of analyses involving multiple visualizations
and, typically, some interactive features are also called
_dashboards_.

To aid the viewer it is usually best to design these visualizations
together, with common axis choices and color mappings.

Fig 12.1 in Unwin (2015) provides a simple example:

---

<img src="wrapup_files/figure-html/unnamed-chunk-15-1.png" alt="A dashboard with three plots. A bar chart shows there are about 4 times as many Arabica samples ad Rubusta samples. A scatterplot of Caffeine against Fat content shows clear separation of the two groups. A parallel coordinates plot shows the 12 values measured on each group." style="float:right; padding:10px" />

Data on the chemical composition of coffee samples collected from
around the world, comprising 43 samples from 29 countries. Each sample
is either of the Arabica or Robusta variety. Twelve of the thirteen
chemical constituents reported in the study are given.  The omitted
variable is total chlorogenic acid; it is generally the sum of the
chlorogenic, neochlorogenic and isochlorogenic acid values.

Streuli, H. (1973). Der heutige stand der kaffeechemie. In
_Association Scientifique International du Cafe, 6th International
Colloquium on Coffee Chemisrty_, Bogata, Columbia, pp.  61-72.

---
layout: true
## Making a Point and Telling a Story
---

In a report, make sure each plot has a point and makes its point.

Make sure to think about:

* axis labels;

* titles and subtitles;

* captions;

* highlighting key features;

* accessibility (e.g. color choice; alt-text).

It is often good to make sure a figure can stand on its own
without asking the reader to search the text for explanations.

---

Communicating with data is like telling a story, with a starting
point, a journey, and an end.

Sometimes a single visualization can capture the full story.

More often, several visualizations will be needed.

Often it is good to:

* start with a high level overview;

* show how to look at some particular cases, e.g. with a single plot;

* build up to a more complete analysis, e.g. with a multi-panel plot.

With multiple visualizations it is good make sure that:

* each one works well on its own;

* they work well together (e.g. use consistent styling, colors).

---

There is a [chapter of Wilke,
2019](https://clauswilke.com/dataviz/telling-a-story.html) with more
advice on this.

A recent book length treatment is

> Deborah Nolan and Sara Stoudt (2021) _Communicating with Data_,
> Oxford Univerity Press.

---
layout: true
## Wrapping Up

Some of the areas we covered:
---

### Visualization

Many different types of graphs.

* Strengths, weaknesses.
* Pitfalls.
* Scalability.
* Creating these graphs in R.

Perception

* Channels and mappings; relative effectiveness.
* Using to assess, design visualizations.
* Effective use of color.

A little on interaction, animation.

Emphasis on techniques useful for exploration, scientific reporting.

---

### Data Technologies

Reading different data formats.

Scraping data from the web.

Cleaning data.

Rearranging data for analysis.

Merging data from several sources.

### Reproducible research tools

`rmarkdown` for integrating code and reporting.

Version control, `git`, `GitLab`.

---
layout: false
## Learning More

Class notes will remain available, in some form, at the class web site.

Some books to look at:

* Alberto Cairo (2019) _How Charts Lie: Getting Smarter about Visual
  Information_, W. W. Norton & Company.

* Claus O. Wilke (2019) [_Fundamentals of Data
  Visualization_](https://clauswilke.com/dataviz/), O’Reilly,
  Inc. ([Book source on
  GitHub](https://github.com/clauswilke/dataviz); [supporting
  materials on GitHub](https://github.com/clauswilke/dviz.supp))

* Kieran Healy (2018) [_Data Visualization: A practical
  introduction_](https://socviz.co/), Princeton

* Winston Chang (2018) [_R Graphics Cookbook_, 2nd
  edition](https://r-graphics.org/), O’Reilly. ([Book source on
  GitHub](https://github.com/wch/rgcookbook))

Some blogs to check out:

* [Junk Charts](https://junkcharts.typepad.com/)

* [Flowing Data](https://flowingdata.com/)

Keep a critical eye out for good (and not so good) uses of data
visualization in the media.

//adapted from Emi Tanaka's gist at //https://gist.github.com/emitanaka/eaa258bb8471c041797ff377704c8505