Some Other Topics

Some topics we did not have time to look at:

Visualizing Uncertainty: Hurricanes

All estimates from data are associated with some degree of uncertainty.

Effectively communicating that uncertainty in visualizations is challenging and an active area of research.

The cone of uncertainty: (From Cairo (2019); images from a blog post by the author.)

The NHC forecast cone is designed so that two-thirds of historical official forecast errors over a 5-year sample fall within the cone for a particular time point..

When published in the media these visualizations are routinely misinterpreted something like this:

A more effective representation might be something like this, showing an ensemble of possible tracks:

An animated version may be more effective, if the presentation medium permits.

Developing better visualizations for hurricane forecasting, especially targeting the public, is an active area of research.

Visualizing Uncertainty: Chocolate Bars

Expert ratings, on a scale from 0 to 5, for chocolate bars manufactured in several countries:

The standard deviations of the data distributions are comparable, but the lengths of confidence intervals for the mean vary because of the different sample sizes:

The same plot with a reduced horizontal range:

A more elaborate display with confidence intervals at several levels:

Confidence densities, or confidence distributions, as proposed in

Adrian W. Bowman. Graphics for Uncertainty. J. R. Statist. Soc. A 182:1-16, 2018. Link

One drawback of all of these methods:

The least precise measurement draws the most attention.

These examples from Wilke’s book use the ungeviz package available on GitHub.

Another package providing some tools for uncertainty visualization is ggdist package.

Visualizing Uncertainty: Old Cars

Using the very old mtcars data set to illustrate estimating a smooth relationship:

A default geom_smooth shows an estimate along with a point-wise confidence band.

This may not give the best sense of the joint uncertainty: if the curve is higher on some places it may need to be lower in others.

Showing an ensemble of curves that all are plausible can be a better choice.

This approach was shown earlier for visualizing possible hurricane paths.

This ensemble is generated using a case-based bootstrap.

These plots are called ensemble plots (also spaghetti plots, for obvious reasons).

If animation is available, an alternative is to show the curves one at a time in an animation.

Again, a bootstrap is used to produce the estimates.

This is an example of a hypothetical outcomes plot, or HOP, as introduced in

Hullman, Jessica, Paul Resnick, and Eytan Adar. “Hypothetical outcome plots outperform error bars and violin plots for inferences about reliability of variable ordering.” PLOS ONE 10, no. 11 (2015).

Data Quality and Integrity

A visualization can accurately reflect data but still be misleading if the data are faulty.

A NY Times article from May 2021 shows a choropleth map of the estimated share of adults who would “definitely” or “probably” get the COVID-19 vaccine.

Cutoffs: 49 60 65 70 75 80 91 %

The map may accurately reflect the estimates, but the estimates have obvious problems.

The data used for the map are available here.

Discussions on social media suggest that the state level data may be more reasonable:

Data Science Ethics

Some issues:

Some references:

Plot Annotation, Plot Ensembles, and Dashboards

Plot annotations can create popout and help focus the viewer’s attention.

They may be increasingly important as images are shared on line without context.

Here is an examples for the mpg data:

Plot Ensembles: Coffee

It is often useful to use several graphics to present an analysis.

Collections of related graphs are sometimes called ensemble graphics.

On line presentations of analyses involving multiple visualizations and, typically, some interactive features are also called dashboards.

To aid the viewer it is usually best to design these visualizations together, with common axis choices and color mappings.

Fig 12.1 in Unwin (2015) provides a simple example:

library(ggplot2)
library(GGally)
library(gridExtra)

coffee_thm <- theme(text = element_text(size = 14))

data(coffee, package = "pgmm")
coffee <- within(coffee, Type <- ifelse(Variety == 1,
                                        "Arabica", "Robusta"))
names(coffee) <- abbreviate(names(coffee), 8)
a <- ggplot(coffee, aes(x = Type)) + geom_bar(aes(fill = Type)) +
    scale_fill_manual(values = c("grey70", "red")) +
    guides(fill = "none") + ylab("") +
    coffee_thm
b <- ggplot(coffee, aes(x = Fat, y = Caffine, colour = Type)) +
    geom_point(size = 3) +
    scale_colour_manual(values = c("grey70", "red")) +
    coffee_thm
c <- ggparcoord(coffee[order(coffee$Type), ], columns = 3 : 14,
                groupColumn = "Type", scale = "uniminmax") +
    xlab("") + ylab("") +
    theme(legend.position = "none") +
    scale_colour_manual(values = c("grey", "red")) +
    theme(axis.ticks.y = element_blank(),
          axis.text.y = element_blank()) +
    coffee_thm
grid.arrange(arrangeGrob(a, b, ncol = 2, widths = c(1, 2)),
             c, nrow = 2)

A dashboard with three plots. A bar chart shows there are about 4 times as many Arabica samples ad Rubusta samples. A scatterplot of Caffeine against Fat content shows clear separation of the two groups. A parallel coordinates plot shows the 12 values measured on each group.

Data on the chemical composition of coffee samples collected from around the world, comprising 43 samples from 29 countries. Each sample is either of the Arabica or Robusta variety. Twelve of the thirteen chemical constituents reported in the study are given. The omitted variable is total chlorogenic acid; it is generally the sum of the chlorogenic, neochlorogenic and isochlorogenic acid values.

Streuli, H. (1973). Der heutige stand der kaffeechemie. In Association Scientifique International du Cafe, 6th International Colloquium on Coffee Chemisrty, Bogata, Columbia, pp. 61-72.

Making a Point and Telling a Story

In a report, make sure each plot has a point and makes its point.

Make sure to think about:

It is often good to make sure a figure can stand on its own without asking the reader to search the text for explanations.

Communicating with data is like telling a story, with a starting point, a journey, and an end.

Sometimes a single visualization can capture the full story.

More often, several visualizations will be needed.

Often it is good to:

With multiple visualizations it is good make sure that:

There is a chapter of Wilke, 2019 with more advice on this.

A recent book length treatment is

Deborah Nolan and Sara Stoudt (2021) Communicating with Data, Oxford Univerity Press.

Wrapping Up

Some of the areas we covered:

Visualization

Many different types of graphs.

  • Strengths, weaknesses.
  • Pitfalls.
  • Scalability.
  • Creating these graphs in R.

Perception

  • Channels and mappings; relative effectiveness.
  • Using to assess, design visualizations.
  • Effective use of color.

A little on interaction, animation.

Emphasis on techniques useful for exploration, scientific reporting.

Data Technologies

Reading different data formats.

Scraping data from the web.

Cleaning data.

Rearranging data for analysis.

Merging data from several sources.

Reproducible research tools

rmarkdown for integrating code and reporting.

Version control, git, GitLab.

Learning More

Class notes will remain available, in some form, at the class web site.

Some books to look at:

Some blogs to check out:

Keep a critical eye out for good (and not so good) uses of data visualization in the media.

