## Background

The goals of a visualization can vary:

• exploration and understanding

• presentation and explanation

• engagement

Usually

• initial data analysis tries to explore and understand the data;

• a scientific paper will use visualizations to present and explain results;

• magazines, newspapers, and on-line publications will use them to attract and engage readers.

A project may involve all three.

Four terms are in common use:

• statistical graphics;

• data visualization;

• information visualization;

• infographics.

Some view these as interchangeable; others view them as a continuum.

Visualizations can be

• static, possibly printed on paper;

• dynamic, involving animation or interactive features.

Historically, only static graphics were available.

Static graphics remain very useful for exploration, especially if they can be created quickly and easily.

Interactive graphics are very effective for engagement and are used heavily in on-line publications.

Traditional scientific publications are mostly limited to static visualizations, though on-line supplements are becoming more common.

We will focus primarily on static visualizations but will also look at a few interactive options.

## Visualization in the Data Analysis Process

A data-driven project typically involves several cycles of

• importing, cleaning, and arranging the data (data wrangling)

• exploring and understanding the data

• modeling the data to separate signal from noise

• exploring residuals and other aspects of model/data mismatch

• communicating the results

A figure that is often used to capture these steps:

Visualization can help at each stage and is often crucial for

• understanding the nature of the original data;

• understanding a complex fitted model;

• understanding departures from the model.

Visualizing the data should almost always come before modeling or summarizing.

A famous example created by Anscombe (1973):

The regression lines for all four groups are essentially identical!

Another set of examples in the same spirit is provided by the package datasauRus; the package vignette shows the examples.

## Some Historical Graphics

Easy construction of graphics is highly computational, but a computer isn’t necessary.

Many graphical ideas and elaborate statistical graphs were created in the 1800s and before.

The following are some classical examples.

### William Playfair

Playfair’s The Commercial and Political Atlas and Statistical Breviary introduced a number of new graphs, including:

A bar graph:

A pie chart:

### Charles Joseph Minard

Minard developed many elaborate graphs, some available as thumbnail images, including an illustration of Napoleon’s Russia campaign

This can be recreated approximately in R.

### Florence Nightingale

Florence Nightingale used a polar area diagram to illustrate causes of death among British troops in the Crimean war.

An approximate recreation in R is available.

### John Snow

John Snow used a map (higher resolution) to identify the source of the 1854 London cholera epidemic.

The data is available, and has been used for some interactive visualizations.

A short movie has recently been produced.

### Statistical Atlas of the United States

A Statistical Atlas of the US from the late 1800s shows a number of nice examples.

The complete atlases are also available.

## Graphics Software

Most statistical systems provide software for producing static graphics.

Statistical static graphics software typically provides

• a variety of standard plots with reasonable default configurations for things like

• bin widths
• axis scaling
• aspect ratio
• ability to customize plot attributes

• ability to add information to plots, such as

• legends
• annotations
• superimposed plots
• ability to produce new kinds of plots

Some software is more flexible than others.

Non-statistical graph or chart software often emphasizes appearance over content: results may look pretty, but content is hard to extract (e.g. 3D pie charts).

Chart drawing packages can be used to produce good statistical graphs but they may not make it easy.

Some newspapers and magazines have very good graphics departments, including

• New York Times
• Economist
• Guardian
• LA Times

Sometimes tools like Adobe Illustrator or Inkscape can be used to edit and improve graphics produced by statistical software.

NY Times graphics creators often create initial graphs in R and enhance in Adobe Illustrator

## Graphics in R

R has several flexible static graphics system, including

• base graphics, or standard graphics, provided by the graphics package in the base R distribution;

• trellis graphics, or lattice graphics, provided by the lattice package (one of the standard Recommended packages typically bundled with R);

• ggplot2 based on Wilkinson’s Grammar of Graphics and available from CRAN.

We will mostly be using ggplot.

## Some Task Levels for Visualization

In evaluating visualization methods it can be useful to think about several levels of tasks that might be accomplished with a visualiation.

A useful list, from highest to lowest level:

• Analyze: Identify patterns, distributions, presence of outliers or clusters, other interesting features.

• Search: Look up aspects of a feature known in advance or revealed by the visualiation.

• Query: Identify, compare features of individual items.

Each higher level builds on the levels below.

As we look at different methods it is useful to consider the tasks they are suited for,

## Scalability

Data sets come in many different sizes and shapes.

Some techniques work well for smaller data sets but deteriorate in effectiveness as size increases.

Sometimes modifications are available that slow the deterioration.

Other methods scale better, though usually at the expense of giving up some level of detail.

As we look at different methods it is useful to consider the scale of data sets they are suited for.