---
title: "Data Visualization and Data Technologies"
output:
html_document:
toc: yes
code_folding: show
code_download: true
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(collapse = TRUE,
fig.height = 5, fig.width = 6, fig.align = "center")
options(htmltools.dir.version = FALSE)
library(ggplot2)
library(tidyverse)
library(here)
theme_set(theme_minimal() +
theme(text = element_text(size = 16)) +
theme(panel.border = element_rect(color = "grey30", fill = NA)))
set.seed(12345)
```
# Basics
## Objectives
Learn how to effecitvely use visualization for
* exploring and understanding data
* communicating and explaining insights
Learn how to use data technologies for
* acquiring data
* cleaning data
* organizing data
Learn how to do this in ways that are
* reproducible
* reusable
* shareable
## Topics
Data visualization
* some history of visualization
* learning the basic graph types
* how to create basic graphs in R
* human perception, and how it affects visualization
* using understanding of perception to guide evaluation and design
* dynamic and interactive visualizations
Data technologies
* basic data types
* reshaping and transforming data
* aggregating and summarizing data
* regular expressions for cleaning data
* harvesting data from the web
Reproducible research and collaboration
* literate programming and data analysis
* version control for collaboration
## Recommended Text Books
> Kieran Healy (2018) [_Data Visualization: A practical
> introduction_](https://socviz.co/), Princeton
> Paul Murrell (2009). [_Introduction to Data
> Technologies_](http://www.stat.auckland.ac.nz/~paul/ItDT/), Chapman
> & Hall/CRC.
> Hadley Wickham and Garrett Grolemund (2016), [_R for Data
> Science_](https://r4ds.had.co.nz/), O'Reilly.
> Claus O. Wilke (2019) [_Fundamentals of Data
> Visualization_](https://clauswilke.com/dataviz/), O’Reilly,
> Inc. ([Book source on
> GitHub](https://github.com/clauswilke/dataviz); [supporting
> materials on GitHub](https://github.com/clauswilke/dviz.supp))
## Prerequisites
An introductory statistics course.
A regression course.
Strongly recommended: Prior exposure to basic use of statistical
programming software, such as R or SAS, as obtained from a regression
course.
## Assessment
Quizzes
* Short quizzes quizzes will be posted on **ICON** after most
lectures.
Homework
* Homework assignments will be due approximately once a week.
* You will typically submit your work by pushing it to your
GitLab repository by 5:00 PM on the due date.
* Your homework solutions should be written as reports, using proper
sentences and paragraphs to present your results.
Project
* You will do a project developing a visual analysis of a data set
of your choosing.
* You can work on your own or in a group of up to three students.
* Your project should represent about 10 hours of work.
* A one page proposal for your project is due on Monday, March 21.
* A final report on your project is due on Friday, May 6.
* Your project may be shared with the class through the class web
page.
Your grade will be based on quizzes (10%), homework (70%) and the
project (20%).
## Tools
We will be using
* [R](https://www.r-project.org) for computing and graphics
* [R Markdown](https://rmarkdown.rstudio.com) for creating
reproducible reports.
* [`git`](https://git-scm.com/) and the
[UI GitLab](https://research-git.uiowa.edu) service for revision control
You will need an editor or IDE; you can use
* [RStudio](https://www.rstudio.com/products/rstudio/) for editing and more
* any other editor or IDE
To [access these tools](access.html) you can
* use the [UI IDAS RStudio Notebook
Server](https://notebooks.hpc.uiowa.edu/stat45800001/hub/home),
* use the CLAS Linux systems via the [FastX remote
desktop](https://fastx.divms.uiowa.edu),
* or install your own on your computer
For help installing your own a good place to start is
## First Steps: Do This Today!
Visit the [UI GitLab site](https://research-git.uiowa.edu) at
and log in with your HawkID.
Make sure you can access the UI IDAS Rstudio Notebook Server with
your HawkID and password.
* The server is available at
.
* If you cannot log into the RStudio server, please let your TA
or me know immediately.
Make sure you are able to log into the CLAS Linux systems with your
HawkID and password.
* The easiest way is to use the
[FastX](https://fastx.divms.uiowa.edu) client at
.
* If you cannot log into the CLAS workstations, please let your TA
or me know immediately.
Look at the [brief introduction to git](git.html) or the beginning
of to see what `git` is about and how to
get started with it.
Make sure you have access to R and try someting like this:
```{r geyser, eval = FALSE}
with(faithful,
plot(eruptions, waiting,
xlab = "Eruption time (min)",
ylab = "Waiting time to next eruption (min)"))
```
The result is a plot that looks like this:
```{r, ref.label="geyser", echo=FALSE}
```
## Getting Set Up
Log into the [UI GitLab site](https://research-git.uiowa.edu) at
to get your GitLab account
activated.
Decide where you want to work:
* [UI IDAS RStudio Notebook
Server](https://notebooks.hpc.uiowa.edu/stat45800001/hub/home)
* [FastX](https://clas.uiowa.edu/linux/help/fastx) for accessing
the CLAS Linux systems via the
[web interface](https://fastx.divms.uiowa.edu) or the
[desktop client](https://clas.uiowa.edu/linux/help/fastx/desktop).
* Your own computer.
Setup needed for IDAS RStudio Server:
* If you are registered then you should have an account
now. If you add the course late you should have an account
within a day.
* [Introduce yourself to Git](https://happygitwithr.com/hello-git.html).
Setup needed for CLAS Linux:
* Install the [desktop
client](https://clas.uiowa.edu/linux/help/fastx/desktop) if you
want to use it. Otherwise, use the [web
interface](https://fastx.divms.uiowa.edu).
* Your account will be set up automatically the first time you log in.
* [Introduce yourself to Git](https://happygitwithr.com/hello-git.html).
Setting up your own computer: (A good resource for help with this is
):
* Install the current version of R.
* You might have older versions from other courses (e.g. from
[Anaconda](https://www.anaconda.com/)).
* You will need to add packages as we go along.
* Install RStudio if you want to use it (highly recommended).
* Install Git.
* [Introduce yourself to Git](https://happygitwithr.com/hello-git.html).
Even if you decide to use your own computer you should make sure you can use
the RStudio server or CLAS systems as a backup.
# Some Examples
## Life Expectancy in the Americas in 2007
The data is from the [GapMinder](https://www.gapminder.org) project.
```{r, message = FALSE, class.source = "fold-hide"}
library(dplyr)
library(ggplot2)
library(gapminder)
le_am_2007 <- filter(gapminder, year == 2007, continent == "Americas") %>%
mutate(country = reorder(country, lifeExp))
knitr::kable(select(le_am_2007, country, lifeExp),
col.names = c("Country", "Life Expectancy (years)"),
digits = 1, format = "html") %>%
kableExtra::kable_styling(bootstrap_options = "striped",
full_width = FALSE,
font_size = 14) %>%
kableExtra::scroll_box(height = "300px", width = "75%")
```
A _dot plot_:
```{r, message = FALSE, fig.height = 6, fig.width = 10, class.source = "fold-hide"}
thm <- theme_minimal() + theme(text = element_text(size = 16))
ggplot(le_am_2007, aes(y = country, x = lifeExp)) +
geom_point(fill = "lightblue") +
labs(x = "Life Expectancy (years)", y = NULL) +
thm + ggtitle("Dot Plot")
```
A _bar chart_:
```{r, message = FALSE, fig.height = 6, fig.width = 10, class.source = "fold-hide"}
ggplot(le_am_2007, aes(x = lifeExp, y = country)) +
geom_col(fill = "lightblue") +
labs(x = "Life Expectancy (years)", y = NULL) +
thm + ggtitle("Bar Chart")
```
Another (bad!) _bar chart_:
```{r, message = FALSE, fig.height = 6, fig.width = 10, class.source = "fold-hide"}
baseline <- 60
ticks <- c(0, 10, 20, 30)
ggplot(le_am_2007, aes(x = lifeExp - baseline, y = country)) +
geom_col(fill = "lightblue") +
labs(x = "Life Expectancy (years)", y = NULL) +
scale_x_continuous(breaks = ticks, labels = ticks + baseline) +
thm + ggtitle("Another Bar Chart")
```
We will look at:
* How to create these views using code that makes them easily reproducible.
* How to assess their advantages and disadvantages as visual
representations of the data
A data set with more variables for more countries and years is
available in the `gapminder` R package.
Data preparation steps:
* _Filter_ the larger data set down to the countries and year we want.
* _Select_ the country name and life expectancy variables.
We will look at how to carry out these steps with reproducible code.
## Yearly Snowfall in Iowa City
How did the winter of 2018/9 compare to other years?
```{r snowfall, echo = FALSE, message = FALSE, warning = FALSE, fig.height = 6, fig.width = 10}
```
The data are available from a [NOAA web serice
API](https://www.ncdc.noaa.gov/cdo-web/webservices/v2) as a _CSV_ file.
```{r, echo = FALSE}
head(select(ic_data_raw, year, month, element, starts_with("VALUE")))
```
Data preparation steps:
* Read in the CSV file.
* _Reshape_ the data to have columns `date`, `TMAX`, `TMIN`, `SNOW` and `PRCP`.
* _Filter_ out bogus dates created by the original format.
* _Convert_ units to more standard (American) ones (e.g. milimeters to inches).
* ...
Code is available [here](#ic-snowfall-example-code).
## Internet Adoption Across the World
An example from [Wilke (2019)](https://clauswilke.com/dataviz/) with
[World Bank
data](https://data.worldbank.org/indicator/IT.NET.USER.ZS).
```{r internet-heatmap, echo = FALSE, fig.height = 6, fig.width = 10}
```
The data are available in several formats (CSV, XML, Excel).
```{r, echo = FALSE}
head(as_tibble(internet_raw))
```
Data preparation:
* Read in the data.
* _Filter_ down to the countries we want.
* _Reshape_ to have columns `country`, `year`, and `users`.
* ...
Code is available [here](#internet-example-code).
## Iowa Wind Turbines
Data is available from the [U.S. Wind Turbine
Database](https://eerscmap.usgs.gov/uswtdb/).
```{r, fig.height = 6, fig.width = 10, message = FALSE, class.source = "fold-hide"}
library(sf)
data(US_counties_geoms, package = "dviz.supp")
data(wind_turbines, package = "dviz.supp")
sf_wt <- st_as_sf(wind_turbines, coords = c("xlong", "ylat"), crs = 4326)
sf_wt_IA <- filter(sf_wt, t_state == "IA")
sf_wt_IA <- mutate(sf_wt_IA,
p_year = ifelse(p_year > 0, p_year, NA),
year = cut(p_year,
breaks = c(0, 2005, 2010, 2015, 2020),
labels = c("before 2005", "2005-2009",
"2010-2014", "2015-2020"),
right = FALSE))
ggplot(filter(US_counties_geoms$lower48, STATEFP == 19)) +
geom_sf() +
geom_sf(data = sf_wt_IA, aes(fill = year, color = year), shape = 21) +
scale_fill_viridis_d(direction = -1, na.value = "red") +
ggthemes::theme_map() +
ggtitle("A Map") +
theme(legend.background = element_rect(fill = "transparent"),
plot.title = element_text(size = 16))
```
```{r, eval = FALSE, echo = FALSE}
## This fails with color = NA
ggplot(filter(US_counties_geoms$lower48, STATEFP == 19)) +
geom_sf() +
geom_sf(data = sf_wt_IA, aes(fill = year), shape = 21, color = NA) +
scale_fill_viridis_d(direction = -1, na.value = "red") +
ggthemes::theme_map() +
ggtitle("A Map") +
theme(legend.background = element_rect(fill = "transparent"),
plot.title = element_text(size = 16))
```
There are two data sets:
* _Shape_ information for drawing the map.
* Data on individual wind turbines.
Data preparation:
* Read in the data.
* Match up the _projection_ used for the map and location data.
* ...
## Iowa Population in 2010
```{r, include = FALSE}
## This is would be needed if we didn't have a library(sf) earlier so
## that the filter operation does the right thing I think
## loadNamespace("sf")
```
```{r, fig.height = 6, fig.width = 10, class.source = "fold-hide"}
data(US_census, package = "dviz.supp")
uscounties <- mutate(US_counties_geoms$lower48,
FIPS = as.numeric(paste0(STATEFP, COUNTYFP)))
uscounties <- left_join(uscounties,
select(US_census, FIPS, pop2010),
"FIPS")
ggplot(filter(uscounties, STATEFP == 19)) +
geom_sf(aes(fill = pop2010)) +
scale_fill_viridis_c(direction = -1, trans = "log10",
labels = scales::comma) +
ggtitle("A Choropleth Map") +
ggthemes::theme_map() +
theme(legend.background = element_rect(fill = "transparent"),
plot.title = element_text(size = 16))
```
Again there are two data sets:
* Shape data for drawing the map.
* County population data from the 2010 census.
Data preparation:
* ...
* _Merge_ or _join_ the population data with the shape data.
* ...
# Reproducibility
## Reproducible Reports and Analyses
Preparing a report on a data analysis project usually involves
* reading the data
* _wrangling_ the data into usable form
* visualizing, summarizing, and modeling
* writing a report that includes your results
To make your work reproducible for someone else, or for you when the
data changes, it is best to use code for the entire workflow.
[_R Markdown_](https://rmarkdown.rstudio.com/) is one technology that
supports this.
## Tools for Reproducibility
R Markdown files contain report text along with code to produce
numerical and graphical results.
Tools are available to
* convert an R Markdown file into a PDF or HTML report;
* extract the code run to produce the computational and graphical results.
```{r, include = FALSE}
using_xaringan <- grepl("xaringan", names(rmarkdown::metadata$output)[[1]])
docwords <- if (using_xaringan) "These slides were" else "This page was"
```
`r docwords` generated from the R Markdown file [intro.Rmd](intro.Rmd).
You will be creating R Markdown files like this for your homework and
project.
Some R Markdown tutorials:
* [R Markdown: The Definitive Guide](https://bookdown.org/yihui/rmarkdown/)
by Yihui Xie is a book-length presentation.
* The [R Markdown Home Page](https://rmarkdown.rstudio.com) has a link
to a [tutorial](https://rmarkdown.rstudio.com/lesson-1.html).
# Code
## Code for the [Iowa City Snowfall Example](#ic-snowfall-example)
Read the data:
```{r snowfall-read, eval = FALSE}
library(tidyverse)
library(lubridate)
if (! file.exists(here("ic_noaa.csv.gz")))
download.file("http://www.stat.uiowa.edu/~luke/data/ic_noaa.csv.gz",
here("ic_noaa.csv.gz"))
ic_data_raw <- as_tibble(read.csv(here("ic_noaa.csv.gz"), head = TRUE))
```
Reshape from (very) wide to (too) long:
```{r snowfall-wide-to-long, eval = FALSE}
ic_data <- select(ic_data_raw, year, month, element, starts_with("VALUE"))
ic_data <- pivot_longer(ic_data,
names_to = "day",
values_to = "value", c(VALUE1 : VALUE31))
```
Extract the day as a number:
```{r snowfall-extract-day, eval = FALSE}
ic_data <- mutate(ic_data, day = as.integer(sub("VALUE", "", day)))
```
Reshape from too long to tidy with one row per day, keeping only the
primary variables:
```{r snowfall-long-to-tidy, eval = FALSE}
corevars <- c("TMAX", "TMIN", "PRCP", "SNOW", "SNWD")
ic_data <- filter(ic_data, element %in% corevars)
ic_data <- pivot_wider(ic_data, names_from = "element", values_from = "value")
```
Add a date variable for plotting and to help get rid of bogus days:
```{r snowfall-add-date, eval = FALSE}
ic_data <- mutate(ic_data, date = lubridate::make_date(year, month, day))
ic_data <- filter(ic_data, ! is.na(date))
```
Make units more standard (American):
```{r snowfall-fix-units, eval = FALSE}
mm2in <- function(x) x / 25.4
C2F <- function(x) 32 + 1.8 * x
ic_data <- transmute(ic_data, year, month, day, date,
PRCP = mm2in(PRCP / 10),
SNOW = mm2in(SNOW),
SNWD = mm2in(SNWD),
TMIN = C2F(TMIN / 10),
TMAX = C2F(TMAX / 10))
```
Add a Month factor with abbreviated levels:
```{r snowfall-add-month, eval = FALSE}
ic_data <- mutate(ic_data,
Month = lubridate::month(month, label = TRUE, abbr = TRUE))
```
Associate January through June with the winter starting in the previous year:
```{r snowfall-wyear, eval = FALSE}
ic_data <- mutate(ic_data, wyear = ifelse(month <= 6, year - 1, year))
```
Compute the winter totals and the total for the 2018/9 winter:
```{r snowfall-winter-totals, eval = FALSE}
ic_snow <- group_by(ic_data, wyear) %>%
summarize(snow = sum(SNOW, na.rm = TRUE))
ic_snow_2018 <- filter(ic_snow, wyear == 2018)$snow
```
Create the histogram and show the 2018/9 total:
```{r snowfall-histogram, eval = FALSE}
ggplot(ic_snow) +
geom_histogram(aes(x = snow), bins = 15, fill = "grey", color = "black") +
geom_vline(xintercept = ic_snow_2018, color = "red", size = 2) +
scale_x_continuous(name = "Total Annual Snowfall (inches)",
sec.axis = dup_axis(name = NULL,
breaks = ic_snow_2018,
labels = "2018/9")) +
scale_y_continuous(expand = expansion(0, 0)) +
ggtitle("A Histogram") +
theme_minimal() +
theme(text = element_text(size = 16))
```
```{r snowfall, eval = FALSE, echo = FALSE}
library(lubridate)
<>
<>
<>
<>
<>
<>
<>
<>
<>
<>
```
## Code for the [Internet Example](#internet-example)
From
[code](https://github.com/clauswilke/dataviz/blob/master/visualizing_amounts.Rmd)
for the [_Visualizing amounts_
chapter](https://clauswilke.com/dataviz/visualizing-amounts.html) in
Claus Wilke's [Fundamentals of Data
Visualization](https://clauswilke.com/dataviz/index.html).
Reading in the data:
```{r internet-heatmap-reading-data, eval = FALSE}
base_url <- "http://api.worldbank.org/v2/en/indicator/"
if (file.exists("internet.xls")) {
## if we have to go with Excel
## xls_url <- paste0(base_url, "IT.NET.USER.ZS?downloadformat=excel")
## download.file(xls_url, "internet.xls")
internet_raw <- readxl::read_xls("internet.xls",
skip = 2, .name_repair = "universal")
names(internet_raw) <- sub("\\.\\.\\.", "X", names(internet_raw))
} else {
csvfile <- here("internet.csv")
if (! file.exists(csvfile)) {
csv_url <- paste0(base_url, "IT.NET.USER.ZS?downloadformat=csv")
download.file(csv_url, "internet.zip")
unzip("internet.zip")
file.rename("API_IT.NET.USER.ZS_DS2_en_csv_v2_2255007.csv",
csvfile)
}
internet_raw <- read.csv(csvfile, skip = 4)
}
```
Reshape to longer format:
```{r internet-heatmap-reshape, eval = FALSE}
internet <- select(internet_raw, country = Country.Name, matches("X.+"))
internet <- pivot_longer(internet, -country,
names_to = "year", values_to = "users")
internet <- mutate(internet, year = as.integer(sub("X", "", year)))
```
Select some countries to include in the plot:
```{r internet-heatmap-select-countries, eval = FALSE}
country_list <- c("United States", "China", "India", "Japan", "Algeria",
"Brazil", "Germany", "France", "United Kingdom", "Italy",
"New Zealand", "Canada", "Mexico", "Chile", "Argentina",
"Norway", "South Africa", "Kenya", "Israel", "Iceland")
internet_short <- filter(internet, country %in% country_list)
## internet_short <- mutate(internet_short,
## users = ifelse(is.na(users), 0, users))
```
Get ordering by 2017 levels:
```{r internet-heatmap-2017-levels, eval = FALSE}
intr17 <- filter(internet_short, year == 2017)
levs <- arrange(intr17, users)$country
```
The basic plot:
```{r, internet-heatmap-basic-plot, eval = FALSE}
p_inet <- ggplot(filter(internet_short, year > 1993),
aes(x = year,
y = factor(country, levs),
fill = users)) +
geom_tile(color = "white", size = 0.25)
```
Adjust color palette and guide:
```{r internet-heatmap-fill-color, eval = FALSE}
p_inet <- p_inet +
scale_fill_viridis_c(
option = "A", begin = 0.05, end = 0.98,
limits = c(0, 100),
name = "internet users / 100 people",
guide = guide_colorbar(
direction = "horizontal",
label.position = "bottom",
title.position = "top",
ticks = FALSE,
barwidth = grid::unit(3.5, "in"),
barheight = grid::unit(0.2, "in"),
order = 1))
```
Adjust `x` and `y` scales:
```{r internet-heatmap-x-y-scales, eval = FALSE}
p_inet <- p_inet +
scale_x_continuous(expand = c(0, 0), name = NULL) +
scale_y_discrete(name = NULL, position = "right")
```
Add layer for `NA` values:
```{r internet-heatmap-NA-layer, eval = FALSE}
p_inet <- p_inet +
ggnewscale::new_scale_fill() +
geom_tile(data = filter(internet_short, year > 1993, is.na(users)),
aes(fill = "NA"), color = "white") +
scale_fill_manual(values = "grey50") +
guides(fill = guide_legend(title = "",
label.position = "bottom",
title.position = "top",
keyheight = grid::unit(0.2, "in"),
keywidth = grid::unit(0.2, "in"),
order = 2))
```
Final plot with title and theme adjustments:
```{r internet-heatmap-finalplot, eval = FALSE}
p_inet +
ggtitle("A Heat Map",
"Countries ordered by percent internet users in 2017.") +
theme(text = element_text(size = 16)) +
theme(axis.line = element_blank(),
axis.ticks = element_blank(),
axis.ticks.length = grid::unit(1, "pt"),
legend.position = "top",
legend.justification = "left",
legend.title.align = 0.5,
legend.title = element_text(size = 12 * 12 / 14)
)
```
```{r internet-heatmap, eval = FALSE, echo = FALSE}
<>
<>
<>
<>
<>
<>
<>
<>
<>
```