Data Visualization and Data Technologies

class: center, middle, title-slide

.title[
# Data Visualization and Data Technologies
]
.author[
### Luke Tierney
]
.institute[
### University of Iowa
]
.date[
### 2024-01-22
]

---

class: center, middle

# Basics

---

## Objectives

Learn how to effecitvely use visualization for

* exploring and understanding data
* communicating and explaining insights

Learn how to use data technologies for

* acquiring data
* cleaning data
* organizing data

Learn how to do this in ways that are

* reproducible
* reusable
* shareable

---

## Topics

Data visualization

* some history of visualization
* learning the basic graph types
* how to create basic graphs in R
* human perception, and how it affects visualization
* using understanding of perception to guide evaluation and design
* dynamic and interactive visualizations

Data technologies

* basic data types
* reshaping and transforming data
* aggregating and summarizing data
* merging several data sets
* regular expressions for cleaning data
* harvesting data from the web

Reproducible research and collaboration

* literate programming and data analysis
* version control for collaboration

---

## Recommended Text Books

> Kieran Healy (2018) [_Data Visualization: A practical
> introduction_](https://socviz.co/), Princeton

> Paul Murrell (2009). [_Introduction to Data
> Technologies_](http://www.stat.auckland.ac.nz/~paul/ItDT/), Chapman
> & Hall/CRC.

> Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund (2023),
> [_R for Data Science (2nd Edition)_](https://r4ds.hadley.nz/),
> O'Reilly.

> Claus O. Wilke (2019) [_Fundamentals of Data
> Visualization_](https://clauswilke.com/dataviz/), O’Reilly,
> Inc. ([Book source on
> GitHub](https://github.com/clauswilke/dataviz); [supporting
> materials on GitHub](https://github.com/clauswilke/dviz.supp))

---
## Prerequisites

An introductory statistics course.

A regression course.

Strongly recommended: Prior exposure to basic use of statistical
programming software, such as R or SAS, as obtained from a regression
course.

---
## Assessment

Quizzes

* Short quizzes will be posted on **ICON** after most lectures.

Homework

* Homework assignments will be due approximately once a week.
* You will typically submit your work by pushing it to your
  GitLab repository by 5:00 PM on the due date.
* Your homework solutions should be written as reports, using proper
  sentences and paragraphs to present your results.

Project

* You will do a project developing a visual analysis of a data set
  of your choosing.
* You can work on your own or in a group of up to three students.
* Your project should represent about 10 hours of work for each student.
* A one page proposal for your project is due on Monday, March 18.
* A final report on your project is due on Friday, May 3.
* Your project may be shared with the class through the class web
  page.

Your grade will be based on quizzes (10%), homework (70%) and the
project (20%).

---
## Tools

We will be using

* [R](https://www.r-project.org) for computing and graphics
* [R Markdown](https://rmarkdown.rstudio.com) for creating
  reproducible reports.
* [`git`](https://git-scm.com/) and the [UI
  GitLab](https://research-git.uiowa.edu) service for revision control
  and submitting work.

You will need an editor or IDE; you can use

* [RStudio](https://posit.co/products/open-source/rstudio/) for editing and more
* any other editor or IDE

To [access these tools](access.html) you can

* use the [UI IDAS RStudio Notebook
  Server](https://notebooks.hpc.uiowa.edu/spring2024-stat-4580-0001/hub/home),
* use the CLAS Linux systems via the [FastX remote
  desktop](https://fastx.divms.uiowa.edu),
* or install your own on your computer

For help installing your own a good place to start is
<https://happygitwithr.com/>

---
layout: true
## First Steps: Do This Today!

---

Visit the [UI GitLab site](https://research-git.uiowa.edu) at
<https://research-git.uiowa.edu> and log in with your HawkID.

Make sure you can access the UI IDAS Rstudio Notebook Server with
your HawkID and password.

* The server is available at

<https://notebooks.hpc.uiowa.edu/spring2024-stat-4580-0001/hub/home>.

* If you cannot log into the RStudio server, please let your TA
  or me know immediately.

Make sure you are able to log into the CLAS Linux systems with your
HawkID and password.

* The easiest way is to use the
  [FastX](https://fastx.divms.uiowa.edu) client at
  <https://fastx.divms.uiowa.edu>.
* If you cannot log into the CLAS workstations, please let your TA
  or me know immediately.

Look at the [brief introduction to git](../git.html) or the
beginning of <https://happygitwithr.com> to see what `git` is about
and how to get started with it.

---

Make sure you have access to R and try someting like this:

```r
with(faithful,
     plot(eruptions, waiting,
          xlab = "Eruption time (min)",
          ylab = "Waiting time to next eruption (min)"))
```

.pull-left[
The result is a plot that looks like this:
]

.pull-right[
<img src="intro_files/figure-html/unnamed-chunk-1-1.png" style="display: block; margin: auto;" />
]

---
layout: true

## Getting Set Up
---

Log into the [UI GitLab site](https://research-git.uiowa.edu) at
<https://research-git.uiowa.edu> to get your GitLab account
activated.

Decide where you want to work:

* [UI IDAS RStudio Notebook
  Server](https://notebooks.hpc.uiowa.edu/spring2024-stat-4580-0001/hub/home)
* [FastX](https://linux.clas.uiowa.edu/help/fastx) for accessing
  the CLAS Linux systems via the
  [web interface](https://fastx.divms.uiowa.edu) or the 
  [desktop client](https://linux.clas.uiowa.edu/help/fastx/desktop).
* Your own computer.

Setup needed for IDAS RStudio Server:

* If you are registered then you should have an account
  now. If you add the course late you should have an account
  within a day.
* [Introduce yourself to Git](https://happygitwithr.com/hello-git.html).

Setup needed for CLAS Linux:

* Install the [desktop
   client](https://linux.clas.uiowa.edu/help/fastx/desktop) if you
   want to use it. Otherwise, use the [web
   interface](https://fastx.divms.uiowa.edu).
* Your account will be set up automatically the first time you log in.
* [Introduce yourself to Git](https://happygitwithr.com/hello-git.html).

---

Setting up your own computer: (A good resource for help with this is
<https://happygitwithr.com>):

* Install the current version of R.
    * You might have older versions from other courses (e.g. from
      [Anaconda](https://www.anaconda.com/)).
    * You will need to add packages as we go along.

* Install RStudio if you want to use it (highly recommended).

* Install Git.

* [Introduce yourself to Git](https://happygitwithr.com/hello-git.html).

Even if you decide to use your own computer you should make sure you can use
the RStudio server or CLAS systems as a backup.

<!--
## Checking and Setting Up an SSH Key

* Using RStudio: Check/create your key by going to **Tools** > **Global** >
      **Options** > **GIT/SVN**.

* From a shell command line: use `ssh-keygen`.

* Open terminal or shell and do:
    
    ```bash
    cat <filename>.pub
    ```
  and copy to the clipboard.

* On the [UI GitLab web page](https://research-git.uiowa.edu) go to
  `<your icon at right>` > **Settings** > **SSH Keys** and paste
  in the public key.

* Check that you can clone your repository

* RStudio: go to **File** > **New Project** > **Version Control** > **Git**.
    * From a shell command line use `git clone`

* On Linux and Mac OS you can avoid the need to type your pass phrase
  many times by calling `ssh-add` in a shell. This should remember
  your pass phrase until you log out.

* In principle this is possible on Windows, but the setup is more
complicated.  This
[link](https://help.github.com/articles/working-with-ssh-key-passphrases/#platform-windows)
provides some hints.
-->

---
layout: false
class: center, middle
# Some Examples

---
layout:true
## Life Expectancy in the Americas in 2007

---

The data is from the [GapMinder](https://www.gapminder.org) project.

.hide-code[

```r
library(dplyr)
library(ggplot2)
library(gapminder)
le_am_2007 <- filter(gapminder, year == 2007, continent == "Americas") |>
    mutate(country = reorder(country, lifeExp))
knitr::kable(select(le_am_2007, country, lifeExp),
             col.names = c("Country", "Life Expectancy (years)"),
             digits = 1, format = "html") |>
    kableExtra::kable_styling(bootstrap_options = "striped",
                              full_width = FALSE,
                              font_size = 14) |>
    kableExtra::scroll_box(height = "300px", width = "75%")
```

<div style="border: 1px solid #ddd; padding: 0px; overflow-y: scroll; height:300px; overflow-x: scroll; width:75%; "><table class="table table-striped" style="font-size: 14px; width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;position: sticky; top:0; background-color: #FFFFFF;"> Country </th>
   <th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> Life Expectancy (years) </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Argentina </td>
   <td style="text-align:right;"> 75.3 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Bolivia </td>
   <td style="text-align:right;"> 65.6 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Brazil </td>
   <td style="text-align:right;"> 72.4 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Canada </td>
   <td style="text-align:right;"> 80.7 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Chile </td>
   <td style="text-align:right;"> 78.6 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Colombia </td>
   <td style="text-align:right;"> 72.9 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Costa Rica </td>
   <td style="text-align:right;"> 78.8 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Cuba </td>
   <td style="text-align:right;"> 78.3 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Dominican Republic </td>
   <td style="text-align:right;"> 72.2 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Ecuador </td>
   <td style="text-align:right;"> 75.0 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> El Salvador </td>
   <td style="text-align:right;"> 71.9 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Guatemala </td>
   <td style="text-align:right;"> 70.3 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Haiti </td>
   <td style="text-align:right;"> 60.9 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Honduras </td>
   <td style="text-align:right;"> 70.2 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Jamaica </td>
   <td style="text-align:right;"> 72.6 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Mexico </td>
   <td style="text-align:right;"> 76.2 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Nicaragua </td>
   <td style="text-align:right;"> 72.9 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Panama </td>
   <td style="text-align:right;"> 75.5 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Paraguay </td>
   <td style="text-align:right;"> 71.8 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Peru </td>
   <td style="text-align:right;"> 71.4 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Puerto Rico </td>
   <td style="text-align:right;"> 78.7 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Trinidad and Tobago </td>
   <td style="text-align:right;"> 69.8 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> United States </td>
   <td style="text-align:right;"> 78.2 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Uruguay </td>
   <td style="text-align:right;"> 76.4 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Venezuela </td>
   <td style="text-align:right;"> 73.7 </td>
  </tr>
</tbody>
</table></div>

]

---

A _dot plot_:

.hide-code[

```r
thm <- theme_minimal() + theme(text = element_text(size = 16))
ggplot(le_am_2007, aes(y = country, x = lifeExp)) +
    geom_point(fill = "lightblue") +
    labs(x = "Life Expectancy (years)", y = NULL) +
    thm + ggtitle("Dot Plot")
```

<img src="intro_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" />
]

---

A _bar chart_:

.hide-code[

```r
ggplot(le_am_2007, aes(x = lifeExp, y = country)) +
    geom_col(fill = "lightblue") +
    labs(x = "Life Expectancy (years)", y = NULL) +
    thm + ggtitle("Bar Chart")
```

<img src="intro_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" />
]

---

Another (bad!)  _bar chart_:

.hide-code[

```r
baseline <- 60
ticks <- c(0, 10, 20, 30)
ggplot(le_am_2007, aes(x = lifeExp - baseline, y = country)) +
    geom_col(fill = "lightblue") +
    labs(x = "Life Expectancy (years)", y = NULL) +
    scale_x_continuous(breaks = ticks, labels = ticks + baseline) +
    thm + ggtitle("Another Bar Chart")
```

<img src="intro_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" />
]

---

We will look at:

* How to create these views using code that makes them easily reproducible.

* How to assess their advantages and disadvantages as visual
  representations of the data

A data set with more variables for more countries and years is
available in the `gapminder` R package.

Data preparation steps:

* _Filter_ the larger data set down to the countries and year we want.

* _Select_ the country name and life expectancy variables.

We will look at how to carry out these steps with reproducible code.

---
layout: true

## Yearly Snowfall in Iowa City

---
name: ic-snowfall-example

How did the winter of 2018/9 compare to other years?

---

The data are available from a [NOAA web serice
API](https://www.ncdc.noaa.gov/cdo-web/webservices/v2) as a
[_CSV_](https://en.wikipedia.org/wiki/Comma-separated_values) file.

```
## # A tibble: 6 × 34
##    year month element VALUE1 VALUE2 VALUE3 VALUE4 VALUE5 VALUE6 VALUE7 VALUE8
##   <int> <int> <chr>    <int>  <int>  <int>  <int>  <int>  <int>  <int>  <int>
## 1  1893     1 TMAX       -17    -28   -150    -44    -17   -106    -56    -67
## 2  1893     1 TMIN       -67   -161   -233   -156   -156   -206   -111   -211
## 3  1893     1 PRCP         0      0      0     64      0     38      0      0
## 4  1893     1 SNOW         0      0      0     64      0     38      0      0
## 5  1893     2 TMAX        11    -56   -106   -122     33     28   -161    -94
## 6  1893     2 TMIN      -233   -206   -217   -267   -122   -211   -256   -278
## # ℹ 23 more variables: VALUE9 <int>, VALUE10 <int>, VALUE11 <int>,
## #   VALUE12 <int>, VALUE13 <int>, VALUE14 <int>, VALUE15 <int>, VALUE16 <int>,
## #   VALUE17 <int>, VALUE18 <int>, VALUE19 <int>, VALUE20 <int>, VALUE21 <int>,
## #   VALUE22 <int>, VALUE23 <int>, VALUE24 <int>, VALUE25 <int>, VALUE26 <int>,
## #   VALUE27 <int>, VALUE28 <int>, VALUE29 <int>, VALUE30 <int>, VALUE31 <int>
```

---

Data preparation steps:

* Read in the CSV file.

* _Reshape_ the data to have columns `date`, `TMAX`, `TMIN`, `SNOW` and `PRCP`.

* _Filter_ out bogus dates created by the original format.

* _Convert_ units to more standard (American) ones (e.g. milimeters to inches).

* ...

Code is available [here](#ic-snowfall-example-code).

---
layout: true
## Internet Adoption Across the World

---
name: internet-example

An example from [Wilke (2019)](https://clauswilke.com/dataviz/) with
[World Bank
data](https://data.worldbank.org/indicator/IT.NET.USER.ZS).

---

The data are available in several formats (CSV, XML, Excel).

```
## # A tibble: 6 × 65
##   Country.Name Country.Code Indicator.Name      Indicator.Code X1960 X1961 X1962
##   <chr>        <chr>        <chr>               <chr>          <int> <lgl> <lgl>
## 1 Aruba        ABW          Individuals using … IT.NET.USER.ZS    NA NA    NA   
## 2 Afghanistan  AFG          Individuals using … IT.NET.USER.ZS    NA NA    NA   
## 3 Angola       AGO          Individuals using … IT.NET.USER.ZS    NA NA    NA   
## 4 Albania      ALB          Individuals using … IT.NET.USER.ZS    NA NA    NA   
## 5 Andorra      AND          Individuals using … IT.NET.USER.ZS    NA NA    NA   
## 6 Arab World   ARB          Individuals using … IT.NET.USER.ZS    NA NA    NA   
## # ℹ 58 more variables: X1963 <lgl>, X1964 <lgl>, X1965 <int>, X1966 <lgl>,
## #   X1967 <lgl>, X1968 <lgl>, X1969 <lgl>, X1970 <int>, X1971 <lgl>,
## #   X1972 <lgl>, X1973 <lgl>, X1974 <lgl>, X1975 <int>, X1976 <int>,
## #   X1977 <int>, X1978 <int>, X1979 <int>, X1980 <int>, X1981 <int>,
## #   X1982 <int>, X1983 <int>, X1984 <int>, X1985 <int>, X1986 <int>,
## #   X1987 <int>, X1988 <int>, X1989 <int>, X1990 <dbl>, X1991 <dbl>,
## #   X1992 <dbl>, X1993 <dbl>, X1994 <dbl>, X1995 <dbl>, X1996 <dbl>, …
```

---

Data preparation:

* Read in the data.

* _Filter_ down to the countries we want.

* _Reshape_ to have columns `country`, `year`, and `users`.

* ...

Code is available [here](#internet-example-code).

---
layout:true
## Iowa Wind Turbines

---

Data is available from the [U.S. Wind Turbine
Database](https://eerscmap.usgs.gov/uswtdb/).

.hide-code[

```r
library(sf)
data(US_counties_geoms, package = "dviz.supp")
wtfile <- "uswtdb_v6_1_20231128.csv.gz"
if (! file.exists(wtfile))
    download.file(paste0("https://stat.uiowa.edu/~luke/data/", wtfile), wtfile)
wind_turbines <- read.csv(wtfile)

sf_wt <- st_as_sf(wind_turbines, coords = c("xlong", "ylat"), crs = 4326)
sf_wt_IA <- filter(sf_wt, t_state == "IA")
sf_wt_IA <- mutate(sf_wt_IA,
                   p_year = ifelse(p_year > 0, p_year, NA),
                   year = cut(p_year,
                              breaks = c(0, 2005, 2010, 2015, 2020, Inf),
                              labels = c("before 2005", "2005-2009",
                                         "2010-2014", "2015-2020",
                                         "2021 and later"),
                              right = FALSE))

ggplot(filter(US_counties_geoms$lower48, STATEFP == 19)) +
    geom_sf() +
    geom_sf(data = sf_wt_IA, aes(fill = year),
            shape = 21, color = "black", stroke = 0.25) +
    scale_fill_viridis_d(direction = -1, na.value = "red") +
    ggthemes::theme_map() +
    ggtitle("A Map") +
    theme(legend.background = element_rect(fill = "transparent"),
          plot.title = element_text(size = 16))
```

<img src="intro_files/figure-html/unnamed-chunk-9-1.png" style="display: block; margin: auto;" />
]

---

There are two data sets:

* _Shape_ information for drawing the map.

* Data on individual wind turbines.

Data preparation:

* Read in the data.

* Match up the _projection_ used for the map and location data.

* ...

---
layout: true
## Iowa Population in 2010

---

.hide-code[

```r
data(US_census, package = "dviz.supp")
uscounties <- mutate(US_counties_geoms$lower48,
                     FIPS = as.numeric(paste0(STATEFP, COUNTYFP)))
uscounties <- left_join(uscounties,
                        select(US_census, FIPS, pop2010),
                        "FIPS")
## old-style crs object detected; please recreate object with a recent sf::st_crs()
ggplot(filter(uscounties, STATEFP == 19)) +
    geom_sf(aes(fill = pop2010)) +
    scale_fill_viridis_c(direction = -1, trans = "log10",
                         labels = scales::comma) +
    ggtitle("A Choropleth Map") +
    ggthemes::theme_map() +
    theme(legend.background = element_rect(fill = "transparent"),
          plot.title = element_text(size = 16))
```

<img src="intro_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" />
]

---

Again there are two data sets:

* Shape data for drawing the map.

* County population data from the 2010 census.

Data preparation:

* ...

* _Merge_ or _join_ the population data with the shape data.

* ...

---
layout: false
class: center, middle
# Reproducibility

---
## Reproducible Reports and Analyses

Preparing a report on a data analysis project usually involves

* reading the data

* _wrangling_ the data into usable form

* visualizing, summarizing, and modeling

* writing a report that includes your results

To make your work reproducible for someone else, or for you when the
data changes, it is best to use code for the entire workflow.

[_R Markdown_](https://rmarkdown.rstudio.com/) is one technology that
supports this.

---
## Tools for Reproducibility

R Markdown files contain report text along with code to produce
numerical and graphical results.

Tools are available to

* convert an R Markdown file into a PDF or HTML report;

* extract the code used to produce the computational and graphical results.

These slides were generated from the R Markdown file [intro.Rmd](intro.Rmd).

You will be creating R Markdown files like this for your homework and
project.

Some R Markdown tutorials:

* [R Markdown: The Definitive Guide](https://bookdown.org/yihui/rmarkdown/)
  by Yihui Xie is a book-length presentation.

* The [R Markdown Home Page](https://rmarkdown.rstudio.com) has a link
  to a [tutorial](https://rmarkdown.rstudio.com/lesson-1.html).

---
class: center, middle

# Code

---
layout: true
## Code for the [Iowa City Snowfall Example](#ic-snowfall-example)

---
name: ic-snowfall-example-code

Read the data:

```r
library(tidyverse)
library(lubridate)
if (! file.exists(here("ic_noaa.csv.gz")))
    download.file("http://www.stat.uiowa.edu/~luke/data/ic_noaa.csv.gz",
                  here("ic_noaa.csv.gz"))
ic_data_raw <- as_tibble(read.csv(here("ic_noaa.csv.gz"), head = TRUE))
```
--

Reshape from (very) wide to (too) long:

```r
ic_data <- select(ic_data_raw, year, month, element, starts_with("VALUE"))
ic_data <- pivot_longer(ic_data,
                        names_to = "day",
                        values_to = "value", c(VALUE1 : VALUE31))
```
--

Extract the day as a number:

```r
ic_data <- mutate(ic_data, day = as.integer(sub("VALUE", "", day)))
```

---

Reshape from too long to tidy with one row per day, keeping only the
primary variables:

```r
corevars <- c("TMAX", "TMIN", "PRCP", "SNOW", "SNWD")
ic_data <- filter(ic_data, element %in% corevars)
ic_data <- pivot_wider(ic_data, names_from = "element", values_from = "value")
```

Add a date variable for plotting and to help get rid of bogus days:

```r
ic_data <- mutate(ic_data, date = lubridate::make_date(year, month, day))
ic_data <- filter(ic_data, ! is.na(date))
```

Make units more standard (American):

```r
mm2in <- function(x) x / 25.4
C2F <- function(x) 32 + 1.8 * x
ic_data <- transmute(ic_data, year, month, day, date,
                     PRCP = mm2in(PRCP / 10),
                     SNOW = mm2in(SNOW),
                     SNWD = mm2in(SNWD),
                     TMIN = C2F(TMIN / 10),
                     TMAX = C2F(TMAX / 10))
```
---

Add a Month factor with abbreviated levels:

```r
ic_data <- mutate(ic_data,
                  Month = lubridate::month(month, label = TRUE, abbr = TRUE))
```

Associate January through June with the winter starting in the previous year:

```r
ic_data <- mutate(ic_data, wyear = ifelse(month <= 6, year - 1, year))
```

Compute the winter totals and the total for the 2018/9 winter:

```r
ic_snow <- group_by(ic_data, wyear) |>
    summarize(snow = sum(SNOW, na.rm = TRUE))
ic_snow_2018 <- filter(ic_snow, wyear == 2018)$snow
```
---

Create the histogram and show the 2018/9 total:

```r
ggplot(ic_snow) +
    geom_histogram(aes(x = snow), bins = 15, fill = "grey", color = "black") +
    geom_vline(xintercept = ic_snow_2018, color = "red", size = 2) +
    scale_x_continuous(name = "Total Annual Snowfall (inches)",
                       sec.axis = dup_axis(name = NULL,
                                           breaks = ic_snow_2018,
                                           labels = "2018/9")) +
    scale_y_continuous(expand = expansion(0, 0)) +
    ggtitle("A Histogram") +
    theme_minimal() +
    theme(text = element_text(size = 16))
```

---
layout: true
## Code for the [Internet Example](#internet-example)

---
name: internet-example-code

From
[code](https://github.com/clauswilke/dataviz/blob/master/visualizing_amounts.Rmd)
for the [_Visualizing amounts_
chapter](https://clauswilke.com/dataviz/visualizing-amounts.html) in
Claus Wilke's [Fundamentals of Data
Visualization](https://clauswilke.com/dataviz/index.html).

Reading in the data:

```r
base_url <- "http://api.worldbank.org/v2/en/indicator/"
if (file.exists("internet.xls")) {
    ## if we have to go with Excel
    ## xls_url <- paste0(base_url, "IT.NET.USER.ZS?downloadformat=excel")
    ## download.file(xls_url, "internet.xls")
    internet_raw <- readxl::read_xls("internet.xls",
                                     skip = 2, .name_repair = "universal")
    names(internet_raw) <- sub("\\.\\.\\.", "X", names(internet_raw))
} else {
    csvfile <- here("internet.csv")
    if (! file.exists(csvfile)) {
        csv_url <- paste0(base_url, "IT.NET.USER.ZS?downloadformat=csv")
        download.file(csv_url, "internet.zip")
        unzip("internet.zip")
        file.rename("API_IT.NET.USER.ZS_DS2_en_csv_v2_2255007.csv",
                    csvfile)
    }

internet_raw <- read.csv(csvfile, skip = 4)
}
```

---

Reshape to longer format:

```r
internet <- select(internet_raw, country = Country.Name, matches("X.+"))
internet <- pivot_longer(internet, -country,
                         names_to = "year", values_to = "users")
internet <- mutate(internet, year = as.integer(sub("X", "", year)))
```
--

Select some countries to include in the plot:

```r
country_list <- c("United States", "China", "India", "Japan", "Algeria",
                  "Brazil", "Germany", "France", "United Kingdom", "Italy",
                  "New Zealand", "Canada", "Mexico", "Chile", "Argentina",
                  "Norway", "South Africa", "Kenya", "Israel", "Iceland")

internet_short <- filter(internet, country %in% country_list)
## internet_short <- mutate(internet_short,
##                          users = ifelse(is.na(users), 0, users))
```

---

Get ordering by 2017 levels:

```r
intr17 <- filter(internet_short, year == 2017)
levs <- arrange(intr17, users)$country
```
--

The basic plot:

```r
p_inet <- ggplot(filter(internet_short, year > 1993),
                 aes(x = year,
                     y = factor(country, levs),
                     fill = users)) +
    geom_tile(color = "white", size = 0.25)
```

---

Adjust color palette and guide:

```r
p_inet <- p_inet +
    scale_fill_viridis_c(
        option = "A", begin = 0.05, end = 0.98,
        limits = c(0, 100),
        name = "internet users / 100 people",
        guide = guide_colorbar(
            direction = "horizontal",
            label.position = "bottom",
            title.position = "top",
            ticks = FALSE,
            barwidth = grid::unit(3.5, "in"),
            barheight = grid::unit(0.2, "in"),
            order = 1))
```

---

Adjust `x` and `y` scales:

```r
p_inet <- p_inet +
    scale_x_continuous(expand = c(0, 0), name = NULL) +
    scale_y_discrete(name = NULL, position = "right")
```
--

Add layer for `NA` values:

```r
p_inet <- p_inet +
    ggnewscale::new_scale_fill() +
    geom_tile(data = filter(internet_short, year > 1993, is.na(users)),
              aes(fill = "NA"), color = "white") +
    scale_fill_manual(values = "grey50") +
    guides(fill = guide_legend(title = "",
                               label.position = "bottom",
                               title.position = "top",
                               keyheight = grid::unit(0.2, "in"),
                               keywidth = grid::unit(0.2, "in"),
                               order = 2))
```

---

Final plot with title and theme adjustments:

```r
p_inet +
    ggtitle("A Heat Map",
            "Countries ordered by percent internet users in 2017.") +
    theme(text = element_text(size = 16)) +
    theme(axis.line = element_blank(),
          axis.ticks = element_blank(),
          axis.ticks.length = grid::unit(1, "pt"),
          legend.position = "top",
          legend.justification = "left",
          legend.title.align = 0.5,
          legend.title = element_text(size = 12 * 12 / 14)
          )
```

//adapted from Emi Tanaka's gist at //https://gist.github.com/emitanaka/eaa258bb8471c041797ff377704c8505