---
title: "Data Technology Notes"
output:
html_document:
toc: yes
---
```{r global_options, include = FALSE}
knitr::opts_chunk$set(collapse = TRUE)
```
```{r, include = FALSE}
library(lattice)
library(tidyverse)
library(gridExtra)
set.seed(12345)
```
## Data Technologies
Class and homework examples often start with a nice, clean,
rectangular data set.
Once you have a source for your data, _data technologies_ are the
tools you need to get your data to that point.
The process of applying these tools is sometimes called [_data
wrangling_](http://r4ds.had.co.nz/wrangle-intro.html).
The main steps are usually
- importing the data;
- cleaning the data;
- merging information from several sources;
- transforming to specialized data types
- reshaping the data.
We have seen many of these steps already.
## Data Import
Your data may be available in a number of different forms, such as:
- as a CSV file;
- another form of text file;
- a spread sheet;
- a table on a web page;
- from a web service;
- in a SQL data base
- in some other type of data base or file format.
We have seen several examples of importing CSV files.
- CSV files are often fairly easy to deal with.
- They are typically meant for data import rather than human reading.
- Spread sheet programs typically provide a mechanism for exporting
data in CSV form.
- Some issues that you need to look for:
- identifying column names;
- identifing row names;
- use of a semicolon separator (common in Europe).
[Unemployment
data](http://www.stat.uiowa.edu/~luke/data/laus/laucntycur14-2018.txt)
from the [Local Area Unemployment Statistics (LAUS)
page](https://www.bls.gov/lau/) ofthe Bureau of Labor Statistics is an
example of a structured text file not in CSV format.
- These files usually have enough structure for automated import.
- Often they have features designed for a human reader that need to be
worked around.
Spread sheets are in principle designed for machine computation.
- Many spread sheets have features designed for a human reader that
need to be worked around.
- Exporting (the relevant part of) a spread sheet to CSV is sometimes
a good option.
- Using a tool that can import directly from the spread sheet is often
better: it makes the analysis easier to reproduce and repeat with
new data.
## Wrangling the Unemployment Data
The unemployment data is available as a [text
file](http://www.stat.uiowa.edu/~luke/data/laus/laucntycur14-2017.txt).
- The fields are delimited with a `|` character and surrounded by
white space.
- There is a header and a footer that need to be removed.
- The counts use a comma to separate thousands.
- The column names are awkward.
- Some names contain an apostrophe, which can be confused with a quote
character.
The base function `read.table` and the `readr` function `read_table`
have many options for dealing with these issues.
We can use `read.table` with:
- `col.names` to specify column names;
- `skip = 6` to remove the header;
- `strip_white` to remove white space around items
- `quote = ''` to turn off interpreting any character as delimiting a
quoted string
- `fill = TRUE` to fill in columns in the footer.
```{r}
lausURL <- "http://www.stat.uiowa.edu/~luke/data/laus/laucntycur14-2017.txt"
lausFile <- "laucntycur14-2017.txt"
if (! file.exists(lausFile))
download.file(lausURL, lausFile)
lausUS <- read.table(lausFile,
col.names = c("LAUSAreaCode", "State", "County",
"Title", "Period",
"LaborForce", "Employed",
"Unemployed", "UnempRate"),
quote = '', sep = "|", skip = 6, strip.white = TRUE,
fill = TRUE)
head(lausUS)
tail(lausUS)
```
Searching for a sequence of dashes allows the code to drop the footer
to work even if the number of rows changes.
```{r}
footstart <- grep("------", lausUS$LAUSAreaCode)
lausUS <- lausUS[1:(footstart - 1), ]
```
`str` show the data types we now have:
```{r}
str(lausUS)
```
Convert the counts to numbers by removing the commas and passing to
`as.numeric`:
```{r}
lausUS <- mutate(lausUS,
LaborForce = as.numeric(gsub(",", "", LaborForce)),
Employed = as.numeric(gsub(",", "", Employed)),
Unemployed = as.numeric(gsub(",", "", Unemployed)))
```
The expression
```r
gsub(",", "", x)
```
replaces all occurrences of the pattern `","` in the elements of `x`
with the empty string `""`. Using `sub` instead of `gsub` would
replace only the first occurrence in each element.
We could also separate out the state character codes and county names:
```{r}
lausUS <- mutate(lausUS,
StateCharCode = sub("^.*, ", "", Title),
CountyName = sub(", .*", "", Title))
```
We could also convert the `Period` variable to a date format.
## Refugee Arrival Data
The [Refugee Processing Center](http://www.wrapsnet.org/) provides
information from the United States Refugee Admissions Program (USRAP),
including
[arrivals by state and nationality](http://www.wrapsnet.org/admissions-and-arrivals/).
Three files are available locally:
- data from [early January 2017](http://homepage.stat.uiowa.edu/~luke/data/Arrivals-2017-01-06.xls);
- data from [early April 2017](http://homepage.stat.uiowa.edu/~luke/data/Arrivals-2017-04-05.xls).
- data from [early March 2018](http://homepage.stat.uiowa.edu/~luke/data/Arrivals-2018-03-05.xls).
These are Excel spread sheets.
The function `read_excel` in the `readxl` package provides a way to
import this kind of data into R.
```{r, include = FALSE}
if (! file.exists("Arrivals-2017-01-06.xls"))
download.file("http://homepage.stat.uiowa.edu/~luke/data/Arrivals-2017-01-06.xls", "Arrivals-2017-01-06.xls")
if (! file.exists("Arrivals-2017-04-05.xls"))
download.file("http://homepage.stat.uiowa.edu/~luke/data/Arrivals-2017-04-05.xls", "Arrivals-2017-04-05.xls")
if (! file.exists("Arrivals-2018-03-05.xls"))
download.file("http://homepage.stat.uiowa.edu/~luke/data/Arrivals-2018-03-05.xls", "Arrivals-2018-03-05.xls")
```
Load the `readxl` package and define a short name for the file:
```{r}
library(readxl)
fname <- "Arrivals-2017-01-06.xls"
```
Read the `FY` line:
```{r}
read_excel(fname, skip = 13, n_max = 1)
year_line <- read_excel(fname, skip = 13,
col_names = FALSE, n_max = 1)
as.numeric(sub("FY ", "", year_line[[1]]))
```
A useful sanity check on the format might be:
```{r}
stopifnot(length(year_line) == 1 && grepl("FY [[:digit:]]+", year_line[[1]]))
year <- as.numeric(sub("FY ", "", year_line[[1]]))
```
Read in the data:
```{r}
d <- read_excel(fname, skip = 16)
head(d)
tail(d)
```
The last line needs to be dropped. Another sanity chack first is a good idea.
```{r}
stopifnot(all(is.na(tail(d, 1)[, -1])))
d <- d[seq_len(nrow(d) - 1), ]
```
```{r}
state_lines <- which(! is.na(d[[1]]))
d[state_lines, ]
```
We can use the first column as the destination:
```{r}
library(dplyr)
d <- rename(d, Dest = 1)
```
We need to replace the `NA` values in `Dest` by the previous non-`NA`
value. There are several options:
- use a `for` loop;
- calculate the gaps and use `rep`;
- use the `fill` function from `tidyr`.
A `for` loop solution:
```{r}
v0 <- d[[1]]
for (i in seq_along(v0)) {
if (! is.na(v0[i]))
s <- v0[i]
else
v0[i] <- s
}
```
A solution using `rep`:
```{r}
v1 <- rep(d[state_lines[-length(state_lines)], 1][[1]], diff(state_lines))
```
A search on `R carry forward` brings up the `fill` function in `tidyr`:
```{r}
v2 <- fill(d, Dest)[[1]]
```
All three approaches produce identical results:
```{r}
identical(v0, v1)
identical(v0, v2)
```
Using `fill` is easiest:
```{r}
d <- fill(d, Dest)
```
Dropping the state lines gives the date we need:
```{r}
head(d[-state_lines, ])
```
To be able to read new files from this source it is useful to wrap
this in a function:
```{r}
library(readxl)
library(dplyr)
library(tidyr)
readRefXLS <- function(fname) {
read_excel(fname, skip = 13, n_max = 1)
## read and check the FY line
year_line <- read_excel(fname, skip = 13,
col_names = FALSE, n_max = 1)
stopifnot(length(year_line) == 1 &&
grepl("FY [[:digit:]]+", year_line[[1]]))
year <- as.numeric(sub("FY ", "", year_line[[1]]))
d <- read_excel(fname, skip = 16)
## check and trim the last line
stopifnot(all(is.na(tail(d, 1)[, -1])))
d <- d[seq_len(nrow(d) - 1), ]
## identify the state summary lines
state_lines <- which(! is.na(d[[1]]))
## rename and fill first column
d <- rename(d, Dest = 1)
d <- fill(d, Dest)
## drop the state summaries and add the FY
d <- d[-state_lines, ]
d$FY <- year
d
}
```
This can read all the files:
```{r}
d1 <- readRefXLS("Arrivals-2017-01-06.xls")
d2 <- readRefXLS("Arrivals-2017-04-05.xls")
d3 <- readRefXLS("Arrivals-2018-03-05.xls")
```
Some explorations:
```{r}
td3 <- gather(d3, month, count, 3 : 14)
arr_by_dest <- summarize(group_by(td3, Dest), count = sum(count))
arr_by_nat <- summarize(group_by(td3, Nationality), count = sum(count))
arr_by_dest_nat <- summarize(group_by(td3, Dest, Nationality),
count = sum(count))
```
```{r}
ggplot(arr_by_dest, aes(count, Dest)) + geom_point()
ggplot(arr_by_dest, aes(count, reorder(Dest, count))) + geom_point()
ggplot(top_n(arr_by_dest, 5, count), aes(count, Dest)) + geom_point()
ggplot(arr_by_nat, aes(count, Nationality)) + geom_point()
ggplot(arr_by_nat, aes(count, reorder(Nationality, count))) + geom_point()
ggplot(top_n(arr_by_nat, 5, count), aes(count, Nationality)) + geom_point()
```
A useful visualization of flows is a [_Sankey
diagram_](https://en.wikipedia.org/wiki/Sankey_diagram). These are
also known as
- [alluvial diagrams](https://en.wikipedia.org/wiki/Alluvial_diagram);
- river digrams;
and some would argue that allucial diagrams and river diagrams are
different things.
Several R packages create these; including
- `alluvial`
- `riverplot`
- `googleVis`.
This code creates a Sankey diagram using `alluvial`:
```{r}
n_top <- 10
s_top <- 25
toprefs <- arrange(arr_by_nat, desc(count))$Nationality[1 : n_top]
topstates <- arrange(arr_by_dest, desc(count))$Dest[1 : s_top]
td3m <- mutate(td3,
Nationality = ifelse(Nationality %in% toprefs,
Nationality, "Other Nat."),
Dest = ifelse(Dest %in% topstates, Dest, "Other Dest."))
std3m <- summarize(group_by(td3m, Dest, Nationality), count = sum(count))
pal <- c(rainbow(n_top), "grey")
pal <- c(RColorBrewer::brewer.pal(n_top, "Paired"), "grey")
src <- c(toprefs, "Other Nat.")
library(alluvial)
with(std3m, alluvial(Nationality, Dest, freq = count,
col = pal[match(Nationality, src)],
cex = 0.8, alpha = 0.7))
```
Another option is a _chord diagram_:
```{r, fig.height = 10, message = FALSE}
library(circlize)
chordDiagram(select(std3m, Nationality, Dest, count))
```
## Gapminder Childhood Mortality Data
The `gapminder` package provides a subset of the data from the
[Gapminder](http://www.gapminder.org/) web site. Additional data sets
are [available](http://www.gapminder.org/data/).
- A data set on childhood mortality is available locally as a
[csv file](http://homepage.stat.uiowa.edu/~luke/data/gapminder-under5mortality.csv)
or an
[Excel file](http://homepage.stat.uiowa.edu/~luke/data/gapminder-under5mortality.xlsx).
- The numbers represent number of deaths within the first five years
per 1000 births.
Loading the data:
```{r}
if (! file.exists("gapminder-under5mortality.xlsx"))
download.file("http://homepage.stat.uiowa.edu/~luke/data/gapminder-under5mortality.xlsx", "gapminder-under5mortality.xlsx")
gcm <- read_excel("gapminder-under5mortality.xlsx")
names(gcm)[1]
names(gcm)[1] <- "country"
```
A _tidy_ version is useful for working with `ggplot`.
```{r}
tgcm <- gather(gcm, year, u5mort, -1)
head(tgcm)
tgcm <- mutate(tgcm, year = as.numeric(year))
head(tgcm)
```
A multiple time series version may also be useful.
```{r}
gcmts <- ts(t(gcm[-1]), start = 1800)
colnames(gcmts) <- gcm$country
```
Some explorations:
```{r}
p <- ggplot(tgcm) + geom_line(aes(year, u5mort, group = country), alpha = 0.3)
p
plotly::ggplotly(p)
xyplot(gcmts, superpose = TRUE, auto.key = FALSE)
xyplot(gcmts, superpose = TRUE, auto.key = FALSE, col = "black")
```
Examining the missing values:
```{r}
anyNA(gcmts)
sum(is.na(gcmts))
length(gcmts)
image(time(gcmts), seq_len(ncol(gcmts)), is.na(gcmts), ylab = "series")
```
Re-ordering by the number of missing values may be useful:
```{r}
naord <- order(apply(gcmts, 2, function(x) sum(is.na(x))))
image(time(gcmts), seq_len(ncol(gcmts)), is.na(gcmts)[, naord], ylab = "series")
```
Some series look like the early values are estimates. Identifying and
removing the ones that are constant between 1800 and 1850:
```{r}
gcmtsNoNA <- gcmts[, ! apply(gcmts, 2, anyNA)]
i1850 <- which(time(gcmts) == 1850)
early_sd <- apply(gcmtsNoNA[1 : i1850, ], 2, sd)
filled_in <- names(which(early_sd == 0))
ggplot(filter(tgcm, ! country %in% filled_in)) +
geom_line(aes(year, u5mort, group = country), alpha = 0.3, na.rm = TRUE)
```
To show mortality rates by continent we can use the continent
identification in the `gapminder` data set of the `gapminder` package.
There are a number of countries in the new data set not in `gapminder`
```{r}
library(gapminder)
dim(gcm)
length(unique(gapminder$country))
head(setdiff(gcm$country, gapminder$country))
```
There are several options for adding the continent information from
`gapminder` to one of our data frames:
- look-up using match
- a left join or inner join using `left_join`
- a left join or inner join using `merge`.
Using `match`, look up the index of the first row in `gapminder` for
the country; then look up the corresponding continent:
```{r}
cidx <- match(tgcm$country, gapminder$country)
head(cidx)
cont <- gapminder$continent[cidx]
head(cont)
d1 <- mutate(tgcm, continent = cont)
```
Our left table will map country to continent, with one row per country:
```{r}
country_continent <- unique(select(gapminder, country, continent))
head(country_continent)
```
The left join using `left_join`:
```{r}
d2 <- left_join(tgcm, country_continent)
```
Joins in general may not produce the ordering you want; you need to
check or make sure by reordering:
```{r}
d1 <- arrange(d1, year, country)
d2 <- arrange(d2, year, country)
identical(unclass(d1), unclass(d2))
```
The left join using `merge`:
```{r}
d3 <- merge(tgcm, country_continent, all.x = TRUE)
d3 <- arrange(d3, year, country)
identical(d1, d3)
```
Mapping continent to `color`:
```{r}
ggplot(d3) +
geom_line(aes(year, u5mort, group = country, color = continent),
alpha = 0.7, na.rm = TRUE)
```
To compare child mortality to GDP per capita for the years covered by
the `gapminder` data frame we need to merge the mortality data into
`gapminder`.
Some country names in `gapminder` do not appear in the child mortality
data:
```{r}
setdiff(gapminder$country, tgcm$country)
grep("Korea", unique(tgcm$country), value = TRUE)
grep("Yemen", unique(tgcm$country), value = TRUE)
```
Change the names in `tgcm`` to match the names in `gapminder`:
```{r}
ink <- which(tgcm$country == "North Korea")
isk <- which(tgcm$country == "South Korea")
iym <- which(tgcm$country == "Yemen")
tgcm$country[ink] <- "Korea, Dem. Rep."
tgcm$country[isk] <- "Korea, Rep."
tgcm$country[iym] <- "Yemen, Rep."
```
Now left join the mortality data with `gapminder`; the key is the
combination of `country` and `year`:
Some plots:
```{r}
gm <- left_join(gapminder, tgcm)
```
```{r}
p <- ggplot(gm, aes(gdpPercap, u5mort, color = continent)) + geom_point()
p
p + scale_x_log10()
p + scale_x_log10() + facet_wrap(~ cut(year, 4))
```