---
title: "Data Technology Notes"
output:
  html_document:
    toc: yes
---

```{r global_options, include = FALSE}
knitr::opts_chunk$set(collapse = TRUE)
```

```{r, include = FALSE}
library(lattice)
library(tidyverse)
library(gridExtra)
set.seed(12345)
```

## Data Technologies

Class and homework examples often start with a nice, clean,
rectangular data set.

Once you have a source for your data, _data technologies_ are the
tools you need to get your data to that point.

The process of applying these tools is sometimes called [_data
wrangling_](https://r4ds.hadley.nz/intro.html#what-you-will-learn).

The main steps are usually

- importing the data;

- cleaning the data;

- merging information from several sources;

- transforming to specialized data types

- reshaping the data.

We have seen many of these steps already.

## Data Import

Your data may be available in a number of different forms, such as:

- as a CSV file;

- another form of text file;

- a spread sheet;

- a table on a web page;

- from a web service;

- in a SQL data base

- in some other type of data base or file format.


We have seen several examples of importing CSV files.

- CSV files are often fairly easy to deal with.

- They are typically meant for data import rather than human reading.

- Spread sheet programs typically provide a mechanism for exporting
  data in CSV form.

- Some issues that you need to look for:

    - identifying column names;

    - identifing row names;

    - use of a semicolon separator (common in Europe).

[Unemployment
data](https://stat.uiowa.edu/~luke/data/laus/laucntycur14-2018.txt)
from the [Local Area Unemployment Statistics (LAUS)
page](https://www.bls.gov/lau/) ofthe Bureau of Labor Statistics is an
example of a structured text file not in CSV format.

- These files usually have enough structure for automated import.

- Often they have features designed for a human reader that need to be
  worked around.

Spread sheets are in principle designed for machine computation.

- Many spread sheets have features designed for a human reader that
  need to be worked around.

- Exporting (the relevant part of) a spread sheet to CSV is sometimes
  a good option.

- Using a tool that can import directly from the spread sheet is often
  better: it makes the analysis easier to reproduce and repeat with
  new data.

## Wrangling the Unemployment Data

The unemployment data is available as a [text
file](https://stat.uiowa.edu/~luke/data/laus/laucntycur14-2017.txt).

- The fields are delimited with a `|` character and surrounded by
  white space.

- There is a header and a footer that need to be removed.

- The counts use a comma to separate thousands.

- The column names are awkward.

- Some names contain an apostrophe, which can be confused with a quote
  character.

The base function `read.table` and the `readr` function `read_table`
have many options for dealing with these issues.

We can use `read.table` with:

- `col.names` to specify column names;
- `skip = 6` to remove the header;
- `strip_white` to remove white space around items
- `quote = ""` to turn off interpreting any character as delimiting a
  quoted string
- `fill = TRUE` to fill in columns in the footer.

```{r}
lausURL <- "https://stat.uiowa.edu/~luke/data/laus/laucntycur14-2017.txt"
lausFile <- "laucntycur14-2017.txt"
if (! file.exists(lausFile))
    download.file(lausURL, lausFile)
lausUS <- read.table(lausFile,
                     col.names = c("LAUSAreaCode", "State", "County",
                                   "Title", "Period",
                                   "LaborForce", "Employed",
                                   "Unemployed", "UnempRate"),
                     quote = "", sep = "|", skip = 6, strip.white = TRUE,
                     fill = TRUE)
head(lausUS)
tail(lausUS)
```

Searching for a sequence of dashes allows the code to drop the footer
to work even if the number of rows changes.

```{r}
footstart <- grep("------", lausUS$LAUSAreaCode)
lausUS <- lausUS[1:(footstart - 1), ]
```

`str` show the data types we now have:
```{r}
str(lausUS)
```

Convert the counts to numbers by removing the commas and passing to
`as.numeric`:

```{r}
lausUS <- mutate(lausUS,
                 LaborForce = as.numeric(gsub(",", "", LaborForce)),
                 Employed = as.numeric(gsub(",", "", Employed)),
                 Unemployed = as.numeric(gsub(",", "", Unemployed)))
```

The expression

```r
gsub(",", "", x)
```

replaces all occurrences of the pattern `","` in the elements of `x`
with the empty string `""`. Using `sub` instead of `gsub` would
replace only the first occurrence in each element.

We could also separate out the state character codes and county names:

```{r}
lausUS <- mutate(lausUS,
                 StateCharCode = sub("^.*, ", "", Title),
                 CountyName = sub(", .*", "", Title))
```

We could also convert the `Period` variable to a date format.


## Refugee Arrival Data

The [Refugee Processing Center](http://www.wrapsnet.org/) provides
information from the United States Refugee Admissions Program (USRAP),
including
[arrivals by state and nationality](http://www.wrapsnet.org/admissions-and-arrivals/).

Three files are available locally:

- data from [early January 2017](http://homepage.stat.uiowa.edu/~luke/data/Arrivals-2017-01-06.xls);

- data from [early April 2017](http://homepage.stat.uiowa.edu/~luke/data/Arrivals-2017-04-05.xls).

- data from [early March 2018](http://homepage.stat.uiowa.edu/~luke/data/Arrivals-2018-03-05.xls).

These are Excel spread sheets.

The function `read_excel` in the `readxl` package provides a way to
import this kind of data into R.

<!--## nolint start: line_length -->
```{r, include = FALSE}
if (! file.exists("Arrivals-2017-01-06.xls"))
    download.file("http://homepage.stat.uiowa.edu/~luke/data/Arrivals-2017-01-06.xls", "Arrivals-2017-01-06.xls")
if (! file.exists("Arrivals-2017-04-05.xls"))
    download.file("http://homepage.stat.uiowa.edu/~luke/data/Arrivals-2017-04-05.xls", "Arrivals-2017-04-05.xls")
if (! file.exists("Arrivals-2018-03-05.xls"))
    download.file("http://homepage.stat.uiowa.edu/~luke/data/Arrivals-2018-03-05.xls", "Arrivals-2018-03-05.xls")
```
<!--## nolint end -->

Load the `readxl` package and define a short name for the file:
```{r}
library(readxl)
fname <- "Arrivals-2017-01-06.xls"
```

Read the `FY` line:
```{r}
read_excel(fname, skip = 13, n_max = 1)
year_line <- read_excel(fname, skip = 13,
                        col_names = FALSE, n_max = 1)
as.numeric(sub("FY ", "", year_line[[1]]))
```

A useful sanity check on the format might be:
```{r}
stopifnot(length(year_line) == 1 && grepl("FY [[:digit:]]+", year_line[[1]]))
year <- as.numeric(sub("FY ", "", year_line[[1]]))
```

Read in the data:

```{r}
d <- read_excel(fname, skip = 16)
head(d)
tail(d)
```

The last line needs to be dropped. Another sanity chack first is a good idea.

```{r}
stopifnot(all(is.na(tail(d, 1)[, -1])))
d <- d[seq_len(nrow(d) - 1), ]
```

```{r}
state_lines <- which(! is.na(d[[1]]))
d[state_lines, ]
```

We can use the first column as the destination:

```{r}
library(dplyr)
d <- rename(d, Dest = 1)
```

We need to replace the `NA` values in `Dest` by the previous non-`NA`
value. There are several options:

- use a `for` loop;
- calculate the gaps and use `rep`;
- use the `fill` function from `tidyr`.

A `for` loop solution:

```{r}
v0 <- d[[1]]
for (i in seq_along(v0)) {
    if (! is.na(v0[i]))
        s <- v0[i]
    else
        v0[i] <- s
}
```

A solution using `rep`:

```{r}
v1 <- rep(d[state_lines[-length(state_lines)], 1][[1]], diff(state_lines))
```

A search on `R carry forward` brings up the `fill` function in `tidyr`:

```{r}
v2 <- fill(d, Dest)[[1]]
```

All three approaches produce identical results:

```{r}
identical(v0, v1)
identical(v0, v2)
```

Using `fill` is easiest:

```{r}
d <- fill(d, Dest)
```

Dropping the state lines gives the date we need:

```{r}
head(d[-state_lines, ])
```

To be able to read new files from this source it is useful to wrap
this in a function:

<!-- ## nolint start: object_usage -->
```{r}
library(readxl)
library(dplyr)
library(tidyr)

readRefXLS <- function(fname) {
    read_excel(fname, skip = 13, n_max = 1)

    ## read and check the FY line
    year_line <- read_excel(fname, skip = 13,
                            col_names = FALSE, n_max = 1)
    stopifnot(length(year_line) == 1 &&
              grepl("FY [[:digit:]]+", year_line[[1]]))
    year <- as.numeric(sub("FY ", "", year_line[[1]]))

    d <- read_excel(fname, skip = 16)

    ## check and trim the last line
    stopifnot(all(is.na(tail(d, 1)[, -1])))
    d <- d[seq_len(nrow(d) - 1), ]

    ## identify the state summary lines
    state_lines <- which(! is.na(d[[1]]))

    ## rename and fill first column
    d <- rename(d, Dest = 1)
    d <- fill(d, Dest)

    ## drop the state summaries and add the FY
    d <- d[-state_lines, ]
    d$FY <- year

    d
}
```
<!-- ## nolint end -->

This can read all the files:
```{r}
d1 <- readRefXLS("Arrivals-2017-01-06.xls")
d2 <- readRefXLS("Arrivals-2017-04-05.xls")
d3 <- readRefXLS("Arrivals-2018-03-05.xls")
```

Some explorations:
```{r}
td3 <- gather(d3, month, count, 3 : 14)

arr_by_dest <- summarize(group_by(td3, Dest), count = sum(count))
arr_by_nat <- summarize(group_by(td3, Nationality), count = sum(count))
arr_by_dest_nat <- summarize(group_by(td3, Dest, Nationality),
                             count = sum(count))
```

```{r}
ggplot(arr_by_dest, aes(count, Dest)) + geom_point()
ggplot(arr_by_dest, aes(count, reorder(Dest, count))) + geom_point()
ggplot(slice_max(arr_by_dest, count, n = 5), aes(count, Dest)) + geom_point()

ggplot(arr_by_nat, aes(count, Nationality)) + geom_point()
ggplot(arr_by_nat, aes(count, reorder(Nationality, count))) + geom_point()
ggplot(slice_max(arr_by_nat, count, n = 5), aes(count, Nationality)) +
    geom_point()
```

A useful visualization of flows is a [_Sankey
diagram_](https://en.wikipedia.org/wiki/Sankey_diagram). These are
also known as

- [alluvial diagrams](https://en.wikipedia.org/wiki/Alluvial_diagram);

- river digrams;

and some would argue that allucial diagrams and river diagrams are
different things.

Several R packages create these; including

- `alluvial`

- `riverplot`

- `googleVis`.

This code creates a Sankey diagram using `alluvial`:

```{r}
n_top <- 10
s_top <- 25

toprefs <- arrange(arr_by_nat, desc(count))$Nationality[1 : n_top]
topstates <- arrange(arr_by_dest, desc(count))$Dest[1 : s_top]

td3m <- mutate(td3,
               Nationality = ifelse(Nationality %in% toprefs,
                                    Nationality, "Other Nat."),
               Dest = ifelse(Dest %in% topstates, Dest, "Other Dest."))

std3m <- summarize(group_by(td3m, Dest, Nationality), count = sum(count))

pal <- c(rainbow(n_top), "grey")
pal <- c(RColorBrewer::brewer.pal(n_top, "Paired"), "grey")
src <- c(toprefs, "Other Nat.")

library(alluvial)
with(std3m, alluvial(Nationality, Dest, freq = count,
                     col = pal[match(Nationality, src)],
                     cex = 0.8, alpha = 0.7))

```

Another option is a _chord diagram_:

```{r, fig.height = 10, message = FALSE}
library(circlize)
chordDiagram(select(std3m, Nationality, Dest, count))
```

## Gapminder Childhood Mortality Data

The `gapminder` package provides a subset of the data from the
[Gapminder](https://www.gapminder.org/) web site. Additional data sets
are [available](https://www.gapminder.org/data/).

- A data set on childhood mortality is available locally as a
  [csv file](https://stat.uiowa.edu/~luke/data/gapminder-under5mortality.csv)
  or an
  [Excel file](http://homepage.stat.uiowa.edu/~luke/data/gapminder-under5mortality.xlsx).

- The numbers represent number of deaths within the first five years
  per 1000 births.

Loading the data:

<!--## nolint start: line_length -->
```{r}
if (! file.exists("gapminder-under5mortality.xlsx"))
    download.file("http://homepage.stat.uiowa.edu/~luke/data/gapminder-under5mortality.xlsx", "gapminder-under5mortality.xlsx")
gcm <- read_excel("gapminder-under5mortality.xlsx")
names(gcm)[1]
names(gcm)[1] <- "country"
```
<!-- ## nolint end -->

A _tidy_ version is useful for working with `ggplot`.

```{r}
tgcm <- gather(gcm, year, u5mort, -1)
head(tgcm)
tgcm <- mutate(tgcm, year = as.numeric(year))
head(tgcm)
```

A multiple time series version may also be useful.
```{r}
gcmts <- ts(t(gcm[-1]), start = 1800)
colnames(gcmts) <- gcm$country
```

Some explorations:
```{r}
p <- ggplot(tgcm) + geom_line(aes(year, u5mort, group = country), alpha = 0.3)
p
plotly::ggplotly(p)

xyplot(gcmts, superpose = TRUE, auto.key = FALSE)
xyplot(gcmts, superpose = TRUE, auto.key = FALSE, col = "black")
```

Examining the missing values:
```{r}
anyNA(gcmts)
sum(is.na(gcmts))
length(gcmts)

image(time(gcmts), seq_len(ncol(gcmts)), is.na(gcmts), ylab = "series")
```

Re-ordering by the number of missing values may be useful:

```{r}
naord <- order(apply(gcmts, 2, function(x) sum(is.na(x))))
image(time(gcmts), seq_len(ncol(gcmts)), is.na(gcmts)[, naord], ylab = "series")
```

Some series look like the early values are estimates. Identifying and
removing the ones that are constant between 1800 and 1850:

```{r}
gcmtsNoNA <- gcmts[, ! apply(gcmts, 2, anyNA)]

i1850 <- which(time(gcmts) == 1850)
early_sd <- apply(gcmtsNoNA[1 : i1850, ], 2, sd)
filled_in <- names(which(early_sd == 0))

ggplot(filter(tgcm, ! country %in% filled_in)) +
    geom_line(aes(year, u5mort, group = country), alpha = 0.3, na.rm = TRUE)
```

To show mortality rates by continent we can use the continent
identification in the `gapminder` data set of the `gapminder` package.

There are a number of countries in the new data set not in `gapminder`
```{r}
library(gapminder)
dim(gcm)
length(unique(gapminder$country))
head(setdiff(gcm$country, gapminder$country))
```

There are several options for adding the continent information from
`gapminder` to one of our data frames:

- look-up using match

- a left join or inner join using `left_join`

- a left join or inner join using `merge`.

Using `match`,  look up the index of the first row in `gapminder` for
the country; then look up the corresponding continent:

```{r}
cidx <- match(tgcm$country, gapminder$country)
head(cidx)

cont <- gapminder$continent[cidx]
head(cont)

d1 <- mutate(tgcm, continent = cont)
```

Our left table will map country to continent, with one row per country:

```{r}
country_continent <- unique(select(gapminder, country, continent))
head(country_continent)
```

The left join using `left_join`:

```{r}
d2 <- left_join(tgcm, country_continent)
```

Joins in general may not produce the ordering you want; you need to
check or make sure by reordering:

```{r}
d1 <- arrange(d1, year, country)
d2 <- arrange(d2, year, country)
identical(unclass(d1), unclass(d2))
```

The left join using `merge`:
```{r}
d3 <- merge(tgcm, country_continent, all.x = TRUE)
d3 <- arrange(d3, year, country)
identical(d1, d3)
```

Mapping continent to `color`:

```{r}
ggplot(d3) +
    geom_line(aes(year, u5mort, group = country, color = continent),
              alpha = 0.7, na.rm = TRUE)
```

To compare child mortality to GDP per capita for the years covered by
the `gapminder` data frame we need to merge the mortality data into
`gapminder`.

Some country names in `gapminder` do not appear in the child mortality
data:

```{r}
setdiff(gapminder$country, tgcm$country)
grep("Korea", unique(tgcm$country), value = TRUE)
grep("Yemen", unique(tgcm$country), value = TRUE)
```

Change the names in `tgcm`` to match the names in `gapminder`:

```{r}
ink <- which(tgcm$country == "North Korea")
isk <- which(tgcm$country == "South Korea")
iym <- which(tgcm$country == "Yemen")
tgcm$country[ink] <- "Korea, Dem. Rep."
tgcm$country[isk] <- "Korea, Rep."
tgcm$country[iym] <- "Yemen, Rep."
```

Now left join the mortality data with `gapminder`; the key is the
combination of `country` and `year`:

Some plots:
```{r}
gm <- left_join(gapminder, tgcm)
```

```{r}
p <- ggplot(gm, aes(gdpPercap, u5mort, color = continent)) + geom_point()
p
p + scale_x_log10()
p + scale_x_log10() + facet_wrap(~ cut(year, 4))
```

<!--
**** merge 'other' sources and destinations
**** sort differently?
E <- tapply(1 : nrow(arr_by_dest_nat), arr_by_dest_nat$Nationality, function(i) structure(arr_by_dest_nat$count[i], names = arr_by_dest_nat$Dest[i]))

toprefs <- arrange(summarize(group_by(arr_by_dest_nat, Nationality), count = sum(count)), desc(count))$Nationality[1:10]
topstates <- arrange(summarize(group_by(arr_by_dest_nat, Dest), count = sum(count)), desc(count))$Dest[1:10]

ed2 <- filter(arr_by_dest_nat, Nationality %in% toprefs & Dest %in% topstates)
ed2 <- rename(ed2, N1 = Nationality, N2 = Dest, Value = count)
ed2 <- mutate(ed2, ID = paste(N1, N2, sep = "->"))

plot(gvisSankey(select(ed2, N1, N2, Value), "N1", "N2", "Value"))

library(riverplot)
V <- data.frame(ID = unique(c(ed2$N1, ed2$N2)))
V <- mutate(V, x = ifelse(ID %in% ed2$N1, 1, 2))
plot(makeRiver(mutate(V, srt = 0, textpos = ifelse(x == 1, 2, 4)), as.data.frame(ed2)))
plot(makeRiver(mutate(V, srt = 0, textpos = ifelse(x == 1, 2, 4)), as.data.frame(ed2)), plot_area = 0.8, textcex = 0.8, xscale = 0.9)

pal <- rainbow(n_top)
pal <- RColorBrewer::brewer.pal(n_top, "Set3")
src <- unique(ed2$N1)

V <- mutate(V,
            srt = 0,
            textpos = ifelse(x == 1, 2, 4),
            col = ifelse(x == 1, pal[match(ID, src)], "grey90"))

rv <- makeRiver(V, as.data.frame(ed2))
out <- capture.output(plot(rv, plot_area = 0.8, textcex = 0.8, xscale = 0.9))

library(alluvial)
alluvial(ed2$N1, ed2$N2, freq = ed2$Value, col = pal[match(ed2$N1, src)],
         cex = 0.8)

ed2X <- filter(arr_by_dest_nat, Nationality %in% toprefs & Dest %in% topstates)
with(ed2X, alluvial(Nationality, Dest, freq = count,
                    col = pal[match(Nationality, src)], cex = 0.8))

td2X <- mutate(td2,
               Nationality = ifelse(Nationality %in% toprefs,
                                    Nationality, "Other Nat."),
               Dest = ifelse(Dest %in% topstates, Dest, "Other Dest."))

edX <- summarize(group_by(td2X, Dest, Nationality), count = sum(count))

pal <- c(rainbow(n_top), "grey")
src <- c(toprefs, "Other Nat.")

with(edX, alluvial(Nationality, Dest, freq = count,
                   col = pal[match(Nationality, src)], cex = 0.8))

alluvial:
http://stackoverflow.com/questions/9968433/sankey-diagrams-in-r

Join with gapminder data

web tex with election data
mention phantomjs/selenium stuff

https://youtu.be/vJG698U2Mvo
-
-->