--- title: "Data Technology Notes" output: html_document: toc: yes --- ```{r global_options, include = FALSE} knitr::opts_chunk$set(collapse = TRUE) ``` ```{r, include = FALSE} library(lattice) library(tidyverse) library(gridExtra) set.seed(12345) ``` ## Data Technologies Class and homework examples often start with a nice, clean, rectangular data set. Once you have a source for your data, _data technologies_ are the tools you need to get your data to that point. The process of applying these tools is sometimes called [_data wrangling_](https://r4ds.had.co.nz/wrangle-intro.html). The main steps are usually - importing the data; - cleaning the data; - merging information from several sources; - transforming to specialized data types - reshaping the data. We have seen many of these steps already. ## Data Import Your data may be available in a number of different forms, such as: - as a CSV file; - another form of text file; - a spread sheet; - a table on a web page; - from a web service; - in a SQL data base - in some other type of data base or file format. We have seen several examples of importing CSV files. - CSV files are often fairly easy to deal with. - They are typically meant for data import rather than human reading. - Spread sheet programs typically provide a mechanism for exporting data in CSV form. - Some issues that you need to look for: - identifying column names; - identifing row names; - use of a semicolon separator (common in Europe). [Unemployment data](https://stat.uiowa.edu/~luke/data/laus/laucntycur14-2018.txt) from the [Local Area Unemployment Statistics (LAUS) page](https://www.bls.gov/lau/) ofthe Bureau of Labor Statistics is an example of a structured text file not in CSV format. - These files usually have enough structure for automated import. - Often they have features designed for a human reader that need to be worked around. Spread sheets are in principle designed for machine computation. - Many spread sheets have features designed for a human reader that need to be worked around. - Exporting (the relevant part of) a spread sheet to CSV is sometimes a good option. - Using a tool that can import directly from the spread sheet is often better: it makes the analysis easier to reproduce and repeat with new data. ## Wrangling the Unemployment Data The unemployment data is available as a [text file](https://stat.uiowa.edu/~luke/data/laus/laucntycur14-2017.txt). - The fields are delimited with a `|` character and surrounded by white space. - There is a header and a footer that need to be removed. - The counts use a comma to separate thousands. - The column names are awkward. - Some names contain an apostrophe, which can be confused with a quote character. The base function `read.table` and the `readr` function `read_table` have many options for dealing with these issues. We can use `read.table` with: - `col.names` to specify column names; - `skip = 6` to remove the header; - `strip_white` to remove white space around items - `quote = ""` to turn off interpreting any character as delimiting a quoted string - `fill = TRUE` to fill in columns in the footer. ```{r} lausURL <- "https://stat.uiowa.edu/~luke/data/laus/laucntycur14-2017.txt" lausFile <- "laucntycur14-2017.txt" if (! file.exists(lausFile)) download.file(lausURL, lausFile) lausUS <- read.table(lausFile, col.names = c("LAUSAreaCode", "State", "County", "Title", "Period", "LaborForce", "Employed", "Unemployed", "UnempRate"), quote = "", sep = "|", skip = 6, strip.white = TRUE, fill = TRUE) head(lausUS) tail(lausUS) ``` Searching for a sequence of dashes allows the code to drop the footer to work even if the number of rows changes. ```{r} footstart <- grep("------", lausUS$LAUSAreaCode) lausUS <- lausUS[1:(footstart - 1), ] ``` `str` show the data types we now have: ```{r} str(lausUS) ``` Convert the counts to numbers by removing the commas and passing to `as.numeric`: ```{r} lausUS <- mutate(lausUS, LaborForce = as.numeric(gsub(",", "", LaborForce)), Employed = as.numeric(gsub(",", "", Employed)), Unemployed = as.numeric(gsub(",", "", Unemployed))) ``` The expression ```r gsub(",", "", x) ``` replaces all occurrences of the pattern `","` in the elements of `x` with the empty string `""`. Using `sub` instead of `gsub` would replace only the first occurrence in each element. We could also separate out the state character codes and county names: ```{r} lausUS <- mutate(lausUS, StateCharCode = sub("^.*, ", "", Title), CountyName = sub(", .*", "", Title)) ``` We could also convert the `Period` variable to a date format. ## Refugee Arrival Data The [Refugee Processing Center](http://www.wrapsnet.org/) provides information from the United States Refugee Admissions Program (USRAP), including [arrivals by state and nationality](http://www.wrapsnet.org/admissions-and-arrivals/). Three files are available locally: - data from [early January 2017](http://homepage.stat.uiowa.edu/~luke/data/Arrivals-2017-01-06.xls); - data from [early April 2017](http://homepage.stat.uiowa.edu/~luke/data/Arrivals-2017-04-05.xls). - data from [early March 2018](http://homepage.stat.uiowa.edu/~luke/data/Arrivals-2018-03-05.xls). These are Excel spread sheets. The function `read_excel` in the `readxl` package provides a way to import this kind of data into R. ```{r, include = FALSE} if (! file.exists("Arrivals-2017-01-06.xls")) download.file("http://homepage.stat.uiowa.edu/~luke/data/Arrivals-2017-01-06.xls", "Arrivals-2017-01-06.xls") if (! file.exists("Arrivals-2017-04-05.xls")) download.file("http://homepage.stat.uiowa.edu/~luke/data/Arrivals-2017-04-05.xls", "Arrivals-2017-04-05.xls") if (! file.exists("Arrivals-2018-03-05.xls")) download.file("http://homepage.stat.uiowa.edu/~luke/data/Arrivals-2018-03-05.xls", "Arrivals-2018-03-05.xls") ``` Load the `readxl` package and define a short name for the file: ```{r} library(readxl) fname <- "Arrivals-2017-01-06.xls" ``` Read the `FY` line: ```{r} read_excel(fname, skip = 13, n_max = 1) year_line <- read_excel(fname, skip = 13, col_names = FALSE, n_max = 1) as.numeric(sub("FY ", "", year_line[[1]])) ``` A useful sanity check on the format might be: ```{r} stopifnot(length(year_line) == 1 && grepl("FY [[:digit:]]+", year_line[[1]])) year <- as.numeric(sub("FY ", "", year_line[[1]])) ``` Read in the data: ```{r} d <- read_excel(fname, skip = 16) head(d) tail(d) ``` The last line needs to be dropped. Another sanity chack first is a good idea. ```{r} stopifnot(all(is.na(tail(d, 1)[, -1]))) d <- d[seq_len(nrow(d) - 1), ] ``` ```{r} state_lines <- which(! is.na(d[[1]])) d[state_lines, ] ``` We can use the first column as the destination: ```{r} library(dplyr) d <- rename(d, Dest = 1) ``` We need to replace the `NA` values in `Dest` by the previous non-`NA` value. There are several options: - use a `for` loop; - calculate the gaps and use `rep`; - use the `fill` function from `tidyr`. A `for` loop solution: ```{r} v0 <- d[[1]] for (i in seq_along(v0)) { if (! is.na(v0[i])) s <- v0[i] else v0[i] <- s } ``` A solution using `rep`: ```{r} v1 <- rep(d[state_lines[-length(state_lines)], 1][[1]], diff(state_lines)) ``` A search on `R carry forward` brings up the `fill` function in `tidyr`: ```{r} v2 <- fill(d, Dest)[[1]] ``` All three approaches produce identical results: ```{r} identical(v0, v1) identical(v0, v2) ``` Using `fill` is easiest: ```{r} d <- fill(d, Dest) ``` Dropping the state lines gives the date we need: ```{r} head(d[-state_lines, ]) ``` To be able to read new files from this source it is useful to wrap this in a function: ```{r} library(readxl) library(dplyr) library(tidyr) readRefXLS <- function(fname) { read_excel(fname, skip = 13, n_max = 1) ## read and check the FY line year_line <- read_excel(fname, skip = 13, col_names = FALSE, n_max = 1) stopifnot(length(year_line) == 1 && grepl("FY [[:digit:]]+", year_line[[1]])) year <- as.numeric(sub("FY ", "", year_line[[1]])) d <- read_excel(fname, skip = 16) ## check and trim the last line stopifnot(all(is.na(tail(d, 1)[, -1]))) d <- d[seq_len(nrow(d) - 1), ] ## identify the state summary lines state_lines <- which(! is.na(d[[1]])) ## rename and fill first column d <- rename(d, Dest = 1) d <- fill(d, Dest) ## drop the state summaries and add the FY d <- d[-state_lines, ] d$FY <- year d } ``` This can read all the files: ```{r} d1 <- readRefXLS("Arrivals-2017-01-06.xls") d2 <- readRefXLS("Arrivals-2017-04-05.xls") d3 <- readRefXLS("Arrivals-2018-03-05.xls") ``` Some explorations: ```{r} td3 <- gather(d3, month, count, 3 : 14) arr_by_dest <- summarize(group_by(td3, Dest), count = sum(count)) arr_by_nat <- summarize(group_by(td3, Nationality), count = sum(count)) arr_by_dest_nat <- summarize(group_by(td3, Dest, Nationality), count = sum(count)) ``` ```{r} ggplot(arr_by_dest, aes(count, Dest)) + geom_point() ggplot(arr_by_dest, aes(count, reorder(Dest, count))) + geom_point() ggplot(slice_max(arr_by_dest, count, n = 5), aes(count, Dest)) + geom_point() ggplot(arr_by_nat, aes(count, Nationality)) + geom_point() ggplot(arr_by_nat, aes(count, reorder(Nationality, count))) + geom_point() ggplot(slice_max(arr_by_nat, count, n = 5), aes(count, Nationality)) + geom_point() ``` A useful visualization of flows is a [_Sankey diagram_](https://en.wikipedia.org/wiki/Sankey_diagram). These are also known as - [alluvial diagrams](https://en.wikipedia.org/wiki/Alluvial_diagram); - river digrams; and some would argue that allucial diagrams and river diagrams are different things. Several R packages create these; including - `alluvial` - `riverplot` - `googleVis`. This code creates a Sankey diagram using `alluvial`: ```{r} n_top <- 10 s_top <- 25 toprefs <- arrange(arr_by_nat, desc(count))$Nationality[1 : n_top] topstates <- arrange(arr_by_dest, desc(count))$Dest[1 : s_top] td3m <- mutate(td3, Nationality = ifelse(Nationality %in% toprefs, Nationality, "Other Nat."), Dest = ifelse(Dest %in% topstates, Dest, "Other Dest.")) std3m <- summarize(group_by(td3m, Dest, Nationality), count = sum(count)) pal <- c(rainbow(n_top), "grey") pal <- c(RColorBrewer::brewer.pal(n_top, "Paired"), "grey") src <- c(toprefs, "Other Nat.") library(alluvial) with(std3m, alluvial(Nationality, Dest, freq = count, col = pal[match(Nationality, src)], cex = 0.8, alpha = 0.7)) ``` Another option is a _chord diagram_: ```{r, fig.height = 10, message = FALSE} library(circlize) chordDiagram(select(std3m, Nationality, Dest, count)) ``` ## Gapminder Childhood Mortality Data The `gapminder` package provides a subset of the data from the [Gapminder](https://www.gapminder.org/) web site. Additional data sets are [available](https://www.gapminder.org/data/). - A data set on childhood mortality is available locally as a [csv file](https://stat.uiowa.edu/~luke/data/gapminder-under5mortality.csv) or an [Excel file](http://homepage.stat.uiowa.edu/~luke/data/gapminder-under5mortality.xlsx). - The numbers represent number of deaths within the first five years per 1000 births. Loading the data: ```{r} if (! file.exists("gapminder-under5mortality.xlsx")) download.file("http://homepage.stat.uiowa.edu/~luke/data/gapminder-under5mortality.xlsx", "gapminder-under5mortality.xlsx") gcm <- read_excel("gapminder-under5mortality.xlsx") names(gcm)[1] names(gcm)[1] <- "country" ``` A _tidy_ version is useful for working with `ggplot`. ```{r} tgcm <- gather(gcm, year, u5mort, -1) head(tgcm) tgcm <- mutate(tgcm, year = as.numeric(year)) head(tgcm) ``` A multiple time series version may also be useful. ```{r} gcmts <- ts(t(gcm[-1]), start = 1800) colnames(gcmts) <- gcm$country ``` Some explorations: ```{r} p <- ggplot(tgcm) + geom_line(aes(year, u5mort, group = country), alpha = 0.3) p plotly::ggplotly(p) xyplot(gcmts, superpose = TRUE, auto.key = FALSE) xyplot(gcmts, superpose = TRUE, auto.key = FALSE, col = "black") ``` Examining the missing values: ```{r} anyNA(gcmts) sum(is.na(gcmts)) length(gcmts) image(time(gcmts), seq_len(ncol(gcmts)), is.na(gcmts), ylab = "series") ``` Re-ordering by the number of missing values may be useful: ```{r} naord <- order(apply(gcmts, 2, function(x) sum(is.na(x)))) image(time(gcmts), seq_len(ncol(gcmts)), is.na(gcmts)[, naord], ylab = "series") ``` Some series look like the early values are estimates. Identifying and removing the ones that are constant between 1800 and 1850: ```{r} gcmtsNoNA <- gcmts[, ! apply(gcmts, 2, anyNA)] i1850 <- which(time(gcmts) == 1850) early_sd <- apply(gcmtsNoNA[1 : i1850, ], 2, sd) filled_in <- names(which(early_sd == 0)) ggplot(filter(tgcm, ! country %in% filled_in)) + geom_line(aes(year, u5mort, group = country), alpha = 0.3, na.rm = TRUE) ``` To show mortality rates by continent we can use the continent identification in the `gapminder` data set of the `gapminder` package. There are a number of countries in the new data set not in `gapminder` ```{r} library(gapminder) dim(gcm) length(unique(gapminder$country)) head(setdiff(gcm$country, gapminder$country)) ``` There are several options for adding the continent information from `gapminder` to one of our data frames: - look-up using match - a left join or inner join using `left_join` - a left join or inner join using `merge`. Using `match`, look up the index of the first row in `gapminder` for the country; then look up the corresponding continent: ```{r} cidx <- match(tgcm$country, gapminder$country) head(cidx) cont <- gapminder$continent[cidx] head(cont) d1 <- mutate(tgcm, continent = cont) ``` Our left table will map country to continent, with one row per country: ```{r} country_continent <- unique(select(gapminder, country, continent)) head(country_continent) ``` The left join using `left_join`: ```{r} d2 <- left_join(tgcm, country_continent) ``` Joins in general may not produce the ordering you want; you need to check or make sure by reordering: ```{r} d1 <- arrange(d1, year, country) d2 <- arrange(d2, year, country) identical(unclass(d1), unclass(d2)) ``` The left join using `merge`: ```{r} d3 <- merge(tgcm, country_continent, all.x = TRUE) d3 <- arrange(d3, year, country) identical(d1, d3) ``` Mapping continent to `color`: ```{r} ggplot(d3) + geom_line(aes(year, u5mort, group = country, color = continent), alpha = 0.7, na.rm = TRUE) ``` To compare child mortality to GDP per capita for the years covered by the `gapminder` data frame we need to merge the mortality data into `gapminder`. Some country names in `gapminder` do not appear in the child mortality data: ```{r} setdiff(gapminder$country, tgcm$country) grep("Korea", unique(tgcm$country), value = TRUE) grep("Yemen", unique(tgcm$country), value = TRUE) ``` Change the names in `tgcm`` to match the names in `gapminder`: ```{r} ink <- which(tgcm$country == "North Korea") isk <- which(tgcm$country == "South Korea") iym <- which(tgcm$country == "Yemen") tgcm$country[ink] <- "Korea, Dem. Rep." tgcm$country[isk] <- "Korea, Rep." tgcm$country[iym] <- "Yemen, Rep." ``` Now left join the mortality data with `gapminder`; the key is the combination of `country` and `year`: Some plots: ```{r} gm <- left_join(gapminder, tgcm) ``` ```{r} p <- ggplot(gm, aes(gdpPercap, u5mort, color = continent)) + geom_point() p p + scale_x_log10() p + scale_x_log10() + facet_wrap(~ cut(year, 4)) ```