Data Technologies

Class and homework examples often start with a nice, clean, rectangular data set.

Once you have a source for your data, data technologies are the tools you need to get your data to that point.

The process of applying these tools is sometimes called data wrangling.

The main steps are usually

We have seen many of these steps already.

Data Import

Your data may be available in a number of different forms, such as:

We have seen several examples of importing CSV files.

Unemployment data from the Local Area Unemployment Statistics (LAUS) page ofthe Bureau of Labor Statistics is an example of a structured text file not in CSV format.

Spread sheets are in principle designed for machine computation.

Wrangling the Unemployment Data

The unemployment data is available as a text file.

The base function read.table and the readr function read_table have many options for dealing with these issues.

We can use read.table with:

lausURL <- "http://www.stat.uiowa.edu/~luke/data/laus/laucntycur14-2017.txt"
lausFile <- "laucntycur14-2017.txt"
if (! file.exists(lausFile))
    download.file(lausURL, lausFile)
lausUS <- read.table(lausFile,
                     col.names = c("LAUSAreaCode", "State", "County",
                                   "Title", "Period",
                                   "LaborForce", "Employed",
                                   "Unemployed", "UnempRate"),
                     quote = '', sep = "|", skip = 6,
                     stringsAsFactors = FALSE, strip.white = TRUE,
                     fill = TRUE)
head(lausUS)
##      LAUSAreaCode State County              Title Period LaborForce
## 1 CN0100100000000     1      1 Autauga County, AL Nov-16     25,809
## 2 CN0100300000000     1      3 Baldwin County, AL Nov-16     89,507
## 3 CN0100500000000     1      5 Barbour County, AL Nov-16      8,213
## 4 CN0100700000000     1      7    Bibb County, AL Nov-16      8,645
## 5 CN0100900000000     1      9  Blount County, AL Nov-16     24,754
## 6 CN0101100000000     1     11 Bullock County, AL Nov-16      4,990
##   Employed Unemployed UnempRate
## 1   24,518      1,291       5.0
## 2   84,817      4,690       5.2
## 3    7,546        667       8.1
## 4    8,105        540       6.2
## 5   23,481      1,273       5.1
## 6    4,609        381       7.6
tail(lausUS)
##                                                                                                                                 LAUSAreaCode
## 45066                                                                                                                        CN7215300000000
## 45067 --------------------------------------------------------------------------------------------------------------------------------------
## 45068                                                                                                                     (p) = preliminary.
## 45069                                                                                            Dash indicates that data are not available.
## 45070                                                                                                                      SOURCE: BLS, LAUS
## 45071                                                                                                                       February 6, 2018
##       State County               Title    Period LaborForce Employed
## 45066    72    153 Yauco Municipio, PR Dec-17(p)     10,523    8,892
## 45067    NA     NA                                                  
## 45068    NA     NA                                                  
## 45069    NA     NA                                                  
## 45070    NA     NA                                                  
## 45071    NA     NA                                                  
##       Unemployed UnempRate
## 45066      1,631      15.5
## 45067                     
## 45068                     
## 45069                     
## 45070                     
## 45071

Searching for a sequence of dashes allows the code to drop the footer to work even if the number of rows changes.

footstart <- grep("------", lausUS$LAUSAreaCode)
lausUS <- lausUS[1:(footstart - 1),]

str show the data types we now have:

str(lausUS)
## 'data.frame':    45066 obs. of  9 variables:
##  $ LAUSAreaCode: chr  "CN0100100000000" "CN0100300000000" "CN0100500000000" "CN0100700000000" ...
##  $ State       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ County      : int  1 3 5 7 9 11 13 15 17 19 ...
##  $ Title       : chr  "Autauga County, AL" "Baldwin County, AL" "Barbour County, AL" "Bibb County, AL" ...
##  $ Period      : chr  "Nov-16" "Nov-16" "Nov-16" "Nov-16" ...
##  $ LaborForce  : chr  "25,809" "89,507" "8,213" "8,645" ...
##  $ Employed    : chr  "24,518" "84,817" "7,546" "8,105" ...
##  $ Unemployed  : chr  "1,291" "4,690" "667" "540" ...
##  $ UnempRate   : chr  "5.0" "5.2" "8.1" "6.2" ...

Convert the counts to numbers by removing the commas and passing to as.numeric:

lausUS <- mutate(lausUS,
                 LaborForce = as.numeric(gsub(",", "", LaborForce)),
                 Employed = as.numeric(gsub(",", "", Employed)),
                 Unemployed = as.numeric(gsub(",", "", Unemployed)))
## Warning in rlang::eval_tidy(~as.numeric(gsub(",", "", LaborForce)),
## <environment>): NAs introduced by coercion
## Warning in rlang::eval_tidy(~as.numeric(gsub(",", "", Employed)),
## <environment>): NAs introduced by coercion
## Warning in rlang::eval_tidy(~as.numeric(gsub(",", "", Unemployed)),
## <environment>): NAs introduced by coercion

The expression

gsub(",", "", x)

replaces all occurrences of the pattern "," in the elements of x with the empty string "". Using sub instead of gsub would replace only the first occurrence in each element.

We could also separate out the state character codes and county names:

lausUS <- mutate(lausUS,
                 StateCharCode = sub("^.*, ", "", Title),
                 CountyName = sub(", .*", "", Title),
                 stringsAsFactors = FALSE)

We could also convert the Period variable to a date format.

Refugee Arrival Data

The Refugee Processing Center provides information from the United States Refugee Admissions Program (USRAP), including arrivals by state and nationality.

Three files are available locally:

These are Excel spread sheets.

The function read_excel in the readxl package provides a way to import this kind of data into R.

Load the readxl package and define a short name for the file:

library(readxl)
fname <- "Arrivals-2017-01-06.xls"

Read the FY line:

read_excel(fname, skip = 13, n_max = 1)
## # A tibble: 0 x 1
## # … with 1 variable: `FY 2017` <lgl>
year_line <- read_excel(fname, skip = 13,
                        col_names = FALSE, n_max = 1)
## New names:
## * `` -> ...1
as.numeric(sub("FY ", "", year_line[[1]]))
## [1] 2017

A useful sanity check on the format might be:

stopifnot(length(year_line) == 1 && grepl("FY [[:digit:]]+", year_line[[1]]))
year <- as.numeric(sub("FY ", "", year_line[[1]]))

Read in the data:

d <- read_excel(fname, skip = 16)
## New names:
## * `` -> ...1
head(d)
## # A tibble: 6 x 18
##   ...1  Nationality   Oct   Nov   Dec   Jan   Feb   Mar   Apr   May   Jun
##   <chr> <chr>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Alab… <NA>            8     4     4     0     0     0     0     0     0
## 2 <NA>  Dem. Rep. …     4     0     0     0     0     0     0     0     0
## 3 <NA>  Iraq            0     0     4     0     0     0     0     0     0
## 4 <NA>  Somalia         4     4     0     0     0     0     0     0     0
## 5 Alas… <NA>           17     3     2     0     0     0     0     0     0
## 6 <NA>  Republic o…     3     0     0     0     0     0     0     0     0
## # … with 7 more variables: Jul <dbl>, Aug <dbl>, Sep <dbl>, Cases <dbl>,
## #   Inds <dbl>, State <dbl>, U.S. <dbl>
tail(d)
## # A tibble: 6 x 18
##   ...1  Nationality   Oct   Nov   Dec   Jan   Feb   Mar   Apr   May   Jun
##   <chr> <chr>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 <NA>  Sudan           0     0     3     0     0     0     0     0     0
## 2 <NA>  Syria          14     4     5     0     0     0     0     0     0
## 3 <NA>  Thailand        0     0     1     0     0     0     0     0     0
## 4 <NA>  Vietnam         7     0     0     0     0     0     0     0     0
## 5 Total <NA>         9945  8355  7371     0     0     0     0     0     0
## 6 Plea… <NA>           NA    NA    NA    NA    NA    NA    NA    NA    NA
## # … with 7 more variables: Jul <dbl>, Aug <dbl>, Sep <dbl>, Cases <dbl>,
## #   Inds <dbl>, State <dbl>, U.S. <dbl>

The last line needs to be dropped. Another sanity chack first is a good idea.

stopifnot(all(is.na(tail(d, 1)[, -1])))
d <- d[seq_len(nrow(d) - 1), ]
state_lines <- which(! is.na(d[[1]]))
d[state_lines, ]
## # A tibble: 50 x 18
##    ...1  Nationality   Oct   Nov   Dec   Jan   Feb   Mar   Apr   May   Jun
##    <chr> <chr>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 Alab… <NA>            8     4     4     0     0     0     0     0     0
##  2 Alas… <NA>           17     3     2     0     0     0     0     0     0
##  3 Ariz… <NA>          547   493   310     0     0     0     0     0     0
##  4 Arka… <NA>            0     0    14     0     0     0     0     0     0
##  5 Cali… <NA>          814   887   636     0     0     0     0     0     0
##  6 Colo… <NA>          192   217   180     0     0     0     0     0     0
##  7 Conn… <NA>           47    73    63     0     0     0     0     0     0
##  8 Dist… <NA>            2     0     0     0     0     0     0     0     0
##  9 Flor… <NA>          267   265   276     0     0     0     0     0     0
## 10 Geor… <NA>          362   304   331     0     0     0     0     0     0
## # … with 40 more rows, and 7 more variables: Jul <dbl>, Aug <dbl>,
## #   Sep <dbl>, Cases <dbl>, Inds <dbl>, State <dbl>, U.S. <dbl>

We can use the first column as the destination:

library(dplyr)
d <- rename(d, Dest = 1)

We need to replace the NA values in Dest by the previous non-NA value. There are several options:

A for loop solution:

v0 <- d[[1]]
for (i in seq_along(v0)) {
    if (! is.na(v0[i]))
        s <- v0[i]
    else
        v0[i] <- s
}

A solution using rep:

v1 <- rep(d[state_lines[-length(state_lines)], 1][[1]], diff(state_lines))

A search on R carry forward brings up the fill function in tidyr:

v2 <- fill(d, Dest)[[1]]

All three approaches produce identical results:

identical(v0, v1)
## [1] FALSE
identical(v0, v2)
## [1] TRUE

Using fill is easiest:

d <- fill(d, Dest)

Dropping the state lines gives the date we need:

head(d[-state_lines,])
## # A tibble: 6 x 18
##   Dest  Nationality   Oct   Nov   Dec   Jan   Feb   Mar   Apr   May   Jun
##   <chr> <chr>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Alab… Dem. Rep. …     4     0     0     0     0     0     0     0     0
## 2 Alab… Iraq            0     0     4     0     0     0     0     0     0
## 3 Alab… Somalia         4     4     0     0     0     0     0     0     0
## 4 Alas… Republic o…     3     0     0     0     0     0     0     0     0
## 5 Alas… Somalia         4     0     2     0     0     0     0     0     0
## 6 Alas… Ukraine        10     3     0     0     0     0     0     0     0
## # … with 7 more variables: Jul <dbl>, Aug <dbl>, Sep <dbl>, Cases <dbl>,
## #   Inds <dbl>, State <dbl>, U.S. <dbl>

To be able to read new files from this source it is useful to wrap this in a function:

library(readxl)
library(dplyr)
library(tidyr)

readRefXLS <- function(fname) {
    read_excel(fname, skip = 13, n_max = 1)

    ## read and check the FY line
    year_line <- read_excel(fname, skip = 13,
                            col_names = FALSE, n_max = 1)
    stopifnot(length(year_line) == 1 &&
              grepl("FY [[:digit:]]+", year_line[[1]]))
    year <- as.numeric(sub("FY ", "", year_line[[1]]))

    d <- read_excel(fname, skip = 16)

    ## check and trim the last line
    stopifnot(all(is.na(tail(d, 1)[, -1])))
    d <- d[seq_len(nrow(d) - 1), ]

    ## identify the state summary lines
    state_lines <- which(! is.na(d[[1]]))

    ## rename and fill first column
    d <- rename(d, Dest = 1)
    d <- fill(d, Dest)

    ## drop the state summaries and add the FY
    d <- d[-state_lines,]
    d$FY <- year

    d
}

This can read all the files:

d1 <- readRefXLS("Arrivals-2017-01-06.xls")
## New names:
## * `` -> ...1
## New names:
## * `` -> ...1
d2 <- readRefXLS("Arrivals-2017-04-05.xls")
## New names:
## * `` -> ...1
## New names:
## * `` -> ...1
d3 <- readRefXLS("Arrivals-2018-03-05.xls")
## New names:
## * `` -> ...1
## New names:
## * `` -> ...1

Some explorations:

td3 <- gather(d3, month, count, 3 : 14)

arr_by_dest <- summarize(group_by(td3, Dest), count = sum(count))
arr_by_nat <- summarize(group_by(td3, Nationality), count = sum(count))
arr_by_dest_nat <- summarize(group_by(td3, Dest, Nationality), count = sum(count))
ggplot(arr_by_dest, aes(count, Dest)) + geom_point()

ggplot(arr_by_dest, aes(count, reorder(Dest, count))) + geom_point()

ggplot(top_n(arr_by_dest, 5, count), aes(count, Dest)) + geom_point()


ggplot(arr_by_nat, aes(count, Nationality)) + geom_point()

ggplot(arr_by_nat, aes(count, reorder(Nationality, count))) + geom_point()

ggplot(top_n(arr_by_nat, 5, count), aes(count, Nationality)) + geom_point()

A useful visualization of flows is a Sankey diagram. These are also known as

and some would argue that allucial diagrams and river diagrams are different things.

Several R packages create these; including

This code creates a Sankey diagram using alluvial:

n_top <- 10
s_top <- 25

toprefs <- arrange(arr_by_nat, desc(count))$Nationality[1 : n_top]
topstates <- arrange(arr_by_dest, desc(count))$Dest[1 : s_top]

td3m <- mutate(td3,
               Nationality = ifelse(Nationality %in% toprefs,
                                    Nationality, "Other Nat."),
               Dest = ifelse(Dest %in% topstates, Dest, "Other Dest."))

std3m <- summarize(group_by(td3m, Dest, Nationality), count = sum(count))

pal <- c(rainbow(n_top), "grey")
pal <- c(RColorBrewer::brewer.pal(n_top, "Paired"), "grey")
src <- c(toprefs, "Other Nat.")

library(alluvial)
with(std3m, alluvial(Nationality, Dest, freq = count,
                     col = pal[match(Nationality, src)],
             cex = 0.8, alpha = 0.7))

Another option is a chord diagram:

library(circlize)
chordDiagram(select(std3m, Nationality, Dest, count))

Gapminder Childhood Mortality Data

The gapminder package provides a subset of the data from the Gapminder web site. Additional data sets are available.

Loading the data:

if (! file.exists("gapminder-under5mortality.xlsx"))
   download.file("http://homepage.stat.uiowa.edu/~luke/data/gapminder-under5mortality.xlsx", "gapminder-under5mortality.xlsx")
gcm <- read_excel("gapminder-under5mortality.xlsx")
names(gcm)[1]
## [1] "Under five mortality"
names(gcm)[1] <- "country"

A tidy version is useful for working with ggplot.

tgcm <- gather(gcm, year, u5mort, -1)
head(tgcm)
## # A tibble: 6 x 3
##   country               year   u5mort
##   <chr>                 <chr>   <dbl>
## 1 Abkhazia              1800.0    NA 
## 2 Afghanistan           1800.0   469.
## 3 Akrotiri and Dhekelia 1800.0    NA 
## 4 Albania               1800.0   375.
## 5 Algeria               1800.0   460.
## 6 American Samoa        1800.0    NA
tgcm <- mutate(tgcm, year = as.numeric(year))
head(tgcm)
## # A tibble: 6 x 3
##   country                year u5mort
##   <chr>                 <dbl>  <dbl>
## 1 Abkhazia               1800    NA 
## 2 Afghanistan            1800   469.
## 3 Akrotiri and Dhekelia  1800    NA 
## 4 Albania                1800   375.
## 5 Algeria                1800   460.
## 6 American Samoa         1800    NA

A multiple time series version may also be useful.

gcmts <- ts(t(gcm[-1]), start = 1800)
colnames(gcmts) <- gcm$country

Some explorations:

p <- ggplot(tgcm) + geom_line(aes(year, u5mort, group = country), alpha = 0.3)
p
## Warning: Removed 18644 rows containing missing values (geom_path).

plotly::ggplotly(p)