--- title: "A Grammar of Data Manipulation" output: html_document: toc: yes code_download: true --- ```{r setup, include = FALSE, message = FALSE} source(here::here("setup.R")) knitr::opts_chunk$set(collapse = TRUE, message = FALSE, fig.height = 5, fig.width = 6, fig.align = "center") set.seed(12345) library(dplyr) library(ggplot2) library(lattice) ``` ## Background The `dplyr` package provides a language, or grammar, for data manipulation. The language contains a number of _verbs_ that operate on tables in data frames. The most commonly used verbs operate on a single data frame: * [`select()`](https://dplyr.tidyverse.org/reference/select.html): pick variables by their names * [`filter()`](https://dplyr.tidyverse.org/reference/filter.html): choose rows that satisfy some criteria * [`mutate()`](https://dplyr.tidyverse.org/reference/mutate.html): create transformed or derived variables * [`arrange()`](https://dplyr.tidyverse.org/reference/arrange.html): reorder the rows * [`summarize()`](https://dplyr.tidyverse.org/reference/summarise.html): collapse rows down to summaries There are also a number of `join` verbs that merge several data frames into one. The `tidyr` package provides additional verbs, such as `pivot_longer` and `pivot_wider` for reshaping data frames. The single table verbs can also be used with [`group_by()`](https://dplyr.tidyverse.org/reference/group_by.html) and [`ungroup()`](https://dplyr.tidyverse.org/reference/group_by.html) to work a group at a time instead of applying to the entire data frame. The design of `dplyr` is strongly motivated by SQL. * `dplyr` can also be used to operate on [tables stored in data bases.](https://dbplyr.tidyverse.org/index.html) Some data sets for illustration: * EPA vehicle data used in HW4 ```{r, message = FALSE, class.source = "fold-hide"} library(readr) if (! file.exists("vehicles.csv.zip")) download.file("http://www.stat.uiowa.edu/~luke/data/vehicles.csv.zip", "vehicles.csv.zip") newmpg <- read_csv("vehicles.csv.zip", guess_max = 100000) ``` * The `nycflights13` package provides data on all flights originating from one of the three main New York City airports in 2013 and heading to airports within the US. ```{r, class.source = "fold-hide"} library(nycflights13) ``` * The `storms` data frame, included in `dplyr` with data on hurricanes between 1975 and 2015. The basic transformation verbs are described in the [Data Transformation chapter](https://r4ds.had.co.nz/transform.html) of [R for Data Science](https://r4ds.had.co.nz/). Utilities in `tidyr` are described in the [Tidy Data chapter](https://r4ds.had.co.nz/tidy-data.html). ## Selecting Variables Data sets often contain hundreds of even thousands of variables. A useful first step is to select a group of interesting variables. A reasonable selection of the EPA vehicle variables: ```{r} newmpg1 <- select(newmpg, make, model, year, cty = city08, hwy = highway08, trans = trany, cyl = cylinders, fuel = fuelType1, displ) head(newmpg1, 2) ```
Variables can be given new names in a `select` call; you can also rename them later with `rename`.
Some variations (the documentation for [`select()`](https://dplyr.tidyverse.org/reference/select.html) gives full details): ```{r} select(newmpg1, year : trans) %>% head(2) ``` ```{r} select(newmpg1, -year, -trans) %>% head(2) ``` ```{r} select(newmpg1, - (year : trans)) %>% head(2) ``` Numerical column specifications can also be used: ```{r} select(newmpg1, 1, 3) %>% head(2) ``` ```{r} select(newmpg1, 1 : 3) %>% head(2) ``` ```{r} select(newmpg1, - (1 : 3)) %>% head(2) ``` Some _helper functions_ can also be used in the column specifications. The most useful ones are `starts_with`, `ends_with`, and `contains`. ```{r} select(newmpg, starts_with("fuel")) %>% head(2) ``` ```{r} select(newmpg, contains("city")) %>% head(2) ```
By default these helpers ignore case.
These can also be preceded by a minus modifier to omit columns: ```{r} select(newmpg1, -contains("m")) %>% head(2) ``` `contains` requires a literal string; `matches` is analogous but its argument is interpreted as a [_regular expression_](regex.html) pattern to be matched. ```{r} select(newmpg, matches("^[Ff]uel")) %>% head(2) ``` Sometimes it is useful to move a few columns to the front; the `everything` helper can be used for that. The original NYC `flights` table: ```{r} head(flights, 2) ``` Moving `air_time` to the front: ```{r} select(flights, air_time, everything()) %>% head(2) ``` The `where` helper function allows columns to be selected based on a predicate applied to the full column: ```{r} select(newmpg1, where(anyNA)) %>% head(2) ``` ```{r} select(flights, where(anyNA)) %>% head(2) ```
If there are missing values in your data it is usually a good idea to review them to make sure you understand what they mean.
Columns can also be selected using list operations: ```{r} vars <- c("make", "model", "year", "city08", "trany", "cylinders", "fuelType1", "displ") newmpg[vars] %>% head(2) ``` ## Filtering Rows The `filter` function picks out the subset of rows that satisfy one or more conditions. All cars with city MPG at or above 130 and model year 2018: ```{r} filter(newmpg1, cty >= 130, year == 2018) ``` All flights from New York to Des Moines on January 1, 2013: ```{r} filter(flights, day == 1, month == 1, dest == "DSM") %>% select(year : dep_time, origin) ``` ### Comparisons The basic comparison operators are * `==`, `!=` * `<`, `<=` * `>`, `>=`
Be sure to use `==`, not a single `=` character.
Be careful about floating point equality tests: ```{r} x <- 1 - 0.9 x == 0.1 near(x, 0.1) ```
Equality comparisons can be used on character vectors and factors. Order comparisons can be used on character vectors, but may produce surprising results, or results that vary with different locale settings.
`%in%` can be used to match against one of several options: ```{r} filter(flights, day %in% 1 : 2, month == 1, dest == "DSM") %>% select(year : dep_time, origin) ``` ### Logical Operations The basic logical operations on vectors are * `&` -- and * `|` -- or * `!` -- not * `xor` -- exclusive or Truth tables: ```{r} x <- c(TRUE, TRUE, FALSE, FALSE) y <- c(TRUE, FALSE, TRUE, FALSE) ! x x & y x | y xor(x, y) ```
Make sure not to confuse the vectorized logical operators `&` and `|` with the scalar flow control operators `&&` and `||`.
The previous `flights` example can also be written as ```{r} filter(flights, day == 1 | day == 2, month == 1, dest == "DSM") %>% select(year : dep_time, origin) ``` It can also be written as ```{r} filter(flights, (day == 1 | day == 2) & month == 1 & dest == "DSM") %>% select(year : dep_time, origin) ``` ### Missing Values `filter` only keeps values corresponding to `TRUE` predicate values; `FALSE` and `NA` values are dropped. Arithmetic and comparison operators will always produce `NA` values if one operand is `NA`. ```{r} 1 + NA NA < 1 NA == NA ``` In a few cases a non-`NA` result can be determined: ```{r} TRUE | NA NA & FALSE NA ^ 0 ``` `is.na` can be used to also select rows with `NA` values. Non-electric vehicles with zero or `NA` for `displ`: ```{r} filter(newmpg1, displ == 0 | is.na(displ), fuel != "Electricity") ``` Flights with `NA` for both `dep_time` and `arr_time`: ```{r} filter(flights, is.na(dep_time) & is.na(arr_time)) %>% head(2) %>% select(year : arr_time) ``` An alternative approach that does not use `filter` would be ```{r} idx <- with(flights, is.na(dep_time) & is.na(arr_time)) flights[idx, ] %>% head(2) %>% select(year : arr_time) ``` ## Exploring the EPA Data Sorting out fuel types: ```{r} select(newmpg, contains("fuelType")) %>% unique() ``` Variables with missing values: ```{r} select(newmpg, where(anyNA)) %>% head(2) ``` Among the reduced data set: ```{r} select(newmpg1, where(anyNA)) %>% names() ``` Unique missing value patterns: ```{r, eval = FALSE, echo = FALSE} ## This may be more robust: incomplete_cases <- function(data) data[Reduce(`|`, lapply(data, is.na)), ] ``` ```{r} incomplete_cases <- function(data) data[! complete.cases(data), ] select(newmpg1, trans, cyl, displ, fuel) %>% incomplete_cases() %>% unique() ``` An alternative to defining `incomplete_cases()`: ```{r} select(newmpg1, trans, cyl, displ, fuel) %>% filter(if_any(everything(), is.na)) %>% unique() ``` Looking at the counts: ```{r} select(newmpg1, trans, cyl, displ, fuel) %>% incomplete_cases() %>% count(trans, cyl, displ, fuel) ```
`count(trans, cyl, displ, fuel)` is essentially an abbreviation of ```{r, eval = FALSE} group_by(trans, cyl, displ, fuel) %>% summarize(n = n()) %>% ungroup() ```
Another approach that avoids specifying the variables with missing values: ```{r} incomplete_cases(newmpg1) %>% count(across(where(anyNA)), fuel) ``` Cylinders for electric vehicles: ```{r} filter(newmpg1, fuel == "Electricity") %>% count(is.na(cyl)) ``` Displacement for electric vehicles: ```{r} filter(newmpg1, fuel == "Electricity") %>% count(is.na(displ)) ``` Electric vehicles with non-missing displacement: ```{r} filter(newmpg1, fuel == "Electricity", ! is.na(displ)) ``` Some things to think about: * Should electric vehicles be dropped from analyses or visualizations involving cylinders or displacement? * Should `cyl` or `displ` be recoded, perhaps as zero? * Why are miles per gallon _not_ missing, or infinite, for electric vehicles? The [documentation for the data](https://www.fueleconomy.gov/feg/ws/index.shtml#vehicle) has some pointers. ## Adding New Variables `mutate()` can be used to define new variables or modify existing ones. New variables are added at the end. Later variables can be defined in terms of earlier ones.
`mutate()` does not modify the original data frame; it creates a new one with the specified modifications.
```{r} fl <- select(flights, year, month, day, air_time, ends_with("delay")) mutate(fl, gain = dep_delay - arr_delay, air_hours = air_time / 60, gain_per_hour = gain / air_hours) %>% head(2) ``` `transmute()` keeps only the new variables. ```{r} transmute(flights, gain = dep_delay - arr_delay, air_hours = air_time / 60, gain_per_hour = gain / air_hours) %>% head(2) ``` Useful operators include: * Basic arithmetic operators, logarithms, etc. * Modular arithmetic operators `%/%` and `%%` * Logical operators and comparison operators. * Offset operators `lead()` and `lag()`. * cumulative aggregates such as `cumsum()`, `cummean()`. * Ranking operators, such as `min_rank()`, `percent_rank()`. * Predicates like `is.na`. A reasonable guess is that flights with `NA` values for both `dep_time` and `arr_time` were cancelled: ```{r} fl <- mutate(flights, cancelled = is.na(dep_time) & is.na(arr_time)) ``` For the `geyser` data set from the `MASS` package we used `lag` to obtain the duration of the previous eruption, ```{r} data(geyser, package = "MASS") geyser2 <- mutate(geyser, prev_duration = lag(duration)) head(geyser2, 4) ``` ```{r geyser-lag, eval = FALSE, warning = FALSE} thm <- theme_minimal() + theme(text = element_text(size = 16)) ggplot(geyser2) + geom_point(aes(x = prev_duration, y = waiting)) + thm ``` ```{r geyser-lag, echo = FALSE} ``` The top five vehicles in `cty` gas mileage (there may be more than five if there is a tie for fifth place): ```{r} mutate(newmpg1, cty_rank = min_rank(desc(cty))) %>% filter(cty_rank <= 5) ``` An alternative to explicitly computing ranks and filtering is the `slice_max()` utility function: ```{r} slice_max(newmpg1, cty, n = 5) ``` The `percent_rank` function produces a 'percentile rank' (but between 0 and 1). The top 5% of countries in life expectancy in 2007 from the `gapminder` data set: ```{r} library(gapminder) gm <- filter(gapminder, year == 2007) gm_top <- filter(gm, percent_rank(desc(lifeExp)) <= 0.05) gm_top ``` And the bottom 5%: ```{r} filter(gm, percent_rank(lifeExp) <= 0.05) ``` Within these results the original row order is retained. The `arrange` function can be used to change this. ## Arranging Rows `arrange()` reorders the rows according to one or more variables. By default the order is smallest to largest; using `desc()` reverses this. Additional variables will be used to break ties. Arranging the top 5% of countries in life expectancy in 2007 by life expectancy: ```{r} arrange(gm_top, desc(lifeExp)) ``` To order by continent name and order by life expectancy within continent: ```{r} arrange(gm_top, continent, desc(lifeExp)) ``` Missing values are sorted to the end: ```{r} df <- data.frame(x = c(5, 2, NA, 3)) arrange(df, x) arrange(df, desc(x)) ``` Using `is.na` you can arrange to put them at the beginning: ```{r} arrange(df, desc(is.na(x)), x) ``` ## Summarizing and Grouping `summarize()` collapses a table down to one row of summaries. The average and maximal departure and arrival delays for flights, along with the total number of flights: ```{r} summarize(flights, ave_dep_delay = mean(dep_delay, na.rm = TRUE), ave_arr_delay = mean(arr_delay, na.rm = TRUE), max_dep_delay = max(dep_delay, na.rm = TRUE), max_arr_delay = max(arr_delay, na.rm = TRUE), n = n()) ``` ### Grouped Summaries `summarize()` is most commonly used with _grouped data_ to provide group level summaries. Grouping delay summaries by destination: ```{r, warning = FALSE} fl_dest <- group_by(flights, dest) %>% summarize(ave_dep_delay = mean(dep_delay, na.rm = TRUE), ave_arr_delay = mean(arr_delay, na.rm = TRUE), max_dep_delay = max(dep_delay, na.rm = TRUE), max_arr_delay = max(arr_delay, na.rm = TRUE), n = n()) %>% ungroup() head(fl_dest, 5) ```
Calling `ungroup` is not always necessary but sometimes it is, so it is a good habit to get into.
The group level summaries can be further transformed, for example to identify the top 10 destinations for average delays: ```{r} mutate(fl_dest, rank = min_rank(desc(ave_dep_delay))) %>% filter(rank <= 10) ``` ### Missing Values Most summaries will be `NA` if any of the data values being summarized are `NA`. Typically, summary functions can be called with `na.rm = TRUE` to produce the summary for all non-missing values. ```{r} summarize(flights, ave_dep_delay = mean(dep_delay, na.rm = TRUE), ave_arr_delay = mean(arr_delay, na.rm = TRUE), max_dep_delay = max(dep_delay, na.rm = TRUE), max_arr_delay = max(arr_delay, na.rm = TRUE), n = n()) ``` You can also use `filter` to remove rows with `NA` values in the variables to be summarized: ```{r} filter(flights, ! is.na(dep_delay), ! is.na(arr_delay)) %>% summarize(ave_dep_delay = mean(dep_delay), ave_arr_delay = mean(arr_delay), max_dep_delay = max(dep_delay), max_arr_delay = max(arr_delay), n = n()) ```
The two approaches do produce different counts.
### Counts The helper function `n()` provides a count of the number of rows in the table being summarized. Including counts with grouped summaries is usually a good idea. Summaries based on small counts are likely to be much less reliable than ones based on larger counts. This is reflected in higher variability among averages for destinations with fewer flights: ```{r ave_arr_delay, eval = FALSE} ggplot(fl_dest) + geom_point(aes(x = n, y = ave_arr_delay), na.rm = TRUE) + thm ``` ```{r ave_arr_delay, echo = FALSE} ``` A `plotly` version allows the individual destinations to be identified easily: ```{r ave_arr_delay_plotly, eval = FALSE} library(plotly) p <- ggplot(fl_dest) + geom_point(aes(x = n, y = ave_arr_delay, text = dest), na.rm = FALSE) + thm ggplotly(p, tooltip = "text") ``` ```{r ave_arr_delay_plotly, echo = FALSE, warning = FALSE} ``` It would be more user-friendly to show the airport name than the three letter FAA code. ### Useful Summaries Some useful summary functions: * Location: `mean`, `median` * Spread: `sd`, `IQR`, `mad` * Rank: `min`, `quantile`, `max` * Position: `first`, `nth`, `last` * Counts: `n`. ## Airline Arrival Delays ### Different Delay Measures Sometimes it is useful to look at subsets or truncated versions of a variable. Airlines like to arrange schedules so they are often early. ```{r, fig.height = 4, class.source = "fold-hide"} ggplot(flights, aes(x = arr_delay, y = after_stat(density))) + geom_histogram(binwidth = 1, na.rm = TRUE, fill = "deepskyblue3") + xlim(c(-100, 250)) + thm ``` This reduces the average arrival delay but may not help travelers much. Two options: * Consider only the positive delays for the average. * Treat negative delays as zero (zero-truncation). The mean of the positive delays can be computed using _subsetting_: ```{r, eval = FALSE} mean(arr_delay[arr_delay > 0]) ``` The mean of the zero-truncated delays can be computed using the `pmax` function: ```{r, eval = FALSE} mean(pmax(arr_delay, 0)) ``` The summaries are ```{r} fl_arr <- filter(flights, ! is.na(arr_delay)) fl_avs <- group_by(fl_arr, dest) %>% summarize(ave = mean(arr_delay), ave_pos = mean(arr_delay[arr_delay > 0]), ave_pos_zero = mean(pmax(arr_delay, 0)), n = n()) %>% ungroup() head(fl_avs) ``` One approach to visualizing the results with color coding the three different summaries: * View the `fl_avs` table as a wide format with `avs`, `ave_pos`, and `ave_pos_zero` as three different measurement types. * Use `pivot_longer()` to convert to a long format with a variable `which` to hold the summary type and a variable `delay` to hold the summary value: ```{r} library(tidyr) fl <- pivot_longer(fl_avs, 2 : 4, names_to = "which", values_to = "delay") arrange(fl, dest) ``` A plot is then easy to create: ```{r aves_pivot, eval = FALSE} ggplot(fl, aes(x = n, y = delay, color = which)) + geom_point() + thm ``` ```{r aves_pivot, echo = FALSE} ``` An alternative: * Use one layer per summary type. * Specify the summary type for each layer as a `color` aesthetic value. ```{r aves_layer, eval = FALSE} ggplot(fl_avs, aes(x = n)) + geom_point(aes(y = ave, color = "ave")) + geom_point(aes(y = ave_pos, color = "ave_pos")) + geom_point(aes(y = ave_pos_zero, color = "ave_pos_zero")) + thm ``` ```{r aves_layer, echo = FALSE} ``` The one missing value case: ```{r} incomplete_cases(fl_avs) ``` ### Delay Proportions Proportions satisfying a condition can be computed as a mean of the logical expression for the condition: The proportion of all flights that were delayed: ```{r} mean(fl_arr$arr_delay > 0) ``` Grouped by destination: ```{r} fl_props <- group_by(fl_arr, dest) %>% summarize( pd0 = mean(arr_delay > 0), ## proportion delayed pd10 = mean(arr_delay > 10), ## more than 10 minutes pd20 = mean(arr_delay > 20), ## more than 20 minutes n = n()) %>% ungroup() fl_props ``` These can be visualized similarly: ```{r prop_delayed_pivot, eval = FALSE} pivot_longer(fl_props, 2 : 4, names_to = "which", values_to = "prop") %>% ggplot(aes(x = n, y = prop, color = which)) + geom_point() + thm ``` ```{r prop_delayed_pivot, echo = FALSE} ``` ## Grouped Mutate and Filter `mutate()`, `filter()`, `arrange()`, `slice_max()`, and `slice_min()` can be used on a group level. ### Standardizing Within Groups `mutate` within groups can be used for standardizing within a group. For example, to compute the proportion of a continent's population living in each country for each year: ```{r} gm <- group_by(gapminder, continent, year) %>% mutate(pop = as.numeric(pop), ppop = pop / sum(pop)) %>% ungroup() ``` For Oceania (in this data set) and years 1952 and 2007: ```{r} gm_oc <- filter(gm, continent == "Oceania", year %in% c(1952, 2007)) %>% arrange(year) gm_oc ``` A quick check that that within year and continent the proportions add up to one: ```{r} group_by(gm_oc, continent, year) %>% summarize(sum_ppop = sum(ppop)) ``` Focus on the 2 countries in each continent with the highest proportions of the continent's population: ```{r, fig.width = 9, class.source = "fold-hide"} gmt2 <- group_by(gm, continent, year) %>% slice_max(ppop, n = 2) %>% ungroup() %>% filter(continent != "Oceania") ## drop as there are only two ggplot(gmt2, aes(x = year, y = ppop, group = country, color = country)) + geom_line() + geom_point() + facet_wrap(~ continent, scales = "free_y") + thm ``` Another approach: * Create separate plots for each continent. * Adjust the themes a little. * Use `facet_wrap` to get the facet label. * Arrange the plots, e.g. using `patchwork` operators. A useful feature is that the `%+%` operator can be used to change the base data frame for a plot: ```{r, fig.height = 6.5, fig.width = 8, class.source = "fold-hide"} pAf <- ggplot(filter(gmt2, continent == "Africa"), aes(x = year, y = ppop, col = country)) + geom_line() + geom_point() + facet_wrap(~ continent) + thm + theme(legend.position = "top", legend.title = element_blank()) thm_nx <- theme(axis.title.x = element_blank()) thm_ny <- theme(axis.title.y = element_blank()) pAm <- pAf %+% filter(gmt2, continent == "Americas") pAs <- pAf %+% filter(gmt2, continent == "Asia") pEu <- pAf %+% filter(gmt2, continent == "Europe") + guides(color = guide_legend(nrow = 2)) library(patchwork) ((pAf + thm_nx) | (pAm + thm_nx + thm_ny)) / (pAs | (pEu + thm_ny)) ``` ### Changes Over Time Changes over time within a country can also be examined with a grouped `mutate` and the `lag` operator. For variables collected over time the `lag` function can be used to compute the previous value of a variable: ```{r} gm_us <- filter(gapminder, country == "United States") %>% transmute(year, lifeExp, prev_lifeExp = lag(lifeExp)) gm_us ``` This can then be used to compute the change from one period to the next: ```{r} mutate(gm_us, le_change = lifeExp - prev_lifeExp) ``` Use this idea to compute changes in `lifeExp` and `gdpPercap` for all countries in a grouped `mutate`: ```{r} gm <- group_by(gapminder, country) %>% mutate( le_change = lifeExp - lag(lifeExp), gdp_change = gdpPercap - lag(gdpPercap), gdp_pct_change = 100 * gdp_change / lag(gdpPercap)) %>% ungroup() ``` `slice_max()` can then be applied to data grouped by country to extract the rows with the highest `gdp_pct_change` for each country: ```{r} group_by(gm, country) %>% slice_max(gdp_pct_change, n = 1) %>% ungroup() %>% select(1 : 3, contains("gdp")) ``` Grouping by continent allows the worst drop in life expectancy for each continent to be found: ```{r} group_by(gm, continent) %>% slice_min(le_change, n = 1) %>% ungroup() %>% select(-contains("gdp")) ``` Without grouping, the row with best percent GDP change overall can be computed: ```{r} slice_max(gm, gdp_pct_change, n = 1) %>% select(country, continent, year, gdp_pct_change) ``` The computation of the best GDP growth years by country can be combined into a single pipe: ```{r, results = "hide"} group_by(gapminder, country) %>% mutate(gdp_change = gdpPercap - lag(gdpPercap), gdp_pct_change = 100 * gdp_change / lag(gdpPercap)) %>% ungroup() %>% group_by(country) %>% slice_max(gdp_pct_change, n = 1) %>% ungroup() %>% select(1 : 3, contains("gdp")) ``` Since the groupings of the two steps are the same and grouped `mutate` preserves the group structure, the middle `ungroup`/`group_by` can be dropped: ```{r, results = "hide"} group_by(gapminder, country) %>% mutate(gdp_change = gdpPercap - lag(gdpPercap), gdp_pct_change = 100 * gdp_change / lag(gdpPercap)) %>% slice_max(gdp_pct_change, n = 1) %>% ungroup() %>% select(1 : 3, contains("gdp")) ```
You need to be very careful making an optimization like this!
## Joins This interactive plot uses the `tooltip` to show the airport code for the destination: ```{r, class.source = "fold-hide"} library(plotly) p <- ggplot(fl_dest) + geom_point(aes(x = n, y = ave_arr_delay, text = dest)) + thm ggplotly(p, tooltip = "text") ``` A nicer approach would be to show the airport name. The `airports` table provides this information: ```{r} head(airports, 2) ``` We need to get this information into the `fl_dest` data frame. As another example, we might want to explore whether aircraft age is related to delays. The `planes` table provides the year of manufacture: ```{r} head(planes, 2) ``` Again, we need to merge this information with the data in the `flights` table. One way to do this is with a _join operation_. Joins are a common operation in data bases. * Murell's _Data Technologies_ book describes data bases in Section 5.6 and data base queries using SQL in Chapter 7. * _R for Data Science_ describes relational data in [Chapter 13](https://r4ds.had.co.nz/relational-data.html). `dplyr` distinguishes between two kinds of joins: * _mutating joins_; * _filtering joins_. These [animated explanations](https://www.garrickadenbuie.com/project/tidyexplain/) may be helpful. ### Mutating Joins Mutating joins combine variables and rows from two tables by matching rows on _keys_. * A _primary key_ is a variable, or combination of variables, in a table that uniquely identify the rows of the table. * A _foreign key_ is a variable, or combination of variables, that uniquely identify a row in some table (usually, but not necessarily, a different table). The simplest join operation is an _inner join_: Only rows for which the key appears in both tables are retained. ```{r, echo = FALSE, out.width = "70%"} knitr::include_graphics(IMG("join-inner.png")) ``` _Outer joins_ retain some other rows with `NA` values for new variables: ```{r, echo = FALSE, out.width = "70%"} knitr::include_graphics(IMG("join-outer.png")) ``` Inner and left joins are the most common. The key of one of the tables should usually be a _primary key_ that uniquely identifies the table's rows. For a left join, the right table key should usually use a primary key. It is usually a good idea to make sure a primary key candidate really does uniquely identify rows. For the `planes` table the `tailnum` variable is a primary key; ```{r} count(planes, tailnum) %>% filter(n > 1) ``` For the `flights` data a primary key needs to use multiple variables. A reasonable guess is to use the date, carrier and flight number, but this isn't quite enough: ```{r} count(flights, year, month, day, flight, carrier) %>% filter(n > 1) %>% head(2) ``` Adding the `origin` airport code does produce a primary key: ```{r} count(flights, year, month, day, flight, carrier, origin) %>% filter(n > 1) ``` To relate delays to aircraft age we can start with summaries by `tailnum` and `year` (`year` would be needed if we had data for more than 2013): ```{r} fl <- filter(flights, ! is.na(dep_time), ! is.na(arr_time)) %>% group_by(year, tailnum) %>% summarize(ave_dep_delay = mean(dep_delay, na.rm = TRUE), ave_arr_delay = mean(arr_delay, na.rm = TRUE), n = n()) %>% ungroup() ``` ```{r} head(fl) ``` A reduced `planes` table with `year` renamed to avoid a conflict: ```{r} pl <- select(planes, tailnum, plane_year = year) ``` ```{r} pl ``` Now * use a left join with `tailnum` as the key to bring the `plane_year` value into the summary table * and then compute the plane age: ```{r} fl_age <- left_join(fl, pl, "tailnum") %>% mutate(plane_age = year - plane_year) fl_age ``` ```{r, include = FALSE} ## not sure if this is still needed fl <- fl_age ``` An initial plot: ```{r plane-age-initial, eval = FALSE} ggplot(fl_age, aes(x = plane_age, y = ave_dep_delay)) + geom_point() + thm ``` ```{r plane-age-initial, echo = FALSE, warning = FALSE} ``` Jittering and adjusting the point size may help: ```{r plane-age-jitter, eval = FALSE} ggplot(fl_age, aes(x = plane_age, y = ave_dep_delay)) + geom_point(position = "jitter", size = 0.1) + thm ``` ```{r plane-age-jitter, echo = FALSE, warning = FALSE} ``` Adding a smooth and adjusting the ranges may also help: ```{r, plane-age-smooth, eval = FALSE} ggplot(fl_age, aes(x = plane_age, y = ave_dep_delay)) + geom_point(position = "jitter", size = 0.1) + geom_smooth() + xlim(0, 30) + ylim(c(-20, 80)) + thm ``` ```{r, plane-age-smooth, echo = FALSE, warning = FALSE} ``` Another option is to compute means or medians within years: ```{r plane-age-means, eval = FALSE} flm <- group_by(fl_age, plane_age) %>% summarize(ave_dep_delay = mean(ave_dep_delay, na.rm = TRUE)) %>% ungroup() ggplot(flm, aes(x = plane_age, y = ave_dep_delay)) + geom_point(na.rm = TRUE) + thm ``` ```{r plane-age-means, echo = FALSE} ``` Both the smooth and the means show a modest increase over the first 10 years. A left join can also be used to add the airport name from the airport table to the `fl_dest` table. The primary key in the `airport` table is the three-letter airport code in the `faa` variable: ```{r} head(airports, 2) ``` The foreign key in the `fl_dest` table to match is the `dest` variable: ```{r} head(fl_dest, 2) ``` The `left_join` function allows this link to be made by specifying the key as `c("dest" = "faa")`: ```{r} fl_dest <- left_join(fl_dest, select(airports, faa, dest_name = name), c("dest" = "faa")) select(fl_dest, dest_name, everything()) %>% head(2) ``` Now a tooltip can provide a more accessible label: ```{r ave_arr_dest_name, eval = FALSE} p <- ggplot(fl_dest) + geom_point(aes(x = n, y = ave_arr_delay, text = paste(dest, dest_name, sep = ": "))) + thm ggplotly(p, tooltip = "text") ``` ```{r ave_arr_dest_name, echo = FALSE, warning = FALSE} ``` ### Filtering Joins Filtering joins only affect rows; they do not add variables. * `semi_join(x, y)` returns all rows in `x` with a match in `y`: ```{r, echo = FALSE, out.width = "70%"} knitr::include_graphics(IMG("join-semi.png")) ``` * `anti_join(x, y)` returns all rows in `x` without a match in `y`: ```{r, echo = FALSE, out.width = "70%"} knitr::include_graphics(IMG("join-anti.png")) ``` `semi_join` is useful for matching filtered results back to a table. The plot of delay against age showed some surprisingly old aircraft: ```{r} top_age <- slice_max(fl_age, plane_age, n = 2) top_age ``` We can identify these aircraft with a semi-join: ```{r} semi_join(planes, top_age, "tailnum") ``` The corresponding flights can also be picked out with a semi-join: ```{r} top_age_flights <- semi_join(flights, top_age, "tailnum") head(top_age_flights, 2) ``` Looking at the carriers or destinations might help: ```{r} select(top_age_flights, carrier, dest, everything()) %>% head(4) ``` The carrier summary: ```{r} count(top_age_flights, carrier) ``` A check of the documentation for `planes` shows that American and one other airline report _fleet numbers_ rather than tail numbers. Anti-joins are useful for detecting join problems. For example, many flights have tail numbers that do not match entries in `planes`: ```{r} fl <- anti_join(flights, planes, "tailnum") nrow(fl) ``` ```{r} count(fl, tailnum) ``` ### Join Issues * Make sure a primary key you want to use has no missing values. * Make sure a primary key you want to use really uniquely identifies a row. * Make sure every value of your foreign key you want to use to link to another table matches a row in the linked table (`anti_join` is a good way to do this). * Check that your foreign key has no missing values. ## Data Base Interface Current data on airline flights in the US is available from the Bureau of Transportation Statistics at The data are available in one compressed CSV file per month. Loaded into R, one month of data requires about 300 Mb of memory. Data for 1987--2008 was compiled for a DataExpo competition of the _American Statistical Association_ held in 2009. The data are available on the Linux systems in the directory `/group/statsoft/data/DataExpo09` as compressed CSV files and as a SQLite data base. Loading the data into R would require about 70 Gb of memory, which is not practical. Using the data base is a good alternative. SQLite is a simple relational data base that works with data stored on a local disk. More sophisticated data bases allow data to be stored on multiple machines or in the cloud These data bases are usually accessed by connecting to a server, which usually involves an authentication process. For SQLite the connection is set up by linking to a file. The RSQLite package provides an interface to SQLite. The `DBI` package provides a high level interface for using data bases. * `dbConnect` creates a generic data base connection object. * `dbListTables` returns a vector with the names of the available tables. * `dbDisconnect` closes a data base connection. Analogous packages exist for Python and Perl. The first step is to connect to the data base. ```{r, eval = FALSE} library(dplyr) library(DBI) library(RSQLite) db <- dbConnect(SQLite(), "/group/statsoft/data/DataExpo09/ontime.sqlite3") ``` There is only one table: ```{r, eval = FALSE} dbListTables(db) ## [1] "ontime" ``` `tbl` creates a tibble-like object for a data base table. ```{r, eval = FALSE} fl <- tbl(db, "ontime") ``` This object is not a regular tibble, but many tibble operations work on it. To find the variables in the table: ```{r, eval = FALSE} tbl_vars(fl) ## [1] "Year" "Month" "DayofMonth" ## [4] "DayOfWeek" "DepTime" "CRSDepTime" ## [7] "ArrTime" "CRSArrTime" "UniqueCarrier" ## [10] "FlightNum" "TailNum" "ActualElapsedTime" ## [13] "CRSElapsedTime" "AirTime" "ArrDelay" ## [16] "DepDelay" "Origin" "Dest" ## [19] "Distance" "TaxiIn" "TaxiOut" ## [22] "Cancelled" "CancellationCode" "Diverted" ## [25] "CarrierDelay" "WeatherDelay" "NASDelay" ## [28] "SecurityDelay" "LateAircraftDelay" ``` Many `dplyr` operations are supported: ```{r, eval = FALSE} sm <- summarize(fl, last_year = max(year), n = n()) ``` This computation is fast because it does not actually access the data base. An operation that needs the data, such as printing the result, will access the data base and take time: ```{r, eval = FALSE} sm ## # Source: lazy query [?? x 2] ## # Database: sqlite 3.37.2 ## # [/mnt/nfs/clasnetappvm/group/statsoft/data/DataExpo09/ontime.sqlite3] ## last_year n ## ## 1 2008 118914458 ``` The `sm` object essentially contains a SQL query, [which can be examined](https://dbplyr.tidyverse.org/articles/sql-translation.html) using `sql_render` from the `dbplyr` package: ```{r, eval = FALSE} dbplyr::sql_render(sm) ## SELECT MAX(`year`) AS `last_year`, COUNT(*) AS `n` ## FROM `ontime` ``` Flights to CID in 2008: ```{r, eval = FALSE} fl_cid <- filter(fl, year == 2008, Dest == "CID") dbplyr::sql_render(fl_cid) ## SELECT * ## FROM `ontime` ## WHERE ((`year` = 2008.0) AND (`Dest` = 'CID')) ``` Number of flights to CID in 2008: ```{r, eval = FALSE} sm <- summarize(fl_cid, n = n()) dbplyr::sql_render(sm) ## SELECT COUNT(*) AS `n` ## FROM `ontime` ## WHERE ((`year` = 2008.0) AND (`Dest` = 'CID')) ``` ```{r, eval = FALSE} sm ## # Source: lazy query [?? x 1] ## # Database: sqlite 3.37.2 ## # [/mnt/nfs/clasnetappvm/group/statsoft/data/DataExpo09/ontime.sqlite3] ## n ## ## 1 2961 ``` Breakdown by where the flights originated: ```{r, eval = FALSE} sm <- group_by(fl_cid, Origin) %>% summarize(n = n()) %>% ungroup() dbplyr::sql_render(sm) ## SELECT `Origin`, COUNT(*) AS `n` ## FROM `ontime` ## WHERE ((`year` = 2008.0) AND (`Dest` = 'CID')) ## GROUP BY `Origin` ``` ```{r, eval = FALSE} sm ## # Source: lazy query [?? x 2] ## # Database: sqlite 3.37.2 ## # [/mnt/nfs/clasnetappvm/group/statsoft/data/DataExpo09/ontime.sqlite3] ## Origin n ## ## 1 ATL 119 ## 2 BNA 1 ## 3 CVG 292 ## 4 DEN 230 ## 5 DFW 519 ## 6 DTW 345 ## 7 MSP 241 ## 8 ORD 1214 ``` Many base R functions can be converted to SQL: ```{r, eval = FALSE} sm <- group_by(fl_cid, Origin) %>% summarize(ave_arr_del_pos = mean(pmax(ArrDelay, 0), na.rm = TRUE)) %>% ungroup() dbplyr::sql_render(sm) ## SELECT `Origin`, AVG(MAX(`ArrDelay`, 0.0)) AS `ave_arr_del_pos` ## FROM `ontime` ## WHERE ((`year` = 2008.0) AND (`Dest` = 'CID')) ## GROUP BY `Origin` ``` ```{r, eval = FALSE} sm_cid <- sm sm ## # Source: lazy query [?? x 2] ## # Database: sqlite 3.37.2 ## # [/mnt/nfs/clasnetappvm/group/statsoft/data/DataExpo09/ontime.sqlite3] ## Origin ave_arr_del_pos ## ## 1 ATL 17.4 ## 2 BNA 18 ## 3 CVG 12.4 ## 4 DEN 12.1 ## 5 DFW 11.3 ## 6 DTW 15.0 ## 7 MSP 20.1 ## 8 ORD 21.6 ``` `mutate()` can also be used: ```{r, eval = FALSE} fl_cid_gain <- mutate(fl_cid, Gain = DepDelay - ArrDelay) dbplyr::sql_render(fl_cid_gain) ## SELECT `Year`, `Month`, `DayofMonth`, `DayOfWeek`, `DepTime`, `CRSDepTime`, `ArrTime`, `CRSArrTime`, `UniqueCarrier`, `FlightNum`, `TailNum`, `ActualElapsedTime`, `CRSElapsedTime`, `AirTime`, `ArrDelay`, `DepDelay`, `Origin`, `Dest`, `Distance`, `TaxiIn`, `TaxiOut`, `Cancelled`, `CancellationCode`, `Diverted`, `CarrierDelay`, `WeatherDelay`, `NASDelay`, `SecurityDelay`, `LateAircraftDelay`, `DepDelay` - `ArrDelay` AS `Gain` ## FROM `ontime` ## WHERE ((`year` = 2008.0) AND (`Dest` = 'CID')) ``` ```{r, eval = FALSE} group_by(fl_cid_gain, Origin) %>% summarize(mean_gain = mean(Gain)) %>% ungroup() ## # Source: lazy query [?? x 2] ## # Database: sqlite 3.37.2 ## # [/mnt/nfs/clasnetappvm/group/statsoft/data/DataExpo09/ontime.sqlite3] ## Origin mean_gain ## ## 1 ATL 8.24 ## 2 BNA -26 ## 3 CVG -1.60 ## 4 DEN 0.278 ## 5 DFW 4.12 ## 6 DTW 3.04 ## 7 MSP -0.357 ## 8 ORD -1.36 ``` Once a filtered result is small enough it can be brought over as a regular tibble using ```{r, eval = FALSE} tb <- collect(fl_cid) ``` There are some limitations to operations that can be done on tables in data bases. Defining and using a function works for a tibble: ```{r, eval = FALSE} pos_mean <- function(x) mean(pmax(x, 0)) group_by(tb, Origin) %>% summarize(ave_arr_del_pos = pos_mean(ArrDelay)) %>% ungroup() ``` But this fails: ```{r, eval = FALSE} group_by(fl_cid, Origin) %>% summarize(ave_arr_del_pos = pos_mean(ArrDelay)) %>% ungroup() ## Error: no such function: pos_mean ``` Some operations may require some modifications. Recomputing `sm_cid`: ```{r, eval = FALSE} fl <- tbl(db, "ontime") fl_cid <- filter(fl, year == 2008, Dest == "CID") sm_cid <- group_by(fl_cid, Origin) %>% summarize(ave_arr_del_pos = mean(pmax(ArrDelay, 0), na.rm = TRUE)) %>% ungroup() ``` Joining with a local table does not work: ```{r, eval = FALSE} ap <- select(nycflights13::airports, faa, name) sm_cid <- left_join(sm_cid, ap, c(Origin = "faa")) ## Error in `auto_copy()`: ## ! `x` and `y` must share the same src. ## ℹ set `copy` = TRUE (may be slow). ``` But this does work: ```{r, eval = FALSE} sm_cid <- left_join(sm_cid, ap, c(Origin = "faa"), copy = TRUE) ``` This copies the `ap` table to the data base: ```{r, eval = FALSE} dbListTables(db) ## [1] "dbplyr_001" "ontime" "sqlite_stat1" "sqlite_stat4" ``` Finish by disconnecting from the data base: ```{r, eval = FALSE} dbDisconnect(db) ``` This is not necessary for SQLite but is for other data bases, so it is good practice. Some notes: * Building data base queries with `dplyr` is very efficient. * Running the queries can take time. * Transferring (partial) results to R can take time. * Inadvertently asking to transfer a result that is too large could cause problems. * Other data bases may arrange to cache query results. * Most data bases, including SQLite, support creating _indices_ that speed up certain queries at the expense of using more storage. The [`dbplyr` web site](https://dbplyr.tidyverse.org/index.html) provides more information on using data bases with `dplyr`. ## Other Wrangling Tools Some additional tools provided by `tidyr`: * `separate()`: split character variable into several variables; * `fill()`: fill in missing values in a variable (e.g. carry forward); * `complete()`: add missing category classes (e.g. zero counts). We have already seen `pivot_longer()` and `pivot_wider()`. ### Separating Variables A data set from the WHO on tuberculosis cases: ```{r} head(who, 2) ``` Some of the remaining variable names: ```{r} names(who) %>% tail(20) ``` The variable `new_sp_m2534`, for example, represents the number of cases for * type `sp` (positive pulmonary smear); * sex `m`; * age between 25 and 34. It would be better (tidier) to have separate variables for * count * type * sex * lower bound on age bracket * upper bound on age bracket A pipeline to clean this up involves pivoting to long form and several `separate` and `mutate` steps. The original data frame: ```{r} who ``` First pivot to longer form: ```{r} who_clean <- pivot_longer(who, new_sp_m014 : newrel_f65, names_to = "key", values_to = "count") ``` ```{r} who_clean ``` Remove `"new"` or `"new_"` prefix: ```{r} who_clean <- pivot_longer(who, new_sp_m014 : newrel_f65, names_to = "key", values_to = "count") %>% mutate(key = sub("new_?", "", key)) ``` ```{r} who_clean ``` Separate `key` into `type` and `sexage`: ```{r} who_clean <- pivot_longer(who, new_sp_m014 : newrel_f65, names_to = "key", values_to = "count") %>% mutate(key = sub("new_?", "", key)) %>% separate(key, c("type", "sexage")) ``` ```{r} who_clean ``` Separate `sexage` into `sex` and `age`: ```{r} who_clean <- pivot_longer(who, new_sp_m014 : newrel_f65, names_to = "key", values_to = "count") %>% mutate(key = sub("new_?", "", key)) %>% separate(key, c("type", "sexage")) %>% separate(sexage, c("sex", "age"), sep = 1) ``` ```{r} who_clean ``` Fix up age categories: ```{r} who_clean <- pivot_longer(who, new_sp_m014 : newrel_f65, names_to = "key", values_to = "count") %>% mutate(key = sub("new_?", "", key)) %>% separate(key, c("type", "sexage")) %>% separate(sexage, c("sex", "age"), sep = 1) %>% mutate(age = sub("014", "0014", age)) %>% mutate(age = sub("65", "65Inf", age)) ``` ```{r} who_clean ``` Finally, split `age` into `age_lo` and `age-hi`: ```{r} who_clean <- pivot_longer(who, new_sp_m014 : newrel_f65, names_to = "key", values_to = "count") %>% mutate(key = sub("new_?", "", key)) %>% separate(key, c("type", "sexage")) %>% separate(sexage, c("sex", "age"), sep = 1) %>% mutate(age = sub("014", "0014", age)) %>% mutate(age = sub("65", "65Inf", age)) %>% separate(age, c("age_lo", "age_hi"), sep = 2, convert = TRUE) ``` ```{r} who_clean ``` A plot of total numbers of cases for a few countries over time: ```{r who-plot, eval = FALSE} filter(who_clean, iso3 %in% c("CAN", "DEU", "GBR", "USA")) %>% group_by(year, iso3) %>% summarize(count = sum(count, na.rm = TRUE)) %>% ungroup() %>% ggplot() + geom_line(aes(x = year, y = count, color = iso3)) + thm ``` ```{r who-plot, echo = FALSE} ``` ### Filling In Omitted Values The [Refugee Processing Center](http://www.wrapsnet.org/) provides information from the United States Refugee Admissions Program (USRAP), including [arrivals by state and nationality](http://www.wrapsnet.org/admissions-and-arrivals/). Three files are available locally: * data from [early January 2017](http://homepage.stat.uiowa.edu/~luke/data/Arrivals-2017-01-06.xls); * data from [early April 2017](http://homepage.stat.uiowa.edu/~luke/data/Arrivals-2017-04-05.xls). * data from [early March 2018](http://homepage.stat.uiowa.edu/~luke/data/Arrivals-2018-03-05.xls). These are Excel spread sheets. Reading in the third file, with data for October 2017 through February 2018: ```{r, message = FALSE} ref_url <- "http://homepage.stat.uiowa.edu/~luke/data/Arrivals-2018-03-05.xls" ref_file <- "Arrivals-2018-03-05.xls" if (! file.exists(ref_file)) download.file(ref_url, ref_file) library(readxl) ref <- read_excel(ref_file, skip = 16) ## skip header ref <- head(ref, -2) ## drop last two rows names(ref)[1] <- "Destination" ``` The `Destination` column needs filling in (this is common in spread sheet data): ```{r} ref ``` Fill down the Destination: ```{r} ref_clean <- fill(ref, Destination) ``` ```{r} ref_clean ``` ```{r, include = FALSE} ## check the totals ref.tot <- filter(ref_clean, is.na(Nationality)) %>% select(Destination, Oct : Sep) rt <- filter(ref_clean, ! is.na(Nationality)) %>% select(Destination, Nationality, Oct : Sep) %>% group_by(Destination) %>% summarize_if(is.numeric, sum) stopifnot(identical(rt, ref.tot)) ``` Drop the totals: ```{r} ref_clean <- fill(ref, Destination) %>% filter(! is.na(Nationality)) ``` ```{r} ref_clean ``` Pivot to a tidier form with one row per month: ```{r} ref_clean <- fill(ref, Destination) %>% filter(! is.na(Nationality)) %>% pivot_longer( Oct : Sep, names_to = "month", values_to = "count") ``` ```{r} ref_clean ``` Top 5 destination states: ```{r} group_by(ref_clean, Destination) %>% summarize(count = sum(count)) %>% ungroup() %>% slice_max(count, n = 5) ``` Top 5 origin nationalities: ```{r} group_by(ref_clean, Nationality) %>% summarize(count = sum(count)) %>% ungroup() %>% slice_max(count, n = 5) ``` ### Completing Missing Factor Level Combinations This example is based on a [blog post](https://win-vector.com/2017/02/21/the-zero-bug/) using data from the [Significant Earthquake Database](https://www.ncei.noaa.gov/access/metadata/landing-page/bin/iso?id=gov.noaa.ngdc.mgg.hazards:G012153). The data used is a selection for North America and Hawaii. Reading in the data: ```{r} if (! file.exists("earthquakes.tsv")) download.file("http://homepage.stat.uiowa.edu/~luke/data/earthquakes.tsv", "earthquakes.tsv") eq <- read.table("earthquakes.tsv", header = TRUE, sep = "\t") eq <- as_tibble(eq) ``` ```{r} select(eq, STATE, everything()) ``` Some `STATE` entries are blank, but the location name identifies the state: ```{r} filter(eq, STATE == "") %>% select(STATE, LOCATION_NAME) ``` One way to fix this: ```{r} badidx <- which(eq$STATE == "") badstate <- sub(":.*", "", eq$LOCATION_NAME[badidx]) eq$STATE[badidx] <- state.abb[match(tolower(badstate), tolower(state.name))] ``` Add full state names as a factor: ```{r} states <- data.frame(STATE = state.abb, STATE_NAME = state.name) eq <- left_join(eq, states, "STATE") eq <- mutate(eq, STATE_NAME = factor(STATE_NAME, state.name)) ``` Look at total number of earthquakes by state: ```{r} tbl <- count(eq, STATE_NAME) head(tbl, 6) ``` ```{r, fig.height = 5, fig.width = 8, class.source = "fold-hide"} library(forcats) pb <- ggplot(tbl, aes(y = fct_rev(STATE_NAME), x = n)) + geom_col() + thm + theme(axis.title.y = element_blank()) ps <- ggplot(tbl, aes(y = fct_rev(STATE_NAME), x = n)) + geom_point() + geom_segment(aes(xend = 0, yend = fct_rev(STATE_NAME))) + thm + theme(axis.title.y = element_blank()) pb | ps ``` Many states, including Iowa, are missing. Their counts should be zero. This often impacts visualizations. `complete()` can be used to fill in the zero values. ```{r} tbl <- count(eq, STATE_NAME) %>% complete(STATE_NAME, fill = list(n = 0)) tbl ``` ```{r, fig.height = 7, fig.width = 8, class.source = "fold-hide"} thm_ytxt_sm <- theme(axis.text.y = element_text(size = 10)) (pb %+% tbl + thm_ytxt_sm) | (ps %+% tbl + thm_ytxt_sm) ``` ## Reading Chapter [_Data transformation_](https://r4ds.had.co.nz/transform.html) of [R for Data Science](https://r4ds.had.co.nz/). Chapter [_Tidy data_](https://r4ds.had.co.nz/tidy-data.html) of [R for Data Science](https://r4ds.had.co.nz/). ## Interactive Tutorial An interactive [`learnr`](https://rstudio.github.io/learnr/) tutorial for these notes is [available](`r WLNK("tutorials/dplyr.Rmd")`). You can run the tutorial with ```{r, eval = FALSE} STAT4580::runTutorial("dplyr") ``` You can install the current version of the `STAT4580` package with ```{r, eval = FALSE} remotes::install_gitlab("luke-tierney/STAT4580") ``` You may need to install the `remotes` package from CRAN first. ## Exercises 1. To bring in `dplyr` and the `mpg` data, start by evaluating ```r library(dplyr) library(ggplot2) ``` The `select` function allows variables to be specified in a variety of ways. Which of the following does **not** produce a data frame with only the variables `manufacturer`, `model`, `cty`, `hwy`? a. `select(mpg, 1:2, 7:8)` b. `select(mpg, starts_with("m"), ends_with("y"))` c. `select(mpg, 1:2, cty : hwy)` d. `select(mpg, -(displ : drv), -(fl : class))` 2. Consider the code ```r library(dplyr) library(ggplot2) filter(mpg, ---) %>% nrow() ``` Which of the replacements for `---` computes the number of Ford vehicles with more than 6 cylinders in the `mpg` table? a. `model == "ford", cyl <= 6` b. `manufacturer == "ford" | cyl > 6` c. `manufacturer == "ford", cyl > 6` d. `manufacturer == "ford", cylinders > 6` 3. In the 2013 NYC flights data provided by the `nycflights13` package how many flights were there to Des Moines (FAA code DSM) from NYC in the first three months of 2013? a. 13 b. 98 c. 64 d. 77 4. To bring in `dplyr` and the `mpg` data, start by evaluating ```r library(dplyr) library(ggplot2) ``` Which of the following sorts the rows for `mpg` by increasing `cyl` value and, within each `cyl` value sorts the rows from largest to smallest `hwy` value. a. `arrange(mpg, desc(hwy), cyl)` b. `arrange(mpg, cyl, desc(hwy))` c. `arrange(mpg, desc(cyl), hwy)` d. `arrange(mpg, cyl, hwy)` 5. To bring in `dplyr` and the `flights` data, start by evaluating ```r library(dplyr) library(nycflights13) ``` The `dep_time` variable in the `flights` data set from the `nycflights13` package is convenient to look at (529 means 5:29 AM and 22:12 means 10:12 PM), but hard to compute with. Which of the following adds variables `dep_hour` and `dep_min` containing hour and minute of the departure time? a. `mutate(flights, dep_hour = dep_time %/% 60, dep_min = dep_time %% 60)` b. `mutate(flights, dep_hour = dep_time %/% 100, dep_min = dep_time %% 100)` c. `mutate(flights, dep_hour = dep_time / 100, dep_min = dep_time - dep_hour)` a. `mutate(flights, dep_hour = dep_time %/% 60, dep_min = dep_hour %% 60)` 6. Using the `gapminder` data set, the following code computes population totals for each continent and each year: ```{r, message = FALSE} library(dplyr) library(gapminder) cpops <- group_by(gapminder, continent, year) %>% summarize(pop = sum(pop)) %>% ungroup() ``` To produce a plot to compare population growth over the years for the continents it is useful to standardize the population data for each continent, for example by dividing the population values by the average population size for each continent. One way to do this is with a grouped `mutate`. The first line of your result should be ```{r, include = FALSE} cpops_std <- group_by(cpops, continent) %>% mutate(stdpop = pop / mean(pop)) %>% ungroup() ``` ```{r} head(cpops_std, 1) ``` ```{r, fig.cap = ""} library(ggplot2) ggplot(cpops_std, aes(x = year, y = stdpop, color = continent)) + geom_line() ``` Which of the following produces the correct result: a. `cpops_std <- group_by(cpops, continent) %>% mutate(stdpop = pop / mean(year)) %>% ungroup()` a. `cpops_std <- group_by(cpops, year) %>% mutate(stdpop = pop / mean(pop)) %>% ungroup()` c. `cpops_std <- group_by(cpops, continent) %>% mutate(stdpop = pop / mean(pop)) %>% ungroup()` d. `cpops_std <- group_by(cpops, year) %>% mutate(stdpop = pop / mean(year)) %>% ungroup()` 7. Another approach to the previous exercise first creates a table of population averages with ```r cpops_avg <- group_by(cpops, continent) %>% summarize(avg_pop = mean(pop)) ``` Then use a left join to add `avg_pop` to the `cpops` table, followed by an ordinary mutate step: ```r left_join(---) %>% mutate(stdpop = pop / avg_pop) ``` Which is the correct replacement for `---`? a. `cpops_avg, cpops, "continent"` b. `cpops, cpops_avg, "continent"` c. `cpops, cpops_avg, "year"` d. `cpops_avg, cpops, "year"`