--- ############################################################# # # # In RStudio click on "Run Document" to run this tutorial # # # ############################################################# title: "A Grammar of Data Manipulation" author: "Luke Tierney" output: learnr::tutorial runtime: shiny_prerendered --- ```{r setup, include = FALSE} library(learnr) library(tidyverse) knitr::opts_chunk$set(echo = FALSE, comment = "", warning = FALSE) <> <> ``` ```{r stop_when_browser_closes, context = "server"} # stop the app when the browser is closed (or, unfortunately, refreshed) session$onSessionEnded(stopApp) ``` ## Palmer Penguins ```{r, out.width = 500, fig.align = "center", fig.cap = "Artwork by @allison_horst"} knitr::include_graphics("https://allisonhorst.github.io/palmerpenguins/man/figur es/lter_penguins.png") ``` The [**palmerpenguins** package](https://allisonhorst.github.io/palmerpenguins/) includes data for adult foraging Adélie, Chinstrap, and Gentoo penguins observed on islands in the Palmer Archipelago near Palmer Station, Antarctica. Data were collected and made available by Dr. Kristen Gorman and the Palmer Station Long Term Ecological Research (LTER) Program. ```{r, echo = TRUE} library(palmerpenguins) penguins ``` In this tutorial you will explore the `penguins` data frame. ## Selecting Variables A useful way to start exploring a data set is to select a few variables and pass the result to the `summary()` function. Start with the `species`, `island`, and `sex` variables: ```{r select-start, exercise = TRUE} ``` ```{r select-start-solution} select(penguins, species, island, sex) %>% summary() ``` Select the variable `species` and the two bill variables. There are many ways to do thisone is to use `starts_with()`. ```{r select-bills, exercise = TRUE} ``` ```{r select-bills-solution} select(penguins, species, starts_with("bill")) ``` Select the `species` variable and the length variables; ```{r select-lengths, exercise = TRUE} ``` ```{r select-lengths-solution} select(penguins, species, contains("length")) ``` Move the `sex` variable into the third position. The `everything()` function may help. ```{r select-sex, exercise = TRUE} ``` ```{r select-sex-solution} select(penguins, species, island, sex, everything()) ``` It is usually a good idea to make sure you understand any missing vaules in your data. A first step is to find any variables that have missing values. You can do this using the `where()` helper and the `anyNA()` function. ```{r select-na, exercise = TRUE} ``` ```{r select-na-solution} select(penguins, where(anyNA)) ``` ## Filtering Rows Filtering can be used to focus on a subset of the rows. Start with looking at the penguins recorded on the island Biscoe. ```{r filter-biscoe, exercise = TRUE} ``` ```{r filter-biscoe-solution} filter(penguins, island == "Biscoe") ``` Now look at the female penguins found on Biscoe. ```{r filter-biscoe-female, exercise = TRUE} ``` ```{r filter-biscoe-female-solution} filter(penguins, island == "Biscoe", sex == "female") ``` You can further limit the subset to birds with body mass at least 5,000 grams. ```{r filter-biscoe-female-5K, exercise = TRUE} ``` ```{r filter-biscoe-female-5K-solution} filter(penguins, island == "Biscoe", sex == "female", body_mass_g >= 5000) ``` Looking at cases with missing values is also useful. Start by finding cases where the bill length is missing. ```{r filter-na-bill-length, exercise = TRUE} ``` ```{r filter-na-bill-length-solution} filter(penguins, is.na(bill_length_mm)) ``` Using `if_any()` and `is.na()` you can find all cases where at least one variable has a missing value: ```{r filter-na-any, exercise = TRUE} ``` ```{r filter-na-any-solution} filter(penguins, if_any(everything(), is.na)) ``` `count()` and `across()` can be used to count the distinct missing/non-missing patterns in the data: ```{r count-na-non-na-pats, exercise = TRUE} ``` ```{r count-na-non-na-pats-solution} count(penguins, across(where(anyNA), is.na)) ``` This includes a count for the pattern where no values are missing. Combine the last two in a pipeline to see counts only for patters where there is at least one missing value. ```{r count-na-only-pats, exercise = TRUE} ``` ```{r count-na-only-pats-solution} filter(penguins, if_any(everything(), is.na)) %>% count(across(where(anyNA), is.na)) ``` ## Adding New Variables Larger birds will have both larger bill length and bill depth values. To help distinguish the species it may be useful to create a variable `bill_ratio` that captures the shape as the ratio of bill length to bill depth. You can create a data set that adds this variable using `mutate()`. ```{r bill-ratio, exercise = TRUE} ``` ```{r bill-ratio-solution} mutate(penguins, bill_ratio = bill_length_mm / bill_depth_mm) ``` Converting a value to a ranking can also sometimes be useful. Create a variable `mass_rank` with one for the heaviest bird, two for the second heaviest, and so on. The `min_rank()` function can help. ```{r mass-rank, exercise = TRUE} ``` ```{r mass-rank-solution} mutate(penguins, mass_rank = min_rank(desc(body_mass_g))) ``` Use this variable to extract the rows for the penguins with the top 5 mass ranks. ```{r mass-rank-top-5, exercise = TRUE} ``` ```{r mass-rank-top-5-solution} mutate(penguins, mass_rank = min_rank(desc(body_mass_g))) %>% filter(mass_rank <= 5) ``` Extracting the penguins with the top 5 mass ranks can also be done without creating the `mass_rank` variable by using the `top_n()` function or the `slice_max()` function. ```{r mass-top-5, exercise = TRUE} ``` ```{r mass-top-5-solution} top_n(penguins, 5, body_mass_g) ``` ## Arranging Rows Sort the rows so the years appear in increasing order and, within years, the penguins are sorted by body mass with the heaviest birds appearing first. Move the year and body mass variables to the beginning so the result is easier to see. ```{r arrange-year-mass, exercise = TRUE} ``` ```{r arrange-year-mass-solution} select(penguins, year, body_mass_g, everything()) %>% arrange(year, desc(body_mass_g)) ``` ## Summarizing and Grouping You can use `count` to fine hoe many penguins where recorded on each island. ```{r count-island, exercise = TRUE} ``` ```{r count-island-solution} count(penguins, island) ``` Similarly, you can compute the number in each species. ```{r count-species, exercise = TRUE} ``` ```{r count-species-solution} count(penguins, species) ``` Next, compute the average body mass for each species. ```{r avg-species-mass, exercise = TRUE} ``` ```{r avg-species-mass-solution} group_by(penguins, species) %>% summarize(avg_mass = mean(body_mass_g, na.rm = TRUE)) ``` Using `top_n` is one way to find the penguins on each island with the largest body mass. ```{r max-mass, exercise = TRUE} ``` ```{r max-mass-solution} group_by(penguins, island) %>% top_n(1, body_mass_g) ``` With the `desc` modifier you can find the penguins with the smallest body masses. ```{r min-mass, exercise = TRUE} ``` ```{r min-mass-solution} group_by(penguins, island) %>% top_n(1, desc(body_mass_g)) ``` To find the proportions of male and female birds on each island you can start by filtering out the birds where sex is not known, count the number of each sex on each island, and use a grouped mutate to add the proportions within each island. ```{r sex-prop, exercise = TRUE} ``` ```{r sex-prop-solution} filter(penguins, ! is.na(sex)) %>% count(island, sex) %>% group_by(island) %>% mutate(p = n / sum(n)) %>% ungroup() ``` Similarly, you can compute the proportions of the three species within each island. ```{r species-prop, exercise = TRUE} ``` ```{r species-prop-solution} count(penguins, island, species) %>% group_by(island) %>% mutate(p = n / sum(n)) %>% ungroup() ``` ## Completing Missing Combinations This grouped bar graph if the species counts on the different islands looks a bit awkward since the species are not present on all islands: ```{r, eval = TRUE, echo = TRUE} count(penguins, island, species) %>% group_by(island) %>% mutate(p = n / sum(n)) %>% ungroup() %>% ggplot(aes(x = island, y = p, fill = species)) + geom_col(position = "dodge") ``` It might be better to include zero proportions for the absent species. You can add these zero values with `complete` from the `tidyr` package. ```{r species-complete, exercise = TRUE} ``` ```{r species-complete-solution} count(penguins, island, species) %>% complete(island, species, fill = list(n = 0)) %>% group_by(island) %>% mutate(p = n / sum(n)) %>% ungroup() %>% ggplot(aes(x = island, y = p, fill = species)) + geom_col(position = "dodge") ``` ## Exercises ### Exercise 1 To bring in `dplyr` and the `mpg` data, start by evaluating ```r library(dplyr) library(ggplot2) ``` The `select()` function allows variables to be specified in a variety of ways. ```{r select-mpg, exercise = TRUE} ``` ```{r select-mpg-question} question( paste("Which of the following does **not** produce a data frame with", "only the variables `manufacturer`, `model`, `cty`, `hwy`?"), answer("`select(mpg, 1:2, 7:8)`", correct = TRUE), answer("`select(mpg, starts_with(\"m\"), ends_with(\"y\"))`"), answer("`select(mpg, 1:2, cty : hwy)`"), answer("`select(mpg, -(displ : drv), -(fl : class))`"), random_answer_order = TRUE, allow_retry = TRUE ) ``` ### Exercise 2 Consider the code ```{r num-fords, exercise = TRUE} library(dplyr) library(ggplot2) filter(mpg, ---) %>% nrow() ``` ```{r num-fords-question} question( paste("Which of the replacements for `---` computes the number of", "Ford vehicles with more than 6 cylinders in the `mpg` table?"), answer("`model == \"ford\", cyl <= 6`"), answer("`manufacturer == \"ford\" | cyl > 6`"), answer("`manufacturer == \"ford\", cyl > 6`", correct = TRUE), answer("`manufacturer == \"ford\", cylinders > 6`"), random_answer_order = TRUE, allow_retry = TRUE ) ``` ### Exercise 3 Consider the 2013 NYC flights data provided by the `nycflights13` package. ```{r nyc-dsm, exercise = TRUE} library(nycflights13) ``` ```{r nyc-dsm-question} question( paste("How many flights were there to Des Moines (FAA code DSM) from", "NYC in the first three months of 2013?"), answer("13"), answer("98"), answer("64"), answer("77", correct = TRUE), random_answer_order = TRUE, allow_retry = TRUE ) ``` ### Exercise 4 Consider the `mpg` data set in the `ggplots2` package. ```{r mpg-sort, exercise = TRUE} library(dplyr) library(ggplot2) ``` ```{r mpg-sort-question} question( paste("Which of the following sorts the rows for `mpg` by increasing", "`cyl` value and, within each `cyl` value sorts the rows from", "largest to smallest `hwy` value."), answer("`arrange(mpg, desc(hwy), cyl)`"), answer("`arrange(mpg, cyl, desc(hwy))`", correct = TRUE), answer("`arrange(mpg, desc(cyl), hwy)`"), answer("`arrange(mpg, cyl, hwy)`"), random_answer_order = TRUE, allow_retry = TRUE ) ``` ### Exercise 5 Consider the `flights` table in the `nycflights13` package. ```{r dep-time, exercise = TRUE} library(dplyr) library(nycflights13) ``` The `dep_time` variable in the `flights` data set from the `nycflights13` package is convenient to look at (529 means 5:29 AM and 22:12 means 10:12 PM), but hard to compute with. ```{r dep-time-question} question( paste("Which of the following adds variables `dep_hour` and `dep_min`", "containing hour and minute of the departure time?"), answer("`mutate(flights, dep_hour = dep_time %/% 60, dep_min = dep_time %% 60)`"), answer("`mutate(flights, dep_hour = dep_time %/% 100, dep_min = dep_time %% 100)`", correct = TRUE), answer("`mutate(flights, dep_hour = dep_time / 100, dep_min = dep_time - dep_hour)`"), answer("`mutate(flights, dep_hour = dep_time %/% 60, dep_min = dep_hour %% 60)`"), random_answer_order = TRUE, allow_retry = TRUE ) ``` ### Exercise 6 Using the `gapminder` data set, the following code computes population totals for each continent and each year: ```{r cpop-std-setup, message = FALSE} library(dplyr) library(gapminder) cpops <- group_by(gapminder, continent, year) %>% summarize(pop = sum(pop)) %>% ungroup() ``` ```{r, eval = FALSE, echo = TRUE} <> ``` To produce a plot to compare population growth over the years for the continents it is useful to standardize the population data for each continent, for example by dividing the population values by the average population size for each continent. One way to do this is with a grouped `mutate`. The first line of your result should be ```{r, include = FALSE} <> cpops_std <- group_by(cpops, continent) %>% mutate(stdpop = pop / mean(pop)) %>% ungroup() ``` ```{r} head(cpops_std, 1) ``` A plot looks like ```{r, fig.cap = ""} library(ggplot2) ggplot(cpops_std, aes(x = year, y = stdpop, color = continent)) + geom_line() ``` ```{r cpop-std, exercise = TRUE} ``` ```{r cpop-std-question} question( "Which of the following produces the correct result:", answer("`cpops_std <- group_by(cpops, continent) %>% mutate(stdpop = pop / mean(year)) %>% ungroup()`"), answer("`cpops_std <- group_by(cpops, year) %>% mutate(stdpop = pop / mean(pop)) %>% ungroup()`"), answer("`cpops_std <- group_by(cpops, continent) %>% mutate(stdpop = pop / mean(pop)) %>% ungroup()`", correct = TRUE), answer("`cpops_std <- group_by(cpops, year) %>% mutate(stdpop = pop / mean(year)) %>% ungroup()`"), random_answer_order = TRUE, allow_retry = TRUE ) ``` ### Exercise 7 Another approach to the previous exercise first creates a table of population averages with ```{r cpops-join-setup} cpops_avg <- group_by(cpops, continent) %>% summarize(avg_pop = mean(pop)) ``` ```{r eval = FALSE, echo = TRUE} <> ``` Then use a left join to add `avg_pop` to the `cpops` table, followed by an ordinary mutate step: ```{r cpops-join, exercise = TRUE} left_join(---) %>% mutate(stdpop = pop / avg_pop) ``` ```{r cpops-join-question} question("Which is the correct replacement for `---`?", answer("`cpops_avg, cpops, \"continent\"`"), answer("`cpops, cpops_avg, \"continent\"`", correct = TRUE), answer("`cpops, cpops_avg, \"year\"`"), answer("`cpops_avg, cpops, \"year\"`"), random_answer_order = TRUE, allow_retry = TRUE ) ```