---
#############################################################
#                                                           #
# In RStudio click on "Run Document" to run this tutorial   #
#                                                           #
#############################################################
title: "A Grammar of Data Manipulation"
author: "Luke Tierney"
output: learnr::tutorial
runtime: shiny_prerendered
---

```{r setup, include = FALSE}
library(learnr)
library(tidyverse)
knitr::opts_chunk$set(echo = FALSE, comment = "", warning = FALSE)
<<load-packages>>
<<prepare-data>>
```

```{r stop_when_browser_closes, context = "server"}
# stop the app when the browser is closed (or, unfortunately, refreshed)
session$onSessionEnded(stopApp)
```

## Palmer Penguins

```{r, out.width = 500, fig.align = "center", fig.cap = "Artwork by @allison_horst"}
knitr::include_graphics("https://allisonhorst.github.io/palmerpenguins/man/figur
es/lter_penguins.png")
```
The [**palmerpenguins** package](https://allisonhorst.github.io/palmerpenguins/)
includes data for adult foraging Adélie, Chinstrap, and Gentoo
penguins observed on islands in the Palmer Archipelago near Palmer
Station, Antarctica. Data were collected and made available by
Dr. Kristen Gorman and the Palmer Station Long Term Ecological
Research (LTER) Program.

```{r, echo = TRUE}
library(palmerpenguins)
penguins
```

In this tutorial you will explore the `penguins` data frame.

## Selecting Variables

A useful way to start exploring a data set is to select a few
variables and pass the result to the `summary()` function.
Start with the `species`, `island`, and `sex` variables:

```{r select-start, exercise = TRUE}

```
```{r select-start-solution}
select(penguins, species, island, sex) %>% summary()
```

Select the variable `species` and the two bill variables. There are
many ways to do thisone is to use `starts_with()`.

```{r select-bills, exercise = TRUE}

```
```{r select-bills-solution}
select(penguins, species, starts_with("bill"))
```

Select the `species` variable and the length variables;

```{r select-lengths, exercise = TRUE}

```
```{r select-lengths-solution}
select(penguins, species, contains("length"))
```

Move the `sex` variable into the third position. The `everything()`
function may help.

```{r select-sex, exercise = TRUE}

```
```{r select-sex-solution}
select(penguins, species, island, sex, everything())
```

It is usually a good idea to make sure you understand any missing
vaules in your data. A first step is to find any variables that have
missing values. You can do this using the `where()` helper and the
`anyNA()` function.

```{r select-na, exercise = TRUE}

```
```{r select-na-solution}
select(penguins, where(anyNA))
```

## Filtering Rows

Filtering can be used to focus on a subset of the rows. Start with
looking at the penguins recorded on the island Biscoe.

```{r filter-biscoe, exercise = TRUE}

```

```{r filter-biscoe-solution}
filter(penguins, island == "Biscoe")
```

Now look at the female penguins found on Biscoe.

```{r filter-biscoe-female, exercise = TRUE}

```
```{r filter-biscoe-female-solution}
filter(penguins, island == "Biscoe", sex == "female")
```

You can further limit the subset to birds with body mass at least
5,000 grams.

```{r filter-biscoe-female-5K, exercise = TRUE}

```
```{r filter-biscoe-female-5K-solution}
filter(penguins, island == "Biscoe", sex == "female", body_mass_g >= 5000)
```

Looking at cases with missing values is also useful. Start by finding
cases where the bill length is missing.

```{r filter-na-bill-length, exercise = TRUE}

```
```{r filter-na-bill-length-solution}
filter(penguins, is.na(bill_length_mm))
```

Using `if_any()` and `is.na()` you can find all cases where at least
one variable has a missing value:

```{r filter-na-any, exercise = TRUE}

```
```{r filter-na-any-solution}
filter(penguins, if_any(everything(), is.na))
```

`count()` and `across()` can be used to count the distinct
missing/non-missing patterns in the data:

```{r count-na-non-na-pats, exercise = TRUE}

```
```{r count-na-non-na-pats-solution}
count(penguins, across(where(anyNA), is.na))
```

This includes a count for the pattern where no values are
missing. Combine the last two in a pipeline to see counts only for
patters where there is at least one missing value.

```{r count-na-only-pats, exercise = TRUE}

```
```{r count-na-only-pats-solution}
filter(penguins, if_any(everything(), is.na)) %>%
    count(across(where(anyNA), is.na))
```


## Adding New Variables

Larger birds will have both larger bill length and bill depth
values. To help distinguish the species it may be useful to create a
variable `bill_ratio` that captures the shape as the ratio of bill
length to bill depth. You can create a data set that adds this
variable using `mutate()`.

```{r bill-ratio, exercise = TRUE}

```
```{r bill-ratio-solution}
mutate(penguins, bill_ratio = bill_length_mm / bill_depth_mm)
```

Converting a value to a ranking can also sometimes be useful.  Create
a variable `mass_rank` with one for the heaviest bird, two for the
second heaviest, and so on. The `min_rank()` function can help.

```{r mass-rank, exercise = TRUE}

```
```{r mass-rank-solution}
mutate(penguins, mass_rank = min_rank(desc(body_mass_g)))
```

Use this variable to extract the rows for the penguins with the top 5 mass
ranks.

```{r mass-rank-top-5, exercise = TRUE}

```
```{r mass-rank-top-5-solution}
mutate(penguins, mass_rank = min_rank(desc(body_mass_g))) %>%
    filter(mass_rank <= 5)
```

Extracting the penguins with the top 5 mass ranks can also be done
without creating the `mass_rank` variable by using the `top_n()`
function or the `slice_max()` function.

```{r mass-top-5, exercise = TRUE}

```
```{r mass-top-5-solution}
top_n(penguins, 5, body_mass_g)
```


## Arranging Rows

Sort the rows so the years appear in increasing order and, within
years, the penguins are sorted by body mass with the heaviest birds
appearing first. Move the year and body mass variables to the
beginning so the result is easier to see.

```{r arrange-year-mass, exercise = TRUE}

```
```{r arrange-year-mass-solution}
select(penguins, year, body_mass_g, everything()) %>%
    arrange(year, desc(body_mass_g))
```

## Summarizing and Grouping

You can use `count` to fine hoe many penguins where recorded on each island.

```{r count-island, exercise = TRUE}

```
```{r count-island-solution}
count(penguins, island)
```

Similarly, you can compute the number in each species.

```{r count-species, exercise = TRUE}

```
```{r count-species-solution}
count(penguins, species)
```

Next, compute the average body mass for each species.

```{r avg-species-mass, exercise = TRUE}

```
```{r avg-species-mass-solution}
group_by(penguins, species) %>%
    summarize(avg_mass = mean(body_mass_g, na.rm = TRUE))
```

Using `top_n` is one way to find the penguins on each island with the
largest body mass.

```{r max-mass, exercise = TRUE}

```
```{r max-mass-solution}
group_by(penguins, island) %>% top_n(1, body_mass_g)
```

With the `desc` modifier you can find the penguins with the smallest
body masses.

```{r min-mass, exercise = TRUE}

```
```{r min-mass-solution}
group_by(penguins, island) %>% top_n(1, desc(body_mass_g))
```


To find the proportions of male and female birds on each island you
can start by filtering out the birds where sex is not known, count the
number of each sex on each island, and use a grouped mutate to add the
proportions within each island.

```{r sex-prop, exercise = TRUE}

```
```{r sex-prop-solution}
filter(penguins, ! is.na(sex)) %>%
    count(island, sex) %>%
    group_by(island) %>%
    mutate(p = n / sum(n)) %>%
    ungroup()
```

Similarly, you can compute the proportions of the three species within
each island.

```{r species-prop, exercise = TRUE}

```
```{r species-prop-solution}
count(penguins, island, species) %>%
    group_by(island) %>%
    mutate(p = n / sum(n)) %>%
    ungroup()
```

## Completing Missing Combinations

This grouped bar graph if the species counts on the different islands
looks a bit awkward since the species are not present on all islands:

```{r, eval = TRUE, echo = TRUE}
count(penguins, island, species) %>%
    group_by(island) %>%
    mutate(p = n / sum(n)) %>%
    ungroup() %>%
    ggplot(aes(x = island,
               y = p,
               fill = species)) +
    geom_col(position = "dodge")
```

It might be better to include zero proportions for the absent species.
You can add these zero values with `complete` from the `tidyr` package.

```{r species-complete, exercise = TRUE}

```
```{r species-complete-solution}
count(penguins, island, species) %>%
    complete(island, species,
             fill = list(n = 0)) %>%
    group_by(island) %>%
    mutate(p = n / sum(n)) %>%
    ungroup() %>%
    ggplot(aes(x = island,
               y = p,
               fill = species)) +
    geom_col(position = "dodge")
```


## Exercises

### Exercise 1

To bring in `dplyr` and the `mpg` data, start by evaluating

```r
library(dplyr)
library(ggplot2)
```
The `select()` function allows variables to be specified in a variety of
ways.

```{r select-mpg, exercise = TRUE}
```

```{r select-mpg-question}
question(
    paste("Which of the following does **not** produce a data frame with",
          "only the variables `manufacturer`, `model`, `cty`, `hwy`?"),
    answer("`select(mpg, 1:2, 7:8)`", correct = TRUE),
    answer("`select(mpg, starts_with(\"m\"), ends_with(\"y\"))`"),
    answer("`select(mpg, 1:2, cty : hwy)`"),
    answer("`select(mpg, -(displ : drv), -(fl : class))`"),
    random_answer_order = TRUE,
    allow_retry = TRUE
)
```

### Exercise 2

Consider the code

```{r num-fords, exercise = TRUE}
library(dplyr)
library(ggplot2)
filter(mpg, ---) %>% nrow()
```

```{r num-fords-question}
question(
    paste("Which of the replacements for `---` computes the number of",
          "Ford vehicles with more than 6 cylinders in the `mpg` table?"),
	answer("`model == \"ford\", cyl <= 6`"),
    answer("`manufacturer == \"ford\" | cyl > 6`"),
    answer("`manufacturer == \"ford\", cyl > 6`", correct = TRUE),
    answer("`manufacturer == \"ford\", cylinders > 6`"),
    random_answer_order = TRUE,
    allow_retry = TRUE
)
```

### Exercise 3

Consider the 2013 NYC flights data provided by the `nycflights13` package.

```{r nyc-dsm, exercise = TRUE}
library(nycflights13)

```
```{r nyc-dsm-question}
question(
    paste("How many flights were there to Des Moines (FAA code DSM) from",
          "NYC in the first three months of 2013?"),
    answer("13"),
    answer("98"),
    answer("64"),
    answer("77", correct = TRUE),
    random_answer_order = TRUE,
    allow_retry = TRUE
)
```

### Exercise 4

Consider the `mpg` data set in the `ggplots2` package.

```{r mpg-sort, exercise = TRUE}
library(dplyr)
library(ggplot2)

```
```{r mpg-sort-question}
question(
    paste("Which of the following sorts the rows for `mpg` by increasing",
          "`cyl` value and, within each `cyl` value sorts the rows from",
          "largest to smallest `hwy` value."),
    answer("`arrange(mpg, desc(hwy), cyl)`"),
    answer("`arrange(mpg, cyl, desc(hwy))`", correct = TRUE),
    answer("`arrange(mpg, desc(cyl), hwy)`"),
    answer("`arrange(mpg, cyl, hwy)`"),
    random_answer_order = TRUE,
    allow_retry = TRUE
)
```

### Exercise 5

Consider the `flights` table in the `nycflights13` package.

```{r dep-time, exercise = TRUE}
library(dplyr)
library(nycflights13)

```

The `dep_time` variable in the `flights` data set from the
`nycflights13` package is convenient to look at (529 means 5:29 AM
and 22:12 means 10:12 PM), but hard to compute with.

```{r dep-time-question}
question(
    paste("Which of the following adds variables `dep_hour` and `dep_min`",
          "containing hour and minute of the departure time?"),
    answer("`mutate(flights, dep_hour = dep_time %/% 60, dep_min = dep_time %% 60)`"),
    answer("`mutate(flights, dep_hour = dep_time %/% 100, dep_min = dep_time %% 100)`", correct = TRUE),
    answer("`mutate(flights, dep_hour = dep_time / 100, dep_min = dep_time - dep_hour)`"),
    answer("`mutate(flights, dep_hour = dep_time %/% 60, dep_min = dep_hour %% 60)`"),
    random_answer_order = TRUE,
    allow_retry = TRUE
)
```


### Exercise 6

Using the `gapminder` data set, the following code computes population
totals for each continent and each year:

```{r cpop-std-setup, message = FALSE}
library(dplyr)
library(gapminder)
cpops <- group_by(gapminder, continent, year) %>%
    summarize(pop = sum(pop)) %>%
    ungroup()
```
```{r, eval = FALSE, echo = TRUE}
<<cpop-std-setup>>
```
    
To produce a plot to compare population growth over the years for
the continents it is useful to standardize the population data for
each continent, for example by dividing the population values by
the average population size for each continent. One way to do this
is with a grouped `mutate`. The first line of your result should be
    
```{r, include = FALSE}
<<cpop-std-setup>>
cpops_std <- group_by(cpops, continent) %>%
    mutate(stdpop = pop / mean(pop)) %>%
    ungroup()
```
```{r}
head(cpops_std, 1)
```

A plot looks like

```{r, fig.cap = ""}
library(ggplot2)
ggplot(cpops_std, aes(x = year, y = stdpop, color = continent)) +
    geom_line()
```

```{r cpop-std, exercise = TRUE}


```
```{r cpop-std-question}
question(
    "Which of the following produces the correct result:",
    answer("`cpops_std <- group_by(cpops, continent) %>% mutate(stdpop = pop / mean(year)) %>% ungroup()`"),
    answer("`cpops_std <- group_by(cpops, year) %>% mutate(stdpop = pop / mean(pop)) %>% ungroup()`"),
    answer("`cpops_std <- group_by(cpops, continent) %>% mutate(stdpop = pop / mean(pop)) %>% ungroup()`", correct = TRUE),
    answer("`cpops_std <- group_by(cpops, year) %>% mutate(stdpop = pop / mean(year)) %>% ungroup()`"),
    random_answer_order = TRUE,
    allow_retry = TRUE
)
```


### Exercise 7

Another approach to the previous exercise first creates a table of
population averages with
   
```{r cpops-join-setup}
cpops_avg <- group_by(cpops, continent) %>%
    summarize(avg_pop = mean(pop))
```
```{r eval = FALSE, echo = TRUE}
<<cpops-join-setup>>
```

Then use a left join to add `avg_pop` to the `cpops` table,
followed by an ordinary mutate step:
    
```{r cpops-join, exercise = TRUE}
left_join(---) %>%
    mutate(stdpop = pop / avg_pop)
```

```{r cpops-join-question}
question("Which is the correct replacement for `---`?",
         answer("`cpops_avg, cpops, \"continent\"`"),
         answer("`cpops, cpops_avg, \"continent\"`", correct = TRUE),
         answer("`cpops, cpops_avg, \"year\"`"),
         answer("`cpops_avg, cpops, \"year\"`"),
         random_answer_order = TRUE,
         allow_retry = TRUE
)
```