A Grammar of Data Manipulation

class: center, middle, title-slide

.title[
# A Grammar of Data Manipulation
]
.author[
### Luke Tierney
]
.institute[
### University of Iowa
]
.date[
### 2025-03-06
]

---

layout: true

## Background
---

The `dplyr` package provides a language, or grammar, for data
manipulation.

The language contains a number of _verbs_ that operate on tables in
data frames.

The most commonly used verbs operate on a single data frame:
--

* [`select()`](https://dplyr.tidyverse.org/reference/select.html):
  pick variables by their names
--

* [`filter()`](https://dplyr.tidyverse.org/reference/filter.html):
  choose rows that satisfy some criteria
--

* [`mutate()`](https://dplyr.tidyverse.org/reference/mutate.html):
  create transformed or derived variables
--

* [`arrange()`](https://dplyr.tidyverse.org/reference/arrange.html):
  reorder the rows
--

* [`summarize()`](https://dplyr.tidyverse.org/reference/summarise.html):
  collapse rows down to summaries

There are also a number of `join` verbs that merge several data frames
into one.

The `tidyr` package provides additional verbs, such as
[`pivot_longer()`](https://tidyr.tidyverse.org/reference/pivot_longer.html)
and
[`pivot_wider()`](https://tidyr.tidyverse.org/reference/pivot_wider.html)
for reshaping data frames.

---

The single table verbs can also be used with
[`group_by()`](https://dplyr.tidyverse.org/reference/group_by.html) and
[`ungroup()`](https://dplyr.tidyverse.org/reference/group_by.html) to
work a group at a time instead of applying to the entire data frame.

An experimental alternative to using `group_by()` and `ungroup()` is
to use the `.by` argument with these verbs.

The design of `dplyr` is strongly motivated by SQL.

* `dplyr` can also be used to operate on
  [tables stored in data bases.](https://dbplyr.tidyverse.org/index.html)

The basic transformation verbs are described in the
[Data Transformation chapter](https://r4ds.hadley.nz/transform.html) of
[R for Data Science](https://r4ds.hadley.nz/).

Utilities in `tidyr` are described in the [Tidy Data
chapter](https://r4ds.hadley.nz/data-tidy.html).

---

Some data sets for illustration:

* EPA vehicle data used in HW4

``` r
  library(readr)
  if (! file.exists("vehicles.csv.zip"))
    download.file("http://www.stat.uiowa.edu/~luke/data/vehicles.csv.zip",
                  "vehicles.csv.zip")
  newmpg <- read_csv("vehicles.csv.zip", guess_max = 100000)
  ```

* The `nycflights13` package provides data on all flights
  originating from one of the three main New York City airports in
  2013 and heading to airports within the US.

``` r
  library(nycflights13)
  ```

* The [`anyflights`](https://cran.r-project.org/package=anyflights)
  package can be used to obtain data for other cities and periods.

* The `storms` data frame, included in `dplyr` with data on
  hurricanes between 1975 and 2015.

---
layout: true
## Selecting Variables
---

Data sets often contain hundreds of even thousands of variables.

A useful first step is to select a group of interesting variables.

A reasonable selection of the [EPA vehicle
variables](https://www.fueleconomy.gov/feg/ws/index.shtml#vehicle):

.small-code[

``` r
newmpg1 <- select(newmpg,
                  make, model, year,
                  cty = city08, hwy = highway08,
                  trans = trany, cyl = cylinders,
                  fuel = fuelType1, displ)
head(newmpg1, 2)
## # A tibble: 2 × 9
##   make       model               year   cty   hwy trans          cyl fuel  displ
##   <chr>      <chr>              <dbl> <dbl> <dbl> <chr>        <dbl> <chr> <dbl>
## 1 Alfa Romeo Spider Veloce 2000  1985    19    25 Manual 5-spd     4 Regu…   2  
## 2 Ferrari    Testarossa          1985     9    14 Manual 5-spd    12 Regu…   4.9
```
]

.alert[
Variables can be given new names in a `select` call; you can also
rename them later with `rename`.
]

---

Some variations (the documentation for
[`select()`](https://dplyr.tidyverse.org/reference/select.html) gives
full details):

.small-code[

``` r
select(newmpg1, year : trans) |> head(2)
## # A tibble: 2 × 4
##    year   cty   hwy trans       
##   <dbl> <dbl> <dbl> <chr>       
## 1  1985    19    25 Manual 5-spd
## 2  1985     9    14 Manual 5-spd
```
]

.small-code[

``` r
select(newmpg1, -year, -trans) |> head(2)
## # A tibble: 2 × 7
##   make       model                cty   hwy   cyl fuel             displ
##   <chr>      <chr>              <dbl> <dbl> <dbl> <chr>            <dbl>
## 1 Alfa Romeo Spider Veloce 2000    19    25     4 Regular Gasoline   2  
## 2 Ferrari    Testarossa             9    14    12 Regular Gasoline   4.9
```
]

.small-code[

``` r
select(newmpg1, - (year : trans)) |> head(2)
## # A tibble: 2 × 5
##   make       model                cyl fuel             displ
##   <chr>      <chr>              <dbl> <chr>            <dbl>
## 1 Alfa Romeo Spider Veloce 2000     4 Regular Gasoline   2  
## 2 Ferrari    Testarossa            12 Regular Gasoline   4.9
```
]

---

Numerical column specifications can also be used:

.small-code[

``` r
select(newmpg1, 1, 3) |> head(2)
## # A tibble: 2 × 2
##   make        year
##   <chr>      <dbl>
## 1 Alfa Romeo  1985
## 2 Ferrari     1985
```
]

.small-code[

``` r
select(newmpg1, 1 : 3) |> head(2)
## # A tibble: 2 × 3
##   make       model               year
##   <chr>      <chr>              <dbl>
## 1 Alfa Romeo Spider Veloce 2000  1985
## 2 Ferrari    Testarossa          1985
```
]

.small-code[

``` r
select(newmpg1, - (1 : 3)) |> head(2)
## # A tibble: 2 × 6
##     cty   hwy trans          cyl fuel             displ
##   <dbl> <dbl> <chr>        <dbl> <chr>            <dbl>
## 1    19    25 Manual 5-spd     4 Regular Gasoline   2  
## 2     9    14 Manual 5-spd    12 Regular Gasoline   4.9
```
]

---

Some _helper functions_ can also be used in the column specifications.

The most useful ones are `starts_with()`, `ends_with()`, and `contains()`.

.small-code[

``` r
select(newmpg, starts_with("fuel")) |> head(2)
## # A tibble: 2 × 5
##   fuelCost08 fuelCostA08 fuelType fuelType1        fuelType2
##        <dbl>       <dbl> <chr>    <chr>            <chr>    
## 1       2200           0 Regular  Regular Gasoline <NA>     
## 2       4150           0 Regular  Regular Gasoline <NA>
```
]

.small-code[

``` r
select(newmpg, contains("city")) |> head(2)
## # A tibble: 2 × 12
##   city08 city08U cityA08 cityA08U cityCD cityE cityUF rangeCity rangeCityA UCity
##    <dbl>   <dbl>   <dbl>    <dbl>  <dbl> <dbl>  <dbl>     <dbl>      <dbl> <dbl>
## 1     19       0       0        0      0     0      0         0          0  23.3
## 2      9       0       0        0      0     0      0         0          0  11  
## # ℹ 2 more variables: UCityA <dbl>, phevCity <dbl>
```
]

.alert[
By default these helpers ignore case.
]

---

These can also be preceded by a minus modifier to omit columns:

.small-code[

``` r
select(newmpg1, -contains("m")) |> head(2)
## # A tibble: 2 × 7
##    year   cty   hwy trans          cyl fuel             displ
##   <dbl> <dbl> <dbl> <chr>        <dbl> <chr>            <dbl>
## 1  1985    19    25 Manual 5-spd     4 Regular Gasoline   2  
## 2  1985     9    14 Manual 5-spd    12 Regular Gasoline   4.9
```
]

`contains` requires a literal string; `matches()` is analogous but its
argument is interpreted as a [_regular expression_](regex.html)
pattern to be matched.

.small-code[

``` r
select(newmpg, matches("^[Ff]uel")) |> head(2)
## # A tibble: 2 × 5
##   fuelCost08 fuelCostA08 fuelType fuelType1        fuelType2
##        <dbl>       <dbl> <chr>    <chr>            <chr>    
## 1       2200           0 Regular  Regular Gasoline <NA>     
## 2       4150           0 Regular  Regular Gasoline <NA>
```
]

---

Sometimes it is useful to move a few columns to the front; the `everything()`
helper can be used for that.

The original NYC `flights` table:

.small-code[

``` r
head(flights, 2)
## # A tibble: 2 × 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013     1     1      517            515         2      830            819
## 2  2013     1     1      533            529         4      850            830
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>
```
]

Moving `air_time` to the front:

.small-code[

``` r
select(flights, air_time, everything()) |> head(2)
## # A tibble: 2 × 19
##   air_time  year month   day dep_time sched_dep_time dep_delay arr_time
##      <dbl> <int> <int> <int>    <int>          <int>     <dbl>    <int>
## 1      227  2013     1     1      517            515         2      830
## 2      227  2013     1     1      533            529         4      850
## # ℹ 11 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>
```
]

---

The `where()` helper function allows columns to be selected based on a
predicate applied to the full column:

``` r
select(newmpg1, where(anyNA)) |> head(2)
## # A tibble: 2 × 3
##   trans          cyl displ
##   <chr>        <dbl> <dbl>
## 1 Manual 5-spd     4   2  
## 2 Manual 5-spd    12   4.9
```

``` r
select(flights, where(anyNA)) |> head(2)
## # A tibble: 2 × 6
##   dep_time dep_delay arr_time arr_delay tailnum air_time
##      <int>     <dbl>    <int>     <dbl> <chr>      <dbl>
## 1      517         2      830        11 N14228       227
## 2      533         4      850        20 N24211       227
```

.alert[
If there are missing values in your data it is usually a good idea to
review them to make sure you understand what they mean.
]

---

Columns can also be selected using list operations:

``` r
vars <- c("make", "model", "year", "city08", "trany", "cylinders",
          "fuelType1", "displ")
newmpg[vars] |> head(2)
## # A tibble: 2 × 8
##   make       model               year city08 trany     cylinders fuelType1 displ
##   <chr>      <chr>              <dbl>  <dbl> <chr>         <dbl> <chr>     <dbl>
## 1 Alfa Romeo Spider Veloce 2000  1985     19 Manual 5…         4 Regular …   2  
## 2 Ferrari    Testarossa          1985      9 Manual 5…        12 Regular …   4.9
```

---
layout: true
## Filtering Rows
---

The [`filter()`](https://dplyr.tidyverse.org/reference/filter.html)
function picks out the subset of rows that satisfy one or more
conditions.

All cars with city MPG at or above 140 and model year 2024:

``` r
filter(newmpg1, cty >= 140, year == 2024)
## # A tibble: 3 × 9
##   make    model                         year   cty   hwy trans   cyl fuel  displ
##   <chr>   <chr>                        <dbl> <dbl> <dbl> <chr> <dbl> <chr> <dbl>
## 1 Hyundai Ioniq 6 Long range RWD (18 …  2024   153   127 Auto…    NA Elec…    NA
## 2 Hyundai Ioniq 6 Standard Range RWD    2024   151   120 Auto…    NA Elec…    NA
## 3 Lucid   Air Pure RWD with 19 inch w…  2024   140   134 Auto…    NA Elec…    NA
```

All flights from New York to Des Moines on January 1, 2013:

``` r
filter(flights, day == 1, month == 1, dest == "DSM") |>
    select(year : dep_time, origin)
## # A tibble: 1 × 5
##    year month   day dep_time origin
##   <int> <int> <int>    <int> <chr> 
## 1  2013     1     1     2119 EWR
```

---

.pull-left[
### Comparisons

The basic comparison operators are

* `==`, `!=`
* `<`, `<=`
* `>`, `>=`
]

.pull-right.smaller[
.alert[
Be sure to use `==`, not a single `=` character.
]
]

.pull-left.smaller[
.alert[
Be careful about [floating
point](https://en.wikipedia.org/wiki/Floating-point_arithmetic)
equality tests:

``` r
x <- 1 - 0.9
x == 0.1
## [1] FALSE
near(x, 0.1)
## [1] TRUE
```
]
]

.pull-right.smaller[
.alert[
Equality comparisons can be used on character vectors and factors.

Order comparisons can be used on character vectors, but may produce
surprising results, or results that vary with different locale
settings.
]
]

---

`%in%` can be used to match against one of several options:

``` r
filter(flights, day %in% 1 : 2, month == 1, dest == "DSM") |>
    select(year : dep_time, origin)
## # A tibble: 2 × 5
##    year month   day dep_time origin
##   <int> <int> <int>    <int> <chr> 
## 1  2013     1     1     2119 EWR   
## 2  2013     1     2     1921 EWR
```

---

### Logical Operations

.pull-left[
The basic logical operations on vectors are

* `&` -- and
* `|` -- or
* `!` -- not
* `xor` -- exclusive or
]

.pull-right[
Truth tables:

``` r
x <- c(TRUE, TRUE,  FALSE, FALSE)
y <- c(TRUE, FALSE, TRUE,  FALSE)
! x
## [1] FALSE FALSE  TRUE  TRUE
x & y
## [1]  TRUE FALSE FALSE FALSE
x | y
## [1]  TRUE  TRUE  TRUE FALSE
xor(x, y)
## [1] FALSE  TRUE  TRUE FALSE
```

]

.alert[
Make sure not to confuse the vectorized logical operators `&` and `|`
with the scalar flow control operators `&&` and `||`.
]

---

The previous `flights` example can also be written as

``` r
filter(flights, day == 1 | day == 2, month == 1, dest == "DSM") |>
    select(year : dep_time, origin)
## # A tibble: 2 × 5
##    year month   day dep_time origin
##   <int> <int> <int>    <int> <chr> 
## 1  2013     1     1     2119 EWR   
## 2  2013     1     2     1921 EWR
```

It can also be written as

``` r
filter(flights, (day == 1 | day == 2) & month == 1 & dest == "DSM") |>
    select(year : dep_time, origin)
## # A tibble: 2 × 5
##    year month   day dep_time origin
##   <int> <int> <int>    <int> <chr> 
## 1  2013     1     1     2119 EWR   
## 2  2013     1     2     1921 EWR
```

---

### Missing Values

`filter()` only keeps values corresponding to `TRUE` predicate values;
`FALSE` and `NA` values are dropped.

.pull-left[
Arithmetic and comparison operators will always produce `NA` values
if one operand is `NA`.

``` r
1 + NA
## [1] NA
NA < 1
## [1] NA
NA == NA
## [1] NA
```

]

.pull-right[
In a few cases a non-`NA` result can be determined:

``` r
TRUE | NA
## [1] TRUE
NA & FALSE
## [1] FALSE
NA ^ 0
## [1] 1
```
]

---

`is.na()` can be used to also select rows with `NA` values.

Non-electric vehicles with zero or `NA` for `displ`:

.code-80[

``` r
filter(newmpg1, displ == 0 | is.na(displ), fuel != "Electricity")
## # A tibble: 2 × 9
##   make   model     year   cty   hwy trans          cyl fuel             displ
##   <chr>  <chr>    <dbl> <dbl> <dbl> <chr>        <dbl> <chr>            <dbl>
## 1 Subaru RX Turbo  1985    22    28 Manual 5-spd    NA Regular Gasoline    NA
## 2 Subaru RX Turbo  1985    21    27 Manual 5-spd    NA Regular Gasoline    NA
```
]

---

Flights with `NA` for both `dep_time` and `arr_time`:

``` r
filter(flights, is.na(dep_time) & is.na(arr_time)) |>
    select(year : arr_time)
## # A tibble: 8,255 × 7
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1       NA           1630        NA       NA
##  2  2013     1     1       NA           1935        NA       NA
##  3  2013     1     1       NA           1500        NA       NA
##  4  2013     1     1       NA            600        NA       NA
##  5  2013     1     2       NA           1540        NA       NA
##  6  2013     1     2       NA           1620        NA       NA
##  7  2013     1     2       NA           1355        NA       NA
##  8  2013     1     2       NA           1420        NA       NA
##  9  2013     1     2       NA           1321        NA       NA
## 10  2013     1     2       NA           1545        NA       NA
## # ℹ 8,245 more rows
```

---

An alternative approach that does not use `filter` would be

``` r
idx <- with(flights, is.na(dep_time) & is.na(arr_time))
flights[idx, ] |>
    select(year : arr_time)
## # A tibble: 8,255 × 7
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1       NA           1630        NA       NA
##  2  2013     1     1       NA           1935        NA       NA
##  3  2013     1     1       NA           1500        NA       NA
##  4  2013     1     1       NA            600        NA       NA
##  5  2013     1     2       NA           1540        NA       NA
##  6  2013     1     2       NA           1620        NA       NA
##  7  2013     1     2       NA           1355        NA       NA
##  8  2013     1     2       NA           1420        NA       NA
##  9  2013     1     2       NA           1321        NA       NA
## 10  2013     1     2       NA           1545        NA       NA
## # ℹ 8,245 more rows
```

---
layout: true
## Exploring the EPA Data
---

Sorting out fuel types:

``` r
select(newmpg, contains("fuelType")) |> unique()
## # A tibble: 14 × 3
##    fuelType                    fuelType1         fuelType2  
##    <chr>                       <chr>             <chr>      
##  1 Regular                     Regular Gasoline  <NA>       
##  2 Premium                     Premium Gasoline  <NA>       
##  3 Diesel                      Diesel            <NA>       
##  4 CNG                         Natural Gas       <NA>       
##  5 Gasoline or natural gas     Regular Gasoline  Natural Gas
##  6 Gasoline or E85             Regular Gasoline  E85        
##  7 Electricity                 Electricity       <NA>       
##  8 Gasoline or propane         Regular Gasoline  Propane    
##  9 Premium or E85              Premium Gasoline  E85        
## 10 Midgrade                    Midgrade Gasoline <NA>       
## 11 Premium Gas or Electricity  Premium Gasoline  Electricity
## 12 Regular Gas and Electricity Regular Gasoline  Electricity
## 13 Premium and Electricity     Premium Gasoline  Electricity
## 14 Regular Gas or Electricity  Regular Gasoline  Electricity
```

---

Variables with missing values:

``` r
select(newmpg, where(anyNA)) |> head(2)
## # A tibble: 2 × 17
##   cylinders displ drive      eng_dscr trany guzzler trans_dscr tCharger sCharger
##       <dbl> <dbl> <chr>      <chr>    <chr> <chr>   <chr>      <lgl>    <chr>   
## 1         4   2   Rear-Whee… (FFS)    Manu… <NA>    <NA>       NA       <NA>    
## 2        12   4.9 Rear-Whee… (GUZZLE… Manu… T       <NA>       NA       <NA>    
## # ℹ 8 more variables: atvType <chr>, fuelType2 <chr>, rangeA <chr>,
## #   evMotor <chr>, mfrCode <chr>, c240Dscr <chr>, c240bDscr <chr>,
## #   startStop <chr>
```

Among the reduced data set:

``` r
select(newmpg1, where(anyNA)) |> names()
## [1] "trans" "cyl"   "displ"
```

---

Unique missing value patterns:

``` r
incomplete_cases <- function(data) data[! complete.cases(data), ]

select(newmpg1, trans, cyl, displ, fuel) |>
    incomplete_cases() |>
    unique()
## # A tibble: 9 × 4
##   trans                              cyl displ fuel            
##   <chr>                            <dbl> <dbl> <chr>           
## 1 <NA>                                NA  NA   Electricity     
## 2 <NA>                                 8   5.8 Regular Gasoline
## 3 <NA>                                 6   4.1 Regular Gasoline
## 4 Manual 5-spd                        NA  NA   Regular Gasoline
## 5 Manual 5-spd                        NA   1.3 Regular Gasoline
## 6 Automatic (A1)                      NA  NA   Electricity     
## 7 Automatic (variable gear ratios)    NA  NA   Electricity     
## 8 Automatic (A1)                      NA   0   Electricity     
## 9 Automatic (A2)                      NA  NA   Electricity
```

---

An alternative to defining `incomplete_cases()`:

``` r
select(newmpg1, trans, cyl, displ, fuel) |>
    filter(if_any(everything(), is.na)) |>
    unique()
## # A tibble: 9 × 4
##   trans                              cyl displ fuel            
##   <chr>                            <dbl> <dbl> <chr>           
## 1 <NA>                                NA  NA   Electricity     
## 2 <NA>                                 8   5.8 Regular Gasoline
## 3 <NA>                                 6   4.1 Regular Gasoline
## 4 Manual 5-spd                        NA  NA   Regular Gasoline
## 5 Manual 5-spd                        NA   1.3 Regular Gasoline
## 6 Automatic (A1)                      NA  NA   Electricity     
## 7 Automatic (variable gear ratios)    NA  NA   Electricity     
## 8 Automatic (A1)                      NA   0   Electricity     
## 9 Automatic (A2)                      NA  NA   Electricity
```

---

.pull-left.width-30[
Looking at the counts:
]

.pull-right.code-80.width-70[

``` r
select(newmpg1, trans, cyl, displ, fuel) |>
    incomplete_cases() |>
    count(trans, cyl, displ, fuel)
## # A tibble: 9 × 5
##   trans                              cyl displ fuel                 n
##   <chr>                            <dbl> <dbl> <chr>            <int>
## 1 Automatic (A1)                      NA   0   Electricity          1
## 2 Automatic (A1)                      NA  NA   Electricity        567
## 3 Automatic (A2)                      NA  NA   Electricity         63
## 4 Automatic (variable gear ratios)    NA  NA   Electricity          9
## 5 Manual 5-spd                        NA   1.3 Regular Gasoline     1
## 6 Manual 5-spd                        NA  NA   Regular Gasoline     2
## 7 <NA>                                 6   4.1 Regular Gasoline     1
## 8 <NA>                                 8   5.8 Regular Gasoline     1
## 9 <NA>                                NA  NA   Electricity          9
```
]

.alert[
.smaller[
`count(trans, cyl, displ, fuel)` is essentially an abbreviation of

``` r
group_by(trans, cyl, displ, fuel) |>
    summarize(n = n()) |>
    ungroup()
```
]
]

---

.pull-left.width-30[
Another approach that avoids specifying the variables with missing values:
]

.pull-right.code-80.width-70[

``` r
incomplete_cases(newmpg1) |>
    count(across(where(anyNA)), fuel)
## # A tibble: 9 × 5
##   trans                              cyl displ fuel                 n
##   <chr>                            <dbl> <dbl> <chr>            <int>
## 1 Automatic (A1)                      NA   0   Electricity          1
## 2 Automatic (A1)                      NA  NA   Electricity        567
## 3 Automatic (A2)                      NA  NA   Electricity         63
## 4 Automatic (variable gear ratios)    NA  NA   Electricity          9
## 5 Manual 5-spd                        NA   1.3 Regular Gasoline     1
## 6 Manual 5-spd                        NA  NA   Regular Gasoline     2
## 7 <NA>                                 6   4.1 Regular Gasoline     1
## 8 <NA>                                 8   5.8 Regular Gasoline     1
## 9 <NA>                                NA  NA   Electricity          9
```
]

---

.pull-left[
Cylinders for electric vehicles:

``` r
filter(newmpg1, fuel == "Electricity") |>
    count(is.na(cyl))
## # A tibble: 1 × 2
##   `is.na(cyl)`     n
##   <lgl>        <int>
## 1 TRUE           649
```
]

.pull-right[
Displacement for electric vehicles:

``` r
filter(newmpg1, fuel == "Electricity") |>
    count(is.na(displ))
## # A tibble: 2 × 2
##   `is.na(displ)`     n
##   <lgl>          <int>
## 1 FALSE              1
## 2 TRUE             648
```
]

Electric vehicles with non-missing displacement:

``` r
filter(newmpg1, fuel == "Electricity", ! is.na(displ))
## # A tibble: 1 × 9
##   make       model   year   cty   hwy trans            cyl fuel        displ
##   <chr>      <chr>  <dbl> <dbl> <dbl> <chr>          <dbl> <chr>       <dbl>
## 1 Mitsubishi i-MiEV  2016   126    99 Automatic (A1)    NA Electricity     0
```

---

Some things to think about for electric vehicles:

* Should electric vehicles be dropped from analyses or visualizations
  involving cylinders or displacement?

* Should `cyl` or `displ` be recoded, perhaps as zero?

* Why are miles per gallon _not_ missing, or infinite, for electric
  vehicles? The [documentation for the
  data](https://www.fueleconomy.gov/feg/ws/index.shtml#vehicle) has
  some pointers.

Other considerations:

* Should some transmission levels be combined?

* Should some fuel type levels be combined?

---
layout: true
## Adding New Variables
---

.pull-left[
`mutate()` can be used to define new variables or modify existing ones.
{{content}}
]
--

New variables are added at the end.
{{content}}
--

Later variables can be defined in terms of earlier ones.

.pull-right[
.alert[
`mutate()` does not modify the original data frame; it creates a new one
with the specified modifications.
]
]

``` r
fl <- select(flights, year, month, day, air_time, ends_with("delay"))
mutate(fl,
       gain = dep_delay - arr_delay,
       air_hours = air_time / 60,
       gain_per_hour = gain / air_hours) |>
    head(2)
## # A tibble: 2 × 9
##    year month   day air_time dep_delay arr_delay  gain air_hours gain_per_hour
##   <int> <int> <int>    <dbl>     <dbl>     <dbl> <dbl>     <dbl>         <dbl>
## 1  2013     1     1      227         2        11    -9      3.78         -2.38
## 2  2013     1     1      227         4        20   -16      3.78         -4.23
```

---

`transmute()` keeps only the new variables.

``` r
transmute(flights,
          gain = dep_delay - arr_delay,
          air_hours = air_time / 60,
          gain_per_hour = gain / air_hours) |>
    head(2)
## # A tibble: 2 × 3
##    gain air_hours gain_per_hour
##   <dbl>     <dbl>         <dbl>
## 1    -9      3.78         -2.38
## 2   -16      3.78         -4.23
```

---

Useful operators include:

.pull-left[
* Basic arithmetic operators, logarithms, etc.
{{content}}
]
--

* Modular arithmetic operators `%/%` and `%%`
{{content}}
--

* Logical operators and comparison operators.
{{content}}
--

* Offset operators `lead()` and `lag()`.

.pull-right[
* cumulative aggregates such as `cumsum()`, `cummean()`.
{{content}}
]
--

* Ranking operators, such as `min_rank()`, `percent_rank()`.
{{content}}
--

* Predicates like `is.na`.

.pull-left.width-70[
A reasonable guess is that flights with `NA` values for both
`dep_time` and `arr_time` were cancelled:

``` r
fl <- mutate(flights,
             cancelled = is.na(dep_time) & is.na(arr_time))
```
]

---

For the `geyser` data set from the `MASS` package we used `lag` to
obtain the duration of the previous eruption,

.pull-left.code-80[

``` r
data(geyser, package = "MASS")
geyser2 <- mutate(geyser,
                  prev_duration = lag(duration))
head(geyser2, 4)
##   waiting duration prev_duration
## 1      80 4.016667            NA
## 2      71 2.150000      4.016667
## 3      57 4.000000      2.150000
## 4      80 4.000000      4.000000
```
{{content}}
]
--

``` r
thm <- theme_minimal() +
    theme(text = element_text(size = 16))
ggplot(geyser2) +
    geom_point(aes(x = prev_duration,
                   y = waiting)) +
    thm
```

.pull-right[

```
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
```

<img src="dplyr_files/figure-html/geyser-lag-1.png" style="display: block; margin: auto;" />
]

---

The top five vehicles in `cty` gas mileage (there may be more than
five if there is a tie for fifth place):

.code-80[

``` r
mutate(newmpg1, cty_rank = min_rank(desc(cty))) |>
    filter(cty_rank <= 5)
## # A tibble: 8 × 10
##   make    model                year   cty   hwy trans   cyl fuel  displ cty_rank
##   <chr>   <chr>               <dbl> <dbl> <dbl> <chr> <dbl> <chr> <dbl>    <int>
## 1 Hyundai Ioniq Electric       2017   150   122 Auto…    NA Elec…    NA        5
## 2 Hyundai Ioniq Electric       2018   150   122 Auto…    NA Elec…    NA        5
## 3 Hyundai Ioniq Electric       2019   150   122 Auto…    NA Elec…    NA        5
## 4 Tesla   Model 3 Standard R…  2021   150   133 Auto…    NA Elec…    NA        5
## 5 Hyundai Ioniq 6 Long range…  2023   153   127 Auto…    NA Elec…    NA        1
## 6 Hyundai Ioniq 6 Standard R…  2023   151   120 Auto…    NA Elec…    NA        3
## 7 Hyundai Ioniq 6 Long range…  2024   153   127 Auto…    NA Elec…    NA        1
## 8 Hyundai Ioniq 6 Standard R…  2024   151   120 Auto…    NA Elec…    NA        3
```
]

---

An alternative to explicitly computing ranks and filtering is the
`slice_max()` utility function:

.code-80[

``` r
slice_max(newmpg1, cty, n = 5)
## # A tibble: 8 × 9
##   make    model                         year   cty   hwy trans   cyl fuel  displ
##   <chr>   <chr>                        <dbl> <dbl> <dbl> <chr> <dbl> <chr> <dbl>
## 1 Hyundai Ioniq 6 Long range RWD (18 …  2023   153   127 Auto…    NA Elec…    NA
## 2 Hyundai Ioniq 6 Long range RWD (18 …  2024   153   127 Auto…    NA Elec…    NA
## 3 Hyundai Ioniq 6 Standard Range RWD    2023   151   120 Auto…    NA Elec…    NA
## 4 Hyundai Ioniq 6 Standard Range RWD    2024   151   120 Auto…    NA Elec…    NA
## 5 Hyundai Ioniq Electric                2017   150   122 Auto…    NA Elec…    NA
## 6 Hyundai Ioniq Electric                2018   150   122 Auto…    NA Elec…    NA
## 7 Hyundai Ioniq Electric                2019   150   122 Auto…    NA Elec…    NA
## 8 Tesla   Model 3 Standard Range Plus…  2021   150   133 Auto…    NA Elec…    NA
```
]

---

The `percent_rank` function produces a 'percentile rank' (but between
0 and 1).

The top 5% of countries in life expectancy in 2007 from the
`gapminder` data set:

``` r
library(gapminder)
gm <- filter(gapminder, year == 2007)
gm_top <- filter(gm, percent_rank(desc(lifeExp)) <= 0.05)
gm_top
## # A tibble: 8 × 6
##   country          continent  year lifeExp       pop gdpPercap
##   <fct>            <fct>     <int>   <dbl>     <int>     <dbl>
## 1 Australia        Oceania    2007    81.2  20434176    34435.
## 2 Hong Kong, China Asia       2007    82.2   6980412    39725.
## 3 Iceland          Europe     2007    81.8    301931    36181.
## 4 Israel           Asia       2007    80.7   6426679    25523.
## 5 Japan            Asia       2007    82.6 127467972    31656.
## 6 Spain            Europe     2007    80.9  40448191    28821.
## 7 Sweden           Europe     2007    80.9   9031088    33860.
## 8 Switzerland      Europe     2007    81.7   7554661    37506.
```
---

And the bottom 5%:

``` r
filter(gm, percent_rank(lifeExp) <= 0.05)
## # A tibble: 8 × 6
##   country      continent  year lifeExp      pop gdpPercap
##   <fct>        <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan  Asia       2007    43.8 31889923      975.
## 2 Angola       Africa     2007    42.7 12420476     4797.
## 3 Lesotho      Africa     2007    42.6  2012649     1569.
## 4 Mozambique   Africa     2007    42.1 19951656      824.
## 5 Sierra Leone Africa     2007    42.6  6144562      863.
## 6 Swaziland    Africa     2007    39.6  1133066     4513.
## 7 Zambia       Africa     2007    42.4 11746035     1271.
## 8 Zimbabwe     Africa     2007    43.5 12311143      470.
```

Within these results the original row order is retained.

The `arrange` function can be used to change this.

---
layout: true
## Arranging Rows
---

`arrange()` reorders the rows according to one or more variables.

By default the order is smallest to largest; using `desc()` reverses this.

Additional variables will be used to break ties.

Arranging the top 5% of countries in life expectancy in 2007 by life
expectancy:

``` r
arrange(gm_top, desc(lifeExp))
## # A tibble: 8 × 6
##   country          continent  year lifeExp       pop gdpPercap
##   <fct>            <fct>     <int>   <dbl>     <int>     <dbl>
## 1 Japan            Asia       2007    82.6 127467972    31656.
## 2 Hong Kong, China Asia       2007    82.2   6980412    39725.
## 3 Iceland          Europe     2007    81.8    301931    36181.
## 4 Switzerland      Europe     2007    81.7   7554661    37506.
## 5 Australia        Oceania    2007    81.2  20434176    34435.
## 6 Spain            Europe     2007    80.9  40448191    28821.
## 7 Sweden           Europe     2007    80.9   9031088    33860.
## 8 Israel           Asia       2007    80.7   6426679    25523.
```

---

To order by continent name and order by life expectancy within continent:

``` r
arrange(gm_top, continent, desc(lifeExp))
## # A tibble: 8 × 6
##   country          continent  year lifeExp       pop gdpPercap
##   <fct>            <fct>     <int>   <dbl>     <int>     <dbl>
## 1 Japan            Asia       2007    82.6 127467972    31656.
## 2 Hong Kong, China Asia       2007    82.2   6980412    39725.
## 3 Israel           Asia       2007    80.7   6426679    25523.
## 4 Iceland          Europe     2007    81.8    301931    36181.
## 5 Switzerland      Europe     2007    81.7   7554661    37506.
## 6 Spain            Europe     2007    80.9  40448191    28821.
## 7 Sweden           Europe     2007    80.9   9031088    33860.
## 8 Australia        Oceania    2007    81.2  20434176    34435.
```

---

.pull-left[
Missing values are sorted to the end:

``` r
df <- data.frame(x = c(5, 2, NA, 3))
arrange(df, x)
##    x
## 1  2
## 2  3
## 3  5
## 4 NA
arrange(df, desc(x))
##    x
## 1  5
## 2  3
## 3  2
## 4 NA
```
]

.pull-right[
Using `is.na` you can arrange to put them at the beginning:

``` r
arrange(df, desc(is.na(x)), x)
##    x
## 1 NA
## 2  2
## 3  3
## 4  5
```
]

---
layout: true
## Summarizing and Grouping
---

`summarize()` collapses a table down to one row of summaries.

The average and maximal departure and arrival delays for flights,
along with the total number of flights:

``` r
summarize(flights,
          ave_dep_delay = mean(dep_delay, na.rm = TRUE),
          ave_arr_delay = mean(arr_delay, na.rm = TRUE),
          max_dep_delay = max(dep_delay, na.rm = TRUE),
          max_arr_delay = max(arr_delay, na.rm = TRUE),
          n = n())
## # A tibble: 1 × 5
##   ave_dep_delay ave_arr_delay max_dep_delay max_arr_delay      n
##           <dbl>         <dbl>         <dbl>         <dbl>  <int>
## 1          12.6          6.90          1301          1272 336776
```

---

### Grouped Summaries

`summarize()` is most commonly used with _grouped data_ to provide group
level summaries.

Grouping delay summaries by destination:

.pull-left.small-code.width-70[

``` r
fl_dest <-
    group_by(flights, dest) |>
    summarize(ave_dep_delay = mean(dep_delay, na.rm = TRUE),
              ave_arr_delay = mean(arr_delay, na.rm = TRUE),
              max_dep_delay = max(dep_delay, na.rm = TRUE),
              max_arr_delay = max(arr_delay, na.rm = TRUE),
              n = n()) |>
    ungroup()
head(fl_dest, 5)
## # A tibble: 5 × 6
##   dest  ave_dep_delay ave_arr_delay max_dep_delay max_arr_delay     n
##   <chr>         <dbl>         <dbl>         <dbl>         <dbl> <int>
## 1 ABQ           13.7           4.38           142           153   254
## 2 ACK            6.46          4.85           219           221   265
## 3 ALB           23.6          14.4            323           328   439
## 4 ANC           12.9          -2.5             75            39     8
## 5 ATL           12.5          11.3            898           895 17215
```
]

.pull-right.width-30[
.alert[
Calling `ungroup` is not always necessary but sometimes it is, so it
is a good habit to get into.
]
]

---

An alternative to using `group_by()` and `ungroup()` is to use the
`.by` argument to `summarize()`:

``` r
flights |>
    summarize(ave_dep_delay = mean(dep_delay, na.rm = TRUE),
              ave_arr_delay = mean(arr_delay, na.rm = TRUE),
              max_dep_delay = max(dep_delay, na.rm = TRUE),
              max_arr_delay = max(arr_delay, na.rm = TRUE),
              n = n(),
              .by = dest) |>
    arrange(dest) |>      ## to match group_by() result
    head()
## # A tibble: 6 × 6
##   dest  ave_dep_delay ave_arr_delay max_dep_delay max_arr_delay     n
##   <chr>         <dbl>         <dbl>         <dbl>         <dbl> <int>
## 1 ABQ           13.7           4.38           142           153   254
## 2 ACK            6.46          4.85           219           221   265
## 3 ALB           23.6          14.4            323           328   439
## 4 ANC           12.9          -2.5             75            39     8
## 5 ATL           12.5          11.3            898           895 17215
## 6 AUS           13.0           6.02           351           349  2439
```

This is currently marked as _Experimental_ in the documentation.

---

The group level summaries can be further transformed, for example
to identify the top 10 destinations for average delays:

``` r
mutate(fl_dest, rank = min_rank(desc(ave_dep_delay))) |>
    filter(rank <= 10)
## # A tibble: 10 × 7
##    dest  ave_dep_delay ave_arr_delay max_dep_delay max_arr_delay     n  rank
##    <chr>         <dbl>         <dbl>         <dbl>         <dbl> <int> <int>
##  1 ALB            23.6          14.4           323           328   439     9
##  2 BHM            29.7          16.9           325           291   297     4
##  3 CAE            35.6          41.8           222           224   116     1
##  4 DSM            26.2          19.0           341           322   569     7
##  5 JAC            26.5          28.1           198           175    25     6
##  6 MSN            23.6          20.2           340           364   572    10
##  7 OKC            30.6          30.6           280           262   346     3
##  8 RIC            23.6          20.1           483           463  2454     8
##  9 TUL            34.9          33.7           251           262   315     2
## 10 TYS            28.5          24.1           291           281   631     5
```

---

### Missing Values

Most summaries will be `NA` if any of the data values being summarized
are `NA`.

Typically, summary functions can be called with `na.rm = TRUE` to
produce the summary for all non-missing values.

---

You can also use `filter()` to remove rows with `NA` values in the
variables to be summarized:

``` r
filter(flights, ! is.na(dep_delay), ! is.na(arr_delay)) |>
    summarize(ave_dep_delay = mean(dep_delay),
              ave_arr_delay = mean(arr_delay),
              max_dep_delay = max(dep_delay),
              max_arr_delay = max(arr_delay),
              n = n())
## # A tibble: 1 × 5
##   ave_dep_delay ave_arr_delay max_dep_delay max_arr_delay      n
##           <dbl>         <dbl>         <dbl>         <dbl>  <int>
## 1          12.6          6.90          1301          1272 327346
```
--

.alert[
The two approaches do produce different counts.
]

---

### Counts

.pull-left[
The helper function `n()` provides a count of the number of rows in
the table being summarized.
{{content}}
]
--

Including counts with grouped summaries is usually a good idea.
{{content}}
--

Summaries based on small counts are likely to be much less reliable
than ones based on larger counts.
{{content}}
--

This is reflected in higher variability among averages for
destinations with fewer flights:

``` r
ggplot(fl_dest) +
    geom_point(aes(x = n,
                   y = ave_arr_delay),
               na.rm = TRUE) +
    thm
```

.pull-right[
<img src="dplyr_files/figure-html/ave_arr_delay-1.png" style="display: block; margin: auto;" />
]

---

.pull-left[
A `plotly` version allows the individual destinations to be identified
easily:

``` r
library(plotly)
p <- ggplot(fl_dest) +
    geom_point(aes(x = n,
                   y = ave_arr_delay,
                   text = dest),
               na.rm = FALSE) +
    thm
ggplotly(p, tooltip = "text")
```
]

.pull-right[
<div class="plotly html-widget html-fill-item" id="htmlwidget-8d32cfd9db97395c4a5f" style="width:432px;height:360px;"></div>
<script type="application/json" data-for="htmlwidget-8d32cfd9db97395c4a5f">{"x":{"data":[{"x":[254,265,439,8,17215,2439,275,443,375,297,6333,15508,896,2589,4681,371,1781,36,116,864,52,2884,4573,14064,3524,138,3941,1525,9705,7266,8738,569,9384,213,17,12055,765,1606,849,15,707,2115,5700,7198,110,2077,25,2720,5997,16174,1,1,668,2008,14082,4113,1789,1009,11728,2802,572,7185,3799,15,221,59,312,346,849,17283,1536,6554,1354,1632,4656,2875,365,19,376,2352,8163,2454,2416,3537,2737,686,804,10,1157,3923,13331,329,5819,2467,284,825,1211,4339,522,1761,7466,315,101,631,1036],"y":[4.3818897637795278,4.8522727272727275,14.397129186602871,-2.5,11.300112846706659,6.0199087515553709,8.0038314176245215,7.0485436893203888,8.027932960893855,16.877323420074351,11.812458908612754,2.9143922247370524,8.2454954954954953,8.9509960159362549,8.9459518599562369,8.1756756756756754,10.726733847065796,7.5999999999999996,41.764150943396224,19.698337292161519,9.5,10.59296846683581,9.1816112881201644,7.3603188532982298,10.601322910402887,14.671641791044776,15.364563758389261,12.680486061472481,9.0669520360004388,8.606500209234202,0.3221268478779209,19.005736137667306,5.4299634591961023,6.3043478260869561,6.3529411764705879,8.0821215432461972,18.189560439560438,14.11260053619303,15.935443037974684,2.1428571428571428,-1.3651925820256776,7.1761881901104179,13.864202117778191,4.2407904022582921,4.6355140186915884,9.9404341241797081,28.095238095238095,11.844834159359513,0.25772849462365588,0.54711094471483834,-22,null,-0.062027231467473527,14.514058355437665,5.4546430872771534,12.364223602484472,10.645314353499407,14.78755364806867,0.29905977745191065,14.167220376522701,20.196043165467625,7.2701688555347088,6.4901749663526243,1.7857142857142858,-0.2857142857142857,4.6034482758620694,3.0776699029126213,30.61904761904762,14.698898408812729,5.8766147531087771,10.949093444909344,8.5629720980422377,5.1415797317436658,10.127190136275146,2.0970473295701257,7.6809905316824469,7.8715083798882679,-12.722222222222221,16.234636871508378,11.660402097902098,10.052380952380952,20.111253196930946,11.560644614079729,3.2381496287835523,3.1391657438169065,6.945371775417299,15.129506008010681,6.5,12.669384057971014,-1.0990990990990992,2.6728915205344266,3.4481707317073171,2.520526589294994,0.17625458996328031,12.109929078014185,-7.8682266009852215,3.0824313072439633,11.0784645098986,-3.8359073359073359,8.9039250146455764,7.4085250338294992,33.65986394557823,12.968421052631578,24.069204152249135,7.465725806451613],"text":["ABQ","ACK","ALB","ANC","ATL","AUS","AVL","BDL","BGR","BHM","BNA","BOS","BQN","BTV","BUF","BUR","BWI","BZN","CAE","CAK","CHO","CHS","CLE","CLT","CMH","CRW","CVG","DAY","DCA","DEN","DFW","DSM","DTW","EGE","EYW","FLL","GRR","GSO","GSP","HDN","HNL","HOU","IAD","IAH","ILM","IND","JAC","JAX","LAS","LAX","LEX","LGA","LGB","MCI","MCO","MDW","MEM","MHT","MIA","MKE","MSN","MSP","MSY","MTJ","MVY","MYR","OAK","OKC","OMA","ORD","ORF","PBI","PDX","PHL","PHX","PIT","PSE","PSP","PVD","PWM","RDU","RIC","ROC","RSW","SAN","SAT","SAV","SBN","SDF","SEA","SFO","SJC","SJU","SLC","SMF","SNA","SRQ","STL","STT","SYR","TPA","TUL","TVC","TYS","XNA"],"type":"scatter","mode":"markers","marker":{"autocolorscale":false,"color":"rgba(0,0,0,1)","opacity":1,"size":5.6692913385826778,"symbol":"circle","line":{"width":1.8897637795275593,"color":"rgba(0,0,0,1)"}},"hoveron":"points","showlegend":false,"xaxis":"x","yaxis":"y","hoverinfo":"text","frame":null}],"layout":{"margin":{"t":24.767123287671232,"r":7.3059360730593621,"b":50.676629306766294,"l":57.716894977168948},"font":{"color":"rgba(0,0,0,1)","family":"","size":21.253632212536321},"xaxis":{"domain":[0,1],"automargin":true,"type":"linear","autorange":false,"range":[-863.10000000000002,18147.099999999999],"tickmode":"array","ticktext":["0","5000","10000","15000"],"tickvals":[-1.1368683772161603e-13,5000,10000,15000],"categoryorder":"array","categoryarray":["0","5000","10000","15000"],"nticks":null,"ticks":"","tickcolor":null,"ticklen":3.6529680365296811,"tickwidth":0,"showticklabels":true,"tickfont":{"color":"rgba(77,77,77,1)","family":"","size":17.002905770029059},"tickangle":-0,"showline":false,"linecolor":null,"linewidth":0,"showgrid":true,"gridcolor":"rgba(235,235,235,1)","gridwidth":0.66417600664176002,"zeroline":false,"anchor":"y","title":{"text":"n","font":{"color":"rgba(0,0,0,1)","family":"","size":21.253632212536321}},"hoverformat":".2f"},"yaxis":{"domain":[0,1],"automargin":true,"type":"linear","autorange":false,"range":[-25.18820754716981,44.952358490566034],"tickmode":"array","ticktext":["-20","0","20","40"],"tickvals":[-20,0,20,40],"categoryorder":"array","categoryarray":["-20","0","20","40"],"nticks":null,"ticks":"","tickcolor":null,"ticklen":3.6529680365296811,"tickwidth":0,"showticklabels":true,"tickfont":{"color":"rgba(77,77,77,1)","family":"","size":17.002905770029059},"tickangle":-0,"showline":false,"linecolor":null,"linewidth":0,"showgrid":true,"gridcolor":"rgba(235,235,235,1)","gridwidth":0.66417600664176002,"zeroline":false,"anchor":"x","title":{"text":"ave_arr_delay","font":{"color":"rgba(0,0,0,1)","family":"","size":21.253632212536321}},"hoverformat":".2f"},"shapes":[{"type":"rect","fillcolor":null,"line":{"color":null,"width":0,"linetype":[]},"yref":"paper","xref":"paper","x0":0,"x1":1,"y0":0,"y1":1}],"showlegend":false,"legend":{"bgcolor":null,"bordercolor":null,"borderwidth":0,"font":{"color":"rgba(0,0,0,1)","family":"","size":17.002905770029059}},"hovermode":"closest","barmode":"relative"},"config":{"doubleClick":"reset","modeBarButtonsToAdd":["hoverclosest","hovercompare"],"showSendToCloud":false},"source":"A","attrs":{"91c75554236c":{"x":{},"y":{},"text":{},"type":"scatter"}},"cur_data":"91c75554236c","visdat":{"91c75554236c":["function (y) ","x"]},"highlight":{"on":"plotly_click","persistent":false,"dynamic":false,"selectize":false,"opacityDim":0.20000000000000001,"selected":{"opacity":1},"debounce":0},"shinyEvents":["plotly_hover","plotly_click","plotly_selected","plotly_relayout","plotly_brushed","plotly_brushing","plotly_clickannotation","plotly_doubleclick","plotly_deselect","plotly_afterplot","plotly_sunburstclick"],"base_url":"https://plot.ly"},"evals":[],"jsHooks":[]}</script>

It would be more user-friendly to show the airport name than the three
letter FAA code.
]

---

### Useful Summaries

Some useful summary functions:

* Location: `mean()`, `median()`
--

* Spread: `sd()`, `IQR()`, `mad()`
--

* Rank: `min()`, `quantile()`, `max()`
--

* Position: `first()`, `nth()`, `last()`
--

* Counts: `n()`.

---
layout: true
## Airline Arrival Delays
---

### Different Delay Measures

.pull-left[
Sometimes it is useful to look at subsets or truncated versions of a
variable.
{{content}}
]
--

Airlines like to arrange schedules so they are often early.

--
.pull-right[
.hide-code[

``` r
ggplot(flights, aes(x = arr_delay,
                    y = after_stat(density))) +
    geom_histogram(binwidth = 1,
                   na.rm = TRUE,
                   fill = "deepskyblue3") +
    xlim(c(-100, 250)) +
    thm
```

<img src="dplyr_files/figure-html/unnamed-chunk-61-1.png" style="display: block; margin: auto;" />
]
{{content}}
]
--

This reduces the average arrival delay but may not help travelers much.

---

.pull-left.width-45[
Two options:
{{content}}
]
--

* Consider only the positive delays for the average.
{{content}}
--

* Treat negative delays as zero (zero-truncation).
{{content}}
--

The mean of the positive delays can be computed using _subsetting_:

``` r
mean(arr_delay[arr_delay > 0])
```
{{content}}
--

The mean of the zero-truncated delays can be computed using the `pmax`
function:

``` r
mean(pmax(arr_delay, 0))
```
--
.pull-right.width-55[
The summaries are

``` r
fl_arr <- filter(flights, ! is.na(arr_delay))
fl_avs <-
    group_by(fl_arr, dest) |>
    summarize(ave = mean(arr_delay),
              ave_pos = mean(arr_delay[arr_delay > 0]),
              ave_pos_zero = mean(pmax(arr_delay, 0)),
              n = n()) |>
    ungroup()
head(fl_avs)
## # A tibble: 6 × 5
##   dest    ave ave_pos ave_pos_zero     n
##   <chr> <dbl>   <dbl>        <dbl> <int>
## 1 ABQ    4.38    41.9        17.7    254
## 2 ACK    4.85    28.6        11.3    264
## 3 ALB   14.4     52.1        22.9    418
## 4 ANC   -2.5     12.4         7.75     8
## 5 ATL   11.3     37.8        17.8  16837
## 6 AUS    6.02    40.6        16.6   2411
```
]

---

.pull-left[
One approach to visualizing the results with color coding the three
different summaries:
{{content}}
]
--

* View the `fl_avs` table as a wide format with `avs`, `ave_pos`, and
  `ave_pos_zero` as three different measurement types.
{{content}}
--

* Use `pivot_longer()` to convert to a long format with a variable
  `which` to hold the summary type and a variable `delay` to
  hold the summary value:

.pull-right[

``` r
library(tidyr)
fl <- pivot_longer(fl_avs, 2 : 4,
                   names_to = "which",
                   values_to = "delay")
arrange(fl, dest)
## # A tibble: 312 × 4
##    dest      n which        delay
##    <chr> <int> <chr>        <dbl>
##  1 ABQ     254 ave           4.38
##  2 ABQ     254 ave_pos      41.9 
##  3 ABQ     254 ave_pos_zero 17.7 
##  4 ACK     264 ave           4.85
##  5 ACK     264 ave_pos      28.6 
##  6 ACK     264 ave_pos_zero 11.3 
##  7 ALB     418 ave          14.4 
##  8 ALB     418 ave_pos      52.1 
##  9 ALB     418 ave_pos_zero 22.9 
## 10 ANC       8 ave          -2.5 
## # ℹ 302 more rows
```
]

---

.pull-left[
A plot is then easy to create:

``` r
ggplot(fl, aes(x = n,
               y = delay,
               color = which)) +
    geom_point() +
    thm
```
]

.pull-right[

```
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
```

<img src="dplyr_files/figure-html/aves_pivot-1.png" style="display: block; margin: auto;" />
]

---

.pull-left[
An alternative:

* Use one layer per summary type.

* Specify the summary type for each layer as a `color` aesthetic value.

``` r
ggplot(fl_avs, aes(x = n)) +
    geom_point(aes(y = ave,
                   color = "ave")) +
    geom_point(aes(y = ave_pos,
                   color = "ave_pos")) +
    geom_point(aes(y = ave_pos_zero,
                   color = "ave_pos_zero")) +
    thm
```
]

.pull-right[

```
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
```

<img src="dplyr_files/figure-html/aves_layer-1.png" style="display: block; margin: auto;" />
]

---

The one missing value case:

``` r
incomplete_cases(fl_avs)
## # A tibble: 1 × 5
##   dest    ave ave_pos ave_pos_zero     n
##   <chr> <dbl>   <dbl>        <dbl> <int>
## 1 LEX     -22     NaN            0     1
```

---

.pull-left.width-40[
### Delay Proportions

Proportions satisfying a condition can be computed as a mean
of the logical expression for the condition:
{{content}}
]
--

The proportion of all flights that were delayed:

``` r
mean(fl_arr$arr_delay > 0)
## [1] 0.4063101
```
{{content}}
--

Grouped by destination:

.pull-right.width-55.smaller[

``` r
fl_props <-
    group_by(fl_arr, dest) |>
    summarize(
        pd0 = mean(arr_delay > 0),   ## proportion delayed
        pd10 = mean(arr_delay > 10), ## more than 10 minutes
        pd20 = mean(arr_delay > 20), ## more than 20 minutes
        n = n()) |>
    ungroup()
fl_props
## # A tibble: 104 × 5
##    dest    pd0  pd10  pd20     n
##    <chr> <dbl> <dbl> <dbl> <int>
##  1 ABQ   0.421 0.339 0.268   254
##  2 ACK   0.394 0.246 0.167   264
##  3 ALB   0.440 0.333 0.273   418
##  4 ANC   0.625 0.125 0.125     8
##  5 ATL   0.472 0.309 0.211 16837
##  6 AUS   0.408 0.284 0.216  2411
##  7 AVL   0.456 0.257 0.207   261
##  8 BDL   0.350 0.262 0.223   412
##  9 BGR   0.383 0.299 0.246   358
## 10 BHM   0.450 0.361 0.305   269
## # ℹ 94 more rows
```
]

---

.pull-left[
These can be visualized similarly:

``` r
pivot_longer(fl_props, 2 : 4,
             names_to = "which",
             values_to = "prop") |>
    ggplot(aes(x = n,
               y = prop,
               color = which)) +
    geom_point() +
    thm
```
]

.pull-right[
<img src="dplyr_files/figure-html/prop_delayed_pivot-1.png" style="display: block; margin: auto;" />
]

---
layout: true
## Grouped Mutate and Filter
---

`mutate()`, `filter()`, `arrange()`, `slice_max()`, and `slice_min()`
can be used on a group level.

### Standardizing Within Groups

`mutate` within groups can be used for standardizing within a group.

For example, to compute the proportion of a continent's population
living in each country for each year:

``` r
gm <- group_by(gapminder, continent, year) |>
    mutate(ppop = pop / sum(pop)) |>
    ungroup()
```

Using the `.by` argument to `mutate()`:

``` r
gm2 <- mutate(gapminder, ppop = pop / sum(pop), .by = c(continent, year))
identical(gm, gm2)
## [1] TRUE
```
---

For Oceania (in this data set) and years 1952 and 2007:

``` r
gm_oc <- filter(gm, continent == "Oceania", year %in% c(1952, 2007)) |>
    arrange(year)
gm_oc
## # A tibble: 4 × 7
##   country     continent  year lifeExp      pop gdpPercap  ppop
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl> <dbl>
## 1 Australia   Oceania    1952    69.1  8691212    10040. 0.813
## 2 New Zealand Oceania    1952    69.4  1994794    10557. 0.187
## 3 Australia   Oceania    2007    81.2 20434176    34435. 0.832
## 4 New Zealand Oceania    2007    80.2  4115771    25185. 0.168
```

A quick check that within year and continent the proportions add
up to one:

``` r
group_by(gm_oc, continent, year) |> summarize(sum_ppop = sum(ppop))
## # A tibble: 2 × 3
## # Groups:   continent [1]
##   continent  year sum_ppop
##   <fct>     <int>    <dbl>
## 1 Oceania    1952        1
## 2 Oceania    2007        1
```

---

.pull-left.width-40[
Focus on the 2 countries in each continent with the highest
proportions of the continent's population:
]

.pull-right.width-60[
.hide-code[

``` r
gmt2 <- group_by(gm, continent, year) |>
    slice_max(ppop, n = 2) |>
    ungroup() |>
    filter(continent != "Oceania") ## drop as there are only two
ggplot(gmt2, aes(x = year, y = ppop,
                 group = country, color = country)) +
    geom_line() + geom_point() +
    facet_wrap(~ continent, scales = "free_y") +
    thm
```

<img src="dplyr_files/figure-html/unnamed-chunk-73-1.png" style="display: block; margin: auto;" />
]
]

---

.pull-left.width-40[
Another approach:
{{content}}
]
--

* Create separate plots for each continent.
{{content}}
--

* Adjust the themes a little.
{{content}}
--

* Use `facet_wrap()` to get the facet label.
{{content}}
--

* Arrange the plots, e.g. using `patchwork` operators.
{{content}}
--

A useful feature is that the `%+%` operator can be used to change the
base data frame for a plot:

.pull-right.width-60[
.hide-code[

``` r
pAf <- ggplot(filter(gmt2, continent == "Africa"),
              aes(x = year, y = ppop, col = country)) +
    geom_line() + geom_point() +
    facet_wrap(~ continent) +
    thm +
    theme(legend.position = "top",
          legend.title = element_blank())
thm_nx <- theme(axis.title.x = element_blank())
thm_ny <- theme(axis.title.y = element_blank())
pAm <- pAf %+% filter(gmt2, continent == "Americas")
pAs <- pAf %+% filter(gmt2, continent == "Asia")
pEu <- pAf %+% filter(gmt2, continent == "Europe") +
    guides(color = guide_legend(nrow = 2))
library(patchwork)
((pAf + thm_nx) | (pAm + thm_nx + thm_ny)) /
    (pAs | (pEu + thm_ny))
```

<img src="dplyr_files/figure-html/unnamed-chunk-74-1.png" style="display: block; margin: auto;" />
]
]

---

.pull-left[
### Changes Over Time
{{content}}
]
--

Changes over time within a country can also be examined with a grouped
`mutate` and the `lag` operator.
{{content}}
--

For variables collected over time the `lag` function can be used to
compute the previous value of a variable:

.pull-right[

``` r
gm_us <- filter(gapminder,
                country == "United States") |>
    transmute(year,
              lifeExp,
              prev_lifeExp = lag(lifeExp))
gm_us
## # A tibble: 12 × 3
##     year lifeExp prev_lifeExp
##    <int>   <dbl>        <dbl>
##  1  1952    68.4         NA  
##  2  1957    69.5         68.4
##  3  1962    70.2         69.5
##  4  1967    70.8         70.2
##  5  1972    71.3         70.8
##  6  1977    73.4         71.3
##  7  1982    74.6         73.4
##  8  1987    75.0         74.6
##  9  1992    76.1         75.0
## 10  1997    76.8         76.1
## 11  2002    77.3         76.8
## 12  2007    78.2         77.3
```
]

---

.pull-left[
This can then be used to compute the change from one period to the next:
]

.pull-right[

``` r
mutate(gm_us,
       le_change = lifeExp - prev_lifeExp)
## # A tibble: 12 × 4
##     year lifeExp prev_lifeExp le_change
##    <int>   <dbl>        <dbl>     <dbl>
##  1  1952    68.4         NA      NA    
##  2  1957    69.5         68.4     1.05 
##  3  1962    70.2         69.5     0.720
##  4  1967    70.8         70.2     0.550
##  5  1972    71.3         70.8     0.580
##  6  1977    73.4         71.3     2.04 
##  7  1982    74.6         73.4     1.27 
##  8  1987    75.0         74.6     0.370
##  9  1992    76.1         75.0     1.07 
## 10  1997    76.8         76.1     0.720
## 11  2002    77.3         76.8     0.5  
## 12  2007    78.2         77.3     0.932
```
]

---

.pull-left.width-40.smaller-code[
Use this idea to compute changes in `lifeExp` and `gdpPercap` for all
countries in a grouped `mutate`:
{{content}}
]
--

``` r
gm <- group_by(gapminder, country) |>
    mutate(
        le_change = lifeExp - lag(lifeExp),
        gdp_change = gdpPercap - lag(gdpPercap),
        gdp_pct_change =
            100 * gdp_change / lag(gdpPercap)) |>
    ungroup()
```
{{content}}
--

`slice_max()` can then be applied to data grouped by country to
extract the rows with the highest `gdp_pct_change` for each country:

.pull-right.small-code.width-60[

``` r
group_by(gm, country) |>
    slice_max(gdp_pct_change, n = 1) |>
    ungroup() |>
    select(1 : 3, contains("gdp"))
## # A tibble: 142 × 6
##    country     continent  year gdpPercap gdp_change gdp_pct_change
##    <fct>       <fct>     <int>     <dbl>      <dbl>          <dbl>
##  1 Afghanistan Asia       2007      975.       248.           34.1
##  2 Albania     Europe     2002     4604.      1411.           44.2
##  3 Algeria     Africa     1972     4183.       936.           28.8
##  4 Angola      Africa     2007     4797.      2024.           73.0
##  5 Argentina   Americas   2007    12779.      3982.           45.3
##  6 Australia   Oceania    1967    14526.      2309.           18.9
##  7 Austria     Europe     1957     8843.      2706.           44.1
##  8 Bahrain     Asia       2007    29796.      6392.           27.3
##  9 Bangladesh  Asia       2007     1391.       255.           22.4
## 10 Belgium     Europe     1972    16672.      3523.           26.8
## # ℹ 132 more rows
```
]

---

Grouping by continent allows the worst drop in life expectancy for
each continent to be found:

``` r
group_by(gm, continent) |>
    slice_min(le_change, n = 1) |>
    ungroup() |>
    select(-contains("gdp"))
## # A tibble: 5 × 6
##   country     continent  year lifeExp      pop le_change
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Rwanda      Africa     1992    23.6  7290203   -20.4  
## 2 El Salvador Americas   1977    56.7  4282586    -1.51 
## 3 Cambodia    Asia       1977    31.2  6978607    -9.10 
## 4 Montenegro  Europe     2002    74.0   720230    -1.46 
## 5 Australia   Oceania    1967    71.1 11872264     0.170
```

---

Without grouping, the row with best percent GDP change overall can be computed:

``` r
slice_max(gm, gdp_pct_change, n = 1) |>
    select(country, continent, year, gdp_pct_change)
## # A tibble: 1 × 4
##   country continent  year gdp_pct_change
##   <fct>   <fct>     <int>          <dbl>
## 1 Libya   Africa     1967           178.
```

---

The computation of the best GDP growth years by country can be
combined into a single pipe:

.code-80[

Since the groupings of the two steps are the same and grouped `mutate`
preserves the group structure, the middle `ungroup`/`group_by` can be
dropped:

.pull-left.code-80.width-70[

``` r
group_by(gapminder, country) |>
    mutate(gdp_change = gdpPercap - lag(gdpPercap),
           gdp_pct_change = 100 * gdp_change / lag(gdpPercap)) |>
    slice_max(gdp_pct_change, n = 1) |>
    ungroup() |>
    select(1 : 3, contains("gdp"))
```
]
--

.pull-right.width-30[
.alert[
You need to be _very careful_ making an optimization like this!
]
]

---
layout: true
## Joins
---

.pull-left[
This interactive plot uses the `tooltip` to show the airport code for
the destination:
]

.pull-right[
.hide-code[

``` r
library(plotly)
p <- ggplot(fl_dest) +
    geom_point(aes(x = n,
                   y = ave_arr_delay,
                   text = dest)) +
    thm
## Warning in geom_point(aes(x = n, y = ave_arr_delay, text = dest)): Ignoring
## unknown aesthetics: text
ggplotly(p, tooltip = "text")
```

<div class="plotly html-widget html-fill-item" id="htmlwidget-d701554a25665d09d09f" style="width:432px;height:360px;"></div>
<script type="application/json" data-for="htmlwidget-d701554a25665d09d09f">{"x":{"data":[{"x":[254,265,439,8,17215,2439,275,443,375,297,6333,15508,896,2589,4681,371,1781,36,116,864,52,2884,4573,14064,3524,138,3941,1525,9705,7266,8738,569,9384,213,17,12055,765,1606,849,15,707,2115,5700,7198,110,2077,25,2720,5997,16174,1,1,668,2008,14082,4113,1789,1009,11728,2802,572,7185,3799,15,221,59,312,346,849,17283,1536,6554,1354,1632,4656,2875,365,19,376,2352,8163,2454,2416,3537,2737,686,804,10,1157,3923,13331,329,5819,2467,284,825,1211,4339,522,1761,7466,315,101,631,1036],"y":[4.3818897637795278,4.8522727272727275,14.397129186602871,-2.5,11.300112846706659,6.0199087515553709,8.0038314176245215,7.0485436893203888,8.027932960893855,16.877323420074351,11.812458908612754,2.9143922247370524,8.2454954954954953,8.9509960159362549,8.9459518599562369,8.1756756756756754,10.726733847065796,7.5999999999999996,41.764150943396224,19.698337292161519,9.5,10.59296846683581,9.1816112881201644,7.3603188532982298,10.601322910402887,14.671641791044776,15.364563758389261,12.680486061472481,9.0669520360004388,8.606500209234202,0.3221268478779209,19.005736137667306,5.4299634591961023,6.3043478260869561,6.3529411764705879,8.0821215432461972,18.189560439560438,14.11260053619303,15.935443037974684,2.1428571428571428,-1.3651925820256776,7.1761881901104179,13.864202117778191,4.2407904022582921,4.6355140186915884,9.9404341241797081,28.095238095238095,11.844834159359513,0.25772849462365588,0.54711094471483834,-22,null,-0.062027231467473527,14.514058355437665,5.4546430872771534,12.364223602484472,10.645314353499407,14.78755364806867,0.29905977745191065,14.167220376522701,20.196043165467625,7.2701688555347088,6.4901749663526243,1.7857142857142858,-0.2857142857142857,4.6034482758620694,3.0776699029126213,30.61904761904762,14.698898408812729,5.8766147531087771,10.949093444909344,8.5629720980422377,5.1415797317436658,10.127190136275146,2.0970473295701257,7.6809905316824469,7.8715083798882679,-12.722222222222221,16.234636871508378,11.660402097902098,10.052380952380952,20.111253196930946,11.560644614079729,3.2381496287835523,3.1391657438169065,6.945371775417299,15.129506008010681,6.5,12.669384057971014,-1.0990990990990992,2.6728915205344266,3.4481707317073171,2.520526589294994,0.17625458996328031,12.109929078014185,-7.8682266009852215,3.0824313072439633,11.0784645098986,-3.8359073359073359,8.9039250146455764,7.4085250338294992,33.65986394557823,12.968421052631578,24.069204152249135,7.465725806451613],"text":["ABQ","ACK","ALB","ANC","ATL","AUS","AVL","BDL","BGR","BHM","BNA","BOS","BQN","BTV","BUF","BUR","BWI","BZN","CAE","CAK","CHO","CHS","CLE","CLT","CMH","CRW","CVG","DAY","DCA","DEN","DFW","DSM","DTW","EGE","EYW","FLL","GRR","GSO","GSP","HDN","HNL","HOU","IAD","IAH","ILM","IND","JAC","JAX","LAS","LAX","LEX","LGA","LGB","MCI","MCO","MDW","MEM","MHT","MIA","MKE","MSN","MSP","MSY","MTJ","MVY","MYR","OAK","OKC","OMA","ORD","ORF","PBI","PDX","PHL","PHX","PIT","PSE","PSP","PVD","PWM","RDU","RIC","ROC","RSW","SAN","SAT","SAV","SBN","SDF","SEA","SFO","SJC","SJU","SLC","SMF","SNA","SRQ","STL","STT","SYR","TPA","TUL","TVC","TYS","XNA"],"type":"scatter","mode":"markers","marker":{"autocolorscale":false,"color":"rgba(0,0,0,1)","opacity":1,"size":5.6692913385826778,"symbol":"circle","line":{"width":1.8897637795275593,"color":"rgba(0,0,0,1)"}},"hoveron":"points","showlegend":false,"xaxis":"x","yaxis":"y","hoverinfo":"text","frame":null}],"layout":{"margin":{"t":24.767123287671232,"r":7.3059360730593621,"b":50.676629306766294,"l":57.716894977168948},"font":{"color":"rgba(0,0,0,1)","family":"","size":21.253632212536321},"xaxis":{"domain":[0,1],"automargin":true,"type":"linear","autorange":false,"range":[-863.10000000000002,18147.099999999999],"tickmode":"array","ticktext":["0","5000","10000","15000"],"tickvals":[-1.1368683772161603e-13,5000,10000,15000],"categoryorder":"array","categoryarray":["0","5000","10000","15000"],"nticks":null,"ticks":"","tickcolor":null,"ticklen":3.6529680365296811,"tickwidth":0,"showticklabels":true,"tickfont":{"color":"rgba(77,77,77,1)","family":"","size":17.002905770029059},"tickangle":-0,"showline":false,"linecolor":null,"linewidth":0,"showgrid":true,"gridcolor":"rgba(235,235,235,1)","gridwidth":0.66417600664176002,"zeroline":false,"anchor":"y","title":{"text":"n","font":{"color":"rgba(0,0,0,1)","family":"","size":21.253632212536321}},"hoverformat":".2f"},"yaxis":{"domain":[0,1],"automargin":true,"type":"linear","autorange":false,"range":[-25.18820754716981,44.952358490566034],"tickmode":"array","ticktext":["-20","0","20","40"],"tickvals":[-20,0,20,40],"categoryorder":"array","categoryarray":["-20","0","20","40"],"nticks":null,"ticks":"","tickcolor":null,"ticklen":3.6529680365296811,"tickwidth":0,"showticklabels":true,"tickfont":{"color":"rgba(77,77,77,1)","family":"","size":17.002905770029059},"tickangle":-0,"showline":false,"linecolor":null,"linewidth":0,"showgrid":true,"gridcolor":"rgba(235,235,235,1)","gridwidth":0.66417600664176002,"zeroline":false,"anchor":"x","title":{"text":"ave_arr_delay","font":{"color":"rgba(0,0,0,1)","family":"","size":21.253632212536321}},"hoverformat":".2f"},"shapes":[{"type":"rect","fillcolor":null,"line":{"color":null,"width":0,"linetype":[]},"yref":"paper","xref":"paper","x0":0,"x1":1,"y0":0,"y1":1}],"showlegend":false,"legend":{"bgcolor":null,"bordercolor":null,"borderwidth":0,"font":{"color":"rgba(0,0,0,1)","family":"","size":17.002905770029059}},"hovermode":"closest","barmode":"relative"},"config":{"doubleClick":"reset","modeBarButtonsToAdd":["hoverclosest","hovercompare"],"showSendToCloud":false},"source":"A","attrs":{"91c7749f7fc9":{"x":{},"y":{},"text":{},"type":"scatter"}},"cur_data":"91c7749f7fc9","visdat":{"91c7749f7fc9":["function (y) ","x"]},"highlight":{"on":"plotly_click","persistent":false,"dynamic":false,"selectize":false,"opacityDim":0.20000000000000001,"selected":{"opacity":1},"debounce":0},"shinyEvents":["plotly_hover","plotly_click","plotly_selected","plotly_relayout","plotly_brushed","plotly_brushing","plotly_clickannotation","plotly_doubleclick","plotly_deselect","plotly_afterplot","plotly_sunburstclick"],"base_url":"https://plot.ly"},"evals":[],"jsHooks":[]}</script>
]
]

---

A nicer approach would be to show the airport name.

The `airports` table provides this information:

``` r
head(airports, 2)
## # A tibble: 2 × 8
##   faa   name                            lat   lon   alt    tz dst   tzone       
##   <chr> <chr>                         <dbl> <dbl> <dbl> <dbl> <chr> <chr>       
## 1 04G   Lansdowne Airport              41.1 -80.6  1044    -5 A     America/New…
## 2 06A   Moton Field Municipal Airport  32.5 -85.7   264    -6 A     America/Chi…
```

We need to get this information into the `fl_dest` data frame.

---

As another example, we might want to explore whether aircraft age
is related to delays.

The `planes` table provides the year of manufacture:

``` r
head(planes, 2)
## # A tibble: 2 × 9
##   tailnum  year type               manufacturer model engines seats speed engine
##   <chr>   <int> <chr>              <chr>        <chr>   <int> <int> <int> <chr> 
## 1 N10156   2004 Fixed wing multi … EMBRAER      EMB-…       2    55    NA Turbo…
## 2 N102UW   1998 Fixed wing multi … AIRBUS INDU… A320…       2   182    NA Turbo…
```

Again, we need to merge this information with the data in the `flights` table.

One way to do this is with a _join operation_.

---

Joins are a common operation in data bases.

* Murell's _Data Technologies_ book describes data bases in
  Section 5.6 and data base queries using SQL in Chapter 7.

* _R for Data Science_ describes relational data in
  [Chapter 13](https://r4ds.hadley.nz/joins.html).

`dplyr` distinguishes between two kinds of joins:

* _mutating joins_;
--

* _filtering joins_.

These [animated
explanations](https://www.garrickadenbuie.com/project/tidyexplain/) may
be helpful.

---

### Mutating Joins

Mutating joins combine variables and rows from two tables by matching
rows on _keys_.

* A _primary key_ is a variable, or combination of variables, in a
  table that uniquely identify the rows of the table.

* A _foreign key_ is a variable, or combination of variables, that
  uniquely identify a row in some table (usually, but not necessarily, a
  different table).

.pull-left[
The simplest join operation is an _inner join_: Only rows for which
the key appears in both tables are retained.
]

.pull-right[
<img src="../img/join-inner.png" width="70%" style="display: block; margin: auto;" />
]

---

.pull-left[
_Outer joins_ retain some other rows with `NA` values for new variables:
]

.pull-right[
<img src="../img/join-outer.png" width="70%" style="display: block; margin: auto;" />
]

---

Inner and left joins are the most common.

The key of one of the tables should usually be a _primary key_ that
uniquely identifies the table's rows.

For a left join, the right table key should usually use a primary key.

It is usually a good idea to make sure a primary key candidate really
does uniquely identify rows.

For the `planes` table the `tailnum` variable is a primary key;

``` r
count(planes, tailnum) |> filter(n > 1)
## # A tibble: 0 × 2
## # ℹ 2 variables: tailnum <chr>, n <int>
```

---

For the `flights` data a primary key needs to use multiple variables.

A reasonable guess is to use the date, carrier and flight number, but
this isn't quite enough:

``` r
count(flights, year, month, day, flight, carrier) |>
    filter(n > 1) |>
    head(2)
## # A tibble: 2 × 6
##    year month   day flight carrier     n
##   <int> <int> <int>  <int> <chr>   <int>
## 1  2013     6     8   2269 WN          2
## 2  2013     6    15   2269 WN          2
```

Adding the `origin` airport code does produce a primary key:

``` r
count(flights, year, month, day, flight, carrier, origin) |>
    filter(n > 1)
## # A tibble: 0 × 7
## # ℹ 7 variables: year <int>, month <int>, day <int>, flight <int>,
## #   carrier <chr>, origin <chr>, n <int>
```

---

To relate delays to aircraft age we can start with summaries by
`tailnum` and `year` (`year` would be needed if we had data for more
than 2013):

``` r
fl <- filter(flights, ! is.na(dep_time), ! is.na(arr_time)) |>
    group_by(year, tailnum) |>
    summarize(ave_dep_delay = mean(dep_delay, na.rm = TRUE),
              ave_arr_delay = mean(arr_delay, na.rm = TRUE),
              n = n()) |>
    ungroup()
```

``` r
head(fl)
## # A tibble: 6 × 5
##    year tailnum ave_dep_delay ave_arr_delay     n
##   <int> <chr>           <dbl>         <dbl> <int>
## 1  2013 D942DN          31.5          31.5      4
## 2  2013 N0EGMQ           8.49          9.98   354
## 3  2013 N10156          17.8          12.7    146
## 4  2013 N102UW           8             2.94    48
## 5  2013 N103US          -3.20         -6.93    46
## 6  2013 N104UW          10.1           1.80    46
```

---

A reduced `planes` table with `year` renamed to avoid a conflict:

``` r
pl <- select(planes, tailnum, plane_year = year)
```

``` r
pl
## # A tibble: 3,322 × 2
##    tailnum plane_year
##    <chr>        <int>
##  1 N10156        2004
##  2 N102UW        1998
##  3 N103US        1999
##  4 N104UW        1999
##  5 N10575        2002
##  6 N105UW        1999
##  7 N107US        1999
##  8 N108UW        1999
##  9 N109UW        1999
## 10 N110UW        1999
## # ℹ 3,312 more rows
```

---

Now

* use a left join with `tailnum` as the key to bring the `plane_year`
  value into the summary table

* and then compute the plane age:

``` r
fl_age <- left_join(fl, pl, "tailnum") |>
    mutate(plane_age = year - plane_year)
fl_age
## # A tibble: 4,037 × 7
##     year tailnum ave_dep_delay ave_arr_delay     n plane_year plane_age
##    <int> <chr>           <dbl>         <dbl> <int>      <int>     <int>
##  1  2013 D942DN         31.5          31.5       4         NA        NA
##  2  2013 N0EGMQ          8.49          9.98    354         NA        NA
##  3  2013 N10156         17.8          12.7     146       2004         9
##  4  2013 N102UW          8             2.94     48       1998        15
##  5  2013 N103US         -3.20         -6.93     46       1999        14
##  6  2013 N104UW         10.1           1.80     46       1999        14
##  7  2013 N10575         22.3          20.7     271       2002        11
##  8  2013 N105UW          2.58         -0.267    45       1999        14
##  9  2013 N107US         -0.463        -5.73     41       1999        14
## 10  2013 N108UW          4.22         -1.25     60       1999        14
## # ℹ 4,027 more rows
```

---

.pull-left[
An initial plot:

``` r
ggplot(fl_age,
       aes(x = plane_age,
           y = ave_dep_delay)) +
    geom_point() +
    thm
```
]

.pull-right[
<img src="dplyr_files/figure-html/plane-age-initial-1.png" style="display: block; margin: auto;" />
]

---

.pull-left[
Jittering and adjusting the point size may help:

``` r
ggplot(fl_age,
       aes(x = plane_age,
           y = ave_dep_delay)) +
    geom_point(position = "jitter",
               size = 0.1) +
    thm
```
]

.pull-right[
<img src="dplyr_files/figure-html/plane-age-jitter-1.png" style="display: block; margin: auto;" />
]

---

.pull-left[
Adding a smooth and adjusting the ranges may also help:

``` r
ggplot(fl_age,
       aes(x = plane_age,
           y = ave_dep_delay)) +
    geom_point(position = "jitter",
               size = 0.1) +
    geom_smooth() +
    xlim(0, 30) +
    ylim(c(-20, 80)) +
    thm
```
]

.pull-right[
<img src="dplyr_files/figure-html/plane-age-smooth-1.png" style="display: block; margin: auto;" />
]

---

.pull-left[
Another option is to compute means or medians within years:

``` r
flm <- group_by(fl_age, plane_age) |>
    summarize(ave_dep_delay =
                  mean(ave_dep_delay,
                       na.rm = TRUE)) |>
    ungroup()
ggplot(flm,
       aes(x = plane_age,
           y = ave_dep_delay)) +
    geom_point(na.rm = TRUE) +
    thm
```
]

.pull-right[
<img src="dplyr_files/figure-html/plane-age-means-1.png" style="display: block; margin: auto;" />
{{content}}
]
--

Both the smooth and the means show a modest increase over the first 10 years.

---

A left join can also be used to add the airport name from the airport
table to the `fl_dest` table.

The primary key in the `airport` table is the three-letter airport code
in the `faa` variable:

The foreign key in the `fl_dest` table to match is the `dest` variable:

``` r
head(fl_dest, 2)
## # A tibble: 2 × 6
##   dest  ave_dep_delay ave_arr_delay max_dep_delay max_arr_delay     n
##   <chr>         <dbl>         <dbl>         <dbl>         <dbl> <int>
## 1 ABQ           13.7           4.38           142           153   254
## 2 ACK            6.46          4.85           219           221   265
```

---

The `left_join` function allows this link to be made by specifying the
key as `c("dest" = "faa")`:

``` r
fl_dest <- left_join(fl_dest,
                     select(airports, faa, dest_name = name),
                     c("dest" = "faa"))
select(fl_dest, dest_name, everything())
## # A tibble: 105 × 7
##    dest_name dest  ave_dep_delay ave_arr_delay max_dep_delay max_arr_delay     n
##    <chr>     <chr>         <dbl>         <dbl>         <dbl>         <dbl> <int>
##  1 Albuquer… ABQ           13.7           4.38           142           153   254
##  2 Nantucke… ACK            6.46          4.85           219           221   265
##  3 Albany I… ALB           23.6          14.4            323           328   439
##  4 Ted Stev… ANC           12.9          -2.5             75            39     8
##  5 Hartsfie… ATL           12.5          11.3            898           895 17215
##  6 Austin B… AUS           13.0           6.02           351           349  2439
##  7 Ashevill… AVL            8.19          8.00           222           228   275
##  8 Bradley … BDL           17.7           7.05           252           266   443
##  9 Bangor I… BGR           19.5           8.03           248           238   375
## 10 Birmingh… BHM           29.7          16.9            325           291   297
## # ℹ 95 more rows
```

---

.pull-left[
Now a tooltip can provide a more accessible label:

``` r
p <- ggplot(fl_dest) +
    geom_point(aes(x = n,
                   y = ave_arr_delay,
                   text = paste(dest,
                                dest_name,
                                sep = ": "))) +
    thm
ggplotly(p, tooltip = "text")
```
]

.pull-right[
<div class="plotly html-widget html-fill-item" id="htmlwidget-27a6251d00cbd347f90b" style="width:432px;height:360px;"></div>
<script type="application/json" data-for="htmlwidget-27a6251d00cbd347f90b">{"x":{"data":[{"x":[254,265,439,8,17215,2439,275,443,375,297,6333,15508,896,2589,4681,371,1781,36,116,864,52,2884,4573,14064,3524,138,3941,1525,9705,7266,8738,569,9384,213,17,12055,765,1606,849,15,707,2115,5700,7198,110,2077,25,2720,5997,16174,1,1,668,2008,14082,4113,1789,1009,11728,2802,572,7185,3799,15,221,59,312,346,849,17283,1536,6554,1354,1632,4656,2875,365,19,376,2352,8163,2454,2416,3537,2737,686,804,10,1157,3923,13331,329,5819,2467,284,825,1211,4339,522,1761,7466,315,101,631,1036],"y":[4.3818897637795278,4.8522727272727275,14.397129186602871,-2.5,11.300112846706659,6.0199087515553709,8.0038314176245215,7.0485436893203888,8.027932960893855,16.877323420074351,11.812458908612754,2.9143922247370524,8.2454954954954953,8.9509960159362549,8.9459518599562369,8.1756756756756754,10.726733847065796,7.5999999999999996,41.764150943396224,19.698337292161519,9.5,10.59296846683581,9.1816112881201644,7.3603188532982298,10.601322910402887,14.671641791044776,15.364563758389261,12.680486061472481,9.0669520360004388,8.606500209234202,0.3221268478779209,19.005736137667306,5.4299634591961023,6.3043478260869561,6.3529411764705879,8.0821215432461972,18.189560439560438,14.11260053619303,15.935443037974684,2.1428571428571428,-1.3651925820256776,7.1761881901104179,13.864202117778191,4.2407904022582921,4.6355140186915884,9.9404341241797081,28.095238095238095,11.844834159359513,0.25772849462365588,0.54711094471483834,-22,null,-0.062027231467473527,14.514058355437665,5.4546430872771534,12.364223602484472,10.645314353499407,14.78755364806867,0.29905977745191065,14.167220376522701,20.196043165467625,7.2701688555347088,6.4901749663526243,1.7857142857142858,-0.2857142857142857,4.6034482758620694,3.0776699029126213,30.61904761904762,14.698898408812729,5.8766147531087771,10.949093444909344,8.5629720980422377,5.1415797317436658,10.127190136275146,2.0970473295701257,7.6809905316824469,7.8715083798882679,-12.722222222222221,16.234636871508378,11.660402097902098,10.052380952380952,20.111253196930946,11.560644614079729,3.2381496287835523,3.1391657438169065,6.945371775417299,15.129506008010681,6.5,12.669384057971014,-1.0990990990990992,2.6728915205344266,3.4481707317073171,2.520526589294994,0.17625458996328031,12.109929078014185,-7.8682266009852215,3.0824313072439633,11.0784645098986,-3.8359073359073359,8.9039250146455764,7.4085250338294992,33.65986394557823,12.968421052631578,24.069204152249135,7.465725806451613],"text":["ABQ: Albuquerque International Sunport","ACK: Nantucket Mem","ALB: Albany Intl","ANC: Ted Stevens Anchorage Intl","ATL: Hartsfield Jackson Atlanta Intl","AUS: Austin Bergstrom Intl","AVL: Asheville Regional Airport","BDL: Bradley Intl","BGR: Bangor Intl","BHM: Birmingham Intl","BNA: Nashville Intl","BOS: General Edward Lawrence Logan Intl","BQN: NA","BTV: Burlington Intl","BUF: Buffalo Niagara Intl","BUR: Bob Hope","BWI: Baltimore Washington Intl","BZN: Gallatin Field","CAE: Columbia Metropolitan","CAK: Akron Canton Regional Airport","CHO: Charlottesville-Albemarle","CHS: Charleston Afb Intl","CLE: Cleveland Hopkins Intl","CLT: Charlotte Douglas Intl","CMH: Port Columbus Intl","CRW: Yeager","CVG: Cincinnati Northern Kentucky Intl","DAY: James M Cox Dayton Intl","DCA: Ronald Reagan Washington Natl","DEN: Denver Intl","DFW: Dallas Fort Worth Intl","DSM: Des Moines Intl","DTW: Detroit Metro Wayne Co","EGE: Eagle Co Rgnl","EYW: Key West Intl","FLL: Fort Lauderdale Hollywood Intl","GRR: Gerald R Ford Intl","GSO: Piedmont Triad","GSP: Greenville-Spartanburg International","HDN: Yampa Valley","HNL: Honolulu Intl","HOU: William P Hobby","IAD: Washington Dulles Intl","IAH: George Bush Intercontinental","ILM: Wilmington Intl","IND: Indianapolis Intl","JAC: Jackson Hole Airport","JAX: Jacksonville Intl","LAS: Mc Carran Intl","LAX: Los Angeles Intl","LEX: Blue Grass","LGA: La Guardia","LGB: Long Beach","MCI: Kansas City Intl","MCO: Orlando Intl","MDW: Chicago Midway Intl","MEM: Memphis Intl","MHT: Manchester Regional Airport","MIA: Miami Intl","MKE: General Mitchell Intl","MSN: Dane Co Rgnl Truax Fld","MSP: Minneapolis St Paul Intl","MSY: Louis Armstrong New Orleans Intl","MTJ: Montrose Regional Airport","MVY: Martha\\\\'s Vineyard","MYR: Myrtle Beach Intl","OAK: Metropolitan Oakland Intl","OKC: Will Rogers World","OMA: Eppley Afld","ORD: Chicago Ohare Intl","ORF: Norfolk Intl","PBI: Palm Beach Intl","PDX: Portland Intl","PHL: Philadelphia Intl","PHX: Phoenix Sky Harbor Intl","PIT: Pittsburgh Intl","PSE: NA","PSP: Palm Springs Intl","PVD: Theodore Francis Green State","PWM: Portland Intl Jetport","RDU: Raleigh Durham Intl","RIC: Richmond Intl","ROC: Greater Rochester Intl","RSW: Southwest Florida Intl","SAN: San Diego Intl","SAT: San Antonio Intl","SAV: Savannah Hilton Head Intl","SBN: South Bend Rgnl","SDF: Louisville International Airport","SEA: Seattle Tacoma Intl","SFO: San Francisco Intl","SJC: Norman Y Mineta San Jose Intl","SJU: NA","SLC: Salt Lake City Intl","SMF: Sacramento Intl","SNA: John Wayne Arpt Orange Co","SRQ: Sarasota Bradenton Intl","STL: Lambert St Louis Intl","STT: NA","SYR: Syracuse Hancock Intl","TPA: Tampa Intl","TUL: Tulsa Intl","TVC: Cherry Capital Airport","TYS: Mc Ghee Tyson","XNA: NW Arkansas Regional"],"type":"scatter","mode":"markers","marker":{"autocolorscale":false,"color":"rgba(0,0,0,1)","opacity":1,"size":5.6692913385826778,"symbol":"circle","line":{"width":1.8897637795275593,"color":"rgba(0,0,0,1)"}},"hoveron":"points","showlegend":false,"xaxis":"x","yaxis":"y","hoverinfo":"text","frame":null}],"layout":{"margin":{"t":24.767123287671232,"r":7.3059360730593621,"b":50.676629306766294,"l":57.716894977168948},"font":{"color":"rgba(0,0,0,1)","family":"","size":21.253632212536321},"xaxis":{"domain":[0,1],"automargin":true,"type":"linear","autorange":false,"range":[-863.10000000000002,18147.099999999999],"tickmode":"array","ticktext":["0","5000","10000","15000"],"tickvals":[-1.1368683772161603e-13,5000,10000,15000],"categoryorder":"array","categoryarray":["0","5000","10000","15000"],"nticks":null,"ticks":"","tickcolor":null,"ticklen":3.6529680365296811,"tickwidth":0,"showticklabels":true,"tickfont":{"color":"rgba(77,77,77,1)","family":"","size":17.002905770029059},"tickangle":-0,"showline":false,"linecolor":null,"linewidth":0,"showgrid":true,"gridcolor":"rgba(235,235,235,1)","gridwidth":0.66417600664176002,"zeroline":false,"anchor":"y","title":{"text":"n","font":{"color":"rgba(0,0,0,1)","family":"","size":21.253632212536321}},"hoverformat":".2f"},"yaxis":{"domain":[0,1],"automargin":true,"type":"linear","autorange":false,"range":[-25.18820754716981,44.952358490566034],"tickmode":"array","ticktext":["-20","0","20","40"],"tickvals":[-20,0,20,40],"categoryorder":"array","categoryarray":["-20","0","20","40"],"nticks":null,"ticks":"","tickcolor":null,"ticklen":3.6529680365296811,"tickwidth":0,"showticklabels":true,"tickfont":{"color":"rgba(77,77,77,1)","family":"","size":17.002905770029059},"tickangle":-0,"showline":false,"linecolor":null,"linewidth":0,"showgrid":true,"gridcolor":"rgba(235,235,235,1)","gridwidth":0.66417600664176002,"zeroline":false,"anchor":"x","title":{"text":"ave_arr_delay","font":{"color":"rgba(0,0,0,1)","family":"","size":21.253632212536321}},"hoverformat":".2f"},"shapes":[{"type":"rect","fillcolor":null,"line":{"color":null,"width":0,"linetype":[]},"yref":"paper","xref":"paper","x0":0,"x1":1,"y0":0,"y1":1}],"showlegend":false,"legend":{"bgcolor":null,"bordercolor":null,"borderwidth":0,"font":{"color":"rgba(0,0,0,1)","family":"","size":17.002905770029059}},"hovermode":"closest","barmode":"relative"},"config":{"doubleClick":"reset","modeBarButtonsToAdd":["hoverclosest","hovercompare"],"showSendToCloud":false},"source":"A","attrs":{"91c73f766873":{"x":{},"y":{},"text":{},"type":"scatter"}},"cur_data":"91c73f766873","visdat":{"91c73f766873":["function (y) ","x"]},"highlight":{"on":"plotly_click","persistent":false,"dynamic":false,"selectize":false,"opacityDim":0.20000000000000001,"selected":{"opacity":1},"debounce":0},"shinyEvents":["plotly_hover","plotly_click","plotly_selected","plotly_relayout","plotly_brushed","plotly_brushing","plotly_clickannotation","plotly_doubleclick","plotly_deselect","plotly_afterplot","plotly_sunburstclick"],"base_url":"https://plot.ly"},"evals":[],"jsHooks":[]}</script>
]

---

### Filtering Joins

.pull-left[
Filtering joins only affect rows; they do not add variables.
{{content}}
]
--

* `semi_join(x, y)` returns all rows in `x` with a match in `y`:

.pull-right[
<img src="../img/join-semi.png" width="70%" style="display: block; margin: auto;" />
]

.pull-left[
* `anti_join(x, y)` returns all rows in `x` without a match in `y`:
]

.pull-right[
<img src="../img/join-anti.png" width="70%" style="display: block; margin: auto;" />
]

---

`semi_join` is useful for matching filtered results back to a table.

The plot of delay against age showed some surprisingly old aircraft:

``` r
top_age <- slice_max(fl_age, plane_age, n = 2)
top_age
## # A tibble: 3 × 7
##    year tailnum ave_dep_delay ave_arr_delay     n plane_year plane_age
##   <int> <chr>           <dbl>         <dbl> <int>      <int>     <int>
## 1  2013 N381AA           5.95          3.5     22       1956        57
## 2  2013 N201AA          17.4          14.1     51       1959        54
## 3  2013 N567AA           3.72         -4.30    61       1959        54
```

We can identify these aircraft with a semi-join:

``` r
semi_join(planes, top_age, "tailnum")
## # A tibble: 3 × 9
##   tailnum  year type               manufacturer model engines seats speed engine
##   <chr>   <int> <chr>              <chr>        <chr>   <int> <int> <int> <chr> 
## 1 N201AA   1959 Fixed wing single… CESSNA       150         1     2    90 Recip…
## 2 N381AA   1956 Fixed wing multi … DOUGLAS      DC-7…       4   102   232 Recip…
## 3 N567AA   1959 Fixed wing single… DEHAVILLAND  OTTE…       1    16    95 Recip…
```

---

The corresponding flights can also be picked out with a semi-join:

.code-80[

``` r
top_age_flights <- semi_join(flights, top_age, "tailnum")
head(top_age_flights, 2)
## # A tibble: 2 × 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013     1     3      909            700       129     1103            850
## 2  2013     1     3       NA            920        NA       NA           1245
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>
```
]

Looking at the carriers or destinations might help:

.code-80[

``` r
select(top_age_flights, carrier, dest, everything()) |> head(4)
## # A tibble: 4 × 19
##   carrier dest   year month   day dep_time sched_dep_time dep_delay arr_time
##   <chr>   <chr> <int> <int> <int>    <int>          <int>     <dbl>    <int>
## 1 AA      ORD    2013     1     3      909            700       129     1103
## 2 AA      DFW    2013     1     3       NA            920        NA       NA
## 3 AA      DFW    2013     1     9     1423           1430        -7     1727
## 4 AA      DFW    2013     1    10     1238           1240        -2     1525
## # ℹ 10 more variables: sched_arr_time <int>, arr_delay <dbl>, flight <int>,
## #   tailnum <chr>, origin <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>
```
]

---

The carrier summary:

``` r
count(top_age_flights, carrier)
## # A tibble: 1 × 2
##   carrier     n
##   <chr>   <int>
## 1 AA        139
```

A check of the documentation for `planes` shows that American and one
other airline report _fleet numbers_ rather than tail numbers.

---

Anti-joins are useful for detecting join problems.

For example, many flights have tail numbers that do not match entries
in `planes`:

.pull-left[

``` r
fl <- anti_join(flights, planes, "tailnum")
nrow(fl)
## [1] 52606
```
]

.pull-right[

``` r
count(fl, tailnum)
## # A tibble: 722 × 2
##    tailnum     n
##    <chr>   <int>
##  1 D942DN      4
##  2 N0EGMQ    371
##  3 N14628      1
##  4 N149AT     22
##  5 N16632     11
##  6 N17627      2
##  7 N1EAMQ    210
##  8 N200AA     34
##  9 N24633      8
## 10 N261AV      6
## # ℹ 712 more rows
```
]

---

### Join Issues

* Make sure a primary key you want to use has no missing values.

* Make sure a primary key you want to use really uniquely identifies
  a row.

* Make sure every value of your foreign key you want to use to link to
  another table matches a row in the linked table (`anti_join` is a
  good way to do this).

* Check that your foreign key has no missing values.

---
layout: true
## Data Base Interface
---

Current data on airline flights in the US is available from the Bureau
of Transportation Statistics at

<http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236>

The data are available in one compressed CSV file per month.

Loaded into R, one month of data requires about 300 Mb of memory.

Data for 1987--2008 was compiled for a DataExpo competition of the
_American Statistical Association_ held in 2009.

The data are available on the Linux systems in the directory
`/group/statsoft/data/DataExpo09` as compressed CSV files and as a
SQLite data base.

Loading the data into R would require about 70 Gb of memory, which is
not practical.

Using the data base is a good alternative.

---

SQLite is a simple relational data base that works with data stored on
a local disk.

More sophisticated data bases allow data to be stored on multiple
machines or in the cloud

These data bases are usually accessed by connecting to a server,
which usually involves an authentication process.

For SQLite the connection is set up by linking to a file.

The RSQLite package provides an interface to SQLite.

The `DBI` package provides a high level interface for using data bases.

* `dbConnect` creates a generic data base connection object.
* `dbListTables` returns a vector with the names of the available tables.
* `dbDisconnect` closes a data base connection.

Analogous packages exist for Python and Perl.

---

The first step is to connect to the data base.

``` r
library(dplyr)
library(DBI)
library(RSQLite)
db <- dbConnect(SQLite(), "/group/statsoft/data/DataExpo09/ontime.sqlite3")
```

There is only one table:

``` r
dbListTables(db)
## [1] "ontime"
```

`tbl` creates a tibble-like object for a data base table.

``` r
fl <- tbl(db, "ontime")
```

This object is not a regular tibble, but many tibble operations work on it.

---

To find the variables in the table:

``` r
tbl_vars(fl)
##  [1] "Year"              "Month"             "DayofMonth"
##  [4] "DayOfWeek"         "DepTime"           "CRSDepTime"
##  [7] "ArrTime"           "CRSArrTime"        "UniqueCarrier"
## [10] "FlightNum"         "TailNum"           "ActualElapsedTime"
## [13] "CRSElapsedTime"    "AirTime"           "ArrDelay"
## [16] "DepDelay"          "Origin"            "Dest"
## [19] "Distance"          "TaxiIn"            "TaxiOut"
## [22] "Cancelled"         "CancellationCode"  "Diverted"
## [25] "CarrierDelay"      "WeatherDelay"      "NASDelay"
## [28] "SecurityDelay"     "LateAircraftDelay"
```

---

Many `dplyr` operations are supported:

``` r
sm <- summarize(fl, last_year = max(Year, na.rm = TRUE), n = n())
```

This computation is fast because it does not actually access
the data base.

An operation that needs the data, such as printing the result, will
access the data base and take time:

``` r
sm
## # Source:   lazy query [?? x 2]
## # Database: sqlite 3.37.2
## #   [/mnt/nfs/clasnetappvm/group/statsoft/data/DataExpo09/ontime.sqlite3]
##   last_year         n
##       <int>     <int>
## 1      2008 118914458
```

---

The `sm` object essentially contains a SQL query,
[which can be
examined](https://dbplyr.tidyverse.org/articles/sql-translation.html)
using `sql_render` from the `dbplyr` package:

``` r
dbplyr::sql_render(sm)
## <SQL> SELECT MAX(`Year`) AS `last_year`, COUNT(*) AS `n`
## FROM `ontime`
```

---

Flights to CID in 2008:

``` r
fl_cid <- filter(fl, Year == 2008, Dest == "CID")
dbplyr::sql_render(fl_cid)
## <SQL> SELECT *
## FROM `ontime`
## WHERE ((`Year` = 2008.0) AND (`Dest` = 'CID'))
```
---

Number of flights to CID in 2008:

``` r
sm <- summarize(fl_cid, n = n())
dbplyr::sql_render(sm)
## <SQL> SELECT COUNT(*) AS `n`
## FROM `ontime`
## WHERE ((`Year` = 2008.0) AND (`Dest` = 'CID'))
```
--

``` r
sm
## # Source:   lazy query [?? x 1]
## # Database: sqlite 3.37.2
## #   [/mnt/nfs/clasnetappvm/group/statsoft/data/DataExpo09/ontime.sqlite3]
##       n
##   <int>
## 1  2961
```

---

Breakdown by where the flights originated:

.small-code[

``` r
sm <- group_by(fl_cid, Origin) |>
    summarize(n = n()) |>
    ungroup()
dbplyr::sql_render(sm)
## <SQL> SELECT `Origin`, COUNT(*) AS `n`
## FROM `ontime`
## WHERE ((`Year` = 2008.0) AND (`Dest` = 'CID'))
## GROUP BY `Origin`
```
]

.small-code[

``` r
sm
## # Source:   lazy query [?? x 2]
## # Database: sqlite 3.37.2
## #   [/mnt/nfs/clasnetappvm/group/statsoft/data/DataExpo09/ontime.sqlite3]
##   Origin     n
##   <chr>  <int>
## 1 ATL      119
## 2 BNA        1
## 3 CVG      292
## 4 DEN      230
## 5 DFW      519
## 6 DTW      345
## 7 MSP      241
## 8 ORD     1214
```
]

---

Many base R functions can be converted to SQL:

.small-code[

``` r
sm <- group_by(fl_cid, Origin) |>
    summarize(ave_arr_del_pos = mean(pmax(ArrDelay, 0), na.rm = TRUE)) |>
    ungroup()
dbplyr::sql_render(sm)
## <SQL> SELECT `Origin`, AVG(MAX(`ArrDelay`, 0.0)) AS `ave_arr_del_pos`
## FROM `ontime`
## WHERE ((`Year` = 2008.0) AND (`Dest` = 'CID'))
## GROUP BY `Origin`
```
]

.small-code[

``` r
sm_cid <- sm
sm
## # Source:   lazy query [?? x 2]
## # Database: sqlite 3.37.2
## #   [/mnt/nfs/clasnetappvm/group/statsoft/data/DataExpo09/ontime.sqlite3]
##   Origin ave_arr_del_pos
##   <chr>            <dbl>
## 1 ATL               17.4
## 2 BNA               18
## 3 CVG               12.4
## 4 DEN               12.1
## 5 DFW               11.3
## 6 DTW               15.0
## 7 MSP               20.1
## 8 ORD               21.6
```
]

---

`mutate()` can also be used:

.small-code[

``` r
fl_cid_gain <- mutate(fl_cid, Gain = DepDelay - ArrDelay)
dbplyr::sql_render(fl_cid_gain)
## <SQL> SELECT `Year`, `Month`, `DayofMonth`, `DayOfWeek`, `DepTime`, `CRSDepTime`, `ArrTime`, `CRSArrTime`, `UniqueCarrier`, `FlightNum`, `TailNum`, `ActualElapsedTime`, `CRSElapsedTime`, `AirTime`, `ArrDelay`, `DepDelay`, `Origin`, `Dest`, `Distance`, `TaxiIn`, `TaxiOut`, `Cancelled`, `CancellationCode`, `Diverted`, `CarrierDelay`, `WeatherDelay`, `NASDelay`, `SecurityDelay`, `LateAircraftDelay`, `DepDelay` - `ArrDelay` AS `Gain`
## FROM `ontime`
## WHERE ((`Year` = 2008.0) AND (`Dest` = 'CID'))
```

``` r
group_by(fl_cid_gain, Origin) |>
    summarize(mean_gain = mean(Gain)) |>
    ungroup()
## # Source:   lazy query [?? x 2]
## # Database: sqlite 3.37.2
## #   [/mnt/nfs/clasnetappvm/group/statsoft/data/DataExpo09/ontime.sqlite3]
##   Origin mean_gain
##   <chr>      <dbl>
## 1 ATL        8.24
## 2 BNA      -26
## 3 CVG       -1.60
## 4 DEN        0.278
## 5 DFW        4.12
## 6 DTW        3.04
## 7 MSP       -0.357
## 8 ORD       -1.36
```
]

---

Once a filtered result is small enough it can be brought over as a
regular tibble using

``` r
tb <- collect(fl_cid)
```

There are some limitations to operations that can be done on tables in
data bases.

Defining and using a function works for a tibble:

``` r
pos_mean <- function(x) mean(pmax(x, 0))
group_by(tb, Origin) |>
    summarize(ave_arr_del_pos = pos_mean(ArrDelay)) |>
    ungroup()
```

But this fails:

``` r
group_by(fl_cid, Origin) |>
    summarize(ave_arr_del_pos = pos_mean(ArrDelay)) |>
    ungroup()
## Error: no such function: pos_mean
```

---

Some operations may require some modifications.

Recomputing `sm_cid`:

``` r
fl <- tbl(db, "ontime")
fl_cid <- filter(fl, Year == 2008, Dest == "CID")
sm_cid <- group_by(fl_cid, Origin) |>
    summarize(ave_arr_del_pos = mean(pmax(ArrDelay, 0), na.rm = TRUE)) |>
    ungroup()
```

Joining with a local table does not work:

``` r
ap <- select(nycflights13::airports, faa, name)
sm_cid <- left_join(sm_cid, ap, c(Origin = "faa"))
## Error in `auto_copy()`:
## ! `x` and `y` must share the same src.
## ℹ set `copy` = TRUE (may be slow).
```
---

But this does work:

``` r
sm_cid <- left_join(sm_cid, ap, c(Origin = "faa"), copy = TRUE)
```

This copies the `ap` table to the data base:

``` r
dbListTables(db)
## [1] "dbplyr_001"   "ontime"       "sqlite_stat1" "sqlite_stat4"
```

Finish by disconnecting from the data base:

``` r
dbDisconnect(db)
```

This is not necessary for SQLite but is for other data bases, so it is
good practice.

---

Some notes:

* Building data base queries with `dplyr` is very efficient.

* Running the queries can take time.

* Transferring (partial) results to R can take time.

* Inadvertently asking to transfer a result that is too large
  could cause problems.

* Other data bases may arrange to cache query results.

* Most data bases, including SQLite, support creating _indices_ that
  speed up certain queries at the expense of using more storage.

The [`dbplyr` web site](https://dbplyr.tidyverse.org/index.html)
provides more information on using data bases with `dplyr`.

---

Some other approaches for handling larger-than-memory data:

- [Apache arrow](https://arrow.apache.org/) and the R
  [`arrow`](https://arrow.apache.org/docs/r/index.html) package.

- [Polars](https://pola.rs/) and the
  [`polars`](https://rpolars.github.io/) and
  [`tidypolars`](https://tidypolars.etiennebacher.com/) R packages.

- [DuckDB](https://duckdb.org/) with a
  [guide](https://robotwealth.com/a-beginners-guide-to-using-duckdb-with-stock-price-data-in-r/)
  for using from R.

---
layout: true
## Other Wrangling Tools
---

Some additional tools provided by `tidyr`:

* `separate()`: split character variable into several variables;

* `fill()`: fill in missing values in a variable (e.g. carry forward);

* `complete()`: add missing category classes (e.g. zero counts).

We have already seen `pivot_longer()` and `pivot_wider()`.

A newer alternative to `separate()` is to use
`separate_wider_position()` and `separate_wider_delim()`

---

### Separating Variables

A data set from the WHO on tuberculosis cases:

.code-80[

``` r
head(who, 2)
## # A tibble: 2 × 60
##   country   iso2  iso3   year new_sp_m014 new_sp_m1524 new_sp_m2534 new_sp_m3544
##   <chr>     <chr> <chr> <dbl>       <dbl>        <dbl>        <dbl>        <dbl>
## 1 Afghanis… AF    AFG    1980          NA           NA           NA           NA
## 2 Afghanis… AF    AFG    1981          NA           NA           NA           NA
## # ℹ 52 more variables: new_sp_m4554 <dbl>, new_sp_m5564 <dbl>,
## #   new_sp_m65 <dbl>, new_sp_f014 <dbl>, new_sp_f1524 <dbl>,
## #   new_sp_f2534 <dbl>, new_sp_f3544 <dbl>, new_sp_f4554 <dbl>,
## #   new_sp_f5564 <dbl>, new_sp_f65 <dbl>, new_sn_m014 <dbl>,
## #   new_sn_m1524 <dbl>, new_sn_m2534 <dbl>, new_sn_m3544 <dbl>,
## #   new_sn_m4554 <dbl>, new_sn_m5564 <dbl>, new_sn_m65 <dbl>,
## #   new_sn_f014 <dbl>, new_sn_f1524 <dbl>, new_sn_f2534 <dbl>, …
```
]

---

Some of the remaining variable names:

.small-code[

``` r
names(who) |> tail(20)
##  [1] "new_ep_f1524" "new_ep_f2534" "new_ep_f3544" "new_ep_f4554" "new_ep_f5564"
##  [6] "new_ep_f65"   "newrel_m014"  "newrel_m1524" "newrel_m2534" "newrel_m3544"
## [11] "newrel_m4554" "newrel_m5564" "newrel_m65"   "newrel_f014"  "newrel_f1524"
## [16] "newrel_f2534" "newrel_f3544" "newrel_f4554" "newrel_f5564" "newrel_f65"
```
]

The variable `new_sp_m2534`, for example, represents the number of cases for

* type `sp` (positive pulmonary smear);
* sex `m`;
* age between 25 and 34.

It would be better (tidier) to have separate variables for

* count
* type
* sex
* lower bound on age bracket
* upper bound on age bracket

A pipeline to clean this up involves pivoting to long form and several
`separate` and `mutate` steps.

---

.pull-left.width-40.small-code[
The original data frame:
]

.pull-right.width-60.small-code[

``` r
who
## # A tibble: 7,240 × 60
##    country  iso2  iso3   year new_sp_m014 new_sp_m1524 new_sp_m2534 new_sp_m3544
##    <chr>    <chr> <chr> <dbl>       <dbl>        <dbl>        <dbl>        <dbl>
##  1 Afghani… AF    AFG    1980          NA           NA           NA           NA
##  2 Afghani… AF    AFG    1981          NA           NA           NA           NA
##  3 Afghani… AF    AFG    1982          NA           NA           NA           NA
##  4 Afghani… AF    AFG    1983          NA           NA           NA           NA
##  5 Afghani… AF    AFG    1984          NA           NA           NA           NA
##  6 Afghani… AF    AFG    1985          NA           NA           NA           NA
##  7 Afghani… AF    AFG    1986          NA           NA           NA           NA
##  8 Afghani… AF    AFG    1987          NA           NA           NA           NA
##  9 Afghani… AF    AFG    1988          NA           NA           NA           NA
## 10 Afghani… AF    AFG    1989          NA           NA           NA           NA
## # ℹ 7,230 more rows
## # ℹ 52 more variables: new_sp_m4554 <dbl>, new_sp_m5564 <dbl>,
## #   new_sp_m65 <dbl>, new_sp_f014 <dbl>, new_sp_f1524 <dbl>,
## #   new_sp_f2534 <dbl>, new_sp_f3544 <dbl>, new_sp_f4554 <dbl>,
## #   new_sp_f5564 <dbl>, new_sp_f65 <dbl>, new_sn_m014 <dbl>,
## #   new_sn_m1524 <dbl>, new_sn_m2534 <dbl>, new_sn_m3544 <dbl>,
## #   new_sn_m4554 <dbl>, new_sn_m5564 <dbl>, new_sn_m65 <dbl>, …
```
]

---

.pull-left.width-40.small-code[
First pivot to longer form:

``` r
who_clean <-
*   pivot_longer(who,
*                new_sp_m014 : newrel_f65,
*                names_to = "key",
*                values_to = "count")
```
]
.pull-right.width-60.small-code[

``` r
who_clean
## # A tibble: 405,440 × 6
##    country     iso2  iso3   year key          count
##    <chr>       <chr> <chr> <dbl> <chr>        <dbl>
##  1 Afghanistan AF    AFG    1980 new_sp_m014     NA
##  2 Afghanistan AF    AFG    1980 new_sp_m1524    NA
##  3 Afghanistan AF    AFG    1980 new_sp_m2534    NA
##  4 Afghanistan AF    AFG    1980 new_sp_m3544    NA
##  5 Afghanistan AF    AFG    1980 new_sp_m4554    NA
##  6 Afghanistan AF    AFG    1980 new_sp_m5564    NA
##  7 Afghanistan AF    AFG    1980 new_sp_m65      NA
##  8 Afghanistan AF    AFG    1980 new_sp_f014     NA
##  9 Afghanistan AF    AFG    1980 new_sp_f1524    NA
## 10 Afghanistan AF    AFG    1980 new_sp_f2534    NA
## # ℹ 405,430 more rows
```
]

---

.pull-left.width-40.small-code[
Remove `"new"` or `"new_"` prefix:

``` r
who_clean <-
    pivot_longer(who,
                 new_sp_m014 : newrel_f65,
                 names_to = "key",
                 values_to = "count") |>
*   mutate(key = sub("new_?", "", key))
```
]
.pull-right.width-60.small-code[

``` r
who_clean
## # A tibble: 405,440 × 6
##    country     iso2  iso3   year key      count
##    <chr>       <chr> <chr> <dbl> <chr>    <dbl>
##  1 Afghanistan AF    AFG    1980 sp_m014     NA
##  2 Afghanistan AF    AFG    1980 sp_m1524    NA
##  3 Afghanistan AF    AFG    1980 sp_m2534    NA
##  4 Afghanistan AF    AFG    1980 sp_m3544    NA
##  5 Afghanistan AF    AFG    1980 sp_m4554    NA
##  6 Afghanistan AF    AFG    1980 sp_m5564    NA
##  7 Afghanistan AF    AFG    1980 sp_m65      NA
##  8 Afghanistan AF    AFG    1980 sp_f014     NA
##  9 Afghanistan AF    AFG    1980 sp_f1524    NA
## 10 Afghanistan AF    AFG    1980 sp_f2534    NA
## # ℹ 405,430 more rows
```
]

---

.pull-left.width-40.small-code[
Separate `key` into `type` and `sexage`:

``` r
who_clean <-
    pivot_longer(who,
                 new_sp_m014 : newrel_f65,
                 names_to = "key",
                 values_to = "count") |>
    mutate(key = sub("new_?", "", key)) |>
*   separate(key, c("type", "sexage"))
```
]
.pull-right.width-60.small-code[

``` r
who_clean
## # A tibble: 405,440 × 7
##    country     iso2  iso3   year type  sexage count
##    <chr>       <chr> <chr> <dbl> <chr> <chr>  <dbl>
##  1 Afghanistan AF    AFG    1980 sp    m014      NA
##  2 Afghanistan AF    AFG    1980 sp    m1524     NA
##  3 Afghanistan AF    AFG    1980 sp    m2534     NA
##  4 Afghanistan AF    AFG    1980 sp    m3544     NA
##  5 Afghanistan AF    AFG    1980 sp    m4554     NA
##  6 Afghanistan AF    AFG    1980 sp    m5564     NA
##  7 Afghanistan AF    AFG    1980 sp    m65       NA
##  8 Afghanistan AF    AFG    1980 sp    f014      NA
##  9 Afghanistan AF    AFG    1980 sp    f1524     NA
## 10 Afghanistan AF    AFG    1980 sp    f2534     NA
## # ℹ 405,430 more rows
```
]

---

.pull-left.width-40.small-code[
Separate `sexage` into `sex` and `age`:

``` r
who_clean
## # A tibble: 405,440 × 8
##    country     iso2  iso3   year type  sex   age   count
##    <chr>       <chr> <chr> <dbl> <chr> <chr> <chr> <dbl>
##  1 Afghanistan AF    AFG    1980 sp    m     014      NA
##  2 Afghanistan AF    AFG    1980 sp    m     1524     NA
##  3 Afghanistan AF    AFG    1980 sp    m     2534     NA
##  4 Afghanistan AF    AFG    1980 sp    m     3544     NA
##  5 Afghanistan AF    AFG    1980 sp    m     4554     NA
##  6 Afghanistan AF    AFG    1980 sp    m     5564     NA
##  7 Afghanistan AF    AFG    1980 sp    m     65       NA
##  8 Afghanistan AF    AFG    1980 sp    f     014      NA
##  9 Afghanistan AF    AFG    1980 sp    f     1524     NA
## 10 Afghanistan AF    AFG    1980 sp    f     2534     NA
## # ℹ 405,430 more rows
```
]

---

.pull-left.width-40.small-code[
Fix up age categories:

``` r
who_clean <-
    pivot_longer(who,
                 new_sp_m014 : newrel_f65,
                 names_to = "key",
                 values_to = "count") |>
    mutate(key = sub("new_?", "", key)) |>
    separate(key, c("type", "sexage")) |>
    separate(sexage, c("sex", "age"),
             sep = 1) |>
*   mutate(age = sub("014", "0014", age)) |>
*   mutate(age = sub("65", "65Inf", age))
```
]
.pull-right.width-60.small-code[

``` r
who_clean
## # A tibble: 405,440 × 8
##    country     iso2  iso3   year type  sex   age   count
##    <chr>       <chr> <chr> <dbl> <chr> <chr> <chr> <dbl>
##  1 Afghanistan AF    AFG    1980 sp    m     0014     NA
##  2 Afghanistan AF    AFG    1980 sp    m     1524     NA
##  3 Afghanistan AF    AFG    1980 sp    m     2534     NA
##  4 Afghanistan AF    AFG    1980 sp    m     3544     NA
##  5 Afghanistan AF    AFG    1980 sp    m     4554     NA
##  6 Afghanistan AF    AFG    1980 sp    m     5564     NA
##  7 Afghanistan AF    AFG    1980 sp    m     65Inf    NA
##  8 Afghanistan AF    AFG    1980 sp    f     0014     NA
##  9 Afghanistan AF    AFG    1980 sp    f     1524     NA
## 10 Afghanistan AF    AFG    1980 sp    f     2534     NA
## # ℹ 405,430 more rows
```
]

---

.pull-left.width-40.small-code[
Finally, split `age` into `age_lo` and `age-hi`:

``` r
who_clean <-
    pivot_longer(who,
                 new_sp_m014 : newrel_f65,
                 names_to = "key",
                 values_to = "count") |>
    mutate(key = sub("new_?", "", key)) |>
    separate(key, c("type", "sexage")) |>
    separate(sexage, c("sex", "age"),
             sep = 1) |>
    mutate(age = sub("014", "0014", age)) |>
    mutate(age = sub("65", "65Inf", age)) |>
*   separate(age, c("age_lo", "age_hi"),
*            sep = 2,
*            convert = TRUE)
```
]
.pull-right.width-60.small-code[

``` r
who_clean
## # A tibble: 405,440 × 9
##    country     iso2  iso3   year type  sex   age_lo age_hi count
##    <chr>       <chr> <chr> <dbl> <chr> <chr>  <int>  <dbl> <dbl>
##  1 Afghanistan AF    AFG    1980 sp    m          0     14    NA
##  2 Afghanistan AF    AFG    1980 sp    m         15     24    NA
##  3 Afghanistan AF    AFG    1980 sp    m         25     34    NA
##  4 Afghanistan AF    AFG    1980 sp    m         35     44    NA
##  5 Afghanistan AF    AFG    1980 sp    m         45     54    NA
##  6 Afghanistan AF    AFG    1980 sp    m         55     64    NA
##  7 Afghanistan AF    AFG    1980 sp    m         65    Inf    NA
##  8 Afghanistan AF    AFG    1980 sp    f          0     14    NA
##  9 Afghanistan AF    AFG    1980 sp    f         15     24    NA
## 10 Afghanistan AF    AFG    1980 sp    f         25     34    NA
## # ℹ 405,430 more rows
```
]

---

.pull-left.small-code[
A plot of total numbers of cases for a few countries over time:

``` r
filter(who_clean,
       iso3 %in% c("CAN", "DEU", "GBR", "USA")) |>
    group_by(year, iso3) |>
    summarize(count = sum(count, na.rm = TRUE)) |>
    ungroup() |>
    ggplot() +
    geom_line(aes(x = year,
                  y = count,
                  color = iso3)) +
    thm
```
]

.pull-right[
<img src="dplyr_files/figure-html/who-plot-1.png" style="display: block; margin: auto;" />
]

---

### Filling In Omitted Values

The [Refugee Processing Center](http://www.wrapsnet.org/) provides
information from the United States Refugee Admissions Program (USRAP),
including
[arrivals by state and nationality](http://www.wrapsnet.org/admissions-and-arrivals/).

.pull-left.width-30[
Three files are available locally:
{{content}}
]
--

* data from [early January
  2017](http://homepage.stat.uiowa.edu/~luke/data/Arrivals-2017-01-06.xls);
{{content}}
--

* data from [early April
  2017](http://homepage.stat.uiowa.edu/~luke/data/Arrivals-2017-04-05.xls).
{{content}}
--

* data from [early March
  2018](http://homepage.stat.uiowa.edu/~luke/data/Arrivals-2018-03-05.xls).
{{content}}
--

These are Excel spread sheets.

.pull-right.width-65[
Reading in the third file, with data for October 2017 through February
2018:

``` r
ref_url <-
    "http://homepage.stat.uiowa.edu/~luke/data/Arrivals-2018-03-05.xls"
ref_file <- "Arrivals-2018-03-05.xls"
if (! file.exists(ref_file))
    download.file(ref_url, ref_file)

library(readxl)
ref <- read_excel(ref_file, skip = 16) ## skip header
ref <- head(ref, -2) ## drop last two rows
names(ref)[1] <- "Destination"
```
]

---

.pull-left.width-30[
The `Destination` column needs filling in (this is common in
spread sheet data):
]

.pull-right.width-70.smaller-code[

``` r
ref
## # A tibble: 500 × 18
##    Destination Nationality   Oct   Nov   Dec   Jan   Feb   Mar   Apr   May   Jun
##    <chr>       <chr>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 Alabama     <NA>            0     0     3     1     4     0     0     0     0
##  2 <NA>        Dem. Rep. …     0     0     3     0     4     0     0     0     0
##  3 <NA>        El Salvador     0     0     0     1     0     0     0     0     0
##  4 Alaska      <NA>            9     2     6     0     0     0     0     0     0
##  5 <NA>        Russia          0     0     6     0     0     0     0     0     0
##  6 <NA>        Ukraine         9     2     0     0     0     0     0     0     0
##  7 Arizona     <NA>           26    47    71    73   110     0     0     0     0
##  8 <NA>        Afghanistan     0     1    14    10    17     0     0     0     0
##  9 <NA>        Algeria         0     0     0     0     1     0     0     0     0
## 10 <NA>        Bhutan          2     3     8     2     0     0     0     0     0
## # ℹ 490 more rows
## # ℹ 7 more variables: Jul <dbl>, Aug <dbl>, Sep <dbl>, Cases <dbl>, Inds <dbl>,
## #   State <dbl>, U.S. <dbl>
```
]

---

.pull-left.width-30[
Fill down the Destination:

``` r
ref_clean <-
*   fill(ref, Destination)
```
]

.pull-right.width-70.smaller-code[

``` r
ref_clean
## # A tibble: 500 × 18
##    Destination Nationality   Oct   Nov   Dec   Jan   Feb   Mar   Apr   May   Jun
##    <chr>       <chr>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 Alabama     <NA>            0     0     3     1     4     0     0     0     0
##  2 Alabama     Dem. Rep. …     0     0     3     0     4     0     0     0     0
##  3 Alabama     El Salvador     0     0     0     1     0     0     0     0     0
##  4 Alaska      <NA>            9     2     6     0     0     0     0     0     0
##  5 Alaska      Russia          0     0     6     0     0     0     0     0     0
##  6 Alaska      Ukraine         9     2     0     0     0     0     0     0     0
##  7 Arizona     <NA>           26    47    71    73   110     0     0     0     0
##  8 Arizona     Afghanistan     0     1    14    10    17     0     0     0     0
##  9 Arizona     Algeria         0     0     0     0     1     0     0     0     0
## 10 Arizona     Bhutan          2     3     8     2     0     0     0     0     0
## # ℹ 490 more rows
## # ℹ 7 more variables: Jul <dbl>, Aug <dbl>, Sep <dbl>, Cases <dbl>, Inds <dbl>,
## #   State <dbl>, U.S. <dbl>
```
]

---

.pull-left.width-30.smaller-code[
Drop the totals:

``` r
ref_clean <-
    fill(ref, Destination) |>
*   filter(! is.na(Nationality))
```
]

.pull-right.width-70.smaller-code[

``` r
ref_clean
## # A tibble: 451 × 18
##    Destination Nationality   Oct   Nov   Dec   Jan   Feb   Mar   Apr   May   Jun
##    <chr>       <chr>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 Alabama     Dem. Rep. …     0     0     3     0     4     0     0     0     0
##  2 Alabama     El Salvador     0     0     0     1     0     0     0     0     0
##  3 Alaska      Russia          0     0     6     0     0     0     0     0     0
##  4 Alaska      Ukraine         9     2     0     0     0     0     0     0     0
##  5 Arizona     Afghanistan     0     1    14    10    17     0     0     0     0
##  6 Arizona     Algeria         0     0     0     0     1     0     0     0     0
##  7 Arizona     Bhutan          2     3     8     2     0     0     0     0     0
##  8 Arizona     Burma           6     3     7     1     9     0     0     0     0
##  9 Arizona     Burundi         0     0     0     5     7     0     0     0     0
## 10 Arizona     Colombia        0     0     2     0     0     0     0     0     0
## # ℹ 441 more rows
## # ℹ 7 more variables: Jul <dbl>, Aug <dbl>, Sep <dbl>, Cases <dbl>, Inds <dbl>,
## #   State <dbl>, U.S. <dbl>
```
]

---

.pull-left.width-30.smaller-code[
Pivot to a tidier form with one row per month:

``` r
ref_clean <-
    fill(ref, Destination) |>
    filter(! is.na(Nationality)) |>
*   pivot_longer(
*       Oct : Sep,
*       names_to = "month",
*       values_to = "count")
```
]

.pull-right.width-70.smaller-code[

``` r
ref_clean
## # A tibble: 5,412 × 8
##    Destination Nationality     Cases  Inds State     U.S. month count
##    <chr>       <chr>           <dbl> <dbl> <dbl>    <dbl> <chr> <dbl>
##  1 Alabama     Dem. Rep. Congo     2     7 0.875 0.000811 Oct       0
##  2 Alabama     Dem. Rep. Congo     2     7 0.875 0.000811 Nov       0
##  3 Alabama     Dem. Rep. Congo     2     7 0.875 0.000811 Dec       3
##  4 Alabama     Dem. Rep. Congo     2     7 0.875 0.000811 Jan       0
##  5 Alabama     Dem. Rep. Congo     2     7 0.875 0.000811 Feb       4
##  6 Alabama     Dem. Rep. Congo     2     7 0.875 0.000811 Mar       0
##  7 Alabama     Dem. Rep. Congo     2     7 0.875 0.000811 Apr       0
##  8 Alabama     Dem. Rep. Congo     2     7 0.875 0.000811 May       0
##  9 Alabama     Dem. Rep. Congo     2     7 0.875 0.000811 Jun       0
## 10 Alabama     Dem. Rep. Congo     2     7 0.875 0.000811 Jul       0
## # ℹ 5,402 more rows
```
]

---

.pull-left[
Top 5 destination states:

``` r
group_by(ref_clean, Destination) |>
    summarize(count = sum(count)) |>
    ungroup() |>
    slice_max(count, n = 5)
## # A tibble: 5 × 2
##   Destination count
##   <chr>       <dbl>
## 1 Ohio          750
## 2 Texas         602
## 3 Washington    562
## 4 California    545
## 5 New York      540
```
]

.pull-right[
Top 5 origin nationalities:

``` r
group_by(ref_clean, Nationality) |>
    summarize(count = sum(count)) |>
    ungroup() |>
    slice_max(count, n = 5)
## # A tibble: 5 × 2
##   Nationality     count
##   <chr>           <dbl>
## 1 Dem. Rep. Congo  1968
## 2 Bhutan           1884
## 3 Burma            1240
## 4 Ukraine          1021
## 5 Eritrea           648
```
]

---

### Completing Missing Factor Level Combinations

This example is based on a [blog
post](https://win-vector.com/2017/02/21/the-zero-bug/) using data
from the [Significant Earthquake
Database](https://www.ncei.noaa.gov/access/metadata/landing-page/bin/iso?id=gov.noaa.ngdc.mgg.hazards:G012153).

.pull-left.width-40.smaller-code[
The data used is a selection for North America and Hawaii.
{{content}}
]
--

Reading in the data:

``` r
if (! file.exists("earthquakes.tsv"))
    download.file("http://homepage.stat.uiowa.edu/~luke/data/earthquakes.tsv",
                  "earthquakes.tsv")
eq <- read.table("earthquakes.tsv",
                 header = TRUE,
                 sep = "\t")
eq <- as_tibble(eq)
```

.pull-right.width-60.smaller-code[

``` r
select(eq, STATE, everything())
## # A tibble: 254 × 47
##    STATE   I_D FLAG_TSUNAMI  YEAR MONTH   DAY  HOUR MINUTE SECOND FOCAL_DEPTH
##    <chr> <int> <chr>        <int> <int> <int> <int>  <int>  <dbl>       <int>
##  1 HI     6697 Tsu           1500    NA    NA    NA     NA     NA          NA
##  2 MA     6013 Tsu           1668     4    13    NA     NA     NA          NA
##  3 OR     9954 Tsu           1700     1    27     5      0     NA          NA
##  4 MA     5828 Tsu           1755    11    18     9     11     35          NA
##  5 AK     5926 Tsu           1788     7    21    NA     NA     NA          NA
##  6 AK     5927 Tsu           1788     8     6    NA     NA     NA          NA
##  7 AK     7057 Tsu           1792    NA    NA    NA     NA     NA          NA
##  8 CA     5857 Tsu           1806     3    25     8     NA     NA          NA
##  9 AR     1594 Tsu           1811    12    16     8     15     NA          NA
## 10 AR     7058 Tsu           1811    12    16    14     15     NA          NA
## # ℹ 244 more rows
## # ℹ 37 more variables: EQ_PRIMARY <dbl>, EQ_MAG_MW <dbl>, EQ_MAG_MS <dbl>,
## #   EQ_MAG_MB <dbl>, EQ_MAG_ML <dbl>, EQ_MAG_MFA <dbl>, EQ_MAG_UNK <dbl>,
## #   INTENSITY <int>, COUNTRY <chr>, LOCATION_NAME <chr>, LATITUDE <dbl>,
## #   LONGITUDE <dbl>, REGION_CODE <int>, DEATHS <int>, DEATHS_DESCRIPTION <int>,
## #   MISSING <lgl>, MISSING_DESCRIPTION <lgl>, INJURIES <int>,
## #   INJURIES_DESCRIPTION <int>, DAMAGE_MILLIONS_DOLLARS <dbl>, …
```
]

---

.pull-left.code-80[
Some `STATE` entries are blank, but the location name identifies the state:

``` r
filter(eq, STATE == "") |>
    select(STATE, LOCATION_NAME)
## # A tibble: 9 × 2
##   STATE LOCATION_NAME                        
##   <chr> <chr>                                
## 1 ""    MONTANA:  CLARKSTON VALLEY           
## 2 ""    ILLINOIS:  WEST SALEM                
## 3 ""    NEVADA:  FALLON                      
## 4 ""    CALIFORNIA:  OCOTILLO                
## 5 ""    COLORADO:  PAONIA                    
## 6 ""    ALASKA: ALEUTIAN ISLANDS: FOX ISLANDS
## 7 ""    OKLAHOMA:  SPARKS                    
## 8 ""    OKLAHOMA:  SPARKS, PRAGUE            
## 9 ""    TEXAS:  WEST
```
]

.pull-right[
One way to fix this:

``` r
badidx <- which(eq$STATE == "")
badstate <- sub(":.*", "",
                eq$LOCATION_NAME[badidx])
eq$STATE[badidx] <-
    state.abb[match(tolower(badstate),
                    tolower(state.name))]
```
]

---

.pull-left.code-80[
Add full state names as a factor:

``` r
states <- data.frame(STATE = state.abb,
                     STATE_NAME = state.name)
eq <- left_join(eq, states, "STATE")
eq <- mutate(eq,
             STATE_NAME = factor(STATE_NAME,
                                 state.name))
```
{{content}}
]
--

Look at total number of earthquakes by state:

``` r
tbl <- count(eq, STATE_NAME)
head(tbl, 6)
## # A tibble: 6 × 2
##   STATE_NAME      n
##   <fct>       <int>
## 1 Alabama         1
## 2 Alaska         78
## 3 Arkansas        2
## 4 California    100
## 5 Colorado        2
## 6 Connecticut     1
```

.pull-right[
.hide-code[

``` r
library(forcats)
pb <- ggplot(tbl, aes(y = fct_rev(STATE_NAME), x = n)) +
    geom_col() +
    thm +
    theme(axis.title.y = element_blank())
ps <- ggplot(tbl, aes(y = fct_rev(STATE_NAME), x = n)) +
    geom_point() +
    geom_segment(aes(xend = 0, yend = fct_rev(STATE_NAME))) +
    thm +
    theme(axis.title.y = element_blank())
pb | ps
```

<img src="dplyr_files/figure-html/unnamed-chunk-165-1.png" style="display: block; margin: auto;" />
]
{{content}}
]
--

Many states, including Iowa, are missing.
{{content}}
--

Their counts should be zero.
{{content}}
--

This often impacts visualizations.

---

.pull-left[
`complete()` can be used to fill in the zero values.

``` r
tbl <-
    count(eq, STATE_NAME) |>
    complete(STATE_NAME, fill = list(n = 0))
tbl
## # A tibble: 50 × 2
##    STATE_NAME      n
##    <fct>       <int>
##  1 Alabama         1
##  2 Alaska         78
##  3 Arizona         0
##  4 Arkansas        2
##  5 California    100
##  6 Colorado        2
##  7 Connecticut     1
##  8 Delaware        0
##  9 Florida         0
## 10 Georgia         0
## # ℹ 40 more rows
```
]

.pull-right[
.hide-code[

``` r
thm_ytxt_sm <- theme(axis.text.y = element_text(size = 10))
(pb %+% tbl + thm_ytxt_sm) | (ps %+% tbl + thm_ytxt_sm)
```

<img src="dplyr_files/figure-html/unnamed-chunk-167-1.png" style="display: block; margin: auto;" />
]
]

<!--
# nolint start
Play around with `tee`

``` r
tee <- function(x, fun) { fun(x); x }
tee(. |> head(2) |> print) |>
tee(browser) ## browserText gets the arg
# nolint end
```

Mention `fill` and such.
-->

<!--

options(tibble.print_max = n, tibble.print_min = m): if there are
    more than n rows, print only the first m rows. Use
    options(tibble.print_max = Inf) to always show all rows.

options(tibble.width = Inf) will always print all columns,
    regardless of the width of the screen.

## **** try to read full table kills the process
## **** check if more reasonable on R-devel

-->

---
layout: false
## Reading

Chapter [_Data transformation_](https://r4ds.hadley.nz/transform.html) of
[R for Data Science](https://r4ds.hadley.nz/).

Chapter [_Tidy data_](https://r4ds.hadley.nz/data-tidy.html) of
[R for Data Science](https://r4ds.hadley.nz/).

---
layout: false

## Interactive Tutorial

An interactive [`learnr`](https://rstudio.github.io/learnr/) tutorial
for these notes is [available](../tutorials/dplyr.Rmd).

You can run the tutorial with

``` r
STAT4580::runTutorial("dplyr")
```

You can install the current version of the `STAT4580` package with

``` r
remotes::install_gitlab("luke-tierney/STAT4580")
```

You may need to install the `remotes` package from CRAN first.

---
layout: true
## Exercises
---

1) To bring in `dplyr` and the `mpg` data, start by evaluating

```r
library(dplyr)
library(ggplot2)
```
The `select` function allows variables to be specified in a variety of
ways. Which of the following does **not** produce a data frame with only
the variables `manufacturer`, `model`, `cty`, `hwy`?

*   a. `select(mpg, 1:2, 7:8)`
*   b. `select(mpg, starts_with("m"), ends_with("y"))`
*   c. `select(mpg, 1:2, cty : hwy)`
*   d. `select(mpg, -(displ : drv), -(fl : class))`

---

2) Consider the code

```r
library(dplyr)
library(ggplot2)
filter(mpg, ---) |> nrow()
```

Which of the replacements for `---` computes the number of Ford
vehicles with more than 6 cylinders in the `mpg` table?

*   a. `model == "ford", cyl <= 6`
*   b. `manufacturer == "ford" | cyl > 6`
*   c. `manufacturer == "ford", cyl > 6`
*   d. `manufacturer == "ford", cylinders > 6`

---

3) In the 2013 NYC flights data provided by the `nycflights13` package
how many flights were there to Des Moines (FAA code DSM) from NYC
in the first three months of 2013?

*   a. 13
*   b. 98
*   c. 64
*   d. 77

---

4) To bring in `dplyr` and the `mpg` data, start by evaluating

```r
library(dplyr)
library(ggplot2)
```

Which of the following sorts the rows for `mpg` by increasing `cyl`
value and, within each `cyl` value sorts the rows from largest to
smallest `hwy` value.

*   a. `arrange(mpg, desc(hwy), cyl)`
*   b. `arrange(mpg, cyl, desc(hwy))`
*   c. `arrange(mpg, desc(cyl), hwy)`
*   d. `arrange(mpg, cyl, hwy)`

---

5) To bring in `dplyr` and the `flights` data, start by evaluating

```r
library(dplyr)
library(nycflights13)
```

The `dep_time` variable in the `flights` data set from the
`nycflights13` package is convenient to look at (529 means 5:29 AM
and 2212 means 10:12 PM), but hard to compute with. Which of the
following adds variables `dep_hour` and `dep_min` containing hour
and minute of the departure time?

*   a. `mutate(flights, dep_hour = dep_time %/% 60, dep_min = dep_time %% 60)`
*   b. `mutate(flights, dep_hour = dep_time %/% 100, dep_min = dep_time %% 100)`
*   c. `mutate(flights, dep_hour = dep_time / 100, dep_min = dep_time - dep_hour)`
*   a. `mutate(flights, dep_hour = dep_time %/% 60, dep_min = dep_hour %% 60)`

---

6) Using the `gapminder` data set, the following code computes
population totals for each continent and each year:

``` r
library(dplyr)
library(gapminder)
cpops <- group_by(gapminder, continent, year) |>
    summarize(pop = sum(pop)) |>
    ungroup()
```

To produce a plot to compare population growth over the years for
the continents it is useful to standardize the population data for
each continent, for example by dividing the population values by
the average population size for each continent. One way to do this
is with a grouped `mutate`. The first line of your result should be

``` r
head(cpops_std, 1)
## # A tibble: 1 × 4
##   continent  year       pop stdpop
##   <fct>     <int>     <dbl>  <dbl>
## 1 Africa     1952 237640501  0.461
```

---

``` r
library(ggplot2)
ggplot(cpops_std, aes(x = year, y = stdpop, color = continent)) +
    geom_line()
```

---

Which of the following produces the correct result:

---

7). Another approach to the previous exercise first creates a table of
population averages with

```r
cpops_avg <- group_by(cpops, continent) |>
    summarize(avg_pop = mean(pop))
```

Then use a left join to add `avg_pop` to the `cpops` table,
followed by an ordinary mutate step:

```r
left_join(---) |>
    mutate(stdpop = pop / avg_pop)
```

Which is the correct replacement for `---`?

*   a. `cpops_avg, cpops, "continent"`
*   b. `cpops, cpops_avg, "continent"`
*   c. `cpops, cpops_avg, "year"`
*   d. `cpops_avg, cpops, "year"`

//adapted from Emi Tanaka's gist at //https://gist.github.com/emitanaka/eaa258bb8471c041797ff377704c8505