1. Fleet City Gas Mileage

After reading the data into the variable newmpg start by focusing on the non-electric vehicles for the years since 2009:

nm <- filter(newmpg, fuelType1 != "Electricity", year >= 2009)

Compute the average city mileage for the models for each year and make:

nm <- summarize(group_by(nm, make, year), cty = mean(city08))

The result in nm is still grouped, so remove the grouping structure before identifying the top five for 2018:

nm <- ungroup(nm)
(tnm18 <- top_n(filter(nm,year == 2018), 5, cty))
## # A tibble: 5 x 3
##   make        year   cty
##   <chr>      <dbl> <dbl>
## 1 Honda       2018  27.1
## 2 Hyundai     2018  26.8
## 3 Mazda       2018  26.3
## 4 Mitsubishi  2018  26.4
## 5 Toyota      2018  25.2

The averages over the years for these makes can be extracted as make %in% tnm18$make:

tnm <- filter(nm, make %in% tnm18$make)

An alternative is to use semi_join:

tnm1 <- semi_join(nm, tnm18, "make")
identical(tnm, tnm1)
## [1] TRUE

A plot shows a steady increase in fleet average city gas mileage for all of these manufacturers over this period.

ggplot(tnm, aes(year, cty, color = make)) + geom_line()

2. Arrival Delays and Cancellations

After loading the data package, a useful first step is to add a canceled variable to the flights table:

flights <- mutate(flights, canceled = is.na(dep_time) & is.na(arr_time))

Compute the average delay, proportion of canceled flights, and number of flights for each hour of the day:

fl <- summarize(group_by(flights, origin, hour),
                delay = mean(arr_delay, na.rm = TRUE),
                pcan = mean(canceled),
                n = n())
head(fl)
## # A tibble: 6 x 5
## # Groups:   origin [1]
##   origin  hour    delay    pcan     n
##   <chr>  <dbl>    <dbl>   <dbl> <int>
## 1 EWR        1  NaN     1           1
## 2 EWR        5   -5.75  0.00335   895
## 3 EWR        6   -3.27  0.0156  11133
## 4 EWR        7   -3.96  0.0136   8658
## 5 EWR        8    1.20  0.0195   9295
## 6 EWR        9   -0.373 0.0150   6084

The first hour has only one flight, which is canceled:

filter(flights, hour == 1)
## # A tibble: 1 x 20
##    year month   day dep_time sched_dep_time dep_delay arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>
## 1  2013     7    27       NA            106        NA       NA
## # … with 13 more variables: sched_arr_time <int>, arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## #   time_hour <dttm>, canceled <lgl>

This seems out of place relative to the many other flights in each of the hours of operation (no other flights are scheduled to depart between midnight and 5 AM). So set it aside for now:

fl <- filter(fl, hour != 1)

A plot of the average delays against departure hour:

ggplot(fl, aes(x = hour, y = delay, color = origin)) + geom_point()

Adding smooth fitted curves helps:

ggplot(fl, aes(x = hour, y = delay, color = origin)) +
    geom_point() + geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Delays increase over the day, tapering off a little in the later evening. Delays are similar across all three airports during the morning. For flights leaving in the late afternoon and early evening, flights from Newark experience greater delays and flights from JFK experience smaller delays. Early morning seem the best time to depart for an on time arrival.

Cancellations also happen more often for flights leaving later than flights leaving earlier. So againan early departure looks like a good idea.

ggplot(fl, aes(x = hour, y = pcan, color = origin)) +
    geom_point() + geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

To see whether conclusion change for shorter or longer flights, add a classification of distance into short or long:

fl2 <- mutate(flights, type = ifelse(distance < 1000, "short", "long"))

Then redo the summaries:

fl2 <- summarize(group_by(fl2, origin, hour, type),
                 delay = mean(arr_delay, na.rm = TRUE),
                 pcan = mean(canceled),
                 n = n())
fl2 <- filter(fl2, hour != 1)

For the average delays on longer flights, all three airports follow a similar pattern of delays increasing throughout the day. For shorter flights, delays for flights out of Newark become substantially larger in the afternoon and evening than for the other two airports.

ggplot(fl2, aes(x = hour, y = delay, color = origin)) +
    geom_point() + geom_smooth(se = FALSE) + facet_wrap(~ type)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

For longer flights the proportion canceled varies little throughout the day. For shorter flights it increases slightly through most of the day.

ggplot(fl2, aes(x = hour, y = pcan, color = origin)) +
    geom_point() + geom_smooth(se = FALSE) + facet_wrap(~ type)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

3. Departure Delays and Wind Speed

Box plots of wind speeds at the three NYC airports show a very high value for one measurement:

library(nycflights13)
ggplot(weather) + geom_boxplot(aes(y = wind_speed, x = origin))
## Warning: Removed 4 rows containing non-finite values (stat_boxplot).


filter(weather, wind_speed > 1000)
## # A tibble: 1 x 15
##   origin  year month   day  hour  temp  dewp humid wind_dir wind_speed
##   <chr>  <dbl> <dbl> <int> <int> <dbl> <dbl> <dbl>    <dbl>      <dbl>
## 1 EWR     2013     2    12     3  39.0  27.0  61.6      260      1048.
## # … with 5 more variables: wind_gust <dbl>, precip <dbl>, pressure <dbl>,
## #   visib <dbl>, time_hour <dttm>

This value is not plausible, so set it to NA:

weather <- mutate(weather,
                  wind_speed = ifelse(wind_speed > 1000, NA, wind_speed))

Join the weather data to the flights data using origin and time_hour as the key. We don’t need the year, month, day, and hour, so drop them to simplify the result:

fl <- left_join(flights,
                select(weather, -(year : hour)),
                c("origin", "time_hour"))

Check that this key is a good primary key for the weather table:

nrow(filter(count(weather, origin, time_hour), n > 1))
## [1] 0
(! anyNA(weather$origin)) & (! anyNA(weather$time_hour))
## [1] TRUE

Compute the average departure delay and number of flights for each wind speed level:

flw <- summarize(group_by(fl, wind_speed),
                 delay = mean(dep_delay,na.rm = TRUE),
                 n = n())

A scatter plot of the average delay times for each wind speed:

ggplot(flw, aes(x = wind_speed, y = delay)) + geom_point()
## Warning: Removed 1 rows containing missing values (geom_point).

For wind speeds below 20 MPH the average delay increases nearly linearly with wind speed. For higher wind speeds the relation is more diffuse.

Using size to encode the number of departures at each wind speed:

ggplot(flw, aes(x = wind_speed, y = delay, size = n)) +
    geom_point() + scale_size_area()
## Warning: Removed 1 rows containing missing values (geom_point).

The number of departures for wind speeds above 20 MPH is much lower than at lower wind speeds, so the averages are based on less data an thus more variable. For lower wind speeds the relation between departure delay and wind speed seems quite solid:

ggplot(filter(flw, wind_speed <= 25),
       aes(x = wind_speed, y = delay, size = n)) +
    geom_point() + scale_size_area()