1. Cancellations and Destination Location

It is useful to add a canceled variable to the flights table, assuming that canceled flights are those with both dep_time and arr_time missing:

library(dplyr)
library(nycflights13)
flights <- mutate(flights, canceled  = is.na(dep_time) & is.na(arr_time))

For each destination and the first three months, compute the number of flights, percent canceled, and average arrival delay:

fl3 <- filter(flights, month <= 3)
fl3 <- summarize(group_by(fl3, dest),
    n = n(),
    pcan = 100 * mean(canceled),
    delay = mean(arr_delay, na.rm = TRUE))

To show the data on a map, add location information by joining with data from the airports table:

fl3 <- left_join(fl3, select(airports, faa, lat, lon, alt), c("dest" = "faa"))

A map showing the cancellation percentages for the top 50 destinations:

library(ggplot2)
pm <- ggplot(top_n(fl3, 50, n), aes(x = lon, y = lat)) + 
    borders("state") + coord_map()
pm + geom_point(aes(size = pcan)) + scale_size_area()
## Warning: Removed 1 rows containing missing values (geom_point).

Using alpha blending can help with the over-plotting along the east coast:

pm + geom_point(aes(size = pcan), alpha = 0.3) + scale_size_area()
## Warning: Removed 1 rows containing missing values (geom_point).

Cancellation percentages are higher for closer airports and airports likely to be experiencing similar weather conditions.

Whether the average delay is 20 minutes or more can be encoded in using color or shape:

pm + geom_point(aes(size = pcan, color = delay >= 20)) + scale_size_area()
## Warning: Removed 1 rows containing missing values (geom_point).

pm + geom_point(aes(size = pcan, shape = delay >= 20)) + scale_size_area()
## Warning: Removed 1 rows containing missing values (geom_point).

For a 15 minute cutoff there are a few more high delay destinations:

pm + geom_point(aes(size = pcan, color = delay >= 15)) + scale_size_area()
## Warning: Removed 1 rows containing missing values (geom_point).

pm + geom_point(aes(size = pcan, shape = delay >= 15)) + scale_size_area()
## Warning: Removed 1 rows containing missing values (geom_point).

The size and shape channels interfere with each other; color and size do not. Picking out the rarer shapes is also harder than spotting the different colors: color achieves better pop-out.

2. Wind Speed, Time of Day, and Departure Delays

Average delays increase approximately linearly until early evening:

library(nycflights13)
library(dplyr)
library(ggplot2)
weather <- mutate(weather,
                  wind_speed = ifelse(wind_speed > 1000, NA, wind_speed))
flights <- left_join(flights,
                     select(weather, -(year : hour)),
                     c("origin", "time_hour"))
fl <- summarize(group_by(flights, wind_speed, hour),
                avg_delay = mean(dep_delay, na.rm = TRUE),
                n = n())

p <- ggplot(fl, aes(x = hour, y = avg_delay)) +
    geom_point(aes(size = n)) + scale_size_area()
p
## Warning: Removed 2 rows containing missing values (geom_point).

The pattern is roughly the same at all wind speeds:

p + facet_wrap(~ cut_number(wind_speed, 5))
## Warning: Removed 2 rows containing missing values (geom_point).

Conditioning on time of day also shows little variation with moderate wind speed levels once time of day is accounted for:

ggplot(filter(fl, wind_speed <= 25),
       aes(x = wind_speed, y = avg_delay)) +
    geom_point(aes(size = n)) + scale_size_area() +
    facet_wrap(~ cut_number(hour, 6))
## Warning: Removed 1 rows containing missing values (geom_point).

The association between departure delay and wind speed seen previously can be attributed to an association between wind speed and time of day:

w <- summarize(group_by(weather, hour),
               avg_wind_speed = mean(wind_speed, na.rm = TRUE))
ggplot(w, aes(x = hour, y = avg_wind_speed)) + geom_point()