--- title: "Assignment 7 Notes" output: html_document: toc: yes --- ```{r global_options, include=FALSE} knitr::opts_chunk\$set(collapse=TRUE) ``` ## 1. Cancellations and Destination Location It is useful to add a `canceled` variable to the `flights` table, assuming that canceled flights are those with both `dep_time` and `arr_time` missing: ```{r, message = FALSE} library(dplyr) library(nycflights13) flights <- mutate(flights, canceled = is.na(dep_time) & is.na(arr_time)) ``` For each destination and the first three months, compute the number of flights, percent canceled, and average arrival delay: ```{r} fl3 <- filter(flights, month <= 3) fl3 <- summarize(group_by(fl3, dest), n = n(), pcan = 100 * mean(canceled), delay = mean(arr_delay, na.rm = TRUE)) ``` To show the data on a map, add location information by joining with data from the `airports` table: ```{r} fl3 <- left_join(fl3, select(airports, faa, lat, lon, alt), c("dest" = "faa")) ``` A map showing the cancellation percentages for the top 50 destinations: ```{r} library(ggplot2) pm <- ggplot(top_n(fl3, 50, n), aes(x = lon, y = lat)) + borders("state") + coord_map() pm + geom_point(aes(size = pcan)) + scale_size_area() ``` Using alpha blending can help with the over-plotting along the east coast: ```{r} pm + geom_point(aes(size = pcan), alpha = 0.3) + scale_size_area() ``` Cancellation percentages are higher for closer airports and airports likely to be experiencing similar weather conditions. Whether the average delay is 20 minutes or more can be encoded in using color or shape: ```{r} pm + geom_point(aes(size = pcan, color = delay >= 20)) + scale_size_area() pm + geom_point(aes(size = pcan, shape = delay >= 20)) + scale_size_area() ``` For a 15 minute cutoff there are a few more high delay destinations: ```{r} pm + geom_point(aes(size = pcan, color = delay >= 15)) + scale_size_area() pm + geom_point(aes(size = pcan, shape = delay >= 15)) + scale_size_area() ``` The size and shape channels interfere with each other; color and size do not. Picking out the rarer shapes is also harder than spotting the different colors: color achieves better pop-out. ## 2. Wind Speed, Time of Day, and Departure Delays Average delays increase approximately linearly until early evening: ```{r, message = FALSE} library(nycflights13) library(dplyr) library(ggplot2) weather <- mutate(weather, wind_speed = ifelse(wind_speed > 1000, NA, wind_speed)) flights <- left_join(flights, select(weather, -(year : hour)), c("origin", "time_hour")) fl <- summarize(group_by(flights, wind_speed, hour), avg_delay = mean(dep_delay, na.rm = TRUE), n = n()) p <- ggplot(fl, aes(x = hour, y = avg_delay)) + geom_point(aes(size = n)) + scale_size_area() p ``` The pattern is roughly the same at all wind speeds: ```{r} p + facet_wrap(~ cut_number(wind_speed, 5)) ``` Conditioning on time of day also shows little variation with moderate wind speed levels once time of day is accounted for: ```{r} ggplot(filter(fl, wind_speed <= 25), aes(x = wind_speed, y = avg_delay)) + geom_point(aes(size = n)) + scale_size_area() + facet_wrap(~ cut_number(hour, 6)) ``` The association between departure delay and wind speed seen previously can be attributed to an association between wind speed and time of day: ```{r} w <- summarize(group_by(weather, hour), avg_wind_speed = mean(wind_speed, na.rm = TRUE)) ggplot(w, aes(x = hour, y = avg_wind_speed)) + geom_point() ```