## General Issues

• Make sure you name your files as requested, including matching the specified use of upper and lower case. This matters on file systems that are case-sensitive.

• Make sure to commit your work to your local repository and push your commits to GitLab. We can only see what is on GitLab, not what is on your computer. You can check what we see by going to the GitLab web interface.

• Include your name and the date in the header of your .Rmd file using author: and date: tags.

• Your HTML file should be a report of your findings.

• Any graph you show should be discussed in your narrative.

• Any code you show should be discussed in your narrative.

• If you do not need to discuss a piece of code in the narrative, use echo FALSE to avoid showing it.

## 1. New York City Airport Names

The names and airport codes for the three New York City airports in the nycflights13 data are shown in the following table:

library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
##     filter, lag
## The following objects are masked from 'package:base':
##
##     intersect, setdiff, setequal, union
library(nycflights13)
nyc_faa <- unique(flights\$origin)
tbl <- select(airports, faa, name) %>% filter(faa %in% nyc_faa)
names(tbl) <- c("Code", "Name")
kbl <- knitr::kable(tbl, format = "html")
kableExtra::kable_styling(kbl, full_width = FALSE)
Code Name
EWR Newark Liberty Intl
JFK John F Kennedy Intl
LGA La Guardia

## 2. Average and Median Departure Delays

tbl <-
group_by(flights, origin) %>%
summarize(avg_dep_delay = mean(dep_delay, na.rm = TRUE),
med_dep_delay = median(dep_delay, na.rm = TRUE)) %>%
ungroup()
names(tbl) <- c("Origin", "Average Delay", "Median Delay")
kbl <- knitr::kable(tbl, format = "html", digits = 1)
kableExtra::kable_styling(kbl, full_width = FALSE)
Origin Average Delay Median Delay
EWR 15.1 -1
JFK 12.1 -1
LGA 10.3 -3

Airlines work very hard to have flights leave on time. In fact the majority at all three airports left early and so the median delays are negative. But the distributions of delay times are heavily skewed to the right, so the average departure delays are quite a bit larger.

## 3. Air Time Distributions

Four possible visualizations without much fine tuning:

library(ggplot2)
library(ggridges)
library(patchwork)
thm <- theme_minimal() + theme(text = element_text(size = 16))
p0 <- ggplot(flights, aes(x = air_time)) + thm
p1 <- p0 +
geom_density(aes(color = origin), bw = 50) +
ggtitle("Color")
p2 <- p0 +
geom_density(aes(fill = origin), alpha = 0.4, bw = 50) +
ggtitle("Fill with Alpha Blending")
p3 <- p0 +
geom_density(bw = 50) + facet_wrap(~ origin, ncol = 1) +
ggtitle("Facets")
p4 <- p0 +
geom_density_ridges(aes(y = origin, height = ..density..),
stat = "density", bw = 50) +
scale_y_discrete(limits = c("LGA", "JFK", "EWR")) +
ggtitle("Ridgeline")
(p1 | p2) / (p3 | p4)## + plot_layout(guides = "collect")
## Warning: Removed 9430 rows containing non-finite values (stat_density).
## Removed 9430 rows containing non-finite values (stat_density).
## Removed 9430 rows containing non-finite values (stat_density).
## Removed 9430 rows containing non-finite values (stat_density).

Neither single-plot view works particularly well in this case. For the plot using fill with alpha blending the overlap is too large to allow the densities to be distinguished easily. The plot mapping origin to color works somewhat better but the lines are still hard to follow. The faceted plot and the ridgeline plot are visually quite similar and both work fairly well.

Flights out of La Guardia are mostly shorter, with very few taking over 300 minutes. Somewhat more long flights originate from Newark, and considerably more long flights originate from JFK.

## 4. Highway Fuel Economy Over the Years, Revisited

library(readr)
if (! file.exists("vehicles.csv.zip"))
"vehicles.csv.zip")
newmpg <- read_csv("vehicles.csv.zip", guess_max = 100000)
newmpg3 <- filter(newmpg, year <= 2019, year >= 2000) %>%
mutate(year = factor(year))

All four approaches, with only minimal tuning for the three new ones:

alpha <- 0.2
size <- 0.3
p1 <- ggplot(newmpg3, aes(x = highway08, y = year)) +
geom_point(position = "jitter", size = size, alpha = alpha) +
ylab(NULL) +
thm
p2 <- ggplot(newmpg3, aes(y = highway08, x = year)) +
geom_boxplot() +
thm +
coord_flip()
p3 <- ggplot(newmpg3, aes(y = highway08, x = year)) +
geom_violin() +
thm +
coord_flip()
p4 <- ggplot(newmpg3, aes(x = highway08, y = year)) +
geom_density_ridges() +
thm

(p1 | p2) / (p3 | p4)
## Picking joint bandwidth of 1.18

The three new approaches do a better job of conveying the increase in fuel economy for the bulk of the vehicles. Both violin and ridgeline plots show the slight bimodal structure in the early years; box plots cannot reflect this. Box plots put a high emphasis on the electric vehicles; this can be reduced by adjusting the point size used. Strip plots also allow the emerging electric vehicles to be seen. Violin plots and ridgeline plots do not show these very well as they are still too small a proportion of the total. The fact that violin plots stop at the maximum helps somewhat. The current geom_density_ridges implementation does not do this but could in principle be modified to do so.