1. Abrasion Loss in Rubber Samples

library(lattice)
library(ggplot2)
library(dplyr)
theme_set(theme_minimal() +
          theme(text = element_text(size = 16)) +
          theme(panel.border = element_rect(color = "grey30", fill = NA)))
if (! file.exists("abrasion.csv"))
    download.file("http://www.stat.uiowa.edu/~luke/data/abrasion.csv",
                  "abrasion.csv")
abrasion <- read.csv("abrasion.csv")

A scatterplot matrix of the three variables:

splom(abrasion)

The plot of abrasion loss against hardness shows a weak decreasing relationship.

A coplot can be used to explore the joint relationship:

abrasion_hgroup <- mutate(abrasion, hgroup = cut_number(hardness, 4))
ggplot(abrasion_hgroup, aes(tensile.strength, abrasion.loss)) +
    geom_point() +
    facet_wrap(~ hgroup)

At each level of hardness, abrasion loss decreases as tensile strength increases.

Showing a muted version of the full data in the background provides context for comparing the facets:

ggplot(abrasion_hgroup, aes(tensile.strength, abrasion.loss)) +
    geom_point(data = abrasion, color = "grey") +
    geom_point(size = 2) +
    facet_wrap(~ hgroup)

As hardness increases, the abrasion loss/tensile strength relation moves lower.

2. Arrival and Departure Delays

Start by taking a 10% sample to make creating useful scatterplots easier:

library(nycflights13)
fl_30K <- sample_frac(flights, 0.1)

A plot of the full data shows that the bulk of flights have short arrival and departure delays. For those with longer delays, the arrival and departure delays are close to the same. This can be made easier to see by adding a 45 degree line:

ggplot(fl_30K, aes(dep_delay, arr_delay)) +
    geom_point(size = 0.5) +
    geom_abline(intercept = 0, slope = 1, color = "red") # not really needed
## Warning: Removed 938 rows containing missing values or values outside the scale range
## (`geom_point()`).

About 85% of the sample with non-missing departure delays had departure delays of at most 30 minutes.

Since delays are only reported to the nearest minute and the data set is still large there is a lot of heaping on the integer values, in particular for the departure delays. Alpha blending helps some, but adding jitter in the horizontal direction is also helpful.

p <- ggplot(filter(fl_30K, dep_delay <= 30),
            aes(dep_delay, arr_delay)) +
    geom_point(size = 0.5, alpha = 0.05, na.rm = TRUE,
               position = position_jitter(height = 0, width = 0.5))
p

At the right alpha level a step down at zero is visible. A histogram of the departure delays in the a five minute interval of zero suggests that there might be some generous rounding down for flights with a one minute delay:

ggplot(filter(fl_30K, dep_delay <= 5, dep_delay > -5), aes(dep_delay)) +
    geom_histogram(binwidth = 1, fill = "grey", color = "black")

There is still a lot of over-plotting, even with the reduced sample. Adding density estimates can help make the distribution of the delay times easier to see.

p1 <- p + geom_density2d(bins = 6, color = "red", na.rm = TRUE)
ggExtra::ggMarginal(p1)

The additions show that most departures and arrivals are not delayed, and that for the flights departing close to on time there is very little relationship between the arrival and departure times; they seem to be nearly independent.

Both the marginal density estimate for the departure delays and the joint density estimates suggest some slightly unusual behavior at the zero point.

3. Wind Speed, Time of Day, and Departure Delays

Join the weather data to the flights data using origin and time_hour as the key. Bring in only the variables we need:

fl <- left_join(flights, select(weather, origin, time_hour, wind_speed),
                c("origin", "time_hour"))

Check that this key is a good primary key for the weather table:

nrow(filter(count(weather, origin, time_hour), n > 1))
## [1] 0
(! anyNA(weather$origin)) & (! anyNA(weather$time_hour))
## [1] TRUE

Include only summer months, and drop rows with missing values or where the wind speed is above 30 mph:

library(tidyr)
fls <- filter(fl, month %in% 6:8) |>
    drop_na() |>
    filter(wind_speed <= 30)

A plot of the proportion of departures that are delayed at each wind speed:

group_by(fls, wind_speed) |>
    summarize(p_delay = mean(dep_delay > 0), n = n()) |>
    ungroup() |>
    ggplot(aes(wind_speed, p_delay)) +
    geom_point() +
    geom_smooth(aes(weight = n))
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

There appears to be an increase in the proportion of delayed departures associated with an increase in wind speed.

However, a coplot of the proportion of delayed departures against departure hour conditioned on wind speed shows that there is also a positive association with the departure hour, and that association does not vary much with wind speed:

group_by(fls, wind_speed, hour) |>
    summarize(p_delay = mean(dep_delay > 0), n = n()) |>
    ungroup() |>
    ggplot(aes(hour, p_delay)) +
    geom_point() +
    geom_smooth(aes(weight = n)) +
    facet_wrap(~ cut_number(wind_speed, 6))
## `summarise()` has grouped output by 'wind_speed'. You can override using the
## `.groups` argument.
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

A reason may be that wind speed increases throughout the day until around 5 PM (17:00 hours):

ggplot(fls, aes(hour, wind_speed)) + geom_smooth()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

The coplot shows that the proportion of delayed departures increases roughly linearly until around 5 PM for all wind speed ranges. A linear regression of the proportion of delayed departures against hour and wind_speed for departures prior to 5 PM shows a very small coefficient for wind speed once time of day is accounted for:

fls_prop <- group_by(fls, wind_speed, hour) |>
    summarize(p_delay = mean(dep_delay > 0), n = n()) |>
    ungroup() |>
    filter(hour <= 17)
## `summarise()` has grouped output by 'wind_speed'. You can override using the
## `.groups` argument.
lm(p_delay ~ hour + wind_speed, weights = n, data = fls_prop)
## 
## Call:
## lm(formula = p_delay ~ hour + wind_speed, data = fls_prop, weights = n)
## 
## Coefficients:
## (Intercept)         hour   wind_speed  
##  -0.0259521    0.0379807   -0.0002924
---
title: "Assignment 8 Notes"
output:
  html_document:
    toc: yes
    code_download: true
    code_folding: "hide"
---

```{r global_options, include = FALSE}
knitr::opts_chunk$set(collapse = TRUE)
```

## 1. Abrasion Loss in Rubber Samples

```{r, message = FALSE}
library(lattice)
library(ggplot2)
library(dplyr)
theme_set(theme_minimal() +
          theme(text = element_text(size = 16)) +
          theme(panel.border = element_rect(color = "grey30", fill = NA)))
if (! file.exists("abrasion.csv"))
    download.file("http://www.stat.uiowa.edu/~luke/data/abrasion.csv",
                  "abrasion.csv")
abrasion <- read.csv("abrasion.csv")
```

A scatterplot matrix of the three variables:

```{r, fig.width = 7, fig.height = 7}
splom(abrasion)
```

The plot of abrasion loss against hardness shows a weak decreasing
relationship.

A coplot can be used to explore the joint relationship:

```{r, fig.width = 7, fig.height = 7}
abrasion_hgroup <- mutate(abrasion, hgroup = cut_number(hardness, 4))
ggplot(abrasion_hgroup, aes(tensile.strength, abrasion.loss)) +
    geom_point() +
    facet_wrap(~ hgroup)
```

At each level of hardness, abrasion loss decreases as tensile strength
increases.

Showing a muted version of the full data in the background provides
context for comparing the facets:

```{r, fig.width = 7, fig.height = 7}
ggplot(abrasion_hgroup, aes(tensile.strength, abrasion.loss)) +
    geom_point(data = abrasion, color = "grey") +
    geom_point(size = 2) +
    facet_wrap(~ hgroup)
```

As hardness increases, the abrasion loss/tensile strength relation
moves lower.


## 2. Arrival and Departure Delays

Start by taking a 10% sample to make creating useful scatterplots easier:

```{r, class.source = "fold-show"}
library(nycflights13)
fl_30K <- sample_frac(flights, 0.1)
```

A plot of the full data shows that the bulk of flights have short
arrival and departure delays. For those with longer delays, the
arrival and departure delays are close to the same. This can be made
easier to see by adding a 45 degree line:
  
```{r}
ggplot(fl_30K, aes(dep_delay, arr_delay)) +
    geom_point(size = 0.5) +
    geom_abline(intercept = 0, slope = 1, color = "red") # not really needed
```

About
`r round(100 * mean(fl_30K$dep_delay <= 30, na.rm = TRUE))`%
of the sample with non-missing departure delays had departure delays
of at most 30 minutes.

Since delays are only reported to the nearest minute and the data set
is still large there is a lot of heaping on the integer values, in
particular for the departure delays. Alpha blending helps some, but
adding jitter in the horizontal direction is also helpful.

```{r}
p <- ggplot(filter(fl_30K, dep_delay <= 30),
            aes(dep_delay, arr_delay)) +
    geom_point(size = 0.5, alpha = 0.05, na.rm = TRUE,
               position = position_jitter(height = 0, width = 0.5))
p
```

At the right alpha level a step down at zero is visible. A histogram
of the departure delays in the a five minute interval of zero suggests
that there might be some generous rounding down for flights with a one
minute delay:

```{r}
ggplot(filter(fl_30K, dep_delay <= 5, dep_delay > -5), aes(dep_delay)) +
    geom_histogram(binwidth = 1, fill = "grey", color = "black")
```

There is still a lot of over-plotting, even with the reduced sample.
Adding density estimates can help make the distribution of the delay
times easier to see.

```{r}
p1 <- p + geom_density2d(bins = 6, color = "red", na.rm = TRUE)
ggExtra::ggMarginal(p1)
```

The additions show that most departures and arrivals are not delayed,
and that for the flights departing close to on time there is very
little relationship between the arrival and departure times; they seem
to be nearly independent.

Both the marginal density estimate for the departure delays and the
joint density estimates suggest some slightly unusual behavior at the
zero point.


## 3. Wind Speed, Time of Day, and Departure Delays

Join the `weather` data to the `flights` data using `origin` and
`time_hour` as the key. Bring in only the variables we need:

```{r, class.source = "fold-show"}
fl <- left_join(flights, select(weather, origin, time_hour, wind_speed),
                c("origin", "time_hour"))
```

Check that this key is a good primary key for the weather table:

```{r, class.source = "fold-show"}
nrow(filter(count(weather, origin, time_hour), n > 1))
(! anyNA(weather$origin)) & (! anyNA(weather$time_hour))
```

Include only summer months, and drop rows with missing values or where
the wind speed is above 30 mph:

```{r, class.source = "fold-show"}
library(tidyr)
fls <- filter(fl, month %in% 6:8) |>
    drop_na() |>
    filter(wind_speed <= 30)
```

A plot of the proportion of departures that are delayed at each wind
speed:

```{r}
group_by(fls, wind_speed) |>
    summarize(p_delay = mean(dep_delay > 0), n = n()) |>
    ungroup() |>
    ggplot(aes(wind_speed, p_delay)) +
    geom_point() +
    geom_smooth(aes(weight = n))
```

There appears to be an increase in the proportion of delayed
departures associated with an increase in wind speed.

However, a coplot of the proportion of delayed departures against
departure hour conditioned on wind speed shows that there is also a
positive association with the departure hour, and that association
does not vary much with wind speed:

```{r}
group_by(fls, wind_speed, hour) |>
    summarize(p_delay = mean(dep_delay > 0), n = n()) |>
    ungroup() |>
    ggplot(aes(hour, p_delay)) +
    geom_point() +
    geom_smooth(aes(weight = n)) +
    facet_wrap(~ cut_number(wind_speed, 6))
```

A reason may be that wind speed increases throughout the day until
around 5 PM (17:00 hours):

```{r}
ggplot(fls, aes(hour, wind_speed)) + geom_smooth()
```

The coplot shows that the proportion of delayed departures
increases roughly linearly until around 5 PM for all wind speed
ranges. A linear regression of the proportion of delayed departures
against `hour` and `wind_speed` for departures prior to 5 PM shows a
very small coefficient for wind speed once time of day is accounted
for:

```{r, class.source = "fold-show"}
fls_prop <- group_by(fls, wind_speed, hour) |>
    summarize(p_delay = mean(dep_delay > 0), n = n()) |>
    ungroup() |>
    filter(hour <= 17)
lm(p_delay ~ hour + wind_speed, weights = n, data = fls_prop)
```
