General Issues

1. Find a Better Visualization

The original:

Some issues:

A simple bar chart with a zero base line:

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
d <- data.frame(pres = c("Obama", "Carter", "Clinton",
                         "G.W. Bush", "Reagan", "G.H.W Bush", "Trump"),
                appr = c(79, 78, 68, 65, 58, 56, 40),
                party = c("D", "D", "D", "R", "R", "R", "R"),
                year = c(2009, 1977, 1993, 2001, 1981, 1989, 2017))
d <- mutate(d, pres = reorder(pres, appr))

p <- ggplot(d, aes(x = pres, y = appr, fill = party)) +
    geom_col() + coord_flip()
p

This can be changed using scale_fill_manual:

p + scale_fill_manual(values = c(R = "red", D = "blue"))

We can reduce the saturation and the value in the HSV color representation to obtain less intense colors; this is commonly used in red state/blue state maps:

myred <- hsv(0, 0.6, 0.8)
myblue <- hsv(2 / 3, 0.6, 0.8)
p + scale_fill_manual(values = c(R = myred, D = myblue))

Some enhancements:

p + scale_fill_manual(values = c(R = myred, D = myblue)) + theme_void() +
    geom_text(aes(y = 3, label = pres),
              size = 8, hjust = "left", color = "white") +
    geom_text(aes(y = appr - 3, label = appr),
              size = 8, hjust = "right", color = "white")

Some notes:

2. EPA Fuel Economy Data

library(lubridate)
library(readr)
if (! file.exists("vehicles.csv.zip") ||
    file.mtime("vehicles.csv.zip") + months(6) < now())
    download.file("http://www.stat.uiowa.edu/~luke/data/vehicles.csv.zip",
                  "vehicles.csv.zip")
newmpg <- read_csv("vehicles.csv.zip", guess_max = 100000)

From the documentation for the data the appropriate variables seem to be:

The primary fuel type counts are

library(dplyr)
tbl <- count(newmpg, fuelType1)
kbl <- knitr::kable(tbl, format = "html")
kableExtra::kable_styling(kbl, full_width = FALSE)
fuelType1 n
Diesel 1231
Electricity 353
Midgrade Gasoline 142
Natural Gas 60
Premium Gasoline 13517
Regular Gasoline 29384

A bar chart of these numbers:

thm <- theme_minimal() + theme(text = element_text(size = 16))
ggplot(tbl, aes(x = n, y = reorder(fuelType1, n))) +
    geom_col() +
    scale_x_continuous(expand = expansion(mult = c(0, .1))) +
    thm +
    ylab(NULL)

Regular gas is the dominant fuel type over all years, with premium second. All other fuel types, including electricity, make up a small fraction.

3. Fuel Type Over the Years

A filled bar chart shows changes in the primary fuel type used over the years:

newmpg2 <- filter(newmpg, year <= 2021) %>%
    mutate(year = factor(year))
ggplot(newmpg2, aes(y = year, fill = fuelType1)) +
    geom_bar(position = "fill") +
    scale_x_continuous(expand = c(0, 0)) +
    labs(x = "Proportion", y = NULL)

Regular gas was the predominant fuel type in the mid 1980s, but premium’s share has gradually increased to the point where almost as many models use premium as regular. Diesel’s popularity declined early and had a small resurgence recently. The market share for electricity is still quite small but is growing.

4. Highway Fuel Economy Over the Years

newmpg3 <- filter(newmpg, year <= 2021, year >= 2000) %>%
    mutate(year = factor(year))
alpha <- 0.2
size <- 0.3

A strip chart is a useful way to look at the full data for a numeric variable at several different levels of a discrete variable, but some tuning is needed for larger data sets. For examining 22 years of highway gas mileage data from the EPA data set using alpha = 0.2 and size = 0.3 along with jittering seems to work reasonably well:

ggplot(newmpg3, aes(x = highway08, y = year)) +
    geom_point(position = "jitter", size = size, alpha = alpha) +
    ylab(NULL) +
    thm

Over time the highway gas mileage distributions are moving upward a little bit, with the upper tails becoming gradually longer and an increasing number of very high efficiency models (mostly electric).

---
title: "Assignment 4 Notes"
output:
  html_document:
    toc: yes
    code_download: true
    code_folding: "hide"
---
	  
```{r global_options, include = FALSE}
knitr::opts_chunk$set(collapse = TRUE, fig.align = "center")
```

## General Issues

* Make sure you name your files as requested, including matching the
  specified use of upper and lower case. This matters on file systems
  that are case-sensitive.

* Make sure to commit your work to your local repository and push your
  commits to GitLab. We can only see what is on GitLab, not what is on
  your computer. You can check what we see by going to the GitLab web
  interface.
 
* Include your name and the date in the header of your `.Rmd` file
  using `author:` and `date:` tags.

* Your HTML file should be a report of your findings.

    * Any graph you show should be discussed in your narrative.

    * Any code you show should be discussed in your narrative.

    * If you do not need to discuss a piece of code in the narrative,
      use `echo FALSE` to avoid showing it.

* If you load a file that you have included in your repository or that
  you download to your repository then you need to make sure the code
  in your Rmarkdown document uses a relative path, not an absolute
  one.  Absolute paths will only make sense on your computer, not on
  the computer of someone else who downloads your repository.

* If you want to check your work is reproducible you can download your
  work to a computer other than the one you use for developing it.
  One option is the CLAS Linux systems accessed via
  [FastX](https://clas.uiowa.edu/linux/help/fastx). You can use
  RStudio there to set up a clean copy of your repository and then
  just pull your changes and check that they knit successfully.
  Using `STAT4580::checkHW` is a convenient way to do this.


## 1. Find a Better Visualization

The original:

<center>
![](img/abcnews_trumptransition.png)
</center>

Some issues:

* The white bars are supposed to represent the numbers, but are not
  using a zero base line -- the bar for Obama's 79 % whould be nearly
  twice as long as the bar for Trump's 40 %.
* The blue and red bars are distracting at best, misleading at
  worst. They could represent the complementary proportion, but the
  lengths are wrong relative to the white bars and to each other.
* The placement of the GMA logo adds to the confusion.

A simple bar chart with a zero base line:

```{r}
library(dplyr)
library(ggplot2)
d <- data.frame(pres = c("Obama", "Carter", "Clinton",
                         "G.W. Bush", "Reagan", "G.H.W Bush", "Trump"),
                appr = c(79, 78, 68, 65, 58, 56, 40),
                party = c("D", "D", "D", "R", "R", "R", "R"),
                year = c(2009, 1977, 1993, 2001, 1981, 1989, 2017))
d <- mutate(d, pres = reorder(pres, appr))

p <- ggplot(d, aes(x = pres, y = appr, fill = party)) +
    geom_col() + coord_flip()
p
```

* In recent years it has become common to represent Democrats as blue,
  Republicans as red.

* The default colors are close to red and blue, but their use is
  opposite to current convention.

This can be changed using `scale_fill_manual`:

```{r}
p + scale_fill_manual(values = c(R = "red", D = "blue"))
```

* Pure colors are very intense when used in larger areas.

* Pure warm colors, like red, are more intense than pure cool colors,
  like blue.

We can reduce the saturation and the value in the HSV color
representation to obtain less intense colors; this is commonly used in
red state/blue state maps:

```{r}
myred <- hsv(0, 0.6, 0.8)
myblue <- hsv(2 / 3, 0.6, 0.8)
p + scale_fill_manual(values = c(R = myred, D = myblue))
```

Some enhancements:
```{r}
p + scale_fill_manual(values = c(R = myred, D = myblue)) + theme_void() +
    geom_text(aes(y = 3, label = pres),
              size = 8, hjust = "left", color = "white") +
    geom_text(aes(y = appr - 3, label = appr),
              size = 8, hjust = "right", color = "white")
```

Some notes:

* A dot chart is a reasonable alternative in this case.

* Horizontal bar charts are the norm in these settings since they
  allow horizontal labels of reasonable size.

* Party is a nominal or categorical attribute, not a numeric attribute.


## 2. EPA Fuel Economy Data

```{r, message = FALSE}
library(lubridate)
library(readr)
if (! file.exists("vehicles.csv.zip") ||
	file.mtime("vehicles.csv.zip") + months(6) < now())
    download.file("http://www.stat.uiowa.edu/~luke/data/vehicles.csv.zip",
                  "vehicles.csv.zip")
newmpg <- read_csv("vehicles.csv.zip", guess_max = 100000)
```

From the [documentation for the
data](https://www.fueleconomy.gov/feg/ws/index.shtml#vehicle) the
appropriate variables seem to be:

  * `highway08` corresponds to `hway` in `mpg`;
  * `cylinders` corresponds to `cyl` in `mpg`;
  * `displ` corresponds to `displ` in `mpg`;
  * `fuelType1` represents the primary fuel type, `fl` in `mpg`.

The primary fuel type counts are

```{r, message = FALSE}
library(dplyr)
tbl <- count(newmpg, fuelType1)
kbl <- knitr::kable(tbl, format = "html")
kableExtra::kable_styling(kbl, full_width = FALSE)
```

A bar chart of these numbers:

```{r}
thm <- theme_minimal() + theme(text = element_text(size = 16))
ggplot(tbl, aes(x = n, y = reorder(fuelType1, n))) +
    geom_col() +
    scale_x_continuous(expand = expansion(mult = c(0, .1))) +
    thm +
    ylab(NULL)
```

Regular gas is the dominant fuel type over all years, with premium second.
All other fuel types, including electricity, make up a small fraction.


## 3. Fuel Type Over the Years

A filled bar chart shows changes in the primary fuel type used over
the years:
  
```{r}
newmpg2 <- filter(newmpg, year <= 2021) %>%
    mutate(year = factor(year))
ggplot(newmpg2, aes(y = year, fill = fuelType1)) +
    geom_bar(position = "fill") +
    scale_x_continuous(expand = c(0, 0)) +
    labs(x = "Proportion", y = NULL)
```

Regular gas was the predominant fuel type in the mid 1980s, but
premium's share has gradually increased to the point where almost as
many models use premium as regular. Diesel's popularity declined early
and had a small resurgence recently. The market share for electricity
is still quite small but is growing.


## 4. Highway Fuel Economy Over the Years

```{r}
newmpg3 <- filter(newmpg, year <= 2021, year >= 2000) %>%
    mutate(year = factor(year))
alpha <- 0.2
size <- 0.3
```

A strip chart is a useful way to look at the full data for a numeric
variable at several different levels of a discrete variable, but some
tuning is needed for larger data sets. For examining 22 years of
highway gas mileage data from the EPA data set using
`alpha` = `r alpha` and `size` = `r size` along with jittering seems to
work reasonably well:
	
```{r}
ggplot(newmpg3, aes(x = highway08, y = year)) +
    geom_point(position = "jitter", size = size, alpha = alpha) +
    ylab(NULL) +
    thm
```

Over time the highway gas mileage distributions are moving upward
a little bit, with the upper tails becoming gradually longer and an
increasing number of very high efficiency models (mostly electric).

<!--
Local Variables: 
mode: poly-markdown+R
mode: flyspell
End:
-->
