General Issues

1. Life Expectancy and GDP Per Capita

One way to select the subset of four years:

library(dplyr)
library(ggplot2)
library(gapminder)
gap <- filter(gapminder, year %% 10 == 7 & year >= 1977)

Another possibility is

gap1 <- filter(gapminder, year %in% c(1977, 1987, 1997, 2007))
identical(gap, gap1)
## [1] TRUE

A faceted plot of life expectancy against GDP per capita, with color encoding continent and area encoding population size:

ggplot(gap, aes(gdpPercap, lifeExp, color = continent, size = pop)) +
    geom_point() + scale_size_area(max_size=8) + facet_wrap(~year)

Using facets

These are essentially four frames from the Gapminder animation.

2. Fuel Economy

A plot of city fuel economy level against the engine displacement with color encoding the number of cylinders and shape encoding the transmission type:

mpg1 <- mutate(mpg, cyl = factor(cyl), trans = substr(trans, 1, 4))
ggplot(mpg1, aes(y = cty, x = displ, color = cyl, shape = trans)) +
    geom_point(size = 2.5)

Five cylinders may seem like an odd number; the cars with five cylinders in this data set are all from one manufacturer:

filter(mpg, cyl == 5)
## # A tibble: 4 x 11
##   manufacturer model  displ  year   cyl trans drv     cty   hwy fl    class
##   <chr>        <chr>  <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 volkswagen   jetta    2.5  2008     5 auto… f        21    29 r     comp…
## 2 volkswagen   jetta    2.5  2008     5 manu… f        21    29 r     comp…
## 3 volkswagen   new b…   2.5  2008     5 manu… f        20    28 r     subc…
## 4 volkswagen   new b…   2.5  2008     5 auto… f        20    29 r     subc…

A Bar chart showing the distribution of transmission types within cylinder counts:

ggplot(mpg1, aes(x = cyl, fill = trans)) + geom_bar(position = "dodge")

or as a stacked bar chart standardized to show relative proportions:

ggplot(mpg1, aes(x = cyl, fill = trans)) + geom_bar(position = "fill")

Manual transmissions are about equally common as automatic transmissions among four and five cylinder cars, but less common on cars with more cylinders.

3. Fuel Economy Again

Reading the data:

library(readr)
if (! file.exists("vehicles.csv.zip"))
    download.file("http://www.stat.uiowa.edu/~luke/data/vehicles.csv.zip",
                  "vehicles.csv.zip")
newmpg <- read_csv("vehicles.csv.zip", guess_max = 10000)
## Warning: 10197 parsing failures.
##   row     col           expected actual               file
## 23158 mfrCode 1/0/T/F/TRUE/FALSE    ADX 'vehicles.csv.zip'
## 23159 mfrCode 1/0/T/F/TRUE/FALSE    ADX 'vehicles.csv.zip'
## 23160 mfrCode 1/0/T/F/TRUE/FALSE    ADX 'vehicles.csv.zip'
## 23161 mfrCode 1/0/T/F/TRUE/FALSE    ADX 'vehicles.csv.zip'
## 23162 mfrCode 1/0/T/F/TRUE/FALSE    BEX 'vehicles.csv.zip'
## ..... ....... .................. ...... ..................
## See problems(...) for more details.

From the documentation for the data the city08 variable seems a reasonable match to the cty variable in the mpg data set.

Select data for models from 2009 to the present, pull out the most useful variables, renaming some, and making trans a factor:

newmpg1 <- filter(newmpg, year >= 2009)
newmpg1 <- select(newmpg1,
                  make, model, year,
                  cty = city08,
                  trans = trany,
                  cyl = cylinders,
                  displ)
newmpg1 <- mutate(newmpg1, trans = factor(trans))
summary(newmpg1)
##      make              model                year           cty        
##  Length:13413       Length:13413       Min.   :2009   Min.   :  8.00  
##  Class :character   Class :character   1st Qu.:2011   1st Qu.: 16.00  
##  Mode  :character   Mode  :character   Median :2014   Median : 19.00  
##                                        Mean   :2014   Mean   : 20.77  
##                                        3rd Qu.:2017   3rd Qu.: 22.00  
##                                        Max.   :2020   Max.   :150.00  
##                                                                       
##              trans           cyl             displ      
##  Automatic (S6) :2516   Min.   : 2.000   Min.   :0.000  
##  Manual 6-spd   :1729   1st Qu.: 4.000   1st Qu.:2.000  
##  Automatic (S8) :1562   Median : 6.000   Median :3.000  
##  Automatic 6-spd:1419   Mean   : 5.775   Mean   :3.301  
##  Automatic 7-spd: 634   3rd Qu.: 6.000   3rd Qu.:4.000  
##  Automatic 4-spd: 631   Max.   :16.000   Max.   :8.400  
##  (Other)        :4922   NA's   :169      NA's   :168

trans levels can be combined using substr again or using fct_recode or fct_collapse from the forcats package.

Using fct_collapse along with grep:

library(forcats)
tlevs <- levels(newmpg1$trans)
head(grep("Auto", tlevs, value = TRUE))
## [1] "Automatic (A1)"    "Automatic (AM-S6)" "Automatic (AM-S7)"
## [4] "Automatic (AM-S8)" "Automatic (AM-S9)" "Automatic (AM5)"
head(grep("Manu", tlevs, value = TRUE))
## [1] "Manual 5-spd" "Manual 6-spd" "Manual 7-spd"
ntrans <- fct_collapse(newmpg1$trans,
                       Automatic = grep("Auto", tlevs, value = TRUE),
                       Manual = grep("Manu", tlevs, value = TRUE))
newmpg1 <- mutate(newmpg1, trans = ntrans)

There are many levels to cyl:

summary(factor(newmpg1$cyl))
##    2    3    4    5    6    8   10   12   16 NA's 
##   14  114 5178  183 4551 2776   97  323    8  169

Collapsing the lower and higher levels will make color encoding more effective:

ncyl <- fct_collapse(factor(newmpg1$cyl),
                     '2 or 3' = c("2", "3"),
                     '10 or more' = c("10", "12", "16"))
levels(ncyl)
## [1] "2 or 3"     "4"          "5"          "6"          "8"         
## [6] "10 or more"
newmpg2 <- mutate(newmpg1, cyl = ncyl)

An initial plot:

ggplot(newmpg2, aes(y = cty, x = displ, color = cyl, shape = trans)) +
    geom_point()
## Warning: Removed 168 rows containing missing values (geom_point).

The large cty value with displ zero and cyl equal to NA is worth a look:

filter(newmpg2, displ == 0)
## # A tibble: 1 x 7
##   make       model   year   cty trans     cyl   displ
##   <chr>      <chr>  <dbl> <dbl> <fct>     <fct> <dbl>
## 1 Mitsubishi i-MiEV  2016   126 Automatic <NA>      0

There are also vehicles with displ equal to NA:

filter(newmpg2, is.na(displ))
## # A tibble: 168 x 7
##    make          model                      year   cty trans    cyl   displ
##    <chr>         <chr>                     <dbl> <dbl> <fct>    <fct> <dbl>
##  1 Nissan        Leaf                       2011   106 Automat… <NA>     NA
##  2 smart         fortwo electric drive ca…  2011    94 Automat… <NA>     NA
##  3 smart         fortwo electric drive co…  2011    94 Automat… <NA>     NA
##  4 Mitsubishi    i-MiEV                     2012   126 Automat… <NA>     NA
##  5 Azure Dynami… Transit Connect Electric…  2012    62 Automat… <NA>     NA
##  6 Azure Dynami… Transit Connect Electric…  2012    62 Automat… <NA>     NA
##  7 Nissan        Leaf                       2012   106 Automat… <NA>     NA
##  8 BMW           Active E                   2011   107 Automat… <NA>     NA
##  9 CODA Automot… CODA                       2012    77 Automat… <NA>     NA
## 10 Ford          Focus Electric             2012   110 Automat… <NA>     NA
## # … with 158 more rows

These are dropped from the plot, though the range of their cty values affects the default range shown in the plot.

By encoding these as displ = 0 we can include these vehicles in the plot.

newmpg3 <- mutate(newmpg2, displ = ifelse(is.na(displ), 0, displ))

ggplot(newmpg3, aes(y = cty, x = displ, color = cyl, shape = trans)) +
    geom_point(na.rm = TRUE) +
    scale_shape_discrete(na.value = 21)

Considering only vehicles with positive and non-NA displ values matches the mpg data:

newmpg4 <- filter(newmpg3, displ > 0)
p <- ggplot(newmpg4,
            aes(y = cty, x = displ, color = cyl, shape = trans)) +
    geom_point(na.rm = TRUE) +
    scale_shape_discrete(na.value = 21)
p

The distribution of transmission type within cylinder levels:

ggplot(newmpg1, aes(x = factor(cyl), fill = trans)) + geom_bar(position = "fill")