General Issues

General Comments

1. Choosing Between Faceting and Color

The faceted plot shows each of the seven groups in a sub-plot, or facet, using the same axis scales for all plots.

library(ggplot2)
ggplot(mpg, aes(x = displ, y = hwy)) +
    geom_point() +
    facet_wrap(~ class, nrow = 2)

The plots are small and there is some over-plotting. The over-plotting could be reduced by reducing the point size.

A single plot that maps class to color benefits from a larger point size to improve discriminability of the colors:

ggplot(mpg, aes(x = displ, y = hwy, color = class)) +
    geom_point(size = 2.5)

The number of colors is large, which makes discrimination more difficult, even with the increased point size. But once groups are identified, their relative positions are easier to see in the colored plot as all comparisons are within a common set of scales.

Faceting reduces plot size and thus increases over-plotting for larger data sets. Reducing point size is an option that can be effective if color and shape are not being used as channels. A significant drawback of faceting is that some group comparisons are moved from common scale comparisons to unaligned scale comparisons. This can sometimes be alleviated somewhat by showing a muted image of the complete data in the background.

Overall, color may have a slight edge in this data set. But it should be kept in mind that color is not effective on all display devices or for all viewers.

In larger data sets color becomes less effective as there will be a considerable amount of over-plotting, given the point size needed to support good color discrimination. Faceting will also suffer from more over-plotting in larger data sets for a given point size, but there is more flexibility to reduce point size. The shape of the data also plays a role, so both approaches are worth considering.

2. Faceting with Muted Full Data

The full data can be added as a background layer in a muted color, such as a light grey:

library(ggplot2)
library(dplyr)
ggplot(mpg, aes(x = displ, y = hwy)) +
    geom_point(data = mutate(mpg, class = NULL), color = "lightgrey") +
    geom_point() +
    facet_wrap(~ class, nrow = 2)

With the full data group-to-whole comparisons are again on aligned scales. For example, with the full data in the background it is easy to see that the 2-seaters are quite different than the other cars. Seeing this in the basic faceted plot shown above is also possible, but it requires some work.

3. Gun Murders in US States

if (! file.exists("murders.csv"))
    download.file("https://www.stat.uiowa.edu/~luke/data/murders.csv",
                  "murders.csv")
murders <- read.csv("murders.csv")

The following graph shows a plot of the total number of gun murders against the population of each state and the District of Columbia. Log axes are used as the distributions of both variables are highly skewed. The points are colored to show the region associated with each state.

ggplot(murders, aes(x = population, y = total, color = region)) +
    geom_point(size = 2.5) +
    scale_x_log10() +
    scale_y_log10()

The relationship between the number of murders and the population size appears to be close to linear. The states in the southern region are mostly towards the top of the set of points: for a given population size the number of murders in southern states appears to be higher than in others.

4. Comparing Some Visualizations

All three plots clearly show that the 5 cylinder group is the smallest. Distinguishing the sizes of the other groups is more challenging.

Plot B uses aligned scales. It is easy to see the ordering, even though the values for 8, 6, and 4 cylinders are quite close.

Plot A relies on length comparisons; it seems possible to recognize that the 8 cylinder group is the smallest among the 4, 6, and 8 cylinder groups, but determining which of the 4 and 6 cylinder groups is smaller is very hard.

Plot C relies on area comparisons. The sizes of the 4, 6, and 8 cylinder groups are very hard to distinguish.

For comparing the group sizes in this data set Plot B is best, followed by Plot A, and then Plot C.

---
title: "Assignment 3 Notes"
output:
  html_document:
    toc: yes
    code_download: true
    code_folding: "hide"
---

```{r global_options, include = FALSE}
knitr::opts_chunk$set(collapse = TRUE, fig.align = "center")
```

## General Issues

* Make sure your file names and file references use identical
  spelling, including upper/lower case. Your code will fail on a
  case-sensitive file system if you don't.

* Make sure to commit your work to your local repository and push your
  commits to GitLab. We can only see what is on GitLab, not what is on
  your computer. You can check what we see by going to the GitLab web
  interface.
 
* Include your name and the date in the header of your `.Rmd` file
  using `author:` and `date:` tags. You can use an inline chunk to
  have the date computed when the document is rendered. Your header
  should look something like this:

```{r, include = FALSE}
rinline <- function(code) {
    sprintf("`r %s`", code)
}
```
    ```
    ---
    title: "Assignment 3"
    author: "Fred Frog"
    date: "`r rinline("Sys.Date()")`"
    output: html_document
    ---
    ```

* If you want to increase the font size for the body text in your HTML
  output one option is to add this line after your document header:

    ```
    <style type="text/css"> body{ font-size: 12pt; } </style>
    ```
    Do _not_ use markdown headers for this. Markdown headers (lines
    starting with one or more `#` characters) should only be used for
    section and subsection headers.


## General Comments

* Your HTML file should be a report of your findings.

    * Any graph you show should be discussed in your narrative.

    * Any code you show should be discussed in your narrative.

    * If you do not need to discuss a piece of code in the narrative,
      use `echo FALSE` to avoid showing it.

    * Your report should also not contain raw R output unless you are
      discussing how R presents results. Output should be incorporated
      into your text with inline code or presented as nicely formatted
      tables.

* Your `.Rmd` file, and possibly supporting `.R` files, contain the
  code for your analysis.

    * If you need to update your code, or if a collaborator needs to
      update your code, that work will be done in your `.Rmd` file.

    * You should make sure the code in your `.Rmd` file is readable.

    * Following the [coding standards](coding.html) helps with this.

    * Please indent by 4 spaces for each level. I find this the most
      readable option.

* If you read a data file in your code make sure that you read it in a
  way that will work for someone else using your repository. If you
  want to read from a local file:

    * Make sure it is available locally either by downloading it as
      needed or including it in your repository.

    * Read the file with a relative path name, assuming your working
      directory will be the directory containing your `Rmd` file.

## 1. Choosing Between Faceting and Color

The faceted plot shows each of the seven groups in a sub-plot, or facet,
using the same axis scales for all plots.

```{r, fig.width = 8}
library(ggplot2)
ggplot(mpg, aes(x = displ, y = hwy)) +
    geom_point() +
    facet_wrap(~ class, nrow = 2)
```

The plots are small and there is some over-plotting. The over-plotting
could be reduced by reducing the point size.

A single plot that maps `class` to color benefits from a larger point size
to improve discriminability of the colors:

```{r, fig.width = 8}
ggplot(mpg, aes(x = displ, y = hwy, color = class)) +
    geom_point(size = 2.5)
```

The number of colors is large, which makes discrimination more
difficult, even with the increased point size. But once groups are
identified, their relative positions are easier to see in the colored
plot as all comparisons are within a common set of scales.

Faceting reduces plot size and thus increases over-plotting for larger
data sets. Reducing point size is an option that can be effective if
color and shape are not being used as channels. A significant drawback
of faceting is that some group comparisons are moved from common scale
comparisons to unaligned scale comparisons. This can sometimes be
alleviated somewhat by showing a muted image of the complete data in
the background.

Overall, color may have a slight edge in this data set. But it should
be kept in mind that color is not effective on all display devices or
for all viewers.

In larger data sets color becomes less effective as there will be a
considerable amount of over-plotting, given the point size needed to
support good color discrimination. Faceting will also suffer from more
over-plotting in larger data sets for a given point size, but there is more
flexibility to reduce point size.  The shape of the data also plays a
role, so both approaches are worth considering.


## 2. Faceting with Muted Full Data

The full data can be added as a background layer in a muted color,
such as a light grey:

```{r, message = FALSE, fig.width = 8}
library(ggplot2)
library(dplyr)
ggplot(mpg, aes(x = displ, y = hwy)) +
    geom_point(data = mutate(mpg, class = NULL), color = "lightgrey") +
    geom_point() +
    facet_wrap(~ class, nrow = 2)
```

With the full data group-to-whole comparisons are again on aligned
scales.  For example, with the full data in the background it is easy
to see that the 2-seaters are quite different than the other
cars. Seeing this in the basic faceted plot shown above is also
possible, but it requires some work.


## 3. Gun Murders in US States

```{r, message = FALSE}
if (! file.exists("murders.csv"))
    download.file("https://www.stat.uiowa.edu/~luke/data/murders.csv",
                  "murders.csv")
murders <- read.csv("murders.csv")
```

The following graph shows a plot of the total number of gun murders
against the population of each state and the District of Columbia.
Log axes are used as the distributions of both variables are highly
skewed. The points are colored to show the region associated with each
state.

```{r}
ggplot(murders, aes(x = population, y = total, color = region)) +
    geom_point(size = 2.5) +
    scale_x_log10() +
    scale_y_log10()
```

The relationship between the number of murders and the population size
appears to be close to linear. The states in the southern region are
mostly towards the top of the set of points: for a given population
size the number of murders in southern states appears to be higher
than in others.

## 4. Comparing Some Visualizations

All three plots clearly show that the 5 cylinder group is the
smallest. Distinguishing the sizes of the other groups is more
challenging.

Plot B uses aligned scales. It is easy to see the ordering, even
though the values for 8, 6, and 4 cylinders are quite close.

Plot A relies on length comparisons; it seems possible to recognize
that the 8 cylinder group is the smallest among the 4, 6, and 8
cylinder groups, but determining which of the 4 and 6 cylinder groups
is smaller is very hard.

Plot C relies on area comparisons.  The sizes of the 4, 6, and 8
cylinder groups are very hard to distinguish.

For comparing the group sizes in this data set Plot B is best, followed
by Plot A, and then Plot C.
