## General Issues

• Make sure your file names and file references use identical spelling, including upper/lower case. Your code will fail on a case-sensitive file system if you don’t.

• Make sure to commit your work to your local repository and push your commits to GitLab. We can only see what is on GitLab, not what is on your computer. You can check what we see by going to the GitLab web interface.

• Include your name and the date in the header of your .Rmd file using author: and date: tags. You can use an inline chunk to have the date computed when the document is rendered. Your header should look something like this:

---
title: "Assignment 3"
author: "Fred Frog"
date: "r Sys.Date()"
output: html_document
---
• If you want to increase the font size for the body text in your HTML output one option is to add this line after your document header:

<style type="text/css"> body{ font-size: 12pt; } </style>

Do not use markdown headers for this. Markdown headers (lines starting with one or more # characters) should only be used for section and subsection headers.

• Any graph you show should be discussed in your narrative.

• Any code you show should be discussed in your narrative.

• If you do not need to discuss a piece of code in the narrative, use echo FALSE to avoid showing it.

• Your report should also not contain raw R output unless you are discussing how R presents results. Output should be incorporated into your text with inline code or presented as nicely formatted tables.

• Your .Rmd file, and possibly supporting .R files, contain the code for your analysis.

• If you need to update your code, or if a collaborator needs to update your code, that work will be done in your .Rmd file.

• You should make sure the code in your .Rmd file is readable.

• Following the coding standards helps with this.

• Please indent by 4 spaces for each level. I find this the most readable option.

• If you read a data file in your code make sure that you read it in a way that will work for someone else using your repository. If you want to read from a local file:

• Read the file with a relative path name, assuming your working directory will be the directory containing your Rmd file.

## 1. Choosing Between Faceting and Color

The faceted plot shows each of the seven groups in a sub-plot, or facet, using the same axis scales for all plots.

library(ggplot2)
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
facet_wrap(~ class, nrow = 2)

The plots are small and there is some over-plotting. The over-plotting could be reduced by reducing the point size.

A single plot that maps class to color benefits from a larger point size to improve discriminability of the colors:

ggplot(mpg, aes(x = displ, y = hwy, color = class)) +
geom_point(size = 2.5)

The number of colors is large, which makes discrimination more difficult, even with the increased point size. But once groups are identified, their relative positions are easier to see in the colored plot as all comparisons are within a common set of scales.

Faceting reduces plot size and thus increases over-plotting for larger data sets. Reducing point size is an option that can be effective if color and shape are not being used as channels. A significant drawback of faceting is that some group comparisons are moved from common scale comparisons to unaligned scale comparisons. This can sometimes be alleviated somewhat by showing a muted image of the complete data in the background.

Overall, color may have a slight edge in this data set. But it should be kept in mind that color is not effective on all display devices or for all viewers.

In larger data sets color becomes less effective as there will be a considerable amount of over-plotting, given the point size needed to support good color discrimination. Faceting will also suffer from more over-plotting in larger data sets for a given point size, but there is more flexibility to reduce point size. The shape of the data also plays a role, so both approaches are worth considering.

## 2. Faceting with Muted Full Data

The full data can be added as a background layer in a muted color, such as a light grey:

library(ggplot2)
library(dplyr)
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(data = mutate(mpg, class = NULL), color = "lightgrey") +
geom_point() +
facet_wrap(~ class, nrow = 2)

With the full data group-to-whole comparisons are again on aligned scales. For example, with the full data in the background it is easy to see that the 2-seaters are quite different than the other cars. Seeing this in the basic plot shown above is also possible, but it requires some work.

## 3. Gun Murders in US States

if (! file.exists("murders.csv"))
"murders.csv")
murders <- read.csv("murders.csv")

The following graph shows a plot of the total number of gun murders against the population of each state and the District of Columbia. Log axis are used as the distributions of both variables are highly skewed. The points are colored to show the region associated with each state.

ggplot(murders, aes(x = population, y = total, color = region)) +
geom_point(size = 2.5) +
scale_x_log10() +
scale_y_log10()

The relationship between the number of murders and the population size appears to be close to linear. The states in the southern region are mostly towards the top of the set of points: for a given population size the number of murders in southern states appears to be higher than in others.

## 4. Comparing Some Visualizations

All three plots clearly show that the 5 cylinder group is the smallest. Distinguishing the sizes of the other groups is more challenging.

Plot B uses aligned scales. It is easy to see the ordering, even though the values for 8, 6, and 4 cylinders are quite close.

Plot C relies on length comparisons; it seems possible to recognize that the 8 cylinder group is the smallest among the 4, 6, and 8 cylinder groups, but determining which of the 4 and 6 cylinder groups is smaller is very hard.

Plot A relies on area comparisons. The sizes of the 4, 6, and 8 cylinder groups are very hard to distinguish.

For comparing the group sizes in this data set Plot B is best, followed by Plot C, and then Plot A.