---
title: "Visualizing a Categorical Variable"
output:
  html_document:
    toc: yes
---

```{r global_options, include = FALSE}
knitr::opts_chunk$set(collapse = TRUE)
```

```{r, include = FALSE}
library(dplyr)
library(ggplot2)
library(lattice)
library(gridExtra)
set.seed(12345)
```

## Categorical Data

Categorical data can be 

* nominal, qualitative
* ordinal

For visualization, the main difference is that ordinal data suggests a
particular display order.

Purely categorical data can come in a range of formats. The most common are

* raw data: individual observations;
* aggregated data: counts for each unique combination of levels
* cross-tabulated data


### Raw Data

```{r, echo = FALSE}
ah <- as.data.frame(HairEyeColor)
raw <- ah[rep(seq_len(nrow(ah)), times = ah$Freq), ][-4]
raw <- raw[sample(seq_len(nrow(raw))), ]
row.names(raw) <- NULL
```

Raw data for a survey of individuals that records hair color, eye
color, and gender of `r nrow(raw)` individuals might look like this:

```{r}
head(raw)
```

### Aggregated Data

One way to aggregate raw categorical data is to use `count` from `dplyr`:

```{r}
library(dplyr)
agg <- count(raw, Hair, Eye, Sex)
head(agg)
```

The `count_` function from `dplyr` allows the variables to use to be
read from the data:

```{r}
agg <- count_(raw, names(raw))
head(agg)
```

### Cross-Tabulated Data

Cross-tabulated data can be produced from aggregate data using `xtabs`:

```{r}
xtabs(n ~ Hair + Eye + Sex, data = agg)
```

Cross-tabulated data can be produced from raw data using `table`:

```{r}
xtb <- table(raw)
xtb
```

* Both raw and aggregate date in this example are in _tidy_ form; the
  cross-tabulated date is not.

* Cross-tabulated data on $p$ variables is arranged in a $p$-way array.

* The cross-tabulated data can be converted to the tidy aggregate form
  using `as.data.frame`:

```{r}
class(xtb)
head(as.data.frame(xtb))
```

The variable `xtb` corresponds to the data set `HairEyeColor` in the
`datasets` package,


### Working With Categorical Variables

Categorical variables are usually represented as:

* character vectors
* factors.

Some advantages of factors:

* more control over ordering of levels
* levels are preserved when forming subsets

Most plotting and modeling functions will convert character vectors to
factors with levels ordered alphabetically.

Some standard R functions for working with factors include

* `factor` creates a factor from another type of variable
* `levels` returns the levels of a factor
* `reorder` changes level order to match another variable
* `relevel` moves a particular level to the first position as a base line
* `droplevels` removes levels not in the variable.

The `tidyverse` package `forcats` adds some more tools, including

* `fct_inorder` creates a factor with levels ordered by first appearance
* `fct_infreq` orders levels by decreasing frequency
* `fct_rev` reverses the levels
* `fct_recode` changes factor levels
* `fct_relevel` moves one or more levels
* `fct_c` merges two or more factors


## Bar Charts For Frequencies

### Basics

The bar chart is often used to show the frequencies of a categorical
variable.

By default, `geom_bar` uses `stat = "count"` and maps its result to
the `y` aesthetic. This is suitable for raw data:

```{r}
ggplot(raw) + geom_bar(aes(x = Hair))
```

For a nominal variable it is often better to order the bars by
decreasing frequency:

```{r}
library(forcats)
ggplot(mutate(raw, Hair = fct_infreq(Hair))) + geom_bar(aes(x = Hair))
```

If the data have already been aggregated, then you need to specify
`stat = "identity"` as well as the variable containing the counts as
the `y` aesthetic:

```{r}
ggplot(agg) + geom_bar(aes(x = Hair, y = n), stat = "identity")
```

An alternative is to use `geom_col`.

For aggregated data reordering can be based on the computed counts
using
```{r}
agg_ord <- mutate(agg, Hair = reorder(Hair, -n, sum))
```

* `-n` is used to order largest to smallest;
* the default summary used by `reorder` is `mean`; `sum` is better here.

```{r}
ggplot(agg_ord) + geom_col(aes(x = Hair, y = n))
```

### Adding a Grouping Variable

Mapping the `Eye` variable to `fill` in `ggplot` produces a _stacked
bar chart_.

An alternative, specified with `position = "dodge"`, is a _side by
side_ bar chart, or a _clustered_ bar chart.

For the side by side chart in particular it may be useful to also
reorder the `Eye` color levels.

```{r}
ecols <- c(Brown = "brown2", Blue = "blue2",
           Hazel = "darkgoldenrod3", Green = "green4")
agg_ord <- mutate(agg,
                  Hair = reorder(Hair, -n, sum),
                  Eye = reorder(Eye, -n, sum))
p1 <- ggplot(agg_ord) +
    geom_col(aes(x = Hair, y = n, fill = Eye)) +
    scale_fill_manual(values = ecols)
p2 <- ggplot(agg_ord) +
    geom_col(aes(x = Hair, y = n, fill = Eye), position = "dodge") +
    scale_fill_manual(values = ecols)
grid.arrange(p1, p2, nrow = 1)
```

Faceting can be used to bring in additional variables:

```{r}
p1 + facet_wrap(~ Sex)
```

The counts shown here may not be the most relevant features for
understanding the joint distributions of these variables.


## Pie Charts and Doughnut Charts

Pie charts can be viewed as stacked bar charts in polar coordinates:
```{r}
hcols <- c(Black = "black", Brown = "brown3",
           Red = "brown1", Blond = "lightgoldenrod1")
p1 <- ggplot(agg_ord) +
    geom_col(aes(x = 1, y = n, fill = Hair), position = "fill") +
    coord_polar(theta = "y") +
    scale_fill_manual(values = hcols)
p2 <- ggplot(agg_ord) +
    geom_col(aes(x = Hair, y = n, fill = Hair)) +
    scale_fill_manual(values = hcols)
grid.arrange(p1, p2, nrow = 1)
```

The axes and grid lines are not helpful for the pie chart and can be
removed with some _theme_ settings.

Using faceting we can also separately show the distributions for men
and women:

```{r}
p3 <- p1 + facet_wrap(~ Sex) +
    theme_bw() +
    theme(axis.title = element_blank(),
          axis.text = element_blank(),
          axis.ticks = element_blank(),
          panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(),
          panel.border = element_blank())
p3
```

Doughnut charts are a variant that has recently become popular in the
media:

```{r}
p4 <- p3 + xlim(0, 1.5)
p4
```

The center is often used for annotation:

```{r}
p4 + geom_text(aes(x = 0, y = 0, label = Sex)) +
    theme(strip.background = element_blank(),
          strip.text = element_blank())
```


## Some Notes

* Pie charts are effective for judging part/whole relationships.
* Pie charts are not very effective for comparing proportions.
* 3D pie charts are popular and a very bad idea. An example
([Fig. 6.61](https://www.dropbox.com/s/tlehzi3kb6ikbsz/6.61.3DIllustration.png?dl=0))
from Andy Kirk's book (2016),
[_Data Visualization: A Handbook for Data Driven Design_](http://book.visualisingdata.com/home):

![](img/badpie.png)

Stacked bar charts with equal heights are an alternative for
representing part-whole relationhips:

```{r}
ggplot(agg) +
    geom_col(aes(x = Sex, y = n, fill = Hair), position = "fill") +
    scale_fill_manual(values = hcols)
```

Another alternative is a _waffle chart_, sometimes also called a
_square pie chart_.

![](img/waffle.png)

The [`waffle`](https://github.com/hrbrmstr/waffle) package is one R
implementation of this idea.

<!--
pareto chart

spine plots
-->