---
title: "More on Categorical Data"
output:
  html_document:
    toc: yes
---

```{r global_options, include = FALSE}
knitr::opts_chunk$set(collapse = TRUE)
```

```{r, include = FALSE}
library(lattice)
library(tidyverse)
library(gridExtra)
set.seed(12345)
```

Visualizing the distribution of multiple categorical variables
involves visualizing counts and proportions.

Distributions can be viewed as

- joint distributions;
- conditional distributions.

The most common approaches use variants of bar and area charts.

The resulting plots are often called _mosaic plots_.

## Data Formats

Purely categorical data can be

* in raw form, one row per observation
* aggregated into counts for unique level combinations
* cross-tabulated

Data that includes categorical and numerical variables is usually in raw form.

* `count` from `dplyr` produces aggregated data from raw data.
* `as.data.frame` converts cross-tabulated data to  aggregated form.

The notes on [visualizing a categorical variable](catone.html) provide
more details and examples.


## Two Data Sets

### Hair and Eye Color

The cross-tabulated data in `HairEyeColor` was used
[previously](catone.html):

```{r}
HairEyeColorDF <- as.data.frame(HairEyeColor)
head(HairEyeColorDF)
```

Marginal distributions of the variables:

```{r, fig.height = 3}
grid.arrange(ggplot(HairEyeColorDF) + geom_col(aes(Sex, Freq)),
             ggplot(HairEyeColorDF) + geom_col(aes(Hair, Freq)),
             ggplot(HairEyeColorDF) + geom_col(aes(Eye, Freq)),
             nrow = 1)
```

### Arthritis Data

The `vcd` package includes the data frame `Arthritis` with several
variables for 84 patients in a clinical trial for a treatment for
rheumatoid arthritis.

- The `Improved` variable is the response.
- The predictors are `Treatment`, `Sex`, and `Age`.

```{r, message = FALSE}
library(vcd)
head(Arthritis)
```

Counts for the categorical predictors:
```{r}
xtabs(~ Treatment + Sex, data = Arthritis)
```
Age distribution:
```{r}
library(lattice)
histogram(~ Age | Sex * Treatment, data = Arthritis)
```

## Bar Charts

### Hair and Eye Color

Default bar charts show the individual count or joint proportions.

For the hair-eye color aggregated data counts:

```{r}
ggplot(HairEyeColorDF) +
    geom_col(aes(Eye, Freq, fill = Sex)) +
    facet_wrap(~ Hair)
```

Joint proportions:

```{r}
ggplot(mutate(HairEyeColorDF, Prop = Freq / sum(Freq))) +
    geom_col(aes(Eye, Prop, fill = Sex)) +
    facet_wrap(~ Hair)
```

* Differing frequencies of the hair colors are visible.
* Conditional distributions of eye color within hair color are harder
  to compare.

Showing conditional distributions requires computing proportions
within groups.

For the joint conditional distribution of sex and eye color given hair color:

```{r}
peh <- mutate(group_by(HairEyeColorDF, Hair), Prop = Freq / sum(Freq))
ggplot(peh) +
    geom_col(aes(Eye, Prop, fill = Sex)) +
    facet_wrap(~ Hair)
```

* It is easier to compare the skewness of the eye color distributions
  for black, brown, and red hair.

* Assessing the proportion of females or males withing the different
  groups is possible but challenging since it requires relative length
  comparisons.

To more clearly see the that the proportion of females among subjects
with blond hair and blue eyes is higher than for other hair/eye color
combinations we can look at the conditional distribution of sex given
hair and eye color

```{r}
pseh <- mutate(group_by(HairEyeColorDF, Hair, Eye), Prop = Freq / sum(Freq))
ggplot(pseh) +
    geom_col(aes(Eye, Prop, fill = Sex)) +
    facet_wrap(~ Hair, nrow = 1) +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))
```

This plot can also be obtained as

```{r, eval = FALSE}
ggplot(HairEyeColorDF) +
    geom_col(aes(Eye, Freq, fill = Sex), position = "fill") +
    facet_wrap(~ Hair, nrow = 1) +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))
```

This visualization no longer shows the that some of the hair/eye color
combinations are more common than others.


### Arthritis Data

For the raw arthritis data, `geom_bar` computes the aggregate counts
and produces a stacked bar chart by default:


```{r}
p <- ggplot(Arthritis, aes(Sex, fill = Improved)) + facet_wrap(~ Treatment)
p + geom_bar()
```

Specifying `position = "dodge"` produces a side-by-side plot:

```{r}
p + geom_bar(position = "dodge")
```

There are no cases of male patients on placebo reporting `Some`
improvement, resulting in wider bars for the other options.

One way to produce a zero height bar:

* aggregate with `count`, and
* use `complete` from `tidyr`


```{r}
library(tidyr)
atsi <- count(Arthritis, Treatment, Sex, Improved)
atsi <- complete(atsi, Treatment, Sex, Improved, fill = list(n = 0))
ggplot(atsi, aes(Sex, n, fill = Improved)) +
    geom_col(position = "dodge") +
    facet_wrap(~ Treatment)
```

Showing conditional distributions instead of joint proportions:

```{r}
patsi <- mutate(group_by(atsi, Treatment, Sex), prop = n / sum(n))
ggplot(patsi) +
    geom_col(aes(x = Sex, y = prop, fill = Improved),
             position = "dodge") +
    facet_wrap(~ Treatment)
```

Stacked bar charts with height one are another option for make
conditional distributions easier to compare:

```{r}
p + geom_bar(position = "fill")
```

Ordering of variables affects which comparisons are easier.

* A researcher might want to emphasize the differential response among
  males and females.
* A patient might prefer to be able to focus on whether the treatment
  is effective for them:

```{r}
ggplot(Arthritis, aes(Treatment, fill = Improved)) +
    geom_bar(position = "fill") +
    facet_wrap(~ Sex)
```

- The stacked bar chart is effective for two categories, and a few
  more if they are ordered.

- Providing a visual indication of uncertainty in the estimates is a
  challenge. The standard errors are around 0.1.

- The proportions of each treatment group that are male or female
  could be encoded in the bar width.

- The resulting plot is called a _spine plot_.

- Neither `ggplot2` nor `lattice` seem to make this easy.


## Spine Plot

_Spine plots_ are a special case of _mosaic plots_, and can be seen as
a generalization of stacked bar plots.

- The proportions for the categories of a predictor variable are
  encoded in the bar widths.

- Spine plots are provided by the base graphics function `spineplot`
  and the `vcd` function `spline`.

- `vcd` plots are built on the `grid` graphics system, like `lattice`
  and `ggplot2` graphics.

- The `ggmosaic` package provides support for mosaic plots in the
  `ggplot` framework.

Spine plots for the arthritis data using `spineplot`:

```{r}
library(forcats)
Arth <- mutate(Arthritis, Improved = fct_rev(Improved))
ArthP <- filter(Arth, Treatment == "Placebo")
ArthT <- filter(Arth, Treatment == "Treated")
opar <- par(mfrow = c(1, 2))
spineplot(Improved ~ Sex, data = ArthP, main = "Placebo")
spineplot(Improved ~ Sex, data = ArthT, main = "Treatment")
par(opar)
```

`spineplot` can use raw data or cross-tabulated data:

```{r}
spineplot(xtabs(Freq ~ Hair + Sex, HairEyeColorDF)[, 2 : 1])
```

Using `geom_mosaic` from `ggmosaic` and the raw arthritis data:

```{r, message = FALSE}
library(ggmosaic)
ggplot(Arth) +
    geom_mosaic(aes(x = product(Sex), fill = Improved)) +
    facet_wrap(~ Treatment)
```

For aggregate counts use the `weight` aesthetic:

```{r}
HDF <- mutate(HairEyeColorDF, Sex = fct_rev(Sex))
ggplot(HDF) +
    geom_mosaic(aes(weight = Freq, x = product(Hair), fill = Sex))
```


## Doubledecker Plots

_Doubledecker plots_ can be viewed as a generalization of spine plots
to multiple predictors.

Package `vcd` provides the `doubledecker` function.

```{r}
doubledecker(Improved ~ Treatment + Sex, data = Arthritis)
doubledecker(Improved ~ Sex + Treatment, data = Arthritis)
```

`doubledecker` also works with raw or cross-tabulated data:

```{r}
doubledecker(xtabs(Freq ~ Hair + Eye + Sex, HairEyeColorDF))
```

Using `ggmosaic`:

```{r}
ggplot(Arth) +
    geom_mosaic(aes(x = product(Sex, Treatment), fill = Improved),
                divider = ddecker()) +
    theme(axis.text.x = element_text(angle = 15, hjust = 1))

ggplot(Arth) +
    geom_mosaic(aes(x = product(Treatment, Sex), fill = Improved),
                divider = ddecker()) +
    theme(axis.text.x = element_text(angle = 15, hjust = 1))

ggplot(HDF) +
    geom_mosaic(aes(weight = Freq, x = product(Eye, Hair), fill = Sex),
                divider = ddecker()) +
    theme(axis.text.x = element_text(angle = 35, hjust = 1))
```


## Mosaic Plots

_Mosaic plots_ recursively partition the axes to represent counts of
categorical variables as rectangles.

- Base graphics provides `mosaicplot`;
- `vcd` provides `mosaic`.

A Mosaic plot for the predictors `Sex` and `Treatment`:

```{r}
mosaicplot(~ Sex + Treatment, data = Arthritis)
```

Adding `Improved` to the joint distribution:

```{r}
vcd::mosaic(~ Sex + Treatment + Improved, data = Arthritis)
```

Identifying `Improved` as the response:

```{r}
vcd::mosaic(Improved ~ Sex + Treatment, data = Arthritis)
```

Matching the doubledecker plots:

```{r}
vcd::mosaic(Improved ~ Treatment + Sex, data = Arthritis,
            split_vertical = c(TRUE, TRUE, FALSE))
vcd::mosaic(Improved ~ Sex + Treatment, data = Arthritis,
            split_vertical = c(TRUE, TRUE, FALSE))
```

A mosaic plot for all bivariate marginals:

```{r}
pairs(xtabs(~ Sex + Treatment + Improved, data = Arthritis))
```


## Spinograms and CD Plots

_Spinograms_ and _CD plots_ show the conditional distribution of a
categorical variable given the value of a numeric variable.

- Spinograms use the same binning as a histogram and then create a
  spine plot.

- CD plots use a smoothing or density estimation approach.

A spinogram for `Improved` against `Age`:

```{r}
spineplot(Improved ~ Age, data = ArthT)
```

This is based on the histogram

```{r}
hist(ArthT$Age, breaks = 13)
```

Spinograms for each level of `Sex` and `Treated`:

```{r}
ArthTF <- filter(ArthT, Sex == "Female")
ArthTM <- filter(ArthT, Sex == "Male")
ArthPF <- filter(ArthP, Sex == "Female")
ArthPM <- filter(ArthP, Sex == "Male")

opar <- par(mfrow = c(2, 2))
spineplot(Improved ~ Age, data = ArthTM, main = "TM")
spineplot(Improved ~ Age, data = ArthTF, main = "TF")
spineplot(Improved ~ Age, data = ArthPM, main = "PM")
spineplot(Improved ~ Age, data = ArthPF, main = "PF")
par(opar)
```

CD plots estimate the conditional density of the `x` variable given the
levels of `y`, weighted by the marginal proportions of `y` and use
these to estimate cumulative probabilities.

- The slice at a particular `x` level visualizes the conditional
  distribution of `y` given `x` at that level.

- The `cd_plot` function from the `vcd` package produces a CD plot
  using `grid` graphics.

- The `cdplot` function from the base `graphics package provides the
  same plots using base graphics.

Analogous CD plots for the `Arthritis` data:

```{r, eval = FALSE}
cd_plot(Improved ~ Age, data = ArthT)

cd_plot(Improved ~ Age, data = ArthTM, main = "TM")
cd_plot(Improved ~ Age, data = ArthTF, main = "TF")
cd_plot(Improved ~ Age, data = ArthPM, main = "PM")
cd_plot(Improved ~ Age, data = ArthPF, main = "PF")
```

<!-- From
http://stackoverflow.com/questions/31077980/how-to-put-two-vcd-grid-graphics-in-a-single-plot -->

```{r, echo = FALSE}
grid.newpage()
pushViewport(viewport(layout = grid.layout(2, 2)))

pushViewport(viewport(layout.pos.row = 1, layout.pos.col = 1))
cd_plot(Improved ~ Age, data = ArthTM, main = "TM", newpage = FALSE)
popViewport()

pushViewport(viewport(layout.pos.row = 1, layout.pos.col = 2))
cd_plot(Improved ~ Age, data = ArthTF, main = "TF", newpage = FALSE)
popViewport()

pushViewport(viewport(layout.pos.row = 2, layout.pos.col = 1))
cd_plot(Improved ~ Age, data = ArthPM, main = "PM", newpage = FALSE)
popViewport()

pushViewport(viewport(layout.pos.row = 2, layout.pos.col = 2))
cd_plot(Improved ~ Age, data = ArthPF, main = "PF", newpage = FALSE)
popViewport()

popViewport()
```

It may be helpful to consider an age grouping like
```{r, eval = FALSE}
cut(Arthritis$Age, c(0, 40, 60, 100))
```

- This loses some information but may simplify the analysis.

- This can also allow the reduction of data to cell counts to help
  with very large data sets.


## Uncertainty Representation

Categorical data are often analyzed by fitting models representing
conditional independence structures.

- Plotting residuals from these models can help assess how well they
  fit.

- `mosaic` supports using color to represent magnitude of residuals
  for comparing to a simple independence model.

For the `Arthritis` data, assessing independence of `Treatment` and
`Improved` produces:

```{r}
vcd::mosaic(~ Treatment + Improved, data = Arthritis, gp = shading_max)
```

Another visualization of the residuals is the _association plot_
produced by `assoc`:

```{r}
assoc(~ Treatment + Improved, data = Arthritis, gp = shading_max)
```

## References

- The vignette [_Residual-Based Shadings in vcd_](https://cran.r-project.org/package=vcd/vignettes/residual-shadings.pdf) in the `vcd` package.

- Zeileis, Achim, David Meyer, and Kurt Hornik. "Residual-based
  shadings for visualizing (conditional) independence." Journal of
  Computational and Graphical Statistics 16, no. 3 (2007): 507-525.

<!--
- The vignette [_Working with categorical data with R and the vcd and
  vcdExtra
  packages_](https://cran.r-project.org/package=vcdExtra/vignettes/vcd-tutorial.pdf)
  in the `vcdExtra` package. -->

Several other experimental mosaic plot implementations are available
for `ggplot`.


## Some Other Visualizations

### Stream Graphs

-
  [Stream graphs](https://visualisingdata.com/2010/08/making-sense-of-streamgraphs/)
  are a generalization of stacked bar charts plotted against a numeric
  variable.

- In some cases the origins of the bars are shifted to improve some
  aspect of the overall visualization.

- An early example is the
  [Baby Name Voyager](https://www.bewitched.com/namevoyager.html).

- A [NY Times
  visualization](https://archive.nytimes.com/www.nytimes.com/interactive/2008/02/23/movies/20080223_REVENUE_GRAPHIC.html?_r=0)
  of movie box office results is another example.

- Some R implementation on GitHub:

    - [`ggTimeSeries`](https://github.com/AtherEnergy/ggTimeSeries)
    - [`streamgraph`](https://hrbrmstr.github.io/streamgraph/), which uses D3.

After installing `streamgraph` with

```{r, eval = FALSE}
devtools::install_github("hrbrmstr/streamgraph")
```

a stream graph for movie genres (these are not mutually exclusive):

```{r}
library(streamgraph)
library(tidyverse)
genres <- c("Action", "Animation", "Comedy", "Drama", "Documentary", "Romance")
mymovies <- select(ggplot2movies::movies, year, one_of(genres))
mymovies_long <- gather(mymovies, genre, value, -year)
movie_counts <- count(mymovies_long, year, genre)
streamgraph(movie_counts, "genre", "n", "year")
```

### Alluvial plots

These are also known as

- _parallel sets_, or
- _Sankey diagrams_.

They can be viewed as a parallel coordinates plot for categorical data.

Using the `alluvial` package:

```{r}
library(alluvial)
pal <- RColorBrewer::brewer.pal(2, "Set1")
with(HDF,
     alluvial(Hair, Eye, Sex, freq = Freq, col = pal[as.numeric(Sex)]))

with(count(Arth, Improved, Treatment, Sex),
     alluvial(Improved, Treatment, Sex, freq = n, col = pal[as.numeric(Sex)]))
```

Using package `ggforce`:

```{r}
library(ggforce)
sHDF <- gather_set_data(HDF, 1 : 3)
sHDF <- mutate(sHDF, x = fct_inorder(factor(x)))
ggplot(sHDF, aes(x, id = id, split = y, value = Freq)) +
    geom_parallel_sets(aes(fill = Sex), alpha = 0.5, axis.width = 0.1) +
    geom_parallel_sets_axes(axis.width = 0.1) +
    geom_parallel_sets_labels(colour = "white") +
    scale_fill_manual(values = c(Male = pal[2], Female = pal[1]))

sArth <- mutate(Arth,
                Improved = factor(Improved, ordered = FALSE)) |>
    count(Improved, Treatment, Sex) |>
    gather_set_data(1 : 3)
sArth <- mutate(sArth, x = fct_inorder(factor(x)), Sex = fct_rev(Sex))
ggplot(sArth, aes(x, id = id, split = y, value = n)) +
    geom_parallel_sets(aes(fill = Sex), alpha = 0.5, axis.width = 0.1) +
    geom_parallel_sets_axes(axis.width = 0.1) +
    geom_parallel_sets_labels(colour = "white") +
    scale_fill_manual(values = c(Male = pal[2], Female = pal[1]))
```

<!--
ggplot(Arthritis) + geom_bar(aes(Improved, fill = Improved)) + facet_wrap(~Treatment + Sex)
ggplot(Arthritis) + geom_bar(aes(Improved, y = ..count.. / sum(..count..), fill = Improved)) + facet_wrap(~Treatment + Sex)
-->

<!--
joint vs conditional
variable order/what is primary comparison
-->