--- title: "More on Categorical Data" output: html_document: toc: yes --- ```{r global_options, include=FALSE} knitr::opts_chunk\$set(collapse=TRUE) ``` ```{r, include = FALSE} library(lattice) library(tidyverse) library(gridExtra) set.seed(12345) ``` Visualizing the distribution of multiple categorical variables involves visualizing counts and proportions. Distributions can be viewed as - joint distributions; - conditional distributions. The most common approaches use variants of bar and area charts. The resulting plots are often called _mosaic plots_. ## Data Formats Purely categorical data can be * in raw form, one row per observation * aggregated into counts for unique level combinations * cross-tabulated Data that includes categorical and numerical variables is usually in raw form. * `count` from `dplyr` produces aggregated data from raw data. * `as.data.frame` converts cross-tabulated data to aggregated form. The notes on [visualizing a categorical variable](catone.html) provide more details and examples. ## Two Data Sets ### Hair and Eye Color The cross-tabulated data in `HairEyeColor` was used [previously](catone.html): ```{r} HairEyeColorDF <- as.data.frame(HairEyeColor) head(HairEyeColorDF) ``` Marginal distributions of the variables: ```{r, fig.height = 3} grid.arrange(ggplot(HairEyeColorDF)+geom_col(aes(Sex, Freq)), ggplot(HairEyeColorDF)+geom_col(aes(Hair, Freq)), ggplot(HairEyeColorDF)+geom_col(aes(Eye, Freq)), nrow = 1) ``` ### Arthritis Data The `vcd` package includes the data frame `Arthritis` with several variables for 84 patients in a clinical trial for a treatment for rheumatoid arthritis. - The `Improved` variable is the response. - The predictors are `Treatment`, `Sex`, and `Age`. ```{r, message = FALSE} library(vcd) head(Arthritis) ``` Counts for the categorical predictors: ```{r} xtabs(~ Treatment + Sex, data = Arthritis) ``` Age distribution: ```{r} library(lattice) histogram(~ Age | Sex * Treatment, data = Arthritis) ``` ## Bar Charts ### Hair and Eye Color Default bar charts show the individual count or joint proportions. For the hair-eye color aggregated data counts: ```{r} ggplot(HairEyeColorDF) + geom_col(aes(Eye, Freq, fill = Sex)) + facet_wrap(~Hair) ``` Joint proportions: ```{r} ggplot(mutate(HairEyeColorDF, Prop = Freq / sum(Freq))) + geom_col(aes(Eye, Prop, fill = Sex)) + facet_wrap(~Hair) ``` * Differing frequencies of the hair colors are visible. * Conditional distributions of eye color within hair color are harder to compare. Showing conditional distributions requires computing proportions within groups. For the joint conditional distribution of sex and eye color given hair color: ```{r} peh <- mutate(group_by(HairEyeColorDF, Hair), Prop = Freq / sum(Freq)) ggplot(peh) + geom_col(aes(Eye, Prop, fill = Sex)) + facet_wrap(~Hair) ``` * It is easier to compare the skewness of the eye color distributions for black, brown, and red hair. * Assessing the proportion of females or males withing the different groups is possible but challenging since it requires relative length comparisons. To more clearly see the that the proportion of females among subjects with blond hair and blue eyes is higher than for other hair/eye color combinations we can look at the conditional distribution of sex given hair and eye color ```{r} pseh <- mutate(group_by(HairEyeColorDF, Hair, Eye), Prop = Freq / sum(Freq)) ggplot(pseh) + geom_col(aes(Eye, Prop, fill = Sex)) + facet_wrap(~Hair, nrow = 1) + theme(axis.text.x = element_text(angle = 45, hjust = 1)) ``` This plot can also be obtained as ```{r, eval = FALSE} ggplot(HairEyeColorDF) + geom_col(aes(Eye, Freq, fill = Sex), position = "fill") + facet_wrap(~Hair, nrow = 1) + theme(axis.text.x = element_text(angle = 45, hjust = 1)) ``` This visualization no longer shows the that some of the hair/eye color combinations are more common than others. ### Arthritis Data For the raw arthritis data, `geom_bar` computes the aggregate counts and produces a stacked bar chart by default: ```{r} p <- ggplot(Arthritis, aes(Sex, fill = Improved)) + facet_wrap(~Treatment) p + geom_bar() ``` Specifying `position = "dodge"` produces a side-by-side plot: ```{r} p + geom_bar(position = "dodge") ``` There are no cases of male patients on placebo reporting `Some` improvement, resulting in wider bars for the other options. One way to produce a zero height bar: * aggregate with `count`, and * use `complete` from `tidyr` ```{r} library(tidyr) atsi <- count(Arthritis, Treatment, Sex, Improved) atsi <- complete(atsi, Treatment, Sex, Improved, fill = list(n = 0)) ggplot(atsi, aes(Sex, n, fill = Improved)) + geom_col(position = "dodge") + facet_wrap(~Treatment) ``` Showing conditional distributions instead of joint proportions: ```{r} patsi <- mutate(group_by(atsi, Treatment, Sex), prop = n / sum(n)) ggplot(patsi) + geom_col(aes(x = Sex, y = prop, fill = Improved), position = "dodge") + facet_wrap(~Treatment) ``` Stacked bar charts with height one are another option for make conditional distributions easier to compare: ```{r} p + geom_bar(position = "fill") ``` Ordering of variables affects which comparisons are easier. * A researcher might want to emphasize the differential response among males and females. * A patient might prefer to be able to focus on whether the treatment is effective for them: ```{r} ggplot(Arthritis, aes(Treatment, fill = Improved)) + geom_bar(position = "fill") + facet_wrap(~Sex) ``` - The stacked bar chart is effective for two categories, and a few more if they are ordered. - Providing a visual indication of uncertainty in the estimates is a challenge. The standard errors are around 0.1. - The proportions of each treatment group that are male or female could be encoded in the bar width. - The resulting plot is called a _spine plot_. - Neither `ggplot2` nor `lattice` seem to make this easy. ## Spine Plot _Spine plots_ are a special case of _mosaic plots_, and can be seen as a generalization of stacked bar plots. - The proportions for the categories of a predictor variable are encoded in the bar widths. - Spine plots are provided by the base graphics function `spineplot` and the `vcd` function `spline`. - `vcd` plots are built on the `grid` graphics system, like `lattice` and `ggplot2` graphics. - The `ggmosaic` package provides support for mosaic plots in the `ggplot` framework. Spine plots for the arthritis data using `spineplot`: ```{r} library(forcats) Arth <- mutate(Arthritis, Improved = fct_rev(Improved)) ArthP <- filter(Arth, Treatment == "Placebo") ArthT <- filter(Arth, Treatment == "Treated") opar <- par(mfrow = c(1, 2)) spineplot(Improved ~ Sex, data = ArthP, main = "Placebo") spineplot(Improved ~ Sex, data = ArthT, main = "Treatment") par(opar) ``` `spineplot` can use raw data or cross-tabulated data: ```{r} spineplot(xtabs(Freq ~ Hair + Sex, HairEyeColorDF)[, 2:1]) ``` Using `geom_mosaic` from `ggmosaic` and the raw arthritis data: ```{r, message = FALSE} library(ggmosaic) ggplot(Arth) + geom_mosaic(aes(x = product(Sex), fill = Improved)) + facet_wrap(~Treatment) ``` For aggregate counts use the `weight` aesthetic: ```{r} HDF <- mutate(HairEyeColorDF, Sex = fct_rev(Sex)) ggplot(HDF) + geom_mosaic(aes(weight = Freq, x = product(Hair), fill = Sex)) ``` ## Doubledecker Plots _Doubledecker plots_ can be viewed as a generalization of spine plots to multiple predictors. Package `vcd` provides the `doubledecker` function. ```{r} doubledecker(Improved ~ Treatment + Sex, data = Arthritis) doubledecker(Improved ~ Sex + Treatment, data = Arthritis) ``` `doubledecker` also works with raw or cross-tabulated data: ```{r} doubledecker(xtabs(Freq ~ Hair + Eye + Sex, HairEyeColorDF)) ``` Using `ggmosaic`: ```{r} ggplot(Arth) + geom_mosaic(aes(x = product(Sex, Treatment), fill = Improved), divider = ddecker()) + theme(axis.text.x = element_text(angle = 15, hjust = 1)) ggplot(Arth) + geom_mosaic(aes(x = product(Treatment, Sex), fill = Improved), divider = ddecker()) + theme(axis.text.x = element_text(angle = 15, hjust = 1)) ggplot(HDF) + geom_mosaic(aes(weight = Freq, x = product(Eye, Hair), fill = Sex), divider = ddecker()) + theme(axis.text.x = element_text(angle = 35, hjust = 1)) ``` ## Mosaic Plots _Mosaic plots_ recursively partition the axes to represent counts of categorical variables as rectangles. - Base graphics provides `mosaicplot`; - `vcd` provides `mosaic`. A Mosaic plot for the predictors `Sex` and `Treatment`: ```{r} mosaicplot(~ Sex + Treatment, data = Arthritis) ``` Adding `Improved` to the joint distribution: ```{r} vcd::mosaic(~ Sex + Treatment + Improved, data = Arthritis) ``` Identifying `Improved` as the response: ```{r} vcd::mosaic(Improved ~ Sex + Treatment, data = Arthritis) ``` Matching the doubledecker plots: ```{r} vcd::mosaic(Improved ~ Treatment + Sex, data = Arthritis, split_vertical=c(TRUE, TRUE, FALSE)) vcd::mosaic(Improved ~ Sex + Treatment, data = Arthritis, split_vertical=c(TRUE, TRUE, FALSE)) ``` A mosaic plot for all bivariate marginals: ```{r} pairs(xtabs(~ Sex + Treatment + Improved, data = Arthritis)) ``` ## Spinograms and CD Plots _Spinograms_ and _CD plots_ show the conditional distribution of a categorical variable given the value of a numeric variable. - Spinograms use the same binning as a histogram and then create a spine plot. - CD plots use a smoothing or density estimation approach. A spinogram for `Improved` against `Age`: ```{r} spineplot(Improved ~ Age, data = ArthT) ``` This is based on the histogram ```{r} hist(ArthT\$Age, breaks = 13) ``` Spinograms for each level of `Sex` and `Treated`: ```{r} ArthTF <- filter(ArthT, Sex == "Female") ArthTM <- filter(ArthT, Sex == "Male") ArthPF <- filter(ArthP, Sex == "Female") ArthPM <- filter(ArthP, Sex == "Male") opar <- par(mfrow = c(2, 2)) spineplot(Improved ~ Age, data = ArthTM, main = "TM") spineplot(Improved ~ Age, data = ArthTF, main = "TF") spineplot(Improved ~ Age, data = ArthPM, main = "PM") spineplot(Improved ~ Age, data = ArthPF, main = "PF") par(opar) ``` CD plots estimate the conditional density of the `x` variable given the levels of `y`, weighted by the marginal proportions of `y` and use these to estimate cumulative probabilities. - The slice at a particular `x` level visualizes the conditional distribution of `y` given `x` at that level. - The `cd_plot` function from the `vcd` package produces a CD plot using `grid` graphics. - The `cdplot` function from the base `graphics package provides the same plots using base graphics. Analogous CD plots for the `Arthritis` data: ```{r, eval = FALSE} cd_plot(Improved ~ Age, data = ArthT) cd_plot(Improved ~ Age, data = ArthTM, main = "TM") cd_plot(Improved ~ Age, data = ArthTF, main = "TF") cd_plot(Improved ~ Age, data = ArthPM, main = "PM") cd_plot(Improved ~ Age, data = ArthPF, main = "PF") ``` ```{r, echo = FALSE} grid.newpage() pushViewport(viewport(layout = grid.layout(2, 2))) pushViewport(viewport(layout.pos.row = 1, layout.pos.col = 1)) cd_plot(Improved ~ Age, data = ArthTM, main = "TM", newpage = FALSE) popViewport() pushViewport(viewport(layout.pos.row = 1, layout.pos.col = 2)) cd_plot(Improved ~ Age, data = ArthTF, main = "TF", newpage = FALSE) popViewport() pushViewport(viewport(layout.pos.row = 2, layout.pos.col = 1)) cd_plot(Improved ~ Age, data = ArthPM, main = "PM", newpage = FALSE) popViewport() pushViewport(viewport(layout.pos.row = 2, layout.pos.col = 2)) cd_plot(Improved ~ Age, data = ArthPF, main = "PF", newpage = FALSE) popViewport() popViewport() ``` It may be helpful to consider an age grouping like ```{r, eval = FALSE} cut(Arthritis\$Age, c(0, 40, 60, 100)) ``` - This loses some information but may simplify the analysis. - This can also allow the reduction of data to cell counts to help with very large data sets. ## Uncertainty Representation Categorical data are often analyzed by fitting models representing conditional independence structures. - Plotting residuals from these models can help assess how well they fit. - `mosaic` supports using color to represent magnitude of residuals for comparing to a simple independence model. For the `Arthritis` data, assessing independence of `Treatment` and `Improved` produces: ```{r} vcd::mosaic(~ Treatment + Improved, data = Arthritis, gp = shading_max) ``` Another visualization of the residuals is the _association plot_ produced by `assoc`: ```{r} assoc(~ Treatment + Improved, data = Arthritis, gp = shading_max) ``` ## References - The vignette [_Residual-Based Shadings in vcd_](https://cran.r-project.org/web/packages/vcd/vignettes/residual-shadings.pdf) in the `vcd` package. - Zeileis, Achim, David Meyer, and Kurt Hornik. "Residual-based shadings for visualizing (conditional) independence." Journal of Computational and Graphical Statistics 16, no. 3 (2007): 507-525. - The vignette [_Working with categorical data with R and the vcd and vcdExtra packages_](https://cran.r-project.org/web/packages/vcdExtra/vignettes/vcd-tutorial.pdf) in the `vcdExtra` package. Several other experimental mosaic plot implementations are available for `ggplot`. ## Some Other Visualizations ### Stream Graphs - [Stream graphs](http://www.visualisingdata.com/2010/08/making-sense-of-streamgraphs/) are a generalization of stacked bar charts plotted against a numeric variable. - In some cases the origins of the bars are shifted to improve some aspect of the overall visualization. - An early example is the [Baby Name Voyager](http://www.babynamewizard.com/voyager). - A [NY Times visualization](http://www.nytimes.com/interactive/2008/02/23/movies/20080223_REVENUE_GRAPHIC.html?_r=0) of movie box office results is another example. - Some R implementation on GitHub: - [`ggTimeSeries`](https://github.com/Ather-Energy/ggTimeSeries) - [`streamgraph`](https://hrbrmstr.github.io/streamgraph/), which uses D3. After installing `streamgraph` with ```{r, eval = FALSE} devtools::install_github("hrbrmstr/streamgraph") ``` a stream graph for movie genres (these are not mutually exclusive): ```{r} library(streamgraph) library(tidyverse) genres <- c("Action", "Animation", "Comedy", "Drama", "Documentary", "Romance") mymovies <- select(ggplot2movies::movies, year, one_of(genres)) mymovies_long <- gather(mymovies, genre, value, -year) movie_counts <- count(mymovies_long, year, genre) streamgraph(movie_counts, "genre", "n", "year") ``` ### Alluvial plots These are also known as - _parallel sets_, or - _Sankey diagrams_. They can be viewed as a parallel coordinates plot for categorical data. Using the `alluvial` package: ```{r} library(alluvial) pal = RColorBrewer::brewer.pal(2, "Set1") with(HDF, alluvial(Hair, Eye, Sex, freq = Freq, col = pal[as.numeric(Sex)])) with(count(Arth, Improved, Treatment, Sex), alluvial(Improved, Treatment, Sex,freq = n, col = pal[as.numeric(Sex)])) ``` Using package `ggforce`: ```{r} library(ggforce) sHDF <- gather_set_data(HDF, 1:3) sHDF <- mutate(sHDF, x = fct_inorder(x)) ggplot(sHDF, aes(x, id = id, split = y, value = Freq)) + geom_parallel_sets(aes(fill = Sex), alpha = 0.5, axis.width = 0.1) + geom_parallel_sets_axes(axis.width = 0.1) + geom_parallel_sets_labels(colour = 'white') + scale_fill_manual(values = c(Male = pal[2], Female = pal[1])) sArth <- gather_set_data(count(Arth, Improved, Treatment, Sex), 1:3) sArth <- mutate(sArth, x = fct_inorder(x), Sex = fct_rev(Sex)) ggplot(sArth, aes(x, id = id, split = y, value = n)) + geom_parallel_sets(aes(fill = Sex), alpha = 0.5, axis.width = 0.1) + geom_parallel_sets_axes(axis.width = 0.1) + geom_parallel_sets_labels(colour = 'white') + scale_fill_manual(values = c(Male = pal[2], Female = pal[1])) ```