--- title: "Visualizing Proportions" output: html_document: toc: yes code_folding: show code_download: true --- ```{r setup, include = FALSE, message = FALSE} source(here::here("setup.R")) knitr::opts_chunk$set(collapse = TRUE, message = FALSE, fig.height = 5, fig.width = 6, fig.align = "center") set.seed(12345) library(dplyr) library(ggplot2) library(lattice) library(gridExtra) library(patchwork) source(here::here("datasets.R")) ``` ## Categorical Data Categorical data can be * nominal, qualitative * ordinal For visualization, the main difference is that ordinal data suggests a particular display order. Purely categorical data can come in a range of formats. The most common are * raw data: individual observations; * aggregated data: counts for each unique combination of levels; * cross-tabulated data. ### Raw Data ```{r, echo = FALSE} ah <- as.data.frame(HairEyeColor) raw <- ah[rep(seq_len(nrow(ah)), times = ah$Freq), ][-4] raw <- raw[sample(seq_len(nrow(raw))), ] row.names(raw) <- NULL ``` Raw data for a survey of individuals that records hair color, eye color, and gender of `r nrow(raw)` individuals might look like this: ```{r} head(raw) ``` ### Aggregated Data One way to aggregate raw categorical data is to use `count` from `dplyr`: ```{r} library(dplyr) agg <- count(raw, Hair, Eye, Sex) head(agg) ``` ### Cross-Tabulated Data Cross-tabulated data can be produced from aggregate data using `xtabs`: ```{r} xtabs(n ~ Hair + Eye + Sex, data = agg) ``` Cross-tabulated data can be produced from raw data using `table`: ```{r} xtb <- table(raw) xtb ``` Both raw and aggregate data in this example are in _tidy_ form; the cross-tabulated data is not. Cross-tabulated data on $p$ variables is arranged in a $p$-way array. The cross-tabulated data can be converted to the tidy aggregate form using `as.data.frame`: ```{r} class(xtb) head(as.data.frame(xtb)) ``` The variable `xtb` corresponds to the data set `HairEyeColor` in the `datasets` package, ### Working With Categorical Variables Categorical variables are usually represented as: * character vectors * factors. Some advantages of factors: * more control over ordering of levels * levels are preserved when forming subsets * levels can reflect possible values not present in the data Most plotting and modeling functions will convert character vectors to factors with levels ordered alphabetically. Some standard R functions for working with factors include * `factor` creates a factor from another type of variable * `levels` returns the levels of a factor * `reorder` changes level order to match another variable * `relevel` moves a particular level to the first position as a base line * `droplevels` removes levels not in the variable. The `tidyverse` package `forcats` adds some more tools, including * `fct_inorder` creates a factor with levels ordered by first appearance * `fct_infreq` orders levels by decreasing frequency * `fct_rev` reverses the levels * `fct_recode` changes factor levels * `fct_relevel` moves one or more levels * `fct_c` merges two or more factors * `fct_collapse` merge some factor levels ## Bar Charts For Frequencies ### Basics A bar chart is often used to show the frequencies of a categorical variable. By default, `geom_bar` uses `stat = "count"` and maps its result to the `y` aesthetic. This is suitable for raw data: ```{r, class.source = "fold-hide"} thm <- theme_minimal() + theme(text = element_text(size = 16)) ggplot(raw) + geom_bar(aes(x = Hair), fill = "deepskyblue3") + thm ``` For a nominal variable it is often better to order the bars by decreasing frequency: ```{r, class.source = "fold-hide"} library(forcats) ggplot(mutate(raw, Hair = fct_infreq(Hair))) + geom_bar(aes(x = Hair), fill = "deepskyblue3") + thm ``` If the data have already been aggregated, then you need to either specify `stat = "identity"` as well as the variable containing the counts as the `y` aesthetic, or use `geom_col`: ```{r, class.source = "fold-hide"} ggplot(agg) + geom_col(aes(x = Hair, y = n), fill = "deepskyblue3") + thm ``` For aggregated data, reordering can be based on the computed counts using ```{r} agg_ord <- mutate(agg, Hair = reorder(Hair, -n, sum)) ``` * `-n` is used to order largest to smallest; * the default summary used by `reorder` is `mean`; `sum` is better here. ```{r, class.source = "fold-hide"} ggplot(agg_ord) + geom_col(aes(x = Hair, y = n), fill = "deepskyblue3") + thm ``` ### Adding a Grouping Variable Mapping the `Eye` variable to `fill` in `ggplot` produces a _stacked bar chart_. An alternative, specified with `position = "dodge"`, is a _side by side_ bar chart, or a _clustered_ bar chart. For the side by side chart in particular it may be useful to also reorder the `Eye` color levels. ```{r, fig.width = 8, class.source = "fold-hide"} ecols <- c(Brown = "brown2", Blue = "blue2", Hazel = "darkgoldenrod3", Green = "green4") agg_ord <- mutate(agg, Hair = reorder(Hair, -n, sum), Eye = reorder(Eye, -n, sum)) p1 <- ggplot(agg_ord) + geom_col(aes(x = Hair, y = n, fill = Eye)) + scale_fill_manual(values = ecols) + thm p2 <- ggplot(agg_ord) + geom_col(aes(x = Hair, y = n, fill = Eye), position = "dodge") + scale_fill_manual(values = ecols) + thm (p1 + guides(fill = "none")) | p2 ``` Faceting can be used to bring in additional variables: ```{r, fig.width = 8, class.source = "fold-hide"} p1 + facet_wrap(~ Sex) ``` The counts shown here may not be the most relevant features for understanding the joint distributions of these variables. ## Pie Charts and Doughnut Charts _Pie charts_ go by many different names (from a [Twitter thread](https://twitter.com/ElephantEating/status/1361039771414319106)): ```{r, echo = FALSE, out.width = "80%"} knitr::include_graphics(IMG("pienames.jpeg")) ``` Pie charts can be viewed as stacked bar charts in polar coordinates: ```{r, fig.width = 8, class.source = "fold-hide"} hcols <- c(Black = "black", Brown = "brown4", Red = "brown1", Blond = "lightgoldenrod1") p1 <- ggplot(agg_ord) + geom_col(aes(x = 1, y = n, fill = Hair), position = "fill") + coord_polar(theta = "y") + scale_fill_manual(values = hcols) + thm p2 <- ggplot(agg_ord) + geom_col(aes(x = Hair, y = n, fill = Hair)) + scale_fill_manual(values = hcols) + thm (p1 + guides(fill = "none")) | p2 ``` The axes and grid lines are not helpful for the pie chart and can be removed with some _theme_ settings. Using faceting we can also separately show the distributions for men and women: ```{r, fig.width = 8, class.source = "fold-hide"} pie_thm <- thm + theme(axis.title = element_blank(), axis.text = element_blank(), axis.ticks = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), panel.border = element_blank()) p3 <- p1 + facet_wrap(~ Sex) + pie_thm p3 ``` _Doughnut charts_ are a variant that has recently become popular in the media: ```{r, fig.width = 8, class.source = "fold-hide"} p4 <- p3 + xlim(0, 1.5) p4 ``` The center is often used for annotation: ```{r, fig.width = 8, class.source = "fold-hide"} p4 + geom_text(aes(x = 0, y = 0, label = Sex), size = 5) + theme(strip.background = element_blank(), strip.text = element_blank()) ``` An alternative to the polar coordinates approach uses `geom_arc_bar` and `stat_pie` from package `ggforce`: ```{r, fig.width = 8, class.source = "fold-hide"} #| warning: false library(ggforce) arrange(agg_ord, desc(Hair)) |> ggplot(aes(x0 = 0, y0 = 0, r0 = 0, r = 1, amount = n, fill = Hair)) + geom_arc_bar(stat = "pie", color = NA) + coord_fixed() + scale_fill_manual(values = hcols) + pie_thm + facet_wrap(~ Sex) ``` For doughnut charts: ```{r, fig.width = 8, class.source = "fold-hide"} arrange(agg_ord, desc(Hair)) |> ggplot(aes(x0 = 0, y0 = 0, r0 = 0.4, r = 1, amount = n, fill = Hair)) + geom_arc_bar(stat = "pie", color = NA) + geom_text(aes(x = 0, y = 0, label = Sex), size = 5) + coord_fixed() + scale_fill_manual(values = hcols) + pie_thm + theme(strip.background = element_blank(), strip.text = element_blank()) + facet_wrap(~ Sex) ``` ## Some Notes Pie charts are effective for judging part/whole relationships. Pie charts can be effective for comparing proportions to * one half * one quarter Pie charts are not very effective for comparing proportions to each other. 3D pie charts are popular and a very bad idea. An example ([Fig. 6.61](https://www.dropbox.com/s/tlehzi3kb6ikbsz/6.61.3DIllustration.png?dl=0)) from Andy Kirk's book (2016), [_Data Visualization: A Handbook for Data Driven Design_](http://book.visualisingdata.com/home): ```{r, echo = FALSE, out.width = "50%"} knitr::include_graphics(IMG("badpie.png")) ``` Pie charts are widely used for political data. * With the right ordering, pie charts are very good at showing which coalitions of parties can form a majority. * When no one candidate earns a majority of the votes, pie charts do not show which candidate has earned a plurality very well. * Good orientation and factor ordering can help. ```{r, fig.width = 8, class.source = "fold-hide"} elect <- geofacet::election |> group_by(candidate) |> summarize(votes = sum(votes)) p1 <- ggplot(elect) + geom_col(aes(x = 1, y = votes, fill = candidate), position = "fill") + coord_polar(theta = "y", start = -1) + xlim(c(-0.5, 1.5)) + scale_fill_manual(values = c(Trump = scales::muted("red", 50, 80), Clinton = scales::muted("blue", 50, 70), Other = "grey")) + pie_thm p2 <- mutate(elect, candidate = factor(candidate, c("Clinton", "Other", "Trump"))) |> ggplot() + geom_col(aes(x = 1, y = votes, fill = candidate), position = "fill") + coord_polar(theta = "y") + xlim(c(-0.5, 1.5)) + scale_fill_manual(values = c(Trump = scales::muted("red", 50, 80), Clinton = scales::muted("blue", 50, 70), Other = "grey")) + pie_thm p3 <- ggplot(elect) + geom_col(aes(x = candidate, y = 100 * (votes / sum(votes)), fill = candidate)) + scale_fill_manual(values = c(Trump = scales::muted("red", 50, 80), Clinton = scales::muted("blue", 50, 70), Other = "grey")) + labs(y = "percent") + thm + theme(axis.text.x = element_blank(), axis.title.x = element_blank()) (p1 + guides(fill = "none")) + (p2 + guides(fill = "none")) + p3 ``` ## Some Alternatives ### Stacked Bar Charts Stacked bar charts with equal heights, or filled bar charts, are an alternative for representing part-whole relationships. * Top and bottom proportions are easy to compare. * Comparing proportions to one half and one quarter is harder. ```{r, class.source = "fold-hide"} ggplot(agg) + geom_col(aes(x = Sex, y = n, fill = Hair), position = "fill") + scale_fill_manual(values = hcols) + thm ``` ### Waffle Charts Another alternative is a _waffle chart_, sometimes also called a _square pie chart_. ```{r, echo = FALSE, out.width = "50%"} knitr::include_graphics(IMG("waffle.png")) ``` The [`waffle`](https://github.com/hrbrmstr/waffle) package is one R implementation of this idea. Currently the development version on GitHub is needed for the following examples. Showing the counts: ```{r, fig.width = 7, class.source = "fold-hide"} library(waffle) stopifnot(packageVersion("waffle") >= "1.0.1") ggplot(arrange(agg, Hair), aes(values = n, fill = Hair)) + geom_waffle(n_rows = 18, flip = TRUE, color = "white", size = 0.33, na.rm = FALSE) + coord_equal() + facet_wrap(~ Sex) + scale_fill_manual(values = hcols) + theme_minimal() + theme_enhance_waffle() ``` ```{r, eval = TRUE, echo = FALSE} ggplot(arrange(agg, Hair), aes(values = n, fill = Hair)) + geom_waffle(n_rows = 10, flip = TRUE, color = "white", size = 0.33, na.rm = FALSE) + coord_equal() + facet_grid(Sex ~ Eye) + scale_fill_manual(values = hcols) + theme_minimal() + theme_enhance_waffle() ``` Showing the proportions: ```{r, fig.width = 7, class.source = "fold-hide"} round_pct <- function(n) { pct <- 100 * (n / sum(n)) nn <- floor(pct) if (sum(nn) < 100) { rem <- pct - nn idx <- sort(order(rem), decreasing = TRUE)[seq_len(100 - sum(nn))] nn[idx] <- nn[idx] + 1 } nn } group_by(agg, Sex) |> mutate(pct = round_pct(n)) |> ungroup() |> ggplot(aes(values = pct, fill = Hair)) + geom_waffle(n_rows = 10, flip = TRUE, color = "white", size = 0.33, na.rm = FALSE) + coord_equal() + facet_wrap(~ Sex) + scale_fill_manual(values = hcols) + theme_minimal() + theme_enhance_waffle() ``` ```{r, eval = FALSE, echo = FALSE} group_by(agg, Sex, Eye) |> mutate(pct = round_pct(n)) |> ungroup() |> arrange(pct, Hair) |> ggplot(aes(values = pct, fill = Hair)) + geom_waffle(n_rows = 10, flip = TRUE, color = "white", size = 0.33, na.rm = FALSE) + coord_equal() + facet_grid(Sex ~ Eye) + scale_fill_manual(values = hcols) + theme_minimal() + theme_enhance_waffle() ``` ## Population Pyramids Bar charts for two groups can be shown back to back. ```{r, class.source = "fold-hide"} mutate(agg, Hair = reorder(Hair, n, sum)) |> ggplot(aes(x = ifelse(Sex == "Male", n, -n), y = Hair, fill = Sex)) + geom_col() + xlab("Count") + thm ``` This is often used for showing age distributions by sex for populations; the result is called a [_population pyramid_](https://www.visualcapitalist.com/us-population-pyramid-1980-2050/). Age distribution data for many countries and years is available from a [Census Bureau website](https://www.census.gov/data-tools/demo/idb/). Data files for 2020 for [Germany](https://stat.uiowa.edu/~luke/data/germany-2020.csv) and [Nigeria](https://stat.uiowa.edu/~luke/data/nigeria-2020.csv) are available locally. ```{r, class.source = "fold-hide"} if (! file.exists("germany-2020.csv")) download.file("https://stat.uiowa.edu/~luke/data/germany-2020.csv", "germany-2020.csv") if (! file.exists("nigeria-2020.csv")) download.file("https://stat.uiowa.edu/~luke/data/nigeria-2020.csv", "nigeria-2020.csv") gm_pop <- read.csv("germany-2020.csv", skip = 1) |> filter(Age != "Total") |> mutate(Age = fct_inorder(Age)) ni_pop <- read.csv("nigeria-2020.csv", skip = 1) |> filter(Age != "Total") |> mutate(Age = fct_inorder(Age)) ``` Combining the data sets allows a side by side comparison of the counts: ```{r, class.source = "fold-hide"} library(tidyr) pop2 <- bind_rows(mutate(gm_pop, Country = "Germany"), mutate(ni_pop, Country = "Nigeria")) |> select(Age, Male = Male.Population, Female = Female.Population, Country) |> pivot_longer(Male : Female, names_to = "Sex", values_to = "n") ggplot(pop2) + geom_col(aes(x = ifelse(Sex == "Male", n, -n), y = Age, fill = Sex)) + facet_wrap(~ Country) + scale_x_continuous( labels = function(n) scales::comma(abs(n))) + xlab("Count") + thm + theme(legend.position = "top") ``` The different shapes are evident, but are harder to see than they could be because of the difference in total population: ```{r, class.source = "fold-hide"} group_by(pop2, Country) |> summarize(Population = sum(n)) |> ungroup() |> mutate(Population = scales::comma(Population)) |> knitr::kable(format = "html", align = "lr") |> kableExtra::kable_styling(full_width = FALSE) ``` Using a group mutate we can compute sex/age group percentages within each country: ```{r, class.source = "fold-hide"} group_by(pop2, Country) |> mutate(pct = 100 * n / sum(n)) |> ungroup() |> ggplot() + geom_col(aes(x = ifelse(Sex == "Male", pct, -pct), y = Age, fill = Sex)) + facet_wrap(~ Country) + scale_x_continuous( labels = function(x) scales::percent(abs(x / 100))) + xlab("Percent") + thm + theme(legend.position = "top") ``` ```{r, eval = FALSE, echo = FALSE} p <- ggplot(gm_pop) + geom_col(aes(x = Male.Population, y = Age, fill = "Male")) + geom_col(aes(x = -Female.Population, y = Age, fill = "Female")) + thm p | (p %+% ni_pop) ``` ## Multiple Categorical Variables Visualizing the distribution of multiple categorical variables involves visualizing counts and proportions. Distributions can be viewed as * joint distributions; * conditional distributions. When one variable (or several) can be viewed as a response and others as predictors then it is common to focus on the conditional distribution of the response given the predictors. The most common approaches use variants of bar and area charts. The resulting plots are often called [_mosaic plots_](https://en.wikipedia.org/wiki/Mosaic_plot). ## Two Data Sets ### Hair and Eye Color ```{r} HairEyeColorDF <- as.data.frame(HairEyeColor) head(HairEyeColorDF) ``` Marginal distributions of the variables: ```{r, fig.height = 3, fig.width = 9, class.source = "fold-hide"} p1 <- ggplot(HairEyeColorDF) + geom_col(aes(Sex, Freq), fill = "deepskyblue3") + thm p2 <- mutate(HairEyeColorDF, Hair = reorder(Hair, -Freq, sum)) |> ggplot() + geom_col(aes(Hair, Freq), fill = "deepskyblue3") + thm p3 <- ggplot(HairEyeColorDF) + geom_col(aes(Eye, Freq), fill = "deepskyblue3") + thm p1 | p2 | p3 ``` ### Arthritis Data The `vcd` package includes the data frame `Arthritis` with several variables for 84 patients in a clinical trial for a treatment for rheumatoid arthritis. ```{r, message = FALSE} data(Arthritis, package = "vcd") head(Arthritis) ``` * The `Improved` variable is the response. * The predictors are `Treatment`, `Sex`, and `Age`. Counts for the categorical predictors: ```{r} xtabs(~ Sex, Arthritis) ``` ```{r} xtabs(~ Treatment, Arthritis) ``` ```{r} xtabs(~ Treatment + Sex, data = Arthritis) ``` Joint distribution of the predictors: ```{r, class.source = "fold-hide"} ggplot(Arthritis) + geom_histogram(aes(x = Age), binwidth = 10, fill = "deepskyblue3", color = "black") + facet_grid(Treatment ~ Sex) + thm ``` Conditional distribuiton of age, given sex and treatment: ```{r, class.source = "fold-hide"} ggplot(Arthritis) + geom_histogram(aes(x = Age, y = after_stat(density)), binwidth = 10, fill = "deepskyblue3", color = "black") + facet_grid(Treatment ~ Sex) + thm ``` ## Bar Charts ### Hair and Eye Color Default bar charts show the individual count or joint proportions. For the hair-eye color aggregated data counts: ```{r, class.source = "fold-hide"} ggplot(HairEyeColorDF) + geom_col(aes(x = Eye, y = Freq, fill = Sex)) + facet_wrap(~ Hair) + thm ``` Joint proportions: ```{r, class.source = "fold-hide"} ggplot(mutate(HairEyeColorDF, Prop = Freq / sum(Freq))) + geom_col(aes(x = Eye, y = Prop, fill = Sex)) + facet_wrap(~ Hair) + thm ``` * Differing frequencies of the hair colors are visible. * Conditional distributions of eye color within hair color are harder to compare. Showing conditional distributions requires computing proportions within groups. For the joint conditional distribution of sex and eye color given hair color: ```{r, class.source = "fold-hide"} group_by(HairEyeColorDF, Hair) |> mutate(Prop = Freq / sum(Freq)) |> ungroup() |> ggplot() + geom_col(aes(x = Eye, y = Prop, fill = Sex)) + facet_wrap(~ Hair) + thm ``` * It is easier to compare the skewness of the eye color distributions for black, brown, and red hair. * Assessing the proportion of females or males withing the different groups is possible but challenging since it requires relative length comparisons. To more clearly see the that the proportion of females among subjects with blond hair and blue eyes is higher than for other hair/eye color combinations we can look at the conditional distribution of sex given hair and eye color. ```{r, class.source = "fold-hide"} group_by(HairEyeColorDF, Hair, Eye) |> mutate(Prop = Freq / sum(Freq)) |> ungroup() |> ggplot() + geom_col(aes(x = Eye, y = Prop, fill = Sex)) + facet_wrap(~ Hair, nrow = 1) + thm + theme(axis.text.x = element_text(angle = 45, hjust = 1)) ``` This plot can also be obtained using `position = "fill"`. ```{r, eval = FALSE, class.source = "fold-hide"} ggplot(HairEyeColorDF) + geom_col(aes(x = Eye, y = Freq, fill = Sex), position = "fill") + facet_wrap(~ Hair, nrow = 1) + thm + theme(axis.text.x = element_text(angle = 45, hjust = 1)) ``` One drawback: This visualization no longer shows that some of the hair/eye color combinations are more common than others. ### Arthritis Data For the raw arthritis data, `geom_bar` computes the aggregate counts and produces a stacked bar chart by default: ```{r} p <- ggplot(Arthritis, aes(x = Sex, fill = Improved)) + facet_wrap(~ Treatment) p + geom_bar() + scale_fill_brewer(palette = "Blues") + thm ``` Specifying `position = "dodge"` produces a side-by-side plot: ```{r, class.source = "fold-hide"} p + geom_bar(position = "dodge") + scale_fill_brewer(palette = "Blues") + thm ``` There are no cases of male patients on placebo reporting `Some` improvement, resulting in wider bars for the other options. One way to produce a zero height bar: * aggregate with `count`, and * use `complete` from `tidyr` ```{r, class.source = "fold-hide"} library(tidyr) comp_counts <- count(Arthritis, Treatment, Sex, Improved) |> complete(Treatment, Sex, Improved, fill = list(n = 0)) ggplot(comp_counts, aes(x = Sex, y = n, fill = Improved)) + geom_col(position = "dodge") + facet_wrap(~ Treatment) + scale_fill_brewer(palette = "Blues") + thm ``` Another option is to use the `preserve = "single"` option with `position_dodge`. ```{r, class.source = "fold-hide"} p + geom_bar(position = position_dodge( preserve = "single")) + scale_fill_brewer(palette = "Blues") + thm ``` Showing conditional distributions of `Improved` given different levels of `Treatment` and `Sex`: ```{r, class.source = "fold-hide"} group_by(comp_counts, Treatment, Sex) |> mutate(prop = n / sum(n)) |> ungroup() |> ggplot() + geom_col(aes(x = Sex, y = prop, fill = Improved), position = "dodge") + facet_wrap(~ Treatment) + scale_fill_brewer(palette = "Blues") + thm ``` Stacked bar charts with height one are another option to make these conditional distributions easier to compare: ```{r, class.source = "fold-hide"} p + geom_bar(position = "fill") + scale_fill_brewer(palette = "Blues") + thm ``` Ordering of variables affects which comparisons are easier. * A researcher might want to emphasize the differential response among males and females. * A patient might prefer to be able to focus on whether the treatment is effective for them: ```{r, class.source = "fold-hide"} ggplot(Arthritis, aes(x = Treatment, fill = Improved)) + geom_bar(position = "fill") + scale_fill_brewer(palette = "Blues") + thm + facet_wrap(~ Sex) ``` Some notes; * The stacked bar chart is effective for two categories, and a few more if they are ordered. * Providing a visual indication of uncertainty in the estimates is a challenge. The standard errors in this case are around 0.1. * The proportions of each treatment group that are male or female could be encoded in the bar widths. * The resulting plot is called a _spine plot_. * Basic `ggplot2` does not seem to make this easy. ## Spine Plots _Spine plots_ are a special case of [_mosaic plots_](https://en.wikipedia.org/wiki/Mosaic_plot), and can be seen as a generalization of stacked bar plots. For a spine plot the proportions for the categories of a predictor variable are encoded in the bar widths. The `ggmosaic` package provides support for mosaic plots in the `ggplot` framework. (It can be a little rough around the edges.) Spine plots are provided by the base graphics function `spineplot` and the `vcd` function `spine`. `vcd` plots are built on the `grid` graphics system, like `lattice` and `ggplot2` graphics. A spine plot for the distribution of `Improved` given `Sex` in the `Treated` group: ```{r, warning = FALSE, class.source = "fold-hide"} library(ggmosaic) filter(Arthritis, Treatment == "Treated") |> mutate(Improved = fct_rev(Improved)) |> ggplot() + geom_mosaic(aes(x = product(Sex), fill = Improved)) + scale_fill_brewer(palette = "Blues", direction = -1) + facet_wrap(~ Treatment) + thm + labs(x = "", y = "Improved") ``` Spine plots for `Treatment` groups using faceting: ```{r, class.source = "fold-hide"} library(ggmosaic) mutate(Arthritis, Improved = fct_rev(Improved)) |> ggplot() + geom_mosaic(aes(x = product(Sex), fill = Improved)) + scale_fill_brewer(palette = "Blues", direction = -1) + facet_wrap(~ Treatment) + thm + labs(x = "", y = "Improved") ``` Spine plots for the arthritis data, faceted on `Sex`: ```{r, class.source = "fold-hide"} library(ggmosaic) mutate(Arthritis, Improved = fct_rev(Improved)) |> ggplot() + geom_mosaic(aes(x = product(Treatment), fill = Improved)) + scale_fill_brewer(palette = "Blues", direction = -1) + facet_wrap(~ Sex) + thm + labs(x = "", y = "Improved") ``` This no longer shows the Female/Male imbalance. For aggregate counts use the weight aesthetic: ```{r, class.source = "fold-hide"} mutate(HairEyeColorDF, Sex = fct_rev(Sex)) |> ggplot() + geom_mosaic(aes(weight = Freq, x = product(Hair), fill = Sex)) + thm + labs(x = "Hair", y = "") ``` Spine plots of `Sex` within `Eye` color, faceted on `Hair` color: ```{r, class.source = "fold-hide"} mutate(HairEyeColorDF, Sex = fct_rev(Sex)) |> ggplot() + geom_mosaic(aes(weight = Freq, x = product(Eye), fill = Sex)) + thm + labs(x = "Eye", y = "") + facet_wrap(~ Hair, nrow = 1, scales = "free_x") + theme(legend.position = "top", axis.text.y = element_blank(), axis.text.x = element_text(angle = 45, hjust = 1)) + scale_y_continuous(expand = c(0, 0)) ``` The relative sizes of the groups on the `x` (eye color) axis are shown within the facets. The sizes of the faceted variable (hair color) groups are not reflected. _Double decker plots_ try to address this. ## Doubledecker Plots _Doubledecker plots_ can be viewed as a generalization of spine plots to multiple predictors. Package `vcd` provides the `doubledecker` function. This function can use a formula interface. ```{r, class.source = "fold-hide"} arth_pal <- RColorBrewer::brewer.pal(3, "Blues") arth_gp <- grid::gpar(fill = arth_pal) vcd::doubledecker(Improved ~ Treatment + Sex, data = Arthritis, gp = arth_gp, margins = c(2, 5, 4, 2)) ``` ```{r, class.source = "fold-hide"} vcd::doubledecker(Improved ~ Sex + Treatment, data = Arthritis, gp = arth_gp, margins = c(2, 5, 4, 2)) ``` Using `ggmosaic`: ```{r, class.source = "fold-hide"} mutate(Arthritis, Improved = fct_rev(Improved)) |> ggplot() + geom_mosaic( aes(x = product(Sex, Treatment), fill = Improved), divider = ddecker()) + scale_fill_brewer(palette = "Blues", direction = -1) + thm + theme(axis.text.x = element_text(angle = 15, hjust = 1)) + labs(x = "", y = "") ``` ```{r, class.source = "fold-hide"} mutate(Arthritis, Improved = fct_rev(Improved)) |> ggplot() + geom_mosaic( aes(x = product(Treatment, Sex), fill = Improved), divider = ddecker()) + scale_fill_brewer(palette = "Blues", direction = -1) + thm + theme(axis.text.x = element_text(angle = 15, hjust = 1)) + labs(x = "", y = "") ``` ## Mosaic Plots _Mosaic plots_ recursively partition the axes to represent counts of categorical variables as rectangles. * Base graphics provides `mosaicplot`; * `vcd` provides `mosaic`. Both support a formula interface. A Mosaic plot for the predictors `Sex` and `Treatment`: ```{r, class.source = "fold-hide"} vcd::mosaic(~ Sex + Treatment, data = Arthritis) ``` Adding `Improved` to the joint distribution: ```{r, class.source = "fold-hide"} vcd::mosaic(~ Sex + Treatment + Improved, data = Arthritis) ``` Identifying `Improved` as the response: ```{r} vcd::mosaic(Improved ~ Sex + Treatment, data = Arthritis) ``` Matching the doubledecker plots: ```{r, class.source = "fold-hide"} vcd::mosaic( Improved ~ Treatment + Sex, data = Arthritis, split_vertical = c(TRUE, TRUE, FALSE)) ``` ```{r, class.source = "fold-hide"} vcd::mosaic( Improved ~ Sex + Treatment, data = Arthritis, split_vertical = c(TRUE, TRUE, FALSE)) ``` Some variants using `ggmosaic`: ```{r, class.source = "fold-hide"} ggplot(mutate(Arthritis, Sex = fct_rev(Sex))) + geom_mosaic( aes(x = product(Treatment, Sex))) + coord_flip() + labs(x = "", y = "") ``` ```{r, class.source = "fold-hide"} ggplot(mutate(Arthritis, Sex = fct_rev(Sex), Improved = fct_rev(Improved))) + geom_mosaic(aes(x = product(Improved, Treatment, Sex))) + coord_flip() ``` A mosaic plot for all bivariate marginals: ```{r, class.source = "fold-hide"} pairs(xtabs(~ Sex + Treatment + Improved, data = Arthritis)) ``` ## Spinograms and CD Plots _Spinograms_ and _CD plots_ show the conditional distribution of a categorical variable given the value of a numeric variable. * Spinograms use the same binning as a histogram and then create a spine plot. * CD plots use a smoothing or density estimation approach. A spinogram for `Improved` against `Age`: ```{r, class.source = "fold-hide"} ArthT <- filter(Arthritis, Treatment == "Treated") |> mutate(Improved = fct_rev(Improved)) arthT_gp <- grid::gpar(fill = rev(arth_gp$fill)) vcd::spine(Improved ~ Age, data = ArthT, gp = arthT_gp, breaks = 5) ``` An analogous plot created with `ggmosaic` by binning the `Age` variable: ```{r, class.source = "fold-hide"} Arth <- mutate(Arthritis, AgeBin = cut(Arthritis$Age, seq(20, by = 10, len = 7)), Improved = fct_rev(Improved)) filter(Arth, Treatment == "Treated") |> count(Improved, AgeBin) |> ggplot() + geom_mosaic(aes(weight = n, x = product(AgeBin), fill = Improved)) + scale_fill_brewer(palette = "Blues", direction = -1) + theme_minimal() + theme(axis.title = element_blank()) ``` A facet grid can be used to create spinograms for each of the `Sex`/`Treatment` combinations: ```{r, fig.width = 8, class.source = "fold-hide"} ggplot(count(Arth, Improved, Sex, Treatment, AgeBin)) + geom_mosaic(aes(weight = n, x = product(AgeBin), fill = Improved)) + scale_fill_brewer(palette = "Blues", direction = -1) + theme_minimal() + facet_grid(Treatment ~ Sex) + theme(axis.title = element_blank()) + theme(axis.text.x = element_text(angle = 35, hjust = 1), axis.text.y = element_blank()) ``` A [spinogram](https://flowingdata.com/2021/08/02/decline-of-u-s-vaccination-rate-compared-against-europes/) in the media (NYT, August 2021): ```{r, echo = FALSE, out.width = "70%"} knitr::include_graphics(IMG("Vaccination-rates-with-Marimekko-1536x925.png")) ``` Some plots in a [Twitter thread](https://twitter.com/jburnmurdoch/status/1503420660869214213?s=20&t=RZxOXiHZ0eXkV6WXMIuVVA): ```{r, echo = FALSE, out.width = "90%"} knitr::include_graphics(IMG("covid-spinogram.png")) ``` ```{r, echo = FALSE, out.width = "90%"} knitr::include_graphics(IMG("covid-mort.png")) ``` CD plots estimate the conditional density of the `x` variable given the levels of `y`, weighted by the marginal proportions of `y` and use these to estimate cumulative probabilities. * The slice at a particular `x` level visualizes the conditional distribution of `y` given `x` at that level. * `geom_density` with `position = stack` is one way to create a CD plot. * The `cd_plot` function from the `vcd` package produces a CD plot using `grid` graphics. * The `cdplot` function from the base `graphics` package provides the same plots using base graphics. CD plots for the `Treated` group: ```{r, fig.height = 6, class.source = "fold-hide"} filter(Arthritis, Treatment == "Treated") |> ggplot(aes(x = Age, fill = Improved)) + geom_density(position = "fill", bw = 5) + scale_fill_brewer(palette = "Blues") + facet_wrap(~ Sex, ncol = 1) + thm ``` CD plots for all combinations end up with one group of size one and one of size zero, which produces a non-useful plot for one combination: ```{r, fig.width = 8, fig.height = 6, class.source = "fold-hide"} count(Arthritis, Treatment, Sex, Improved) |> complete(Treatment, Sex, Improved, fill = list(n = 0)) |> filter(n < 2) ggplot(Arthritis, aes(x = Age, fill = Improved)) + geom_density(position = "fill", bw = 5) + scale_fill_brewer(palette = "Blues") + facet_grid(Treatment ~ Sex) + thm ``` ## Uncertainty Representation Categorical data are often analyzed by fitting models representing conditional independence structures. * Plotting residuals from these models can help assess how well they fit. * `vcd::mosaic` supports using color to represent magnitude of residuals for comparing to a simple independence model. For the `Arthritis` data, observed counts and expected counts under an independence model assuming `Treatment` and `Improved` are independent can be visualized as mosaic plots: ```{r, fig.width = 8, fig.height = 4, class.source = "fold-hide"} ## there are easier ways do do this ... v <- count(Arthritis, Treatment, Improved) pT <- group_by(v, Treatment) |> summarize(n = sum(n)) |> mutate(pT = n / sum(n)) |> select(-n) pI <- group_by(v, Improved) |> summarize(n = sum(n)) |> mutate(pI = n / sum(n)) |> select(-n) v <- left_join(v, pT, "Treatment") |> left_join(pI, "Improved") |> mutate(p = pT * pI, Treatment = fct_rev(Treatment)) po <- ggplot(v) + geom_mosaic(aes(weight = n, x = product(Improved, Treatment), fill = Improved)) + scale_fill_brewer(palette = "Blues") + guides(fill = "none") + labs(title = "Observed Proportions") + thm + coord_flip() + theme(axis.text.y = element_text(angle = 90, hjust = 0)) pe <- ggplot(v) + geom_mosaic(aes(weight = p, x = product(Improved, Treatment), fill = Improved)) + scale_fill_brewer(palette = "Blues") + guides(fill = "none") + labs(title = "Expected Proportions") + thm + coord_flip() + theme(axis.text.y = element_text(angle = 90, hjust = 0)) po + pe ``` A plot for assessing the fit of the residuals between the observed and expected data under a model assuming independence of `Treatment` and `Improved` produces: ```{r, class.source = "fold-hide"} vcd::mosaic(~ Treatment + Improved, data = Arthritis, gp = vcd::shading_max) ``` Another visualization of the residuals is the _association plot_ produced by `assoc`: ```{r, class.source = "fold-hide"} vcd::assoc(~ Treatment + Improved, data = Arthritis, gp = vcd::shading_max) ``` ## References > The vignette [_Residual-Based Shadings in > vcd_](https://cran.r-project.org/package=vcd/vignettes/residual-shadings.pdf) > in the `vcd` package. > Zeileis, Achim, David Meyer, and Kurt Hornik. "Residual-based > shadings for visualizing (conditional) independence." Journal of > Computational and Graphical Statistics 16, no. 3 (2007): 507-525. > The vignette [_Working with categorical data with R and the vcd and vcdExtra packages_](https://www.datavis.ca/courses/VCD/vcd-tutorial.pdf) in the `vcdExtra` package. Several other experimental mosaic plot implementations are available for `ggplot`. ## Some Other Visualizations ### Tree Maps Tree maps show hierarchically structured (or tree-tructured) data. * Each branch is represented by a rectangle. * Leaf node tiles have areas proportional to the value of a variable. * Tiles are often colored to reflect the value of another variable. The package `treemapify` provides a `ggplot`-based implementation. The data set `G20` includes some variables on the G-20 member countries: ```{r, class.source = "fold-hide"} library(treemapify) select(G20, region, country, gdp_mil_usd, hdi) |> knitr::kable(format = "html") |> kableExtra::kable_styling( full_width = FALSE) ``` A simple tree with only one level, the individual countries: ```{r, include = FALSE} library(nomnoml) ``` ```{nomnoml, echo = FALSE} #padding: 25 #fontsize: 18 #linewidth: 2 #direction: down [Root] -> [Germany] [Root] -> [France] [Root] -> [Japan] [Root] -> [China] [Root] -> [...] ``` A corresponding tree map based on `gdp_mil_usd`: ```{r, class.source = "fold-hide"} ggplot(G20, aes(area = gdp_mil_usd)) + geom_treemap() + geom_treemap_text(aes(label = country), color = "white") ``` A tree grouping by region: ```{nomnoml, echo = FALSE} #padding: 25 #fontsize: 18 #linewidth: 2 #direction: down [Root] -> [Europe] [Root] -> [Asia] [Root] -> [...] [Europe] -> [Germany] [Europe] -> [France] [Europe] -> [....] [Asia] -> [Japan] [Asia] -> [China] [Asia] -> [.....] ``` A corresponding tree map: ```{r, class.source = "fold-hide"} ggplot(G20, aes(area = gdp_mil_usd, subgroup = region)) + geom_treemap() + geom_treemap_text(aes(label = country), color = "white") + geom_treemap_subgroup_border( color = "red") + geom_treemap_subgroup_text(color = "red") ``` A tree map showing GDP values for the G-20 members, grouped by region, with fill mapped to the country's Human Development Index: ```{r, class.source = "fold-hide"} ggplot(G20, aes(area = gdp_mil_usd, fill = hdi, subgroup = region)) + geom_treemap() + geom_treemap_text(aes(label = country), color = "white") + geom_treemap_subgroup_border() + geom_treemap_subgroup_text( color = "lightgrey") ``` A treemap representing the distribution of eye color within hair color: ```{r, class.source = "fold-hide"} group_by(agg, Eye, Hair) |> summarize(n = sum(n)) |> ungroup() |> ggplot(aes(area = n, subgroup = Hair)) + geom_treemap(aes(fill = Eye), color = "white") + geom_treemap_subgroup_text() + geom_treemap_subgroup_border( color = "black", size = 6) + geom_treemap_text(aes(label = Eye), color = "grey90") + scale_fill_manual(values = ecols) + guides(fill = "none") ``` A treemap representing proportions for `Improved` within `Treatment` within `Sex` for the Arthritis data: ```{r, class.source = "fold-hide"} count(Arth, Treatment, Improved, Sex) |> ggplot(aes(area = n, subgroup = Sex, fill = Improved, subgroup2 = Treatment)) + geom_treemap() + geom_treemap_subgroup_text() + scale_fill_brewer(palette = "Blues", direction = -1) + geom_treemap_subgroup_border() + geom_treemap_subgroup2_text(place = "top", size = 20) ``` ### Alluvial plots These are also known as * _parallel sets_, or * _Sankey diagrams_. They can be viewed as a parallel coordinates plot for categorical data. Several implementations are available, including: * `geom_parallel_sets` from `ggforce`; * `geom_sankey` from [`ggsankey`](https://github.com/davidsjoberg/ggsankey); * `geom_alluvium` from `ggalluvial`. ```{r, echo = FALSE, eval = FALSE} HDF <- mutate(HairEyeColorDF, Sex = fct_rev(Sex)) library(alluvial) pal <- RColorBrewer::brewer.pal(3, "Set1") with(HDF, alluvial(Hair, Eye, Sex, freq = Freq, col = pal[as.numeric(Sex)])) ``` Hair/Eye color using the `ggforce` package: ```{r, class.source = "fold-hide"} pal <- RColorBrewer::brewer.pal(3, "Set1") HDF <- mutate(HairEyeColorDF, Sex = fct_rev(Sex)) library(ggforce) sHDF <- gather_set_data(HDF, 3 : 1) sHDF <- mutate(sHDF, x = fct_inorder(as.factor(x))) #**** simplify this? ggplot(sHDF, aes(x, id = id, split = y, value = Freq)) + geom_parallel_sets(aes(fill = Sex), alpha = 0.5, axis.width = 0.1) + geom_parallel_sets_axes( axis.width = 0.1) + geom_parallel_sets_labels( colour = 'white') + scale_fill_manual( values = c(Male = pal[2], Female = pal[1])) + theme_void() + guides(fill = "none") ``` ```{r, echo = FALSE, eval = FALSE} with(count(Arth, Improved, Treatment, Sex), alluvial(Improved, Treatment, Sex, freq = n, col = pal[as.numeric(Sex)])) ``` Arthritis data with `ggforce`: ```{r, class.source = "fold-hide"} sArth <- mutate(Arth, Improved = factor(Improved, ordered = FALSE)) |> count(Improved, Treatment, Sex) |> gather_set_data(3 : 1) sArth <- mutate(sArth, x = fct_inorder(factor(x)), Sex = fct_rev(Sex)) ggplot(sArth, aes(x, id = id, split = y, value = n)) + geom_parallel_sets(aes(fill = Sex), alpha = 0.5, axis.width = 0.1) + geom_parallel_sets_axes(axis.width = 0.1) + geom_parallel_sets_labels( colour = 'white') + scale_fill_manual( values = c(Male = pal[2], Female = pal[1])) + theme_void() + guides(fill = "none") ``` ```{r, eval = FALSE, echo = FALSE} ## refugee data (needs adjusting text sizes or labels) ## could also try chord diagram R <- count(ref.dtl, Destination, Nationality, wt = Cases) nats <- (count(R, Nationality, wt = n) |> slice_max(n, n = 10))$Nationality dsts <- (count(R, Destination, wt = n) |> slice_max(n, n = 10))$Destination mutate(R, Nationality = fct_other(Nationality, keep = nats), Destination = fct_other(Destination, keep = dsts)) |> gather_set_data(1 : 2) |> ggplot(aes(x, id = id, split = y, value = n)) + geom_parallel_sets(aes(fill = Nationality), alpha = 0.5, axis.width = 0.1) + geom_parallel_sets_axes(axis.width = 0.1) + geom_parallel_sets_labels(colour = 'white', size = 3) + theme_void() + guides(fill = "none") ``` ### Stream Graphs [Stream graphs](https://www.visualisingdata.com/2010/08/making-sense-of-streamgraphs/) are a generalization of stacked bar charts plotted against a numeric variable. In some cases the origins of the bars are shifted to improve some aspect of the overall visualization. An early example is the [Baby Name Voyager](https://www.bewitched.com/namevoyager.html). (A more recent variant is also [available](https://namerology.com/baby-name-grapher/).) A [NY Times visualization](https://archive.nytimes.com/www.nytimes.com/interactive/2008/02/23/movies/20080223_REVENUE_GRAPHIC.html?_r=0) of movie box office results is another example. ([Blog post with a static version](https://flowingdata.com/2008/02/25/ebb-and-flow-of-box-office-receipts-over-past-20-years/)). Some R implementations on GitHub: * [`ggTimeSeries`](https://github.com/AtherEnergy/ggTimeSeries) * [`streamgraph`](https://hrbrmstr.github.io/streamgraph/) (uses D3) * [`ggstream`](https://github.com/davidsjoberg/ggstream). A stream graph for movie genres (these are not mutually exclusive): ```{r, warning = FALSE, class.source = "fold-hide"} ## install with: remotes::install_github("hrbrmstr/streamgraph") library(streamgraph) library(tidyverse) genres <- c("Action", "Animation", "Comedy", "Drama", "Documentary", "Romance") mymovies <- select(ggplot2movies::movies, year, one_of(genres)) mymovies_long <- pivot_longer( mymovies, -year, names_to = "genre", values_to = "value") movie_counts <- count(mymovies_long, year, genre) streamgraph(movie_counts, "genre", "n", "year") ``` ## Reading Chapters [_Visualizing proportions_](https://clauswilke.com/dataviz/visualizing-proportions.html) and [_Visualizing nested proportions_](https://clauswilke.com/dataviz/nested-proportions.html) in [_Fundamentals of Data Visualization_](https://clauswilke.com/dataviz/). ## Interactive Tutorial An interactive [`learnr`](https://rstudio.github.io/learnr/) tutorial for these notes is [available](`r WLNK("tutorials/proportions.Rmd")`). You can run the tutorial with ```{r, eval = FALSE} STAT4580::runTutorial("proportions") ``` You can install the current version of the `STAT4580` package with ```{r, eval = FALSE} remotes::install_gitlab("luke-tierney/STAT4580") ``` You may need to install the `remotes` package from CRAN first. ## Exercises 1. Figure A shows a bar char of the flights leaving NYC airports in 2013 for each day of the week. Figure B shows the market share of five major internet browsers in 2015. ```{r, message = FALSE, echo = FALSE, fig.height = 4, fig.width = 8} library(lubridate) library(dplyr) library(nycflights13) library(ggplot2) library(patchwork) thm <- theme_minimal() + theme(text = element_text(size = 15)) p1 <- mutate(flights, date = make_date(year, month, day), wday = wday(date, label = TRUE, abbr = TRUE)) |> ggplot(aes(x = wday)) + geom_bar(fill = "deepskyblue3") + scale_y_continuous(expand = c(0, 0)) + labs(title = "Flights from NYC in 2013", subtitle = "By Day of the Week", caption = "Figure A", x = NULL, y = "Number of Flights") + thm browsers2015 <- data.frame(Browser = c("Opera", "Safari", "Firefox", "Chrome", "IE"), share = c(2, 22, 21, 27, 29)) p2 <- ggplot(browsers2015, aes(x = Browser, y = share)) + geom_col(fill = "deepskyblue3") + scale_y_continuous(expand = c(0, 0)) + labs(title = "Browser Market Share", subtitle = "2015", caption = "Figure B", x = NULL, y = "Percent") + thm p1 | p2 ``` For which of these bar charts would it be better to reorder the categories so the bars are ordered from largest to smallest? a. Yes for Figure A. No for Figure B. b. No for Figure A. Yes for Figure B. c. Yes for both. d. No for both. 2. Consider the stacked bar chart `p1` and the spine plot `p2` for the hair and eye color data produced by the following code: ```{r, eval = FALSE} library(dplyr) library(ggplot2) library(ggmosaic) ecols <- c(Brown = "brown2", Blue = "blue2", Hazel = "darkgoldenrod3", Green = "green4") HairEyeColorDF <- as.data.frame(HairEyeColor) p0 <- ggplot(HairEyeColorDF) + scale_fill_manual(values = ecols) + theme_minimal() p1 <- p0 + geom_col(aes(x = Hair, y = Freq / sum(Freq), fill = Eye)) p2 <- p0 + geom_mosaic(aes(x = product(Hair), fill = Eye, weight = Freq)) ``` Use the two plots to answer: Which hair color has the highest proportion of individuals with green eyes? a. Black b. Brown c. Red d. Blond Which plot makes it easiest to answer this question? 3. Use the plots of the previous question to answer: The proportion of individuals with red hair is closest to: a. 5% b. 8% c. 12% d. 20% Which plot makes it easiest to answer this question?