class: center, middle, title-slide .title[ # Visualizing Proportions ] .author[ ### Luke Tierney ] .institute[ ### University of Iowa ] .date[ ### 2023-05-06 ] --- layout: true <link rel="stylesheet" href="stat4580.css" type="text/css" /> <style type="text/css"> .remark-code { font-size: 85%; } </style> <!-- title based on Wilke's chapter --> ## Categorical Data --- Categorical data can be -- * nominal, qualitative -- * ordinal -- For visualization, the main difference is that ordinal data suggests a particular display order. -- Purely categorical data can come in a range of formats. -- The most common are -- * raw data: individual observations; -- * aggregated data: counts for each unique combination of levels; -- * cross-tabulated data. --- ### Raw Data .pull-left[ Raw data for a survey of individuals that records hair color, eye color, and gender of 592 individuals might look like this: ] .pull-right[ ```r head(raw) ## Hair Eye Sex ## 1 Brown Blue Male ## 2 Brown Brown Male ## 3 Brown Hazel Male ## 4 Blond Green Female ## 5 Brown Brown Female ## 6 Brown Hazel Male ``` ] --- ### Aggregated Data .pull-left[ One way to aggregate raw categorical data is to use `count` from `dplyr`: ] .pull-right[ ```r library(dplyr) agg <- count(raw, Hair, Eye, Sex) head(agg) ## Hair Eye Sex n ## 1 Black Brown Male 32 ## 2 Black Brown Female 36 ## 3 Black Blue Male 11 ## 4 Black Blue Female 9 ## 5 Black Hazel Male 10 ## 6 Black Hazel Female 5 ``` ] <!-- The `count_` function from `dplyr` allows the variables to use to be read from the data: ```r agg <- count_(raw, names(raw)) head(agg) ``` Apparently the "modern" way to do this is ```r count(raw, !!! syms(names(raw))) ## Hair Eye Sex n ## 1 Black Brown Male 32 ## 2 Black Brown Female 36 ## 3 Black Blue Male 11 ## 4 Black Blue Female 9 ## 5 Black Hazel Male 10 ## 6 Black Hazel Female 5 ## 7 Black Green Male 3 ## 8 Black Green Female 2 ## 9 Brown Brown Male 53 ## 10 Brown Brown Female 66 ## 11 Brown Blue Male 50 ## 12 Brown Blue Female 34 ## 13 Brown Hazel Male 25 ## 14 Brown Hazel Female 29 ## 15 Brown Green Male 15 ## 16 Brown Green Female 14 ## 17 Red Brown Male 10 ## 18 Red Brown Female 16 ## 19 Red Blue Male 10 ## 20 Red Blue Female 7 ## 21 Red Hazel Male 7 ## 22 Red Hazel Female 7 ## 23 Red Green Male 7 ## 24 Red Green Female 7 ## 25 Blond Brown Male 3 ## 26 Blond Brown Female 4 ## 27 Blond Blue Male 30 ## 28 Blond Blue Female 64 ## 29 Blond Hazel Male 5 ## 30 Blond Hazel Female 5 ## 31 Blond Green Male 8 ## 32 Blond Green Female 8 ``` --> --- ### Cross-Tabulated Data .pull-left[ Cross-tabulated data can be produced from aggregate data using `xtabs`: ] .pull-right[ ```r xtabs(n ~ Hair + Eye + Sex, data = agg) ## , , Sex = Male ## ## Eye ## Hair Brown Blue Hazel Green ## Black 32 11 10 3 ## Brown 53 50 25 15 ## Red 10 10 7 7 ## Blond 3 30 5 8 ## ## , , Sex = Female ## ## Eye ## Hair Brown Blue Hazel Green ## Black 36 9 5 2 ## Brown 66 34 29 14 ## Red 16 7 7 7 ## Blond 4 64 5 8 ``` ] --- .pull-left[ Cross-tabulated data can be produced from raw data using `table`: ] .pull-right[ ```r xtb <- table(raw) xtb ## , , Sex = Male ## ## Eye ## Hair Brown Blue Hazel Green ## Black 32 11 10 3 ## Brown 53 50 25 15 ## Red 10 10 7 7 ## Blond 3 30 5 8 ## ## , , Sex = Female ## ## Eye ## Hair Brown Blue Hazel Green ## Black 36 9 5 2 ## Brown 66 34 29 14 ## Red 16 7 7 7 ## Blond 4 64 5 8 ``` ] --- .pull-left[ Both raw and aggregate data in this example are in _tidy_ form; the cross-tabulated date is not. {{content}} ] -- Cross-tabulated data on `\(p\)` variables is arranged in a `\(p\)`-way array. {{content}} -- The cross-tabulated data can be converted to the tidy aggregate form using `as.data.frame`: -- .pull-right[ ```r class(xtb) ## [1] "table" head(as.data.frame(xtb)) ## Hair Eye Sex Freq ## 1 Black Brown Male 32 ## 2 Brown Brown Male 53 ## 3 Red Brown Male 10 ## 4 Blond Brown Male 3 ## 5 Black Blue Male 11 ## 6 Brown Blue Male 50 ``` {{content}} ] -- The variable `xtb` corresponds to the data set `HairEyeColor` in the `datasets` package, --- ### Working With Categorical Variables Categorical variables are usually represented as: -- * character vectors -- * factors. -- Some advantages of factors: -- * more control over ordering of levels -- * levels are preserved when forming subsets -- * levels can reflect possible values not present in the data -- Most plotting and modeling functions will convert character vectors to factors with levels ordered alphabetically. --- Some standard R functions for working with factors include * `factor` creates a factor from another type of variable * `levels` returns the levels of a factor * `reorder` changes level order to match another variable * `relevel` moves a particular level to the first position as a base line * `droplevels` removes levels not in the variable. -- The `tidyverse` package `forcats` adds some more tools, including * `fct_inorder` creates a factor with levels ordered by first appearance * `fct_infreq` orders levels by decreasing frequency * `fct_rev` reverses the levels * `fct_recode` changes factor levels * `fct_relevel` moves one or more levels * `fct_c` merges two or more factors * `fct_collapse` merge some factor levels --- layout: true ## Bar Charts For Frequencies --- ### Basics .pull-left[ A bar chart is often used to show the frequencies of a categorical variable. {{content}} ] -- By default, `geom_bar` uses `stat = "count"` and maps its result to the `y` aesthetic. {{content}} -- This is suitable for raw data: -- .pull-right[ .hide-code[ ```r thm <- theme_minimal() + theme(text = element_text(size = 16)) ggplot(raw) + geom_bar(aes(x = Hair), fill = "deepskyblue3") + thm ``` <img src="proportions_files/figure-html/unnamed-chunk-9-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ For a nominal variable it is often better to order the bars by decreasing frequency: ] .pull-right[ .hide-code[ ```r library(forcats) ggplot(mutate(raw, Hair = fct_infreq(Hair))) + geom_bar(aes(x = Hair), fill = "deepskyblue3") + thm ``` <img src="proportions_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ If the data have already been aggregated, then you need to either specify `stat = "identity"` as well as the variable containing the counts as the `y` aesthetic, or use `geom_col`: ] .pull-right[ .hide-code[ ```r ggplot(agg) + geom_col(aes(x = Hair, y = n), fill = "deepskyblue3") + thm ``` <img src="proportions_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ For aggregated data, reordering can be based on the computed counts using ```r agg_ord <- mutate(agg, Hair = reorder(Hair, -n, sum)) ``` {{content}} ] -- * `-n` is used to order largest to smallest; {{content}} -- * the default summary used by `reorder` is `mean`; `sum` is better here. -- .pull-right[ .hide-code[ ```r ggplot(agg_ord) + geom_col(aes(x = Hair, y = n), fill = "deepskyblue3") + thm ``` <img src="proportions_files/figure-html/unnamed-chunk-13-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left.width-45[ ### Adding a Grouping Variable Mapping the `Eye` variable to `fill` in `ggplot` produces a _stacked bar chart_. {{content}} ] -- An alternative, specified with `position = "dodge"`, is a _side by side_ bar chart, or a _clustered_ bar chart. {{content}} -- For the side by side chart in particular it may be useful to also reorder the `Eye` color levels. -- .pull-right.width-55[ .hide-code[ ```r ecols <- c(Brown = "brown2", Blue = "blue2", Hazel = "darkgoldenrod3", Green = "green4") agg_ord <- mutate(agg, Hair = reorder(Hair, -n, sum), Eye = reorder(Eye, -n, sum)) p1 <- ggplot(agg_ord) + geom_col(aes(x = Hair, y = n, fill = Eye)) + scale_fill_manual(values = ecols) + thm p2 <- ggplot(agg_ord) + geom_col(aes(x = Hair, y = n, fill = Eye), position = "dodge") + scale_fill_manual(values = ecols) + thm (p1 + guides(fill = "none")) | p2 ``` <img src="proportions_files/figure-html/unnamed-chunk-14-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left.width-45[ Faceting can be used to bring in additional variables: ] .pull-right.width-55[ .hide-code[ ```r p1 + facet_wrap(~ Sex) ``` <img src="proportions_files/figure-html/unnamed-chunk-15-1.png" style="display: block; margin: auto;" /> ] {{content}} ] -- The counts shown here may not be the most relevant features for understanding the joint distributions of these variables. --- layout: true ## Pie Charts and Doughnut Charts --- .pull-left.width-40[ _Pie charts_ go by many different names (from a [Twitter thread](https://twitter.com/ElephantEating/status/1361039771414319106)): ] .pull-right.width-60[ <img src="../img/pienames.jpeg" width="80%" style="display: block; margin: auto;" /> ] --- Pie charts can be viewed as stacked bar charts in polar coordinates: .hide-code[ ```r hcols <- c(Black = "black", Brown = "brown4", Red = "brown1", Blond = "lightgoldenrod1") p1 <- ggplot(agg_ord) + geom_col(aes(x = 1, y = n, fill = Hair), position = "fill") + coord_polar(theta = "y") + scale_fill_manual(values = hcols) + thm p2 <- ggplot(agg_ord) + geom_col(aes(x = Hair, y = n, fill = Hair)) + scale_fill_manual(values = hcols) + thm (p1 + guides(fill = "none")) | p2 ``` <img src="proportions_files/figure-html/unnamed-chunk-17-1.png" style="display: block; margin: auto;" /> ] -- The axes and grid lines are not helpful for the pie chart and can be removed with some _theme_ settings. --- Using faceting we can also separately show the distributions for men and women: .hide-code[ ```r pie_thm <- thm + theme(axis.title = element_blank(), axis.text = element_blank(), axis.ticks = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), panel.border = element_blank()) p3 <- p1 + facet_wrap(~ Sex) + pie_thm p3 ``` <img src="proportions_files/figure-html/unnamed-chunk-18-1.png" style="display: block; margin: auto;" /> ] --- _Doughnut charts_ are a variant that has recently become popular in the media: .hide-code[ ```r p4 <- p3 + xlim(0, 1.5) p4 ``` <img src="proportions_files/figure-html/unnamed-chunk-19-1.png" style="display: block; margin: auto;" /> ] --- The center is often used for annotation: .hide-code[ ```r p4 + geom_text(aes(x = 0, y = 0, label = Sex), size = 5) + theme(strip.background = element_blank(), strip.text = element_blank()) ``` <img src="proportions_files/figure-html/unnamed-chunk-20-1.png" style="display: block; margin: auto;" /> ] --- An alternative to the polar coordinates approach uses `geom_arc_bar` and `stat_pie` from package `ggforce`: .hide-code[ ```r library(ggforce) arrange(agg_ord, desc(Hair)) %>% ggplot(aes(x0 = 0, y0 = 0, r0 = 0, r = 1, amount = n, fill = Hair)) + geom_arc_bar(stat = "pie", color = NA) + coord_fixed() + scale_fill_manual(values = hcols) + pie_thm + facet_wrap(~ Sex) ``` <img src="proportions_files/figure-html/unnamed-chunk-21-1.png" style="display: block; margin: auto;" /> ] --- For doughnut charts: .hide-code[ ```r arrange(agg_ord, desc(Hair)) %>% ggplot(aes(x0 = 0, y0 = 0, r0 = 0.4, r = 1, amount = n, fill = Hair)) + geom_arc_bar(stat = "pie", color = NA) + geom_text(aes(x = 0, y = 0, label = Sex), size = 5) + coord_fixed() + scale_fill_manual(values = hcols) + pie_thm + theme(strip.background = element_blank(), strip.text = element_blank()) + facet_wrap(~ Sex) ``` <img src="proportions_files/figure-html/unnamed-chunk-22-1.png" style="display: block; margin: auto;" /> ] --- layout: true ## Some Notes --- Pie charts are effective for judging part/whole relationships. -- Pie charts can be effective for comparing proportions to * one half * one quarter -- Pie charts are not very effective for comparing proportions to each other. -- 3D pie charts are popular and a very bad idea. An example ([Fig. 6.61](https://www.dropbox.com/s/tlehzi3kb6ikbsz/6.61.3DIllustration.png?dl=0)) from Andy Kirk's book (2016), [_Data Visualization: A Handbook for Data Driven Design_](http://book.visualisingdata.com/home): <img src="../img/badpie.png" width="50%" style="display: block; margin: auto;" /> --- Pie charts are widely used for political data. -- * With the right ordering, pie charts are very good at showing which coalitions of parties can form a majority. -- * When no one candidate earns a majority of the votes, pie charts do not show which candidate has earned a plurality very well. -- * Good orientation and factor ordering can help. -- .hide-code[ ```r elect <- geofacet::election %>% group_by(candidate) %>% summarize(votes = sum(votes)) p1 <- ggplot(elect) + geom_col(aes(x = 1, y = votes, fill = candidate), position = "fill") + coord_polar(theta = "y", start = -1) + xlim(c(-0.5, 1.5)) + scale_fill_manual(values = c(Trump = scales::muted("red", 50, 80), Clinton = scales::muted("blue", 50, 70), Other = "grey")) + pie_thm p2 <- mutate(elect, candidate = factor(candidate, c("Clinton", "Other", "Trump"))) %>% ggplot() + geom_col(aes(x = 1, y = votes, fill = candidate), position = "fill") + coord_polar(theta = "y") + xlim(c(-0.5, 1.5)) + scale_fill_manual(values = c(Trump = scales::muted("red", 50, 80), Clinton = scales::muted("blue", 50, 70), Other = "grey")) + pie_thm p3 <- ggplot(elect) + geom_col(aes(x = candidate, y = 100 * (votes / sum(votes)), fill = candidate)) + scale_fill_manual(values = c(Trump = scales::muted("red", 50, 80), Clinton = scales::muted("blue", 50, 70), Other = "grey")) + labs(y = "percent") + thm + theme(axis.text.x = element_blank(), axis.title.x = element_blank()) (p1 + guides(fill = "none")) + (p2 + guides(fill = "none")) + p3 ``` <img src="proportions_files/figure-html/unnamed-chunk-24-1.png" style="display: block; margin: auto;" /> ] --- layout: true ## Some Alternatives --- ### Stacked Bar Charts .pull-left[ Stacked bar charts with equal heights, or filled bar charts, are an alternative for representing part-whole relationships. {{content}} ] -- * Top and bottom proportions are easy to compare. {{content}} -- * Comparing proportions to one half and one quarter is harder. -- .pull-right[ .hide-code[ ```r ggplot(agg) + geom_col(aes(x = Sex, y = n, fill = Hair), position = "fill") + scale_fill_manual(values = hcols) + thm ``` <img src="proportions_files/figure-html/unnamed-chunk-25-1.png" style="display: block; margin: auto;" /> ] ] --- ### Waffle Charts Another alternative is a _waffle chart_, sometimes also called a _square pie chart_. <img src="../img/waffle.png" width="50%" style="display: block; margin: auto;" /> -- The [`waffle`](https://github.com/hrbrmstr/waffle) package is one R implementation of this idea. -- Currently the development version on GitHub is needed for the following examples. --- Showing the counts: .hide-code[ ```r library(waffle) stopifnot(packageVersion("waffle") >= "1.0.1") ggplot(arrange(agg, Hair), aes(values = n, fill = Hair)) + geom_waffle(n_rows = 18, flip = TRUE, color = "white", size = 0.33, na.rm = FALSE) + coord_equal() + facet_wrap(~ Sex) + scale_fill_manual(values = hcols) + theme_minimal() + theme_enhance_waffle() ``` <img src="proportions_files/figure-html/unnamed-chunk-27-1.png" style="display: block; margin: auto;" /> ] <img src="proportions_files/figure-html/unnamed-chunk-28-1.png" style="display: block; margin: auto;" /> --- Showing the proportions: .hide-code[ ```r round_pct <- function(n) { pct <- 100 * (n / sum(n)) nn <- floor(pct) if (sum(nn) < 100) { rem <- pct - nn idx <- sort(order(rem), decreasing = TRUE)[seq_len(100 - sum(nn))] nn[idx] <- nn[idx] + 1 } nn } group_by(agg, Sex) %>% mutate(pct = round_pct(n)) %>% ungroup() %>% ggplot(aes(values = pct, fill = Hair)) + geom_waffle(n_rows = 10, flip = TRUE, color = "white", size = 0.33, na.rm = FALSE) + coord_equal() + facet_wrap(~ Sex) + scale_fill_manual(values = hcols) + theme_minimal() + theme_enhance_waffle() ``` <img src="proportions_files/figure-html/unnamed-chunk-29-1.png" style="display: block; margin: auto;" /> ] --- layout: true ## Population Pyramids --- .pull-left[ Bar charts for two groups can be shown back to back. ] -- .pull-right[ .hide-code[ ```r mutate(agg, Hair = reorder(Hair, n, sum)) %>% ggplot(aes(x = ifelse(Sex == "Male", n, -n), y = Hair, fill = Sex)) + geom_col() + xlab("Count") + thm ``` <img src="proportions_files/figure-html/unnamed-chunk-31-1.png" style="display: block; margin: auto;" /> ] {{content}} ] -- This is often used for showing age distributions by sex for populations; the result is called a [_population pyramid_](https://www.visualcapitalist.com/us-population-pyramid-1980-2050/). --- .pull-left.width-40[ Age distribution data for many countries and years is available from a [Census Bureau website](https://www.census.gov/data-tools/demo/idb/). Data files for 2020 for [Germany](https://stat.uiowa.edu/~luke/data/germany-2020.csv) and [Nigeria](https://stat.uiowa.edu/~luke/data/nigeria-2020.csv) are available locally. .hide-code[ ```r if (! file.exists("germany-2020.csv")) download.file("https://stat.uiowa.edu/~luke/data/germany-2020.csv", "germany-2020.csv") if (! file.exists("nigeria-2020.csv")) download.file("https://stat.uiowa.edu/~luke/data/nigeria-2020.csv", "nigeria-2020.csv") gm_pop <- read.csv("germany-2020.csv", skip = 1) %>% filter(Age != "Total") %>% mutate(Age = fct_inorder(Age)) ni_pop <- read.csv("nigeria-2020.csv", skip = 1) %>% filter(Age != "Total") %>% mutate(Age = fct_inorder(Age)) ``` ] {{content}} ] -- Combining the data sets allows a side by side comparison of the counts: -- .pull-right.width-60[ .hide-code[ ```r library(tidyr) pop2 <- bind_rows(mutate(gm_pop, Country = "Germany"), mutate(ni_pop, Country = "Nigeria")) %>% select(Age, Male = Male.Population, Female = Female.Population, Country) %>% pivot_longer(Male : Female, names_to = "Sex", values_to = "n") ggplot(pop2) + geom_col(aes(x = ifelse(Sex == "Male", n, -n), y = Age, fill = Sex)) + facet_wrap(~ Country) + scale_x_continuous( labels = function(n) scales::comma(abs(n))) + xlab("Count") + thm + theme(legend.position = "top") ``` <img src="proportions_files/figure-html/unnamed-chunk-33-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left.width-40[ The different shapes are evident, but are harder to see than they could be because of the difference in total population: .hide-code[ ```r group_by(pop2, Country) %>% summarize(Population = sum(n)) %>% ungroup() %>% mutate(Population = scales::comma(Population)) %>% knitr::kable(format = "html", align = "lr") %>% kableExtra::kable_styling(full_width = FALSE) ``` <table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Country </th> <th style="text-align:right;"> Population </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Germany </td> <td style="text-align:right;"> 80,159,662 </td> </tr> <tr> <td style="text-align:left;"> Nigeria </td> <td style="text-align:right;"> 214,028,302 </td> </tr> </tbody> </table> ] {{content}} ] -- Using a group mutate we can compute sex/age group percentages within each country: -- .pull-right.width-60[ .hide-code[ ```r group_by(pop2, Country) %>% mutate(pct = 100 * n / sum(n)) %>% ungroup() %>% ggplot() + geom_col(aes(x = ifelse(Sex == "Male", pct, -pct), y = Age, fill = Sex)) + facet_wrap(~ Country) + scale_x_continuous( labels = function(x) scales::percent(abs(x / 100))) + xlab("Percent") + thm + theme(legend.position = "top") ``` <img src="proportions_files/figure-html/unnamed-chunk-35-1.png" style="display: block; margin: auto;" /> ] ] --- layout: false ## Multiple Categorical Variables Visualizing the distribution of multiple categorical variables involves visualizing counts and proportions. -- Distributions can be viewed as -- * joint distributions; -- * conditional distributions. -- When one variable (or several) can be viewed as a response and others as predictors then it is common to focus on the conditional distribution of the response given the predictors. -- The most common approaches use variants of bar and area charts. -- The resulting plots are often called [_mosaic plots_](https://en.wikipedia.org/wiki/Mosaic_plot). --- layout: true ## Two Data Sets --- .pull-left.width-35[ ### Hair and Eye Color ```r HairEyeColorDF <- as.data.frame(HairEyeColor) head(HairEyeColorDF) ## Hair Eye Sex Freq ## 1 Black Brown Male 32 ## 2 Brown Brown Male 53 ## 3 Red Brown Male 10 ## 4 Blond Brown Male 3 ## 5 Black Blue Male 11 ## 6 Brown Blue Male 50 ``` ] -- .pull-right.width-65[ Marginal distributions of the variables: .hide-code[ ```r p1 <- ggplot(HairEyeColorDF) + geom_col(aes(Sex, Freq), fill = "deepskyblue3") + thm p2 <- ggplot(HairEyeColorDF) + geom_col(aes(Hair, Freq), fill = "deepskyblue3") + thm p3 <- ggplot(HairEyeColorDF) + geom_col(aes(Eye, Freq), fill = "deepskyblue3") + thm p1 | p2 | p3 ``` <img src="proportions_files/figure-html/unnamed-chunk-38-1.png" style="display: block; margin: auto;" /> ] ] --- ### Arthritis Data .pull-left[ The `vcd` package includes the data frame `Arthritis` with several variables for 84 patients in a clinical trial for a treatment for rheumatoid arthritis. {{content}} ] -- ```r data(Arthritis, package = "vcd") head(Arthritis) ## ID Treatment Sex Age Improved ## 1 57 Treated Male 27 Some ## 2 46 Treated Male 29 None ## 3 77 Treated Male 30 None ## 4 17 Treated Male 32 Marked ## 5 36 Treated Male 46 Marked ## 6 23 Treated Male 58 Marked ``` {{content}} -- * The `Improved` variable is the response. {{content}} -- * The predictors are `Treatment`, `Sex`, and `Age`. -- .pull-right[ Counts for the categorical predictors: ```r xtabs(~ Sex, Arthritis) ## Sex ## Female Male ## 59 25 ``` {{content}} ] -- ```r xtabs(~ Treatment, Arthritis) ## Treatment ## Placebo Treated ## 43 41 ``` {{content}} -- ```r xtabs(~ Treatment + Sex, data = Arthritis) ## Sex ## Treatment Female Male ## Placebo 32 11 ## Treated 27 14 ``` --- .pull-left.width-35[ Joint distribution of the predictors: ] .pull-right.width-65[ .hide-code[ ```r ggplot(Arthritis) + geom_histogram(aes(x = Age), binwidth = 10, fill = "deepskyblue3", color = "black") + facet_grid(Treatment ~ Sex) + thm ``` <img src="proportions_files/figure-html/unnamed-chunk-43-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left.width-35[ Conditional distribuiton of age, given sex and treatment: ] .pull-right.width-65[ .hide-code[ ```r ggplot(Arthritis) + geom_histogram(aes(x = Age, y = after_stat(density)), binwidth = 10, fill = "deepskyblue3", color = "black") + facet_grid(Treatment ~ Sex) + thm ``` <img src="proportions_files/figure-html/unnamed-chunk-44-1.png" style="display: block; margin: auto;" /> ] ] --- layout: true ## Bar Charts --- .pull-left.width-40[ ### Hair and Eye Color Default bar charts show the individual count or joint proportions. {{content}} ] -- For the hair-eye color aggregated data counts: -- .pull-right.width-60[ .hide-code[ ```r ggplot(HairEyeColorDF) + geom_col(aes(x = Eye, y = Freq, fill = Sex)) + facet_wrap(~ Hair) + thm ``` <img src="proportions_files/figure-html/unnamed-chunk-45-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left.width-60[ Joint proportions: .hide-code[ ```r ggplot(mutate(HairEyeColorDF, Prop = Freq / sum(Freq))) + geom_col(aes(x = Eye, y = Prop, fill = Sex)) + facet_wrap(~ Hair) + thm ``` <img src="proportions_files/figure-html/unnamed-chunk-46-1.png" style="display: block; margin: auto;" /> ] ] -- .pull-right.width-40[ * Differing frequencies of the hair colors are visible. {{content}} ] -- * Conditional distributions of eye color within hair color are harder to compare. --- Showing conditional distributions requires computing proportions within groups. -- For the joint conditional distribution of sex and eye color given hair color: .pull-left.width-60[ .hide-code[ ```r group_by(HairEyeColorDF, Hair) %>% mutate(Prop = Freq / sum(Freq)) %>% ungroup() %>% ggplot() + geom_col(aes(x = Eye, y = Prop, fill = Sex)) + facet_wrap(~ Hair) + thm ``` <img src="proportions_files/figure-html/unnamed-chunk-47-1.png" style="display: block; margin: auto;" /> ] ] -- .pull-right.width-40[ * It is easier to compare the skewness of the eye color distributions for black, brown, and red hair. {{content}} ] -- * Assessing the proportion of females or males withing the different groups is possible but challenging since it requires relative length comparisons. --- To more clearly see the that the proportion of females among subjects with blond hair and blue eyes is higher than for other hair/eye color combinations we can look at the conditional distribution of sex given hair and eye color. .pull-left[ .hide-code[ ```r group_by(HairEyeColorDF, Hair, Eye) %>% mutate(Prop = Freq / sum(Freq)) %>% ungroup() %>% ggplot() + geom_col(aes(x = Eye, y = Prop, fill = Sex)) + facet_wrap(~ Hair, nrow = 1) + thm + theme(axis.text.x = element_text(angle = 45, hjust = 1)) ``` <img src="proportions_files/figure-html/unnamed-chunk-48-1.png" style="display: block; margin: auto;" /> ] ] -- .pull-right[ This plot can also be obtained using `position = "fill"`. .hide-code[ ```r ggplot(HairEyeColorDF) + geom_col(aes(x = Eye, y = Freq, fill = Sex), position = "fill") + facet_wrap(~ Hair, nrow = 1) + thm + theme(axis.text.x = element_text(angle = 45, hjust = 1)) ``` ] {{content}} ] -- One drawback: This visualization no longer shows that some of the hair/eye color combinations are more common than others. --- .pull-left[ ### Arthritis Data For the raw arthritis data, `geom_bar` computes the aggregate counts and produces a stacked bar chart by default: ] .pull-right[ .hide-code[ ```r p <- ggplot(Arthritis, aes(x = Sex, fill = Improved)) + facet_wrap(~ Treatment) p + geom_bar() + scale_fill_brewer(palette = "Blues") + thm ``` <img src="proportions_files/figure-html/unnamed-chunk-50-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ Specifying `position = "dodge"` produces a side-by-side plot: ] .pull-right[ .hide-code[ ```r p + geom_bar(position = "dodge") + scale_fill_brewer(palette = "Blues") + thm ``` <img src="proportions_files/figure-html/unnamed-chunk-51-1.png" style="display: block; margin: auto;" /> ] {{content}} ] -- There are no cases of male patients on placebo reporting `Some` improvement, resulting in wider bars for the other options. --- .pull-left[ One way to produce a zero height bar: * aggregate with `count`, and * use `complete` from `tidyr` ] .pull-right[ .hide-code[ ```r library(tidyr) comp_counts <- count(Arthritis, Treatment, Sex, Improved) %>% complete(Treatment, Sex, Improved, fill = list(n = 0)) ggplot(comp_counts, aes(x = Sex, y = n, fill = Improved)) + geom_col(position = "dodge") + facet_wrap(~ Treatment) + scale_fill_brewer(palette = "Blues") + thm ``` <img src="proportions_files/figure-html/unnamed-chunk-52-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ Another option is to use the `preserve = "single"` option with `position_dodge`. ] .pull-right[ .hide-code[ ```r p + geom_bar(position = position_dodge( preserve = "single")) + scale_fill_brewer(palette = "Blues") + thm ``` <img src="proportions_files/figure-html/unnamed-chunk-53-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ Showing conditional distributions of `Improved` given different levels of `Treatment` and `Sex`: ] .pull-right[ .hide-code[ ```r group_by(comp_counts, Treatment, Sex) %>% mutate(prop = n / sum(n)) %>% ungroup() %>% ggplot() + geom_col(aes(x = Sex, y = prop, fill = Improved), position = "dodge") + facet_wrap(~ Treatment) + scale_fill_brewer(palette = "Blues") + thm ``` <img src="proportions_files/figure-html/unnamed-chunk-54-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ Stacked bar charts with height one are another option to make these conditional distributions easier to compare: ] .pull-right[ .hide-code[ ```r p + geom_bar(position = "fill") + scale_fill_brewer(palette = "Blues") + thm ``` <img src="proportions_files/figure-html/unnamed-chunk-55-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ Ordering of variables affects which comparisons are easier. {{content}} ] -- * A researcher might want to emphasize the differential response among males and females. {{content}} -- * A patient might prefer to be able to focus on whether the treatment is effective for them: -- .pull-right[ .hide-code[ ```r ggplot(Arthritis, aes(x = Treatment, fill = Improved)) + geom_bar(position = "fill") + scale_fill_brewer(palette = "Blues") + thm + facet_wrap(~ Sex) ``` <img src="proportions_files/figure-html/unnamed-chunk-56-1.png" style="display: block; margin: auto;" /> ] ] --- Some notes; -- * The stacked bar chart is effective for two categories, and a few more if they are ordered. -- * Providing a visual indication of uncertainty in the estimates is a challenge. The standard errors in this case are around 0.1. -- * The proportions of each treatment group that are male or female could be encoded in the bar widths. -- * The resulting plot is called a _spine plot_. -- * Basic `ggplot2` does not seem to make this easy. --- layout: true ## Spine Plots --- _Spine plots_ are a special case of [_mosaic plots_](https://en.wikipedia.org/wiki/Mosaic_plot), and can be seen as a generalization of stacked bar plots. -- For a spine plot the proportions for the categories of a predictor variable are encoded in the bar widths. -- The `ggmosaic` package provides support for mosaic plots in the `ggplot` framework. (It can be a little rough around the edges.) -- Spine plots are provided by the base graphics function `spineplot` and the `vcd` function `spine`. -- `vcd` plots are built on the `grid` graphics system, like `lattice` and `ggplot2` graphics. <!-- spine plot in the wild: https://t.co/91xqkWXIfD; in img/energy-spine.png --> --- .pull-left[ A spine plot for the distribution of `Improved` given `Sex` in the `Treated` group: ] .pull-right[ .hide-code[ ```r library(ggmosaic) filter(Arthritis, Treatment == "Treated") %>% mutate(Improved = fct_rev(Improved)) %>% ggplot() + geom_mosaic(aes(x = product(Sex), fill = Improved)) + scale_fill_brewer(palette = "Blues", direction = -1) + facet_wrap(~ Treatment) + thm + labs(x = "", y = "Improved") ``` <img src="proportions_files/figure-html/unnamed-chunk-57-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ Spine plots for `Treatment` groups using faceting: ] .pull-right[ .hide-code[ ```r library(ggmosaic) mutate(Arthritis, Improved = fct_rev(Improved)) %>% ggplot() + geom_mosaic(aes(x = product(Sex), fill = Improved)) + scale_fill_brewer(palette = "Blues", direction = -1) + facet_wrap(~ Treatment) + thm + labs(x = "", y = "Improved") ``` <img src="proportions_files/figure-html/unnamed-chunk-58-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ Spine plots for the arthritis data, faceted on `Sex`: ] .pull-right[ .hide-code[ ```r library(ggmosaic) mutate(Arthritis, Improved = fct_rev(Improved)) %>% ggplot() + geom_mosaic(aes(x = product(Treatment), fill = Improved)) + scale_fill_brewer(palette = "Blues", direction = -1) + facet_wrap(~ Sex) + thm + labs(x = "", y = "Improved") ``` <img src="proportions_files/figure-html/unnamed-chunk-59-1.png" style="display: block; margin: auto;" /> ] {{content}} ] -- This no longer shows the Female/Male imbalance. --- .pull-left[ For aggregate counts use the weight aesthetic: ] .pull-right[ .hide-code[ ```r mutate(HairEyeColorDF, Sex = fct_rev(Sex)) %>% ggplot() + geom_mosaic(aes(weight = Freq, x = product(Hair), fill = Sex)) + thm + labs(x = "Hair", y = "") ``` <img src="proportions_files/figure-html/unnamed-chunk-60-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ Spine plots of `Sex` within `Eye` color, faceted on `Hair` color: .hide-code[ ```r mutate(HairEyeColorDF, Sex = fct_rev(Sex)) %>% ggplot() + geom_mosaic(aes(weight = Freq, x = product(Eye), fill = Sex)) + thm + labs(x = "Eye", y = "") + facet_wrap(~ Hair, nrow = 1, scales = "free_x") + theme(legend.position = "top", axis.text.y = element_blank(), axis.text.x = element_text(angle = 45, hjust = 1)) + scale_y_continuous(expand = c(0, 0)) ``` <img src="proportions_files/figure-html/unnamed-chunk-61-1.png" style="display: block; margin: auto;" /> ] ] -- .pull-right[ The relative sizes of the groups on the `x` (eye color) axis are shown within the facets. {{content}} ] -- The sizes of the faceted variable (hair color) groups are not reflected. {{content}} -- _Double decker plots_ try to address this. --- layout: true ## Doubledecker Plots --- _Doubledecker plots_ can be viewed as a generalization of spine plots to multiple predictors. Package `vcd` provides the `doubledecker` function. This function can use a formula interface. .pull-left[ .hide-code[ ```r arth_pal <- RColorBrewer::brewer.pal(3, "Blues") arth_gp <- grid::gpar(fill = arth_pal) vcd::doubledecker(Improved ~ Treatment + Sex, data = Arthritis, gp = arth_gp, margins = c(2, 5, 4, 2)) ``` <img src="proportions_files/figure-html/unnamed-chunk-62-1.png" style="display: block; margin: auto;" /> ] ] -- .pull-right[ .hide-code[ ```r vcd::doubledecker(Improved ~ Sex + Treatment, data = Arthritis, gp = arth_gp, margins = c(2, 5, 4, 2)) ``` <img src="proportions_files/figure-html/unnamed-chunk-63-1.png" style="display: block; margin: auto;" /> ] ] --- Using `ggmosaic`: .pull-left[ .hide-code[ ```r mutate(Arthritis, Improved = fct_rev(Improved)) %>% ggplot() + geom_mosaic( aes(x = product(Sex, Treatment), fill = Improved), divider = ddecker()) + scale_fill_brewer(palette = "Blues", direction = -1) + thm + theme(axis.text.x = element_text(angle = 15, hjust = 1)) + labs(x = "", y = "") ``` <img src="proportions_files/figure-html/unnamed-chunk-64-1.png" style="display: block; margin: auto;" /> ] ] -- .pull-right[ .hide-code[ ```r mutate(Arthritis, Improved = fct_rev(Improved)) %>% ggplot() + geom_mosaic( aes(x = product(Treatment, Sex), fill = Improved), divider = ddecker()) + scale_fill_brewer(palette = "Blues", direction = -1) + thm + theme(axis.text.x = element_text(angle = 15, hjust = 1)) + labs(x = "", y = "") ``` <img src="proportions_files/figure-html/unnamed-chunk-65-1.png" style="display: block; margin: auto;" /> ] ] --- layout: true ## Mosaic Plots --- _Mosaic plots_ recursively partition the axes to represent counts of categorical variables as rectangles. -- .pull-left[ * Base graphics provides `mosaicplot`; {{content}} ] -- * `vcd` provides `mosaic`. {{content}} -- Both support a formula interface. {{content}} -- A Mosaic plot for the predictors `Sex` and `Treatment`: -- .pull-right[ .hide-code[ ```r vcd::mosaic(~ Sex + Treatment, data = Arthritis) ``` <img src="proportions_files/figure-html/unnamed-chunk-66-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ Adding `Improved` to the joint distribution: .hide-code[ ```r vcd::mosaic(~ Sex + Treatment + Improved, data = Arthritis) ``` <img src="proportions_files/figure-html/unnamed-chunk-67-1.png" style="display: block; margin: auto;" /> ] ] -- .pull-right[ .hide-code[ Identifying `Improved` as the response: <!-- ##vcd::mosaic(Improved ~ Sex + Treatment, data = Arthritis, gp = arth_gp)--> ```r vcd::mosaic(Improved ~ Sex + Treatment, data = Arthritis) ``` <img src="proportions_files/figure-html/unnamed-chunk-68-1.png" style="display: block; margin: auto;" /> ] ] --- Matching the doubledecker plots: .pull-left[ .hide-code[ ```r vcd::mosaic( Improved ~ Treatment + Sex, data = Arthritis, split_vertical = c(TRUE, TRUE, FALSE)) ``` <img src="proportions_files/figure-html/unnamed-chunk-69-1.png" style="display: block; margin: auto;" /> ] ] -- .pull-right[ .hide-code[ ```r vcd::mosaic( Improved ~ Sex + Treatment, data = Arthritis, split_vertical = c(TRUE, TRUE, FALSE)) ``` <img src="proportions_files/figure-html/unnamed-chunk-70-1.png" style="display: block; margin: auto;" /> ] ] --- Some variants using `ggmosaic`: .pull-left[ .hide-code[ ```r ggplot(mutate(Arthritis, Sex = fct_rev(Sex))) + geom_mosaic( aes(x = product(Treatment, Sex))) + coord_flip() + labs(x = "", y = "") ``` <img src="proportions_files/figure-html/unnamed-chunk-71-1.png" style="display: block; margin: auto;" /> ] ] -- .pull-right[ .hide-code[ ```r ggplot(mutate(Arthritis, Sex = fct_rev(Sex))) + geom_mosaic(aes(x = product(Improved, Treatment, Sex))) + coord_flip() ``` <img src="proportions_files/figure-html/unnamed-chunk-72-1.png" style="display: block; margin: auto;" /> ] ] --- A mosaic plot for all bivariate marginals: .hide-code[ ```r pairs(xtabs(~ Sex + Treatment + Improved, data = Arthritis)) ``` <img src="proportions_files/figure-html/unnamed-chunk-73-1.png" style="display: block; margin: auto;" /> ] --- layout: true ## Spinograms and CD Plots --- <!-- building a spinogram from scratch, more or less: ```r Arth <- mutate(Arthritis, AgeBin = cut(Age, seq(20, by = 10, len = 7)), Improved = fct_rev(Improved)) d <- filter(Arth, Treatment == "Treated") %>% count(AgeBin) %>% mutate(brk = (cumsum(lag(n, default = 0)) + 0.5 * n) / sum(n)) p <- ggplot(d) p + geom_col(aes(x = AgeBin, y = n)) p2 <- p + geom_col(aes(x = 0.5, y = n, color = fct_rev(AgeBin)), position = "fill", fill = NA) + guides(color = "none") + scale_y_continuous(breaks = d$brk, labels = d$AgeBin) p2 p2 + coord_flip() ``` ```r props <- filter(Arth, Treatment == "Treated") %>% count(AgeBin, Improved) %>% mutate(AgeBin = fct_drop(AgeBin)) %>% complete(AgeBin, Improved, fill = list(n = 0)) %>% group_by(AgeBin) %>% mutate(prop = n / sum(n)) %>% ungroup() %>% select(-n) props ``` --> .pull-left[ _Spinograms_ and _CD plots_ show the conditional distribution of a categorical variable given the value of a numeric variable. {{content}} ] -- * Spinograms use the same binning as a histogram and then create a spine plot. {{content}} -- * CD plots use a smoothing or density estimation approach. -- .pull-right[ A spinogram for `Improved` against `Age`: .hide-code[ ```r ArthT <- filter(Arthritis, Treatment == "Treated") %>% mutate(Improved = fct_rev(Improved)) arthT_gp <- grid::gpar(fill = rev(arth_gp$fill)) vcd::spine(Improved ~ Age, data = ArthT, gp = arthT_gp, breaks = 5) ``` <img src="proportions_files/figure-html/unnamed-chunk-74-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ An analogous plot created with `ggmosaic` by binning the `Age` variable: ] .pull-right[ .hide-code[ ```r Arth <- mutate(Arthritis, AgeBin = cut(Arthritis$Age, seq(20, by = 10, len = 7)), Improved = fct_rev(Improved)) filter(Arth, Treatment == "Treated") %>% count(Improved, AgeBin) %>% ggplot() + geom_mosaic(aes(weight = n, x = product(AgeBin), fill = Improved)) + scale_fill_brewer(palette = "Blues", direction = -1) + theme_minimal() + theme(axis.title = element_blank()) ``` <img src="proportions_files/figure-html/unnamed-chunk-75-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left.width-45[ A facet grid can be used to create spinograms for each of the `Sex`/`Treatment` combinations: ] .pull-right.width-55[ .hide-code[ ```r ggplot(count(Arth, Improved, Sex, Treatment, AgeBin)) + geom_mosaic(aes(weight = n, x = product(AgeBin), fill = Improved)) + scale_fill_brewer(palette = "Blues", direction = -1) + theme_minimal() + facet_grid(Treatment ~ Sex) + theme(axis.title = element_blank()) + theme(axis.text.x = element_text(angle = 35, hjust = 1), axis.text.y = element_blank()) ``` <img src="proportions_files/figure-html/unnamed-chunk-76-1.png" style="display: block; margin: auto;" /> ] ] --- A [spinogram](https://flowingdata.com/2021/08/02/decline-of-u-s-vaccination-rate-compared-against-europes/) in the media (NYT, August 2021): <img src="../img/Vaccination-rates-with-Marimekko-1536x925.png" width="70%" style="display: block; margin: auto;" /> --- Some plots in a [Twitter thread](https://twitter.com/jburnmurdoch/status/1503420660869214213?s=20&t=RZxOXiHZ0eXkV6WXMIuVVA): .pull-left[ <img src="../img/covid-spinogram.png" width="90%" style="display: block; margin: auto;" /> ] -- .pull-right[ <img src="../img/covid-mort.png" width="90%" style="display: block; margin: auto;" /> ] --- CD plots estimate the conditional density of the `x` variable given the levels of `y`, weighted by the marginal proportions of `y` and use these to estimate cumulative probabilities. -- * The slice at a particular `x` level visualizes the conditional distribution of `y` given `x` at that level. -- * `geom_density` with `position = stack` is one way to create a CD plot. -- * The `cd_plot` function from the `vcd` package produces a CD plot using `grid` graphics. -- * The `cdplot` function from the base `graphics` package provides the same plots using base graphics. <!-- Building a CD plot in steps: ```r d <- filter(Arthritis, Treatment == "Treated", Sex == "Female") ## use y = after_stat(count) so area is number of rows p0 <- ggplot(d, aes(x = Age, y = after_stat(count))) + geom_density(bw = 5, fill = "grey") p0 ## separate count-weighted densities for each group with alpha blending p1 <- ggplot(d, aes(x = Age, y = after_stat(count), fill = Improved)) + geom_density(bw = 5, alpha = 0.5) + scale_fill_brewer(palette = "Blues") + guides(fill = "none") p1 ## easier to see the separate densities with faceting p1 + facet_wrap(~ Improved) ## stack the densities ggplot(d, aes(x = Age, y = after_stat(count), fill = Improved)) + geom_density(position = "stack", bw = 5) + scale_fill_brewer(palette = "Blues") + guides(fill = "none") ## rescale to height 1 with position = "fill" p2 <- ggplot(d, aes(x = Age, y = after_stat(count), fill = Improved)) + geom_density(position = "fill", bw = 5) + scale_fill_brewer(palette = "Blues") + guides(fill = "none") p2 p2 + geom_vline(xintercept = 40, lty = 2) p2 + geom_vline(xintercept = 60, lty = 2) ``` --> --- .pull-left[ CD plots for the `Treated` group: ] .pull-right[ .hide-code[ ```r filter(Arthritis, Treatment == "Treated") %>% ggplot(aes(x = Age, fill = Improved)) + geom_density(position = "fill", bw = 5) + scale_fill_brewer(palette = "Blues") + facet_wrap(~ Sex, ncol = 1) + thm ``` <img src="proportions_files/figure-html/unnamed-chunk-80-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ CD plots for all combinations end up with one group of size one and one of size zero, which produces a non-useful plot for one combination: ] .pull-right[ .hide-code[ ```r count(Arthritis, Treatment, Sex, Improved) %>% complete(Treatment, Sex, Improved, fill = list(n = 0)) %>% filter(n < 2) ## # A tibble: 2 × 4 ## Treatment Sex Improved n ## <fct> <fct> <ord> <int> ## 1 Placebo Male Some 0 ## 2 Placebo Male Marked 1 ggplot(Arthritis, aes(x = Age, fill = Improved)) + geom_density(position = "fill", bw = 5) + scale_fill_brewer(palette = "Blues") + facet_grid(Treatment ~ Sex) + thm ## Warning: Groups with fewer than two data points have been dropped. ## Warning: Removed 1 rows containing missing values (`position_stack()`). ``` <img src="proportions_files/figure-html/unnamed-chunk-81-1.png" style="display: block; margin: auto;" /> ] ] --- layout: true ## Uncertainty Representation --- .pull-left[ Categorical data are often analyzed by fitting models representing conditional independence structures. {{content}} ] -- * Plotting residuals from these models can help assess how well they fit. {{content}} -- * `vcd::mosaic` supports using color to represent magnitude of residuals for comparing to a simple independence model. {{content}} -- For the `Arthritis` data, observed counts and expected counts under an independence model assuming `Treatment` and `Improved` are independent can be visualized as mosaic plots: -- .pull-right[ .hide-code[ ```r ## there are easier ways do do this ... v <- count(Arthritis, Treatment, Improved) pT <- group_by(v, Treatment) %>% summarize(n = sum(n)) %>% mutate(pT = n / sum(n)) %>% select(-n) pI <- group_by(v, Improved) %>% summarize(n = sum(n)) %>% mutate(pI = n / sum(n)) %>% select(-n) v <- left_join(v, pT, "Treatment") %>% left_join(pI, "Improved") %>% mutate(p = pT * pI, Treatment = fct_rev(Treatment)) po <- ggplot(v) + geom_mosaic(aes(weight = n, x = product(Improved, Treatment), fill = Improved)) + scale_fill_brewer(palette = "Blues") + guides(fill = "none") + labs(title = "Observed Proportions") + thm + coord_flip() + theme(axis.text.y = element_text(angle = 90, hjust = 0)) pe <- ggplot(v) + geom_mosaic(aes(weight = p, x = product(Improved, Treatment), fill = Improved)) + scale_fill_brewer(palette = "Blues") + guides(fill = "none") + labs(title = "Expected Proportions") + thm + coord_flip() + theme(axis.text.y = element_text(angle = 90, hjust = 0)) po + pe ``` <img src="proportions_files/figure-html/unnamed-chunk-82-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ A plot for assessing the fit of the residuals between the observed and expected data under a model assuming independence of `Treatment` and `Improved` produces: ] .pull-right[ .hide-code[ ```r vcd::mosaic(~ Treatment + Improved, data = Arthritis, gp = vcd::shading_max) ``` <img src="proportions_files/figure-html/unnamed-chunk-83-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ Another visualization of the residuals is the _association plot_ produced by `assoc`: ] .pull-right[ .hide-code[ ```r vcd::assoc(~ Treatment + Improved, data = Arthritis, gp = vcd::shading_max) ``` <img src="proportions_files/figure-html/unnamed-chunk-84-1.png" style="display: block; margin: auto;" /> ] ] --- layout: true ## References --- > The vignette [_Residual-Based Shadings in > vcd_](https://cran.r-project.org/package=vcd/vignettes/residual-shadings.pdf) > in the `vcd` package. > Zeileis, Achim, David Meyer, and Kurt Hornik. "Residual-based > shadings for visualizing (conditional) independence." Journal of > Computational and Graphical Statistics 16, no. 3 (2007): 507-525. > The vignette [_Working with categorical data with R and the vcd and vcdExtra packages_](https://www.datavis.ca/courses/VCD/vcd-tutorial.pdf) in the `vcdExtra` package. Several other experimental mosaic plot implementations are available for `ggplot`. --- layout: true ## Some Other Visualizations --- .pull-left[ ### Tree Maps {{content}} ] -- Tree maps show hierarchically structured (or tree-tructured) data. {{content}} -- * Each branch is represented by a rectangle. {{content}} -- * Leaf node tiles have areas proportional to the value of a variable. {{content}} -- * Tiles are often colored to reflect the value of another variable. {{content}} -- The package `treemapify` provides a `ggplot`-based implementation. {{content}} -- The data set `G20` includes some variables on the G-20 member countries: -- .pull-right.scroll-box-20[ .hide-code[ ```r library(treemapify) select(G20, region, country, gdp_mil_usd, hdi) %>% knitr::kable(format = "html") %>% kableExtra::kable_styling( full_width = FALSE) ``` <table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> region </th> <th style="text-align:left;"> country </th> <th style="text-align:right;"> gdp_mil_usd </th> <th style="text-align:right;"> hdi </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Africa </td> <td style="text-align:left;"> South Africa </td> <td style="text-align:right;"> 384315 </td> <td style="text-align:right;"> 0.629 </td> </tr> <tr> <td style="text-align:left;"> North America </td> <td style="text-align:left;"> United States </td> <td style="text-align:right;"> 15684750 </td> <td style="text-align:right;"> 0.937 </td> </tr> <tr> <td style="text-align:left;"> North America </td> <td style="text-align:left;"> Canada </td> <td style="text-align:right;"> 1819081 </td> <td style="text-align:right;"> 0.911 </td> </tr> <tr> <td style="text-align:left;"> North America </td> <td style="text-align:left;"> Mexico </td> <td style="text-align:right;"> 1177116 </td> <td style="text-align:right;"> 0.775 </td> </tr> <tr> <td style="text-align:left;"> South America </td> <td style="text-align:left;"> Brazil </td> <td style="text-align:right;"> 2395968 </td> <td style="text-align:right;"> 0.730 </td> </tr> <tr> <td style="text-align:left;"> South America </td> <td style="text-align:left;"> Argentina </td> <td style="text-align:right;"> 474954 </td> <td style="text-align:right;"> 0.811 </td> </tr> <tr> <td style="text-align:left;"> Asia </td> <td style="text-align:left;"> China </td> <td style="text-align:right;"> 8227037 </td> <td style="text-align:right;"> 0.699 </td> </tr> <tr> <td style="text-align:left;"> Asia </td> <td style="text-align:left;"> Japan </td> <td style="text-align:right;"> 5963969 </td> <td style="text-align:right;"> 0.912 </td> </tr> <tr> <td style="text-align:left;"> Asia </td> <td style="text-align:left;"> South Korea </td> <td style="text-align:right;"> 1155872 </td> <td style="text-align:right;"> 0.909 </td> </tr> <tr> <td style="text-align:left;"> Asia </td> <td style="text-align:left;"> India </td> <td style="text-align:right;"> 1824832 </td> <td style="text-align:right;"> 0.554 </td> </tr> <tr> <td style="text-align:left;"> Asia </td> <td style="text-align:left;"> Indonesia </td> <td style="text-align:right;"> 878198 </td> <td style="text-align:right;"> 0.629 </td> </tr> <tr> <td style="text-align:left;"> Eurasia </td> <td style="text-align:left;"> Russia </td> <td style="text-align:right;"> 2021960 </td> <td style="text-align:right;"> 0.788 </td> </tr> <tr> <td style="text-align:left;"> Eurasia </td> <td style="text-align:left;"> Turkey </td> <td style="text-align:right;"> 794468 </td> <td style="text-align:right;"> 0.722 </td> </tr> <tr> <td style="text-align:left;"> Europe </td> <td style="text-align:left;"> European Union </td> <td style="text-align:right;"> 16414483 </td> <td style="text-align:right;"> 0.876 </td> </tr> <tr> <td style="text-align:left;"> Europe </td> <td style="text-align:left;"> Germany </td> <td style="text-align:right;"> 3400579 </td> <td style="text-align:right;"> 0.920 </td> </tr> <tr> <td style="text-align:left;"> Europe </td> <td style="text-align:left;"> France </td> <td style="text-align:right;"> 2608699 </td> <td style="text-align:right;"> 0.893 </td> </tr> <tr> <td style="text-align:left;"> Europe </td> <td style="text-align:left;"> United Kingdom </td> <td style="text-align:right;"> 2440505 </td> <td style="text-align:right;"> 0.875 </td> </tr> <tr> <td style="text-align:left;"> Europe </td> <td style="text-align:left;"> Italy </td> <td style="text-align:right;"> 2014079 </td> <td style="text-align:right;"> 0.881 </td> </tr> <tr> <td style="text-align:left;"> Middle East </td> <td style="text-align:left;"> Saudi Arabia </td> <td style="text-align:right;"> 727307 </td> <td style="text-align:right;"> 0.782 </td> </tr> <tr> <td style="text-align:left;"> Oceania </td> <td style="text-align:left;"> Australia </td> <td style="text-align:right;"> 1541797 </td> <td style="text-align:right;"> 0.938 </td> </tr> </tbody> </table> ] ] --- .pull-left[ A simple tree with only one level, the individual countries: <!-- # nolint start --> <center>
</center> <!-- # nolint end --> ] -- .pull-right[ A corresponding tree map based on `gdp_mil_usd`: .hide-code[ ```r ggplot(G20, aes(area = gdp_mil_usd)) + geom_treemap() + geom_treemap_text(aes(label = country), color = "white") ``` <img src="proportions_files/figure-html/unnamed-chunk-88-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ A tree grouping by region: <center>
</center> ] .pull-right[ A corresponding tree map: .hide-code[ ```r ggplot(G20, aes(area = gdp_mil_usd, subgroup = region)) + geom_treemap() + geom_treemap_text(aes(label = country), color = "white") + geom_treemap_subgroup_border( color = "red") + geom_treemap_subgroup_text(color = "red") ``` <img src="proportions_files/figure-html/unnamed-chunk-90-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ A tree map showing GDP values for the G-20 members, grouped by region, with fill mapped to the country's Human Development Index: ] .pull-right[ .hide-code[ ```r ggplot(G20, aes(area = gdp_mil_usd, fill = hdi, subgroup = region)) + geom_treemap() + geom_treemap_text(aes(label = country), color = "white") + geom_treemap_subgroup_border() + geom_treemap_subgroup_text( color = "lightgrey") ``` <img src="proportions_files/figure-html/unnamed-chunk-91-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ A treemap representing the distribution of eye color within hair color: ] .pull-right[ .hide-code[ ```r group_by(agg, Eye, Hair) %>% summarize(n = sum(n)) %>% ungroup() %>% ggplot(aes(area = n, subgroup = Hair)) + geom_treemap(aes(fill = Eye), color = "white") + geom_treemap_subgroup_text() + geom_treemap_subgroup_border( color = "black", size = 6) + geom_treemap_text(aes(label = Eye), color = "grey90") + scale_fill_manual(values = ecols) + guides(fill = "none") ``` <img src="proportions_files/figure-html/unnamed-chunk-92-1.png" style="display: block; margin: auto;" /> ] ] --- .pull-left[ A treemap representing proportions for `Improved` within `Treatment` within `Sex` for the Arthritis data: ] .pull-right[ .hide-code[ ```r count(Arth, Treatment, Improved, Sex) %>% ggplot(aes(area = n, subgroup = Sex, fill = Improved, subgroup2 = Treatment)) + geom_treemap() + geom_treemap_subgroup_text() + scale_fill_brewer(palette = "Blues", direction = -1) + geom_treemap_subgroup_border() + geom_treemap_subgroup2_text(place = "top", size = 20) ``` <img src="proportions_files/figure-html/unnamed-chunk-93-1.png" style="display: block; margin: auto;" /> ] ] --- ### Alluvial plots These are also known as -- * _parallel sets_, or -- * _Sankey diagrams_. -- They can be viewed as a parallel coordinates plot for categorical data. -- Several implementations are available, including: <!-- * `alluvial` using base graphics; --> -- * `geom_parallel_sets` from `ggforce`; -- * `geom_sankey` from [`ggsankey`](https://github.com/davidsjoberg/ggsankey); -- * `geom_alluvium` from `ggalluvial`. <!-- Hair/Eye color using the `alluvial` package: --> --- .pull-left[ Hair/Eye color using the `ggforce` package: .hide-code[ ```r pal <- RColorBrewer::brewer.pal(3, "Set1") HDF <- mutate(HairEyeColorDF, Sex = fct_rev(Sex)) library(ggforce) sHDF <- gather_set_data(HDF, 3 : 1) sHDF <- mutate(sHDF, x = fct_inorder(as.factor(x))) #**** simplify this? ggplot(sHDF, aes(x, id = id, split = y, value = Freq)) + geom_parallel_sets(aes(fill = Sex), alpha = 0.5, axis.width = 0.1) + geom_parallel_sets_axes( axis.width = 0.1) + geom_parallel_sets_labels( colour = 'white') + scale_fill_manual( values = c(Male = pal[2], Female = pal[1])) + theme_void() + guides(fill = "none") ``` <img src="proportions_files/figure-html/unnamed-chunk-95-1.png" style="display: block; margin: auto;" /> ] ] <!-- Arthritis data with `alluvial`: --> -- .pull-right[ Arthritis data with `ggforce`: .hide-code[ ```r sArth <- mutate(Arth, Improved = factor(Improved, ordered = FALSE)) %>% count(Improved, Treatment, Sex) %>% gather_set_data(3 : 1) sArth <- mutate(sArth, x = fct_inorder(factor(x)), Sex = fct_rev(Sex)) ggplot(sArth, aes(x, id = id, split = y, value = n)) + geom_parallel_sets(aes(fill = Sex), alpha = 0.5, axis.width = 0.1) + geom_parallel_sets_axes(axis.width = 0.1) + geom_parallel_sets_labels( colour = 'white') + scale_fill_manual( values = c(Male = pal[2], Female = pal[1])) + theme_void() + guides(fill = "none") ``` <img src="proportions_files/figure-html/unnamed-chunk-97-1.png" style="display: block; margin: auto;" /> ] ] --- --- .pull-left[ ### Stream Graphs {{content}} ] -- [Stream graphs](https://www.visualisingdata.com/2010/08/making-sense-of-streamgraphs/) are a generalization of stacked bar charts plotted against a numeric variable. {{content}} -- In some cases the origins of the bars are shifted to improve some aspect of the overall visualization. {{content}} -- An early example is the [Baby Name Voyager](https://www.bewitched.com/namevoyager.html). (A more recent variant is also [available](https://namerology.com/baby-name-grapher/).) {{content}} -- A [NY Times visualization](https://archive.nytimes.com/www.nytimes.com/interactive/2008/02/23/movies/20080223_REVENUE_GRAPHIC.html?_r=0) of movie box office results is another example. ([Blog post with a static version](https://flowingdata.com/2008/02/25/ebb-and-flow-of-box-office-receipts-over-past-20-years/)). {{content}} -- Some R implementations on GitHub: {{content}} -- * [`ggTimeSeries`](https://github.com/AtherEnergy/ggTimeSeries) * [`streamgraph`](https://hrbrmstr.github.io/streamgraph/) (uses D3) * [`ggstream`](https://github.com/davidsjoberg/ggstream). -- .pull-right[ A stream graph for movie genres (these are not mutually exclusive): .hide-code[ ```r ## install with: remotes::install_github("hrbrmstr/streamgraph") library(streamgraph) library(tidyverse) genres <- c("Action", "Animation", "Comedy", "Drama", "Documentary", "Romance") mymovies <- select(ggplot2movies::movies, year, one_of(genres)) mymovies_long <- pivot_longer( mymovies, -year, names_to = "genre", values_to = "value") movie_counts <- count(mymovies_long, year, genre) streamgraph(movie_counts, "genre", "n", "year") ```
] ] <!-- nice example with Xmen characters from TidyTuesday https://github.com/Z3tt/TidyTuesday/blob/master/R/2020_27_ClaremontRunXMen.Rmd --> --- layout: false ## Reading Chapters [_Visualizing proportions_](https://clauswilke.com/dataviz/visualizing-proportions.html) and [_Visualizing nested proportions_](https://clauswilke.com/dataviz/nested-proportions.html) in [_Fundamentals of Data Visualization_](https://clauswilke.com/dataviz/). --- layout: false ## Interactive Tutorial An interactive [`learnr`](https://rstudio.github.io/learnr/) tutorial for these notes is [available](../tutorials/proportions.Rmd). You can run the tutorial with ```r STAT4580::runTutorial("proportions") ``` You can install the current version of the `STAT4580` package with ```r remotes::install_gitlab("luke-tierney/STAT4580") ``` You may need to install the `remotes` package from CRAN first. --- layout: true ## Exercises --- 1) Figure A shows a bar char of the flights leaving NYC airports in 2013 for each day of the week. Figure B shows the market share of five major internet browsers in 2015. <img src="proportions_files/figure-html/unnamed-chunk-102-1.png" style="display: block; margin: auto;" /> For which of these bar charts would it be better to reorder the categories so the bars are ordered from largest to smallest? * a. Yes for Figure A. No for Figure B. * b. No for Figure A. Yes for Figure B. * c. Yes for both. * d. No for both. --- 2) Consider the stacked bar chart `p1` and the spine plot `p2` for the hair and eye color data produced by the following code: ```r library(dplyr) library(ggplot2) library(ggmosaic) ecols <- c(Brown = "brown2", Blue = "blue2", Hazel = "darkgoldenrod3", Green = "green4") HairEyeColorDF <- as.data.frame(HairEyeColor) p0 <- ggplot(HairEyeColorDF) + scale_fill_manual(values = ecols) + theme_minimal() p1 <- p0 + geom_col(aes(x = Hair, y = Freq / sum(Freq), fill = Eye)) p2 <- p0 + geom_mosaic(aes(x = product(Hair), fill = Eye, weight = Freq)) ``` Use the two plots to answer: Which hair color has the highest proportion of individuals with green eyes? * a. Black * b. Brown * c. Red * d. Blond Which plot makes it easiest to answer this question? --- 3) Use the plots of the previous question to answer: The proportion of individuals with red hair is closest to: * a. 5% * b. 8% * c. 12% * c. 20% Which plot makes it easiest to answer this question? <!-- pareto chart - pie charts - bar charts - stacked bar charts - grouped bar charts - population pyramids - waffle charts, square pie charts - joint and conditional distributions - spine plots - spinograms? - double decker plots - cd_plots - mosaic plots - tree maps - sankey diagrams, alluvial charts, parallel sets - chord diagrams - stream graphs -->
//adapted from Emi Tanaka's gist at //https://gist.github.com/emitanaka/eaa258bb8471c041797ff377704c8505