--- title: "Visualizing a Categorical Variable" output: html_document: toc: yes --- ```{r global_options, include = FALSE} knitr::opts_chunk$set(collapse = TRUE) ``` ```{r, include = FALSE} library(dplyr) library(ggplot2) library(lattice) library(gridExtra) set.seed(12345) ``` ## Categorical Data Categorical data can be * nominal, qualitative * ordinal For visualization, the main difference is that ordinal data suggests a particular display order. Purely categorical data can come in a range of formats. The most common are * raw data: individual observations; * aggregated data: counts for each unique combination of levels * cross-tabulated data ### Raw Data ```{r, echo = FALSE} ah <- as.data.frame(HairEyeColor) raw <- ah[rep(seq_len(nrow(ah)), times = ah$Freq), ][-4] raw <- raw[sample(seq_len(nrow(raw))), ] row.names(raw) <- NULL ``` Raw data for a survey of individuals that records hair color, eye color, and gender of `r nrow(raw)` individuals might look like this: ```{r} head(raw) ``` ### Aggregated Data One way to aggregate raw categorical data is to use `count` from `dplyr`: ```{r} library(dplyr) agg <- count(raw, Hair, Eye, Sex) head(agg) ``` The `count_` function from `dplyr` allows the variables to use to be read from the data: ```{r} agg <- count_(raw, names(raw)) head(agg) ``` ### Cross-Tabulated Data Cross-tabulated data can be produced from aggregate data using `xtabs`: ```{r} xtabs(n ~ Hair + Eye + Sex, data = agg) ``` Cross-tabulated data can be produced from raw data using `table`: ```{r} xtb <- table(raw) xtb ``` * Both raw and aggregate date in this example are in _tidy_ form; the cross-tabulated date is not. * Cross-tabulated data on $p$ variables is arranged in a $p$-way array. * The cross-tabulated data can be converted to the tidy aggregate form using `as.data.frame`: ```{r} class(xtb) head(as.data.frame(xtb)) ``` The variable `xtb` corresponds to the data set `HairEyeColor` in the `datasets` package, ### Working With Categorical Variables Categorical variables are usually represented as: * character vectors * factors. Some advantages of factors: * more control over ordering of levels * levels are preserved when forming subsets Most plotting and modeling functions will convert character vectors to factors with levels ordered alphabetically. Some standard R functions for working with factors include * `factor` creates a factor from another type of variable * `levels` returns the levels of a factor * `reorder` changes level order to match another variable * `relevel` moves a particular level to the first position as a base line * `droplevels` removes levels not in the variable. The `tidyverse` package `forcats` adds some more tools, including * `fct_inorder` creates a factor with levels ordered by first appearance * `fct_infreq` orders levels by decreasing frequency * `fct_rev` reverses the levels * `fct_recode` changes factor levels * `fct_relevel` moves one or more levels * `fct_c` merges two or more factors ## Bar Charts For Frequencies ### Basics The bar chart is often used to show the frequencies of a categorical variable. By default, `geom_bar` uses `stat = "count"` and maps its result to the `y` aesthetic. This is suitable for raw data: ```{r} ggplot(raw) + geom_bar(aes(x = Hair)) ``` For a nominal variable it is often better to order the bars by decreasing frequency: ```{r} library(forcats) ggplot(mutate(raw, Hair = fct_infreq(Hair))) + geom_bar(aes(x = Hair)) ``` If the data have already been aggregated, then you need to specify `stat = "identity"` as well as the variable containing the counts as the `y` aesthetic: ```{r} ggplot(agg) + geom_bar(aes(x = Hair, y = n), stat = "identity") ``` An alternative is to use `geom_col`. For aggregated data reordering can be based on the computed counts using ```{r} agg_ord <- mutate(agg, Hair = reorder(Hair, -n, sum)) ``` * `-n` is used to order largest to smallest; * the default summary used by `reorder` is `mean`; `sum` is better here. ```{r} ggplot(agg_ord) + geom_col(aes(x = Hair, y = n)) ``` ### Adding a Grouping Variable Mapping the `Eye` variable to `fill` in `ggplot` produces a _stacked bar chart_. An alternative, specified with `position = "dodge"`, is a _side by side_ bar chart, or a _clustered_ bar chart. For the side by side chart in particular it may be useful to also reorder the `Eye` color levels. ```{r} ecols <- c(Brown = "brown2", Blue = "blue2", Hazel = "darkgoldenrod3", Green = "green4") agg_ord <- mutate(agg, Hair = reorder(Hair, -n, sum), Eye = reorder(Eye, -n, sum)) p1 <- ggplot(agg_ord) + geom_col(aes(x = Hair, y = n, fill = Eye)) + scale_fill_manual(values = ecols) p2 <- ggplot(agg_ord) + geom_col(aes(x = Hair, y = n, fill = Eye), position = "dodge") + scale_fill_manual(values = ecols) grid.arrange(p1, p2, nrow = 1) ``` Faceting can be used to bring in additional variables: ```{r} p1 + facet_wrap(~ Sex) ``` The counts shown here may not be the most relevant features for understanding the joint distributions of these variables. ## Pie Charts and Doughnut Charts Pie charts can be viewed as stacked bar charts in polar coordinates: ```{r} hcols <- c(Black = "black", Brown = "brown3", Red = "brown1", Blond = "lightgoldenrod1") p1 <- ggplot(agg_ord) + geom_col(aes(x = 1, y = n, fill = Hair), position = "fill") + coord_polar(theta = "y") + scale_fill_manual(values = hcols) p2 <- ggplot(agg_ord) + geom_col(aes(x = Hair, y = n, fill = Hair)) + scale_fill_manual(values = hcols) grid.arrange(p1, p2, nrow = 1) ``` The axes and grid lines are not helpful for the pie chart and can be removed with some _theme_ settings. Using faceting we can also separately show the distributions for men and women: ```{r} p3 <- p1 + facet_wrap(~ Sex) + theme_bw() + theme(axis.title = element_blank(), axis.text = element_blank(), axis.ticks = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), panel.border = element_blank()) p3 ``` Doughnut charts are a variant that has recently become popular in the media: ```{r} p4 <- p3 + xlim(0, 1.5) p4 ``` The center is often used for annotation: ```{r} p4 + geom_text(aes(x = 0, y = 0, label = Sex)) + theme(strip.background = element_blank(), strip.text = element_blank()) ``` ## Some Notes * Pie charts are effective for judging part/whole relationships. * Pie charts are not very effective for comparing proportions. * 3D pie charts are popular and a very bad idea. An example ([Fig. 6.61](https://www.dropbox.com/s/tlehzi3kb6ikbsz/6.61.3DIllustration.png?dl=0)) from Andy Kirk's book (2016), [_Data Visualization: A Handbook for Data Driven Design_](http://book.visualisingdata.com/home): ![](img/badpie.png) Stacked bar charts with equal heights are an alternative for representing part-whole relationhips: ```{r} ggplot(agg) + geom_col(aes(x = Sex, y = n, fill = Hair), position = "fill") + scale_fill_manual(values = hcols) ``` Another alternative is a _waffle chart_, sometimes also called a _square pie chart_. ![](img/waffle.png) The [`waffle`](https://github.com/hrbrmstr/waffle) package is one R implementation of this idea.