Categorical Data

Categorical data can be

• nominal, qualitative

• ordinal

For visualization, the main difference is that ordinal data suggests a particular display order.

Purely categorical data can come in a range of formats.

The most common are

• raw data: individual observations;

• aggregated data: counts for each unique combination of levels;

• cross-tabulated data.

Raw Data

Raw data for a survey of individuals that records hair color, eye color, and gender of 592 individuals might look like this:

head(raw)
##    Hair   Eye    Sex
## 1 Brown  Blue   Male
## 2 Brown Brown   Male
## 3 Brown Hazel   Male
## 4 Blond Green Female
## 5 Brown Brown Female
## 6 Brown Hazel   Male

Aggregated Data

One way to aggregate raw categorical data is to use count from dplyr:

library(dplyr)
agg <- count(raw, Hair, Eye, Sex)
##    Hair   Eye    Sex  n
## 1 Black Brown   Male 32
## 2 Black Brown Female 36
## 3 Black  Blue   Male 11
## 4 Black  Blue Female  9
## 5 Black Hazel   Male 10
## 6 Black Hazel Female  5

Cross-Tabulated Data

Cross-tabulated data can be produced from aggregate data using xtabs:

xtabs(n ~ Hair + Eye + Sex, data = agg)
## , , Sex = Male
##
##        Eye
## Hair    Brown Blue Hazel Green
##   Black    32   11    10     3
##   Brown    53   50    25    15
##   Red      10   10     7     7
##   Blond     3   30     5     8
##
## , , Sex = Female
##
##        Eye
## Hair    Brown Blue Hazel Green
##   Black    36    9     5     2
##   Brown    66   34    29    14
##   Red      16    7     7     7
##   Blond     4   64     5     8

Cross-tabulated data can be produced from raw data using table:

xtb <- table(raw)
xtb
## , , Sex = Male
##
##        Eye
## Hair    Brown Blue Hazel Green
##   Black    32   11    10     3
##   Brown    53   50    25    15
##   Red      10   10     7     7
##   Blond     3   30     5     8
##
## , , Sex = Female
##
##        Eye
## Hair    Brown Blue Hazel Green
##   Black    36    9     5     2
##   Brown    66   34    29    14
##   Red      16    7     7     7
##   Blond     4   64     5     8

Both raw and aggregate data in this example are in tidy form; the cross-tabulated date is not.

Cross-tabulated data on $$p$$ variables is arranged in a $$p$$-way array.

The cross-tabulated data can be converted to the tidy aggregate form using as.data.frame:

class(xtb)
## [1] "table"
##    Hair   Eye  Sex Freq
## 1 Black Brown Male   32
## 2 Brown Brown Male   53
## 3   Red Brown Male   10
## 4 Blond Brown Male    3
## 5 Black  Blue Male   11
## 6 Brown  Blue Male   50

The variable xtb corresponds to the data set HairEyeColor in the datasets package,

Working With Categorical Variables

Categorical variables are usually represented as:

• character vectors

• factors.

• more control over ordering of levels

• levels are preserved when forming subsets

• levels can reflect possible values not present in the data

Most plotting and modeling functions will convert character vectors to factors with levels ordered alphabetically.

Some standard R functions for working with factors include

• factor creates a factor from another type of variable
• levels returns the levels of a factor
• reorder changes level order to match another variable
• relevel moves a particular level to the first position as a base line
• droplevels removes levels not in the variable.

The tidyverse package forcats adds some more tools, including

• fct_inorder creates a factor with levels ordered by first appearance
• fct_infreq orders levels by decreasing frequency
• fct_rev reverses the levels
• fct_recode changes factor levels
• fct_relevel moves one or more levels
• fct_c merges two or more factors
• fct_collapse merge some factor levels

Bar Charts For Frequencies

Basics

A bar chart is often used to show the frequencies of a categorical variable.

By default, geom_bar uses stat = "count" and maps its result to the y aesthetic.

This is suitable for raw data:

thm <- theme_minimal() +
theme(text = element_text(size = 16))
ggplot(raw) +
geom_bar(aes(x = Hair),
fill = "deepskyblue3") +
thm

For a nominal variable it is often better to order the bars by decreasing frequency:

library(forcats)
ggplot(mutate(raw,
Hair = fct_infreq(Hair))) +
geom_bar(aes(x = Hair),
fill = "deepskyblue3") +
thm

If the data have already been aggregated, then you need to either specify stat = "identity" as well as the variable containing the counts as the y aesthetic, or use geom_col:

ggplot(agg) +
geom_col(aes(x = Hair,
y = n),
fill = "deepskyblue3") +
thm

For aggregated data, reordering can be based on the computed counts using

agg_ord <-
mutate(agg,
Hair = reorder(Hair, -n, sum))
• -n is used to order largest to smallest;

• the default summary used by reorder is mean; sum is better here.

ggplot(agg_ord) +
geom_col(aes(x = Hair,
y = n),
fill = "deepskyblue3") +
thm

Mapping the Eye variable to fill in ggplot produces a stacked bar chart.

An alternative, specified with position = "dodge", is a side by side bar chart, or a clustered bar chart.

For the side by side chart in particular it may be useful to also reorder the Eye color levels.

ecols <- c(Brown = "brown2",
Blue = "blue2",
Hazel = "darkgoldenrod3",
Green = "green4")
agg_ord <-
mutate(agg,
Hair = reorder(Hair, -n, sum),
Eye = reorder(Eye, -n, sum))
p1 <- ggplot(agg_ord) +
geom_col(aes(x = Hair,
y = n,
fill = Eye)) +
scale_fill_manual(values = ecols) +
thm
p2 <- ggplot(agg_ord) +
geom_col(aes(x = Hair,
y = n,
fill = Eye),
position = "dodge") +
scale_fill_manual(values = ecols) +
thm
(p1 + guides(fill = "none")) | p2

Faceting can be used to bring in additional variables:

p1 + facet_wrap(~ Sex)

The counts shown here may not be the most relevant features for understanding the joint distributions of these variables.

Pie Charts and Doughnut Charts

Pie charts can be viewed as stacked bar charts in polar coordinates:

hcols <- c(Black = "black", Brown = "brown4",
Red = "brown1", Blond = "lightgoldenrod1")
p1 <- ggplot(agg_ord) +
geom_col(aes(x = 1, y = n, fill = Hair), position = "fill") +
coord_polar(theta = "y") +
scale_fill_manual(values = hcols) +
thm
p2 <- ggplot(agg_ord) +
geom_col(aes(x = Hair, y = n, fill = Hair)) +
scale_fill_manual(values = hcols) +
thm
(p1 + guides(fill = "none")) | p2

The axes and grid lines are not helpful for the pie chart and can be removed with some theme settings.

Using faceting we can also separately show the distributions for men and women:

pie_thm <- thm +
theme(axis.title = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_blank())

p3 <- p1 + facet_wrap(~ Sex) + pie_thm
p3

Doughnut charts are a variant that has recently become popular in the media:

p4 <- p3 + xlim(0, 1.5)
p4

The center is often used for annotation:

p4 + geom_text(aes(x = 0, y = 0, label = Sex), size = 5) +
theme(strip.background = element_blank(),
strip.text = element_blank())

An alternative to the polar coordinates approach uses geom_arc_bar and stat_pie from package ggforce:

library(ggforce)
arrange(agg_ord, desc(Hair)) %>%
ggplot(aes(x0 = 0, y0 = 0, r0 = 0, r = 1, amount = n, fill = Hair)) +
geom_arc_bar(stat = "pie", color = NA) +
coord_fixed() +
scale_fill_manual(values = hcols) +
pie_thm +
facet_wrap(~ Sex)

For doughnut charts:

arrange(agg_ord, desc(Hair)) %>%
ggplot(aes(x0 = 0, y0 = 0, r0 = 0.4, r = 1, amount = n, fill = Hair)) +
geom_arc_bar(stat = "pie", color = NA) +
geom_text(aes(x = 0, y = 0, label = Sex), size = 5) +
coord_fixed() +
scale_fill_manual(values = hcols) +
pie_thm +
theme(strip.background = element_blank(),
strip.text = element_blank()) +
facet_wrap(~ Sex)

Some Notes

Pie charts are effective for judging part/whole relationships.

Pie charts can be effective for comparing proportions to

• one half
• one quarter

Pie charts are not very effective for comparing proportions to each other.

3D pie charts are popular and a very bad idea. An example (Fig. 6.61) from Andy Kirk’s book (2016), Data Visualization: A Handbook for Data Driven Design:

Pie charts are widely used for political data.

• With the right ordering, pie charts are very good at showing which coalitions of parties can form a majority.

• When no one candidate earns a majority of the votes, pie charts do not show which candidate has earned a plurality very well.

• Good orientation and factor ordering can help.

elect <- geofacet::election %>%
group_by(candidate) %>%

p1 <- ggplot(elect) +
geom_col(aes(x = 1, y = votes, fill = candidate), position = "fill") +
coord_polar(theta = "y", start = -1) +
xlim(c(-0.5, 1.5)) +
scale_fill_manual(values = c(Trump = scales::muted("red", 50, 80),
Clinton = scales::muted("blue", 50, 70),
Other = "grey")) +
pie_thm

p2 <- mutate(elect,
candidate = factor(candidate,
c("Clinton", "Other", "Trump"))) %>%
ggplot() +
geom_col(aes(x = 1, y = votes, fill = candidate), position = "fill") +
coord_polar(theta = "y") +
xlim(c(-0.5, 1.5)) +
scale_fill_manual(values = c(Trump = scales::muted("red", 50, 80),
Clinton = scales::muted("blue", 50, 70),
Other = "grey")) +
pie_thm

(p1 + guides(fill = "none")) + p2

Some Alternatives

Stacked Bar Charts

Stacked bar charts with equal heights, or filled bar charts, are an alternative for representing part-whole relationships.

• Top and bottom proportions are easy to compare.

• Comparing proportions to one half and one quarter is harder.

ggplot(agg) +
geom_col(aes(x = Sex, y = n, fill = Hair), position = "fill") +
scale_fill_manual(values = hcols) +
thm

Waffle Charts

Another alternative is a waffle chart, sometimes also called a square pie chart.

The waffle package is one R implementation of this idea.

Currently the development version on GitHub is needed for the following examples.

Showing the counts:

library(waffle)
stopifnot(packageVersion("waffle") >= "1.0.1")
ggplot(arrange(agg, Hair), aes(values = n, fill = Hair)) +
geom_waffle(n_rows = 18, flip = TRUE, color = "white", size = 0.33) +
coord_equal() +
facet_wrap(~ Sex) +
scale_fill_manual(values = hcols) +
theme_minimal() +
theme_enhance_waffle()

Showing the proportions:

round_pct <- function(n) {
pct <- 100 * (n / sum(n))
nn <- floor(pct)
if (sum(nn) < 100) {
rem <- pct - nn
idx <- sort(order(rem), decreasing = TRUE)[seq_len(100 - sum(nn))]
nn[idx] <- nn[idx] + 1
}
nn
}

group_by(agg, Sex) %>%
mutate(pct = round_pct(n)) %>%
ungroup() %>%
ggplot(aes(values = pct, fill = Hair)) +
geom_waffle(n_rows = 10, flip = TRUE, color = "white", size = 0.33) +
coord_equal() +
facet_wrap(~ Sex) +
scale_fill_manual(values = hcols) +
theme_minimal() +
theme_enhance_waffle()

Population Pyramids

Bar charts for two groups can be shown back to back.

mutate(agg, Hair = reorder(Hair, n, sum)) %>%
ggplot(aes(x = ifelse(Sex == "Male", n, -n),
y = Hair,
fill = Sex)) +
geom_col() +
xlab("Count") +
thm

This is often used for showing age distributions by sex for populations; the result is called a population pyramid.

Age distribution data for many countries and years is available from a Census Bureau website.

Data files for 2020 for Germany and Nigeria are available locally.

if (! file.exists("germany-2020.csv"))
"germany-2020.csv")
if (! file.exists("nigeria-2020.csv"))
"nigeria-2020.csv")

gm_pop <- read.csv("germany-2020.csv", skip = 1) %>%
filter(Age != "Total") %>%
mutate(Age = fct_inorder(Age))

ni_pop <- read.csv("nigeria-2020.csv", skip = 1) %>%
filter(Age != "Total") %>%
mutate(Age = fct_inorder(Age))

Combining the data sets allows a side by side comparison of the counts:

library(tidyr)
pop2 <-
bind_rows(mutate(gm_pop, Country = "Germany"),
mutate(ni_pop, Country = "Nigeria")) %>%
select(Age,
Male = Male.Population,
Female = Female.Population, Country) %>%
pivot_longer(Male : Female,
names_to = "Sex",
values_to = "n")

ggplot(pop2) +
geom_col(aes(x = ifelse(Sex == "Male", n, -n),
y = Age,
fill = Sex)) +
facet_wrap(~ Country) +
scale_x_continuous(
labels = function(n) scales::comma(abs(n))) +
xlab("Count") +
thm +
theme(legend.position = "top")

The different shapes are evident, but are harder to see than they could be because of the difference in total population:

group_by(pop2, Country) %>%
summarize(Population = sum(n)) %>%
ungroup() %>%
mutate(Population = scales::comma(Population)) %>%
knitr::kable(format = "html", align = "lr") %>%
kableExtra::kable_styling(full_width = FALSE)
Country Population
Germany 80,159,662
Nigeria 214,028,302

Using a group mutate we can compute sex/age group percentages within each country:

group_by(pop2, Country) %>%
mutate(pct = 100 * n / sum(n)) %>%
ungroup() %>%
ggplot() +
geom_col(aes(x = ifelse(Sex == "Male", pct, -pct),
y = Age,
fill = Sex)) +
facet_wrap(~ Country) +
scale_x_continuous(
labels = function(x) scales::percent(abs(x / 100))) +
xlab("Percent") +
thm +
theme(legend.position = "top")

Multiple Categorical Variables

Visualizing the distribution of multiple categorical variables involves visualizing counts and proportions.

Distributions can be viewed as

• joint distributions;

• conditional distributions.

When one variable (or several) can be viewed as a response and others as predictors then it is common to focus on the conditional distribution of the response given the predictors.

The most common approaches use variants of bar and area charts.

The resulting plots are often called mosaic plots.

Two Data Sets

Hair and Eye Color

HairEyeColorDF <-
as.data.frame(HairEyeColor)
##    Hair   Eye  Sex Freq
## 1 Black Brown Male   32
## 2 Brown Brown Male   53
## 3   Red Brown Male   10
## 4 Blond Brown Male    3
## 5 Black  Blue Male   11
## 6 Brown  Blue Male   50

Marginal distributions of the variables:

p1 <- ggplot(HairEyeColorDF) +
geom_col(aes(Sex, Freq), fill = "deepskyblue3") +
thm
p2 <- ggplot(HairEyeColorDF) +
geom_col(aes(Hair, Freq), fill = "deepskyblue3") +
thm
p3 <- ggplot(HairEyeColorDF) +
geom_col(aes(Eye, Freq), fill = "deepskyblue3") +
thm
p1 | p2 | p3

Arthritis Data

The vcd package includes the data frame Arthritis with several variables for 84 patients in a clinical trial for a treatment for rheumatoid arthritis.

data(Arthritis, package = "vcd")
##   ID Treatment  Sex Age Improved
## 1 57   Treated Male  27     Some
## 2 46   Treated Male  29     None
## 3 77   Treated Male  30     None
## 4 17   Treated Male  32   Marked
## 5 36   Treated Male  46   Marked
## 6 23   Treated Male  58   Marked
• The Improved variable is the response.

• The predictors are Treatment, Sex, and Age.

Counts for the categorical predictors:

xtabs(~ Sex, Arthritis)
## Sex
## Female   Male
##     59     25
xtabs(~ Treatment, Arthritis)
## Treatment
## Placebo Treated
##      43      41
xtabs(~ Treatment + Sex, data = Arthritis)
##          Sex
## Treatment Female Male
##   Placebo     32   11
##   Treated     27   14

Joint distribution of the predictors:

ggplot(Arthritis) +
geom_histogram(aes(x = Age),
binwidth = 10,
fill = "deepskyblue3",
color = "black") +
facet_grid(Treatment ~ Sex) +
thm

Conditional distribuiton of age, given sex and treatment:

ggplot(Arthritis) +
geom_histogram(aes(x = Age,
y = after_stat(density)),
binwidth = 10,
fill = "deepskyblue3",
color = "black") +
facet_grid(Treatment ~ Sex) +
thm

Bar Charts

Hair and Eye Color

Default bar charts show the individual count or joint proportions.

For the hair-eye color aggregated data counts:

ggplot(HairEyeColorDF) +
geom_col(aes(x = Eye, y = Freq, fill = Sex)) +
facet_wrap(~ Hair) +
thm

Joint proportions:

ggplot(mutate(HairEyeColorDF, Prop = Freq / sum(Freq))) +
geom_col(aes(x = Eye, y = Prop, fill = Sex)) +
facet_wrap(~ Hair) +
thm

• Differing frequencies of the hair colors are visible.

• Conditional distributions of eye color within hair color are harder to compare.

Showing conditional distributions requires computing proportions within groups.

For the joint conditional distribution of sex and eye color given hair color:

group_by(HairEyeColorDF, Hair) %>%
mutate(Prop = Freq / sum(Freq)) %>%
ungroup() %>%
ggplot() +
geom_col(aes(x = Eye, y = Prop, fill = Sex)) +
facet_wrap(~ Hair) +
thm

• It is easier to compare the skewness of the eye color distributions for black, brown, and red hair.

• Assessing the proportion of females or males withing the different groups is possible but challenging since it requires relative length comparisons.

To more clearly see the that the proportion of females among subjects with blond hair and blue eyes is higher than for other hair/eye color combinations we can look at the conditional distribution of sex given hair and eye color.

group_by(HairEyeColorDF, Hair, Eye) %>%
mutate(Prop = Freq / sum(Freq)) %>%
ungroup() %>%
ggplot() +
geom_col(aes(x = Eye,
y = Prop,
fill = Sex)) +
facet_wrap(~ Hair, nrow = 1) +
thm +
theme(axis.text.x =
element_text(angle = 45,
hjust = 1))

This plot can also be obtained using position = "fill".

ggplot(HairEyeColorDF) +
geom_col(aes(x = Eye,
y = Freq,
fill = Sex),
position = "fill") +
facet_wrap(~ Hair, nrow = 1) +
thm +
theme(axis.text.x =
element_text(angle = 45,
hjust = 1))

One drawback: This visualization no longer shows that some of the hair/eye color combinations are more common than others.

Arthritis Data

For the raw arthritis data, geom_bar computes the aggregate counts and produces a stacked bar chart by default:

p <- ggplot(Arthritis, aes(x = Sex,
fill = Improved)) +
facet_wrap(~ Treatment)
p + geom_bar() +
scale_fill_brewer(palette = "Blues") +
thm

Specifying position = "dodge" produces a side-by-side plot:

p + geom_bar(position = "dodge") +
scale_fill_brewer(palette = "Blues") +
thm

There are no cases of male patients on placebo reporting Some improvement, resulting in wider bars for the other options.

One way to produce a zero height bar:

• aggregate with count, and

• use complete from tidyr

library(tidyr)
comp_counts <-
count(Arthritis,
Treatment, Sex, Improved) %>%
complete(Treatment, Sex, Improved,
fill = list(n = 0))
ggplot(comp_counts,
aes(x = Sex, y = n, fill = Improved)) +
geom_col(position = "dodge") +
facet_wrap(~ Treatment) +
scale_fill_brewer(palette = "Blues") +
thm

Another option is to use the preserve = "single" option with position_dodge.

p + geom_bar(position =
position_dodge(
preserve = "single")) +
scale_fill_brewer(palette = "Blues") +
thm

Showing conditional distributions of Improved given different levels of Treatment and Sex:

group_by(comp_counts, Treatment, Sex) %>%
mutate(prop = n / sum(n)) %>%
ggplot() +
geom_col(aes(x = Sex,
y = prop,
fill = Improved),
position = "dodge") +
facet_wrap(~ Treatment) +
scale_fill_brewer(palette = "Blues") +
thm

Stacked bar charts with height one are another option to make these conditional distributions easier to compare:

p + geom_bar(position = "fill") +
scale_fill_brewer(palette = "Blues") +
thm

Ordering of variables affects which comparisons are easier.

• A researcher might want to emphasize the differential response among males and females.

• A patient might prefer to be able to focus on whether the treatment is effective for them:

ggplot(Arthritis, aes(x = Treatment, fill = Improved)) +
geom_bar(position = "fill") +
scale_fill_brewer(palette = "Blues") +
thm +
facet_wrap(~ Sex)

Some notes;

• The stacked bar chart is effective for two categories, and a few more if they are ordered.

• Providing a visual indication of uncertainty in the estimates is a challenge. The standard errors are around 0.1.

• The proportions of each treatment group that are male or female could be encoded in the bar widths.

• The resulting plot is called a spine plot.

• Basic ggplot2 does not seem to make this easy.

Spine Plots

Spine plots are a special case of mosaic plots, and can be seen as a generalization of stacked bar plots.

For a spine plot the proportions for the categories of a predictor variable are encoded in the bar widths.

The ggmosaic package provides support for mosaic plots in the ggplot framework. (It can be a little rough around the edges.)

Spine plots are provided by the base graphics function spineplot and the vcd function spine.

vcd plots are built on the grid graphics system, like lattice and ggplot2 graphics.

A spine plot for the distribution of Improved given Sex in the Treated group:

library(ggmosaic)
filter(Arthritis, Treatment == "Treated") %>%
mutate(Improved = fct_rev(Improved)) %>%
ggplot() +
geom_mosaic(aes(x = product(Sex),
fill = Improved)) +
scale_fill_brewer(palette = "Blues",
direction = -1) +
facet_wrap(~ Treatment) +
thm + labs(x = "", y = "Improved")

Spine plots for Treatment groups using faceting:

library(ggmosaic)
mutate(Arthritis,
Improved = fct_rev(Improved)) %>%
ggplot() +
geom_mosaic(aes(x = product(Sex),
fill = Improved)) +
scale_fill_brewer(palette = "Blues",
direction = -1) +
facet_wrap(~ Treatment) +
thm + labs(x = "", y = "Improved")

Spine plots for the arthritis data, faceted on Sex:

library(ggmosaic)
mutate(Arthritis,
Improved = fct_rev(Improved)) %>%
ggplot() +
geom_mosaic(aes(x = product(Treatment),
fill = Improved)) +
scale_fill_brewer(palette = "Blues",
direction = -1) +
facet_wrap(~ Sex) +
thm + labs(x = "", y = "Improved")

For aggregate counts use the weight aesthetic:

mutate(HairEyeColorDF, Sex = fct_rev(Sex)) %>%
ggplot() +
geom_mosaic(aes(weight = Freq,
x = product(Hair),
fill = Sex)) +
thm + labs(x = "Hair", y = "")

Spine plots of Sex within Eye color, faceted on Hair color:

mutate(HairEyeColorDF, Sex = fct_rev(Sex)) %>%
ggplot() +
geom_mosaic(aes(weight = Freq,
x = product(Eye),
fill = Sex)) +
thm + labs(x = "Eye", y = "") +
facet_wrap(~ Hair,
nrow = 1,
scales = "free_x") +
theme(legend.position = "top",
axis.text.y = element_blank(),
axis.text.x =
element_text(angle = 45,
hjust = 1)) +
scale_y_continuous(expand = c(0, 0))

The relative sizes of the groups on the x (eye color) axis are shown within the facets.

The sizes of the faceted variable (hair color) groups are not reflected.

Double decker plots try to address this.

Doubledecker Plots

Doubledecker plots can be viewed as a generalization of spine plots to multiple predictors.

Package vcd provides the doubledecker function.

This function can use a formula interface.

arth_pal <-
RColorBrewer::brewer.pal(3, "Blues")
arth_gp <- grid::gpar(fill = arth_pal)
vcd::doubledecker(Improved ~ Treatment + Sex,
data = Arthritis,
gp = arth_gp,
margins = c(2, 5, 4, 2))

vcd::doubledecker(Improved ~ Sex + Treatment,
data = Arthritis,
gp = arth_gp,
margins = c(2, 5, 4, 2))

Using ggmosaic:

mutate(Arthritis,
Improved = fct_rev(Improved)) %>%
ggplot() +
geom_mosaic(
aes(x = product(Sex, Treatment),
fill = Improved),
divider = ddecker()) +
scale_fill_brewer(palette = "Blues",
direction = -1) +
thm +
theme(axis.text.x =
element_text(angle = 15,
hjust = 1)) +
labs(x = "", y = "")

mutate(Arthritis,
Improved = fct_rev(Improved)) %>%
ggplot() +
geom_mosaic(
aes(x = product(Treatment, Sex),
fill = Improved),
divider = ddecker()) +
scale_fill_brewer(palette = "Blues",
direction = -1) +
thm +
theme(axis.text.x =
element_text(angle = 15,
hjust = 1)) +
labs(x = "", y = "")

Mosaic Plots

Mosaic plots recursively partition the axes to represent counts of categorical variables as rectangles.

• Base graphics provides mosaicplot;

• vcd provides mosaic.

Both support a formula interface.

A Mosaic plot for the predictors Sex and Treatment:

vcd::mosaic(~ Sex + Treatment,
data = Arthritis)

Adding Improved to the joint distribution:

vcd::mosaic(~ Sex + Treatment + Improved,
data = Arthritis)

Identifying Improved as the response:

vcd::mosaic(Improved ~ Sex + Treatment,
data = Arthritis)

Matching the doubledecker plots:

vcd::mosaic(
Improved ~ Treatment + Sex,
data = Arthritis,
split_vertical = c(TRUE, TRUE, FALSE))

vcd::mosaic(
Improved ~ Sex + Treatment,
data = Arthritis,
split_vertical = c(TRUE, TRUE, FALSE))

Some variants using ggmosaic:

ggplot(mutate(Arthritis, Sex = fct_rev(Sex))) +
geom_mosaic(
aes(x = product(Treatment,
Sex))) +
coord_flip() +
labs(x = "", y = "")

ggplot(mutate(Arthritis, Sex = fct_rev(Sex))) +
geom_mosaic(aes(x = product(Improved,
Treatment,
Sex))) +
coord_flip()

A mosaic plot for all bivariate marginals:

pairs(xtabs(~ Sex + Treatment + Improved, data = Arthritis))

Spinograms and CD Plots

Spinograms and CD plots show the conditional distribution of a categorical variable given the value of a numeric variable.

• Spinograms use the same binning as a histogram and then create a spine plot.

• CD plots use a smoothing or density estimation approach.

A spinogram for Improved against Age:

ArthT <- filter(Arthritis,
Treatment == "Treated") %>%
mutate(Improved = fct_rev(Improved))
arthT_gp <-
grid::gpar(fill = rev(arth_gp$fill)) vcd::spine(Improved ~ Age, data = ArthT, gp = arthT_gp, breaks = 5) An analogous plot created with ggmosaic by binning the Age variable: Arth <- mutate(Arthritis, AgeBin = cut(Arthritis$Age,
seq(20, by = 10,
len = 7)),
Improved = fct_rev(Improved))
filter(Arth, Treatment == "Treated") %>%
count(Improved, AgeBin) %>%
ggplot() +
geom_mosaic(aes(weight = n,
x = product(AgeBin),
fill = Improved)) +
scale_fill_brewer(palette = "Blues",
direction = -1) +
theme_minimal() +
theme(axis.title = element_blank())

A facet grid can be used to create spinograms for each of the Sex/Treatment combinations:

ggplot(count(Arth, Improved, Sex, Treatment, AgeBin)) +
geom_mosaic(aes(weight = n,
x = product(AgeBin),
fill = Improved)) +
scale_fill_brewer(palette = "Blues",
direction = -1) +
theme_minimal() +
facet_grid(Treatment ~ Sex) +
theme(axis.title = element_blank()) +
theme(axis.text.x = element_text(angle = 35,
hjust = 1),
axis.text.y = element_blank())

A spinogram in the media (NYT, August 2021):

CD plots estimate the conditional density of the x variable given the levels of y, weighted by the marginal proportions of y and use these to estimate cumulative probabilities.

• The slice at a particular x level visualizes the conditional distribution of y given x at that level.

• geom_density with position = stack is one way to create a CD plot.

• The cd_plot function from the vcd package produces a CD plot using grid graphics.

• The cdplot function from the base graphics package provides the same plots using base graphics.

CD plots for the Treated group:

filter(Arthritis, Treatment == "Treated") %>%
ggplot(aes(x = Age, fill = Improved)) +
geom_density(position = "fill", bw = 5) +
scale_fill_brewer(palette = "Blues") +
facet_wrap(~ Sex, ncol = 1) +
thm

CD plots for all combinations end up with one group of size one and one of size zero, which produces a non-useful plot for one combination:

count(Arthritis, Treatment, Sex, Improved) %>%
complete(Treatment, Sex, Improved,
fill = list(n = 0)) %>%
filter(n < 2)
## # A tibble: 2 × 4
##   Treatment Sex   Improved     n
##   <fct>     <fct> <ord>    <int>
## 1 Placebo   Male  Some         0
## 2 Placebo   Male  Marked       1

ggplot(Arthritis,
aes(x = Age, fill = Improved)) +
geom_density(position = "fill", bw = 5) +
scale_fill_brewer(palette = "Blues") +
facet_grid(Treatment ~ Sex) +
thm
## Warning: Groups with fewer than two data points have been dropped.
## Warning: Removed 1 rows containing missing values (position_stack).

Uncertainty Representation

Categorical data are often analyzed by fitting models representing conditional independence structures.

• Plotting residuals from these models can help assess how well they fit.

• mosaic supports using color to represent magnitude of residuals for comparing to a simple independence model.

For the Arthritis data, observed counts and expected counts under an independence model assuming Treatment and Improved are independent can be visualized as mosaic plots:

## there are easier ways do do this ...
v <- count(Arthritis, Treatment, Improved)
pT <- group_by(v, Treatment) %>%
summarize(n = sum(n)) %>%
mutate(pT = n / sum(n)) %>%
select(-n)
pI <- group_by(v, Improved) %>%
summarize(n = sum(n)) %>%
mutate(pI = n / sum(n)) %>%
select(-n)
v <- left_join(v, pT, "Treatment") %>%
left_join(pI, "Improved") %>%
mutate(p = pT * pI,
Treatment = fct_rev(Treatment))

po <- ggplot(v) +
geom_mosaic(aes(weight = n, x = product(Improved, Treatment),
fill = Improved)) +
scale_fill_brewer(palette = "Blues") +
guides(fill = "none") +
labs(title = "Observed Proportions") +
thm +
coord_flip() +
theme(axis.text.y = element_text(angle = 90, hjust = 0))

pe <- ggplot(v) +
geom_mosaic(aes(weight = p, x = product(Improved, Treatment),
fill = Improved)) +
scale_fill_brewer(palette = "Blues") +
guides(fill = "none") +
labs(title = "Expected Proportions") +
thm +
coord_flip() +
theme(axis.text.y = element_text(angle = 90, hjust = 0))

po + pe

A plot for assessing the fit of the residuals between the observed and expected data under a model assuming independence of Treatment and Improved produces:

vcd::mosaic(~ Treatment + Improved,
data = Arthritis,
gp = vcd::shading_max)

Another visualization of the residuals is the association plot produced by assoc:

vcd::assoc(~ Treatment + Improved,
data = Arthritis,
gp = vcd::shading_max)

References

The vignette Residual-Based Shadings in vcd in the vcd package.

Zeileis, Achim, David Meyer, and Kurt Hornik. “Residual-based shadings for visualizing (conditional) independence.” Journal of Computational and Graphical Statistics 16, no. 3 (2007): 507-525.

The vignette Working with categorical data with R and the vcd and vcdExtra packages in the vcdExtra package.

Several other experimental mosaic plot implementations are available for ggplot.

Some Other Visualizations

Tree Maps

Tree maps show hierarchically structured (or tree-tructured) data.

• Each branch is represented by a rectangle.

• Leaf node tiles have areas proportional to the value of a variable.

• Tiles are often colored to reflect the value of another variable.

The package treemapify provides a ggplot-based implementation.

The data set G20 includes some variables on the G-20 member countries:

library(treemapify)
select(G20, region,
country, gdp_mil_usd, hdi) %>%
knitr::kable(format = "html") %>%
kableExtra::kable_styling(
full_width = FALSE)
region country gdp_mil_usd hdi
Africa South Africa 384315 0.629
North America United States 15684750 0.937
North America Mexico 1177116 0.775
South America Brazil 2395968 0.730
South America Argentina 474954 0.811
Asia China 8227037 0.699
Asia Japan 5963969 0.912
Asia South Korea 1155872 0.909
Asia India 1824832 0.554
Asia Indonesia 878198 0.629
Eurasia Russia 2021960 0.788
Eurasia Turkey 794468 0.722
Europe European Union 16414483 0.876
Europe Germany 3400579 0.920
Europe France 2608699 0.893
Europe United Kingdom 2440505 0.875
Europe Italy 2014079 0.881
Middle East Saudi Arabia 727307 0.782
Oceania Australia 1541797 0.938

A simple tree with only one level, the individual countries:

A corresponding tree map based on gdp_mil_usd:

ggplot(G20, aes(area = gdp_mil_usd)) +
geom_treemap() +
geom_treemap_text(aes(label = country),
color = "white")

A tree grouping by region:

A corresponding tree map:

ggplot(G20, aes(area = gdp_mil_usd,
subgroup = region)) +
geom_treemap() +
geom_treemap_text(aes(label = country),
color = "white") +
geom_treemap_subgroup_border(
color = "red") +
geom_treemap_subgroup_text(color = "red")

A tree map showing GDP values for the G-20 members, grouped by region, with fill mapped to the country’s Human Development Index:

ggplot(G20, aes(area = gdp_mil_usd,
fill = hdi,
subgroup = region)) +
geom_treemap() +
geom_treemap_text(aes(label = country),
color = "white") +
geom_treemap_subgroup_border() +
geom_treemap_subgroup_text(
color = "lightgrey")

A treemap representing the distribution of eye color within hair color:

group_by(agg, Eye, Hair) %>%
summarize(n = sum(n)) %>%
ungroup() %>%
ggplot(aes(area = n,
subgroup = Hair)) +
geom_treemap(aes(fill = Eye),
color = "white") +
geom_treemap_subgroup_text() +
geom_treemap_subgroup_border(
color = "black", size = 6) +
geom_treemap_text(aes(label = Eye),
color = "grey90") +
scale_fill_manual(values = ecols) +
guides(fill = "none")

A treemap representing proportions for Improved within Treatment within Sex for the Arthritis data:

count(Arth, Treatment, Improved, Sex) %>%
ggplot(aes(area = n,
subgroup = Sex, fill = Improved,
subgroup2 = Treatment)) +
geom_treemap() +
geom_treemap_subgroup_text() +
scale_fill_brewer(palette = "Blues",
direction = -1) +
geom_treemap_subgroup_border() +
geom_treemap_subgroup2_text(place = "top",
size = 20)

Alluvial plots

These are also known as

• parallel sets, or

• Sankey diagrams.

They can be viewed as a parallel coordinates plot for categorical data.

Several implementations available, including:

• geom_parallel_sets from ggforce;

• geom_sankey from ggsankey;

• geom_alluvium from ggalluvial.

Hair/Eye color using the ggforce package:

pal <- RColorBrewer::brewer.pal(3, "Set1")
HDF <- mutate(HairEyeColorDF,
Sex = fct_rev(Sex))
library(ggforce)
sHDF <- gather_set_data(HDF, 3 : 1)
sHDF <- mutate(sHDF, x = fct_inorder(x))
ggplot(sHDF, aes(x, id = id,
split = y,
value = Freq)) +
geom_parallel_sets(aes(fill = Sex),
alpha = 0.5,
axis.width = 0.1) +
geom_parallel_sets_axes(
axis.width = 0.1) +
geom_parallel_sets_labels(
colour = 'white') +
scale_fill_manual(
values = c(Male = pal[2],
Female = pal[1])) +
theme_void() + guides(fill = "none")

Arthritis data with ggforce:

sArth <- gather_set_data(count(Arth,
Improved,
Treatment,
Sex),
3 : 1)
sArth <- mutate(sArth,
x = fct_inorder(x),
Sex = fct_rev(Sex))
ggplot(sArth, aes(x,
id = id,
split = y,
value = n)) +
geom_parallel_sets(aes(fill = Sex),
alpha = 0.5,
axis.width = 0.1) +
geom_parallel_sets_axes(axis.width = 0.1) +
geom_parallel_sets_labels(
colour = 'white') +
scale_fill_manual(
values = c(Male = pal[2],
Female = pal[1])) +
theme_void() + guides(fill = "none")

Stream Graphs

Stream graphs are a generalization of stacked bar charts plotted against a numeric variable.

In some cases the origins of the bars are shifted to improve some aspect of the overall visualization.

An early example is the Baby Name Voyager.

A NY Times visualization of movie box office results is another example. (Blog post with a static version).

Some R implementations on GitHub:

A stream graph for movie genres (these are not mutually exclusive):

## install with: remotes::install_github("hrbrmstr/streamgraph")
library(streamgraph)
library(tidyverse)
genres <- c("Action", "Animation", "Comedy",
"Drama", "Documentary", "Romance")
mymovies <- select(ggplot2movies::movies,
year, one_of(genres))
mymovies_long <- pivot_longer(
mymovies, -year,
names_to = "genre",
values_to = "value")
movie_counts <- count(mymovies_long,
year, genre)
streamgraph(movie_counts, "genre", "n", "year")

Interactive Tutorial

An interactive learnr tutorial for these notes is available.

You can run the tutorial with

STAT4580::runTutorial("proportions")

You can install the current version of the STAT4580 package with

remotes::install_gitlab("luke-tierney/STAT4580")

You may need to install the remotes package from CRAN first.

Exercises

1. Figure A shows a bar char of the flights leaving NYC airports in 2013 for each day of the week. Figure B shows the market share of five major internet browsers in 2015.

For which of these bar charts would it be better to reorder the categories so the bars are ordered from largest to smallest?

1. Yes for Figure A. No for Figure B.
2. No for Figure A. Yes for Figure B.
3. Yes for both.
4. No for both.
2. Consider the stacked bar chart p1 and the spine plot p2 for the hair and eye color data produced by the following code:

library(dplyr)
library(ggplot2)
library(ggmosaic)
ecols <- c(Brown = "brown2", Blue = "blue2",
Hazel = "darkgoldenrod3", Green = "green4")
HairEyeColorDF <- as.data.frame(HairEyeColor)
p0 <- ggplot(HairEyeColorDF) +
scale_fill_manual(values = ecols) +
theme_minimal()

p1 <- p0 + geom_col(aes(x = Hair, y = Freq / sum(Freq), fill = Eye))

p2 <- p0 + geom_mosaic(aes(x = product(Hair), fill = Eye, weight = Freq))

Use the two plots to answer: Which hair color has the highest proportion of individuals with green eyes?

1. Black
2. Brown
3. Red
4. Blond

Which plot makes it easiest to answer this question?

3. Use the plots of the previous question to answer: The proportion of individuals with red hair is closest to:

1. 5%
2. 8%
3. 12%
4. 20%

Which plot makes it easiest to answer this question?