--- title: "Visualizing Amounts" output: html_document: toc: yes code_folding: "hide" code_download: true --- ```{r setup, include = FALSE} source(here::here("setup.R")) knitr::opts_chunk$set(collapse = TRUE, message = FALSE, fig.height = 5, fig.width = 6, fig.align = "center") library(lattice) library(tidyverse) library(gridExtra) library(scales) set.seed(12345) ``` ```{r, include = FALSE} source(here::here("datasets.R")) ``` ## Visualization and Comparison A visualization of a single value without some comparative context is rarely useful ```{r, echo = FALSE} library(ggplot2) library(scales) ggplot() + geom_point(aes(x = 0, y = 0), shape = 21, size = 150, fill = muted("blue")) + geom_text(aes(x = 0, y = 0, label = 200), size = 50, color = "white") + theme_void() ``` Useful visualizations almost always involve comparisons. * position of a value on an axis * relative position of two values * relative magnitudes of two values A simple and common setting: Visualizing a measurement for each of a set of categories. Some of the more common visualizations: * Dot Charts (also called Cleveland Dot Charts) * Bar Charts * Grouped Bar Charts (Clustered Bar Charts) * Stacked Bar Charts * Diverging Bar Charts * Heat Maps * Bubble Charts Some less frequently used visualizations: * Waterfall Charts * Polar Area Charts (Coxcomb Charts) * Dumbbell Charts Some questions to keep in mind: * What comparisons and other assessments do these visualizations support? * How do these visualizations scale to larger data sets? ## Dot Plots ### Basics One of the simplest visualizations of a single numerical variable with a modest number of observations and lables for the observations is a _dot plot_, or _Cleveland dot plot_: ```{r, message = FALSE, class.source = "fold-hide"} library(dplyr) library(ggplot2) library(gapminder) le_am_2007 <- filter(gapminder, year == 2007, continent == "Americas") thm <- theme_minimal() + theme(text = element_text(size = 16)) ggplot(le_am_2007, aes(y = country, x = lifeExp)) + geom_point(color = "deepskyblue3", size = 2) + labs(x = "Life Expectancy (years)", y = NULL) + thm ``` ```{r, echo = FALSE, eval = FALSE} ## to order alphabetically from top to bottom: le_am_2007 <- mutate(le_am_2007, country = factor(country, rev(sort(unique(country), descending = TRU)))) ``` This visualization * shows the overall distribution of the data, and * makes it easy to locate the life expectancy of a particular country. Unless there is a natural order to the categories (e.g. months of the year or days of the week) it is usually better to reorder to make the plot increasing or decreasing: ```{r, class.source = "fold-hide"} ggplot(le_am_2007, aes(y = reorder(country, lifeExp), x = lifeExp)) + geom_point(color = "deepskyblue3", size = 2) + labs(x = "Life Expectancy (years)", y = NULL) + thm ``` * Locating a particular country is a little more difficult. * But the shape of the distribution is more apparent. * Approximate median and quartiles can be read off easily. Dot plots are particularly appropriate for interval data. * they often do not show the origin; * they focus the viewer's attention on differences. Dot plots are often very useful for group summaries like totals or averages. For the `barley` data, total yield within each site, adding up across all varieties and both years, can be computed as ```{r} b_tot_site <- group_by(barley, site) %>% summarize(yield = sum(yield)) ``` The totals can then be visualized in a dot plot: ```{r, class.source = "fold-hide"} ggplot(b_tot_site, aes(x = yield, y = site)) + geom_point(color = "deepskyblue3", size = 2.5) + labs(x = "Total Yield (bushels/acre)", y = NULL) + thm ``` ### Larger Data Sets For larger data sets, like the `citytemps` data with `r nrow(citytemps)` observations, over-plotting of labels becomes a problem: ```{r, class.source = "fold-hide"} ggplot(citytemps, aes(x = temp, y = reorder(city, temp))) + geom_point(color = "deepskyblue3", size = 0.5) + labs(x = "Temperature (degrees F)", y = NULL) + thm ``` Reducing to 30 or 40, e.g. by taking a sample or a meaningful subset, can help: ```{r, class.source = "fold-hide"} ct1 <- filter(citytemps, temp < 32) %>% sample_n(10) ct2 <- filter(citytemps, temp >= 32) %>% sample_n(20) ctsamp <- bind_rows(ct1, ct2) ggplot(ctsamp, aes(x = temp, y = reorder(city, temp))) + geom_point(color = "deepskyblue3", size = 2) + labs(x = "Temperature (degrees F)", y = NULL) + thm ``` ### Some Variations The size of the dots can be used to encode an additional numeric variable. This view uses area to encode population size: ```{r, class.source = "fold-hide"} ggplot(le_am_2007, aes(y = reorder(country, lifeExp), x = lifeExp, size = pop / 1000000)) + geom_point(col = "deepskyblue3") + labs(x = "Life Expectancy (years)", y = NULL) + scale_size_area("Population\n(Millions)", max_size = 8) + thm + theme(legend.position = "top") ``` This is sometimes called a _bubble chart_. Repeated measures, such as values for 2002 and 2007, can be shown and distinguished by color, shape, or both. Using color: ```{r gapminder-dotplot-two-year, echo = FALSE} ``` ```{r gapminder-dotplot-two-year, eval = FALSE, class.source = "fold-hide"} ## filter down to data for 2002 and 2007 ## for the Americas le2 <- filter(gapminder, year >= 2002, continent == "Americas") ## make a factor Year to get a discrete color ## palette, not a continuous one le2 <- mutate(le2, Year = factor(year)) ggplot(le2, aes(y = reorder(country, lifeExp), x = lifeExp, color = Year)) + geom_point() + labs(x = "Life Expectancy (years)", y = NULL) + thm + theme(legend.position = "top") ``` * All countries show some improvement in life expectancy. * The small improvement for Jamaica _pops out_ a bit. Using shape for the `barley` data: ```{r} b_tot <- group_by(barley, site, year) %>% summarize(yield = sum(yield), .groups = "drop") ggplot(b_tot, aes(x = yield, y = site, shape = year)) + geom_point(color = "deepskyblue3", size = 3) + labs(x = "Total Yield (bushels/acre)", y = NULL) + thm + theme(legend.position = "top") ``` For repeated two values per class it can help to connect the two dots. This visually emphasizes the relative sizes of differences. The result is sometimes called a _dumbbell chart_. For the barley yields data: ```{r, class.source = "fold-hide"} ggplot(b_tot, aes(x = yield, y = site)) + geom_line(aes(group = site), linewidth = 2, color = "grey") + geom_point(aes(color = year), size = 4) + labs(x = "Total Yield (bushels/acre)", y = NULL) + thm + theme(legend.position = "top") ``` For the Gapminder life expectancy data: ```{r, echo = FALSE, eval = FALSE} library(ggalt) library(scales) pivot_wider(b_tot, names_from = "year", values_from = "yield", names_prefix = "X") %>% ggplot(aes(x = X1931, y = site, xend = X1932)) + geom_dumbbell(size = 3, color = "grey", color_x = muted("red"), color_xend = muted("blue"), dot_guide = TRUE) ``` ```{r, class.source = "fold-hide"} ggplot(le2, aes(y = reorder(country, lifeExp), x = lifeExp, color = Year)) + geom_line(aes(group = country), linewidth = 1.5, color = "grey") + geom_point(size = 2.5) + labs(x = "Life Expectancy (years)", y = NULL) + thm + theme(legend.position = "top") ``` ### Variations in Appearance The dot plots introduced by W. S. Cleveland in his 1993 book _Visualizing Data_ use only horizontal grid lines. This also corresponds to the dot plots provided by base and lattice graphics and to the dot plot obtained using the Wall Street Journal theme from the `ggthemes`package: ```{r, fig.height = 5.5, class.source = "fold-hide"} ggplot(b_tot_site, aes(x = yield, y = site)) + geom_point(size = 2) + labs(x = "Total Yield (bushels/acre)", y = NULL) + ggthemes::theme_wsj() ``` A theme to closely match the style used by Cleveland can be defined as ```{r} theme_dotplotx <- function() { theme( ## remove the vertical grid lines panel.grid.major.x = element_blank(), panel.grid.minor.x = element_blank(), ## explicitly set the horizontal lines ## (or they will disappear too) panel.grid.major.y = element_line(color = "black", linetype = 3), axis.text.y = element_text(size = rel(1.2)), ## use a white backgrounsd panel.background = element_rect(fill = "white", color = NA), panel.border = element_rect(fill = NA, color = "grey20"), ## increase text size text = element_text(size = 16)) } ``` This produces ```{r, class.source = "fold-hide"} ggplot(b_tot_site, aes(x = yield, y = site)) + geom_point(size = 2) + labs(x = "Total Yield (bushels/acre)", y = NULL) + theme_dotplotx() ``` ## Bar Charts ### Basics A basic bar chart: ```{r, class.source = "fold-hide"} p <- ggplot(le_am_2007) + geom_col(aes(y = lifeExp, x = reorder(country, lifeExp)), fill = "deepskyblue3") + labs(y = "Life Expectancy (years)", x = NULL) + scale_y_continuous( expand = expansion(mult = c(0, .1))) + thm p ``` The labels are a mess. * One option is to write labels at an angle, but this makes them hard to read. * Flipping the plot is usually a better option. To make labels readable we can flip the plot: ```{r, class.source = "fold-hide"} ## Redoing the plot plot with `x` and `y` aesthetics reversed did not ## work in the past but does work in current ggplot2 versions. ## But as we already have the previous plot we can just use coord_flip(), p + coord_flip() ``` ### Some Notes Bar charts seem to be used much more than dot plots in the popular media, but they are less widely applicable. Research on visual perception has shown that viewers of bar charts subconsciously, or _pre-attentively_, focus on relative lengths of bars. This means that for a bar chart to be appropriate and work well: * Ratio comparisons have to make sense for you data. * The relative bar lengths have to accurately reflect that data ratio. This in turn implies that bar charts should _always_ have a zero base line. You may need to intervene with your software's defaults to make this happen. Creating a bar chart with a non-zero base line is possible in `ggplot` but not easy: ```{r, message = FALSE, class.source = "fold-hide"} baseline <- 60 ticks <- c(0, 10, 20, 30) ggplot(le_am_2007, aes(x = lifeExp - baseline, y = reorder(country, lifeExp))) + geom_col(fill = "deepskyblue3") + labs(x = "Life Expectancy (years)", y = NULL) + scale_x_continuous( breaks = ticks, labels = ticks + baseline, expand = expansion(mult = c(0, .1))) + thm + labs(title = "A Bad Bar Chart") ``` * The visual impression is that the value for Bolivia is five times higher than the value for Haiti. * This is _not_ an accurate representation of the data. Bar charts always push the viewer to ratio comparisons, whether they are meaningful or not. Using a non-zero baseline can therefore [mislead the viewer](https://www.storytellingwithdata.com/blog/2012/09/bar-charts-must-have-zero-baseline). Some news organizations seem particularly prone to taking advantage of/falling prey to this issue. ```{r, echo = FALSE, out.width = "70%"} knitr::include_graphics(IMG("bushtaxbar.jpeg")) ``` ```{r, echo = FALSE, out.width = "70%"} knitr::include_graphics(IMG("obamabar.jpeg")) ``` A recent [blog post](https://www.statschat.org.nz/2020/02/02/graphs-dont-matter/) discusses a court case in New Zealand about a misleading bar chart with a non-zero baseline. Another recent [blog post](https://www.storytellingwithdata.com/blog/2020/2/19/what-is-a-bar-chart) may be helpful. ### Data With Both Positive and Negative Values Bar charts can be used for data containing both positive and negative values: ```{r, fig.width = 10, warning = FALSE, class.source = "fold-hide"} ctsampC <- mutate(ctsamp, temp = round((temp - 32) / 1.8)) p1 <- ggplot(ctsampC, aes(y = city, x = temp)) + geom_point(color = "deepskyblue3") + geom_vline(xintercept = 0, lty = 2) + labs(x = "Temperature (degrees C)", y = NULL) + thm p2 <- ggplot(ctsampC, aes(x = temp, y = city)) + geom_col(fill = "deepskyblue3") + labs(x = "Temperature (degrees C)", y = NULL) + thm library(patchwork) p1 + p2 ``` This puts a strong emphasis on the base line This can be useful if the baseline is meaningful, such as * [an average level](https://www.ncei.noaa.gov/access/monitoring/climate-at-a-glance/time-series/global) * zero degrees Celcius Zero degrees Fahrenheit is not meaningful: ```{r, fig.width = 10, warning = FALSE, class.source = "fold-hide"} p1 <- ggplot(ctsamp, aes(y = city, x = temp)) + geom_point(color = "deepskyblue3") + geom_vline(xintercept = 0, lty = 2) + labs(x = "Temperature (degrees F)", y = NULL) + thm p2 <- ggplot(ctsamp, aes(x = temp, y = city)) + geom_col(fill = "deepskyblue3") + labs(x = "Temperature (degrees F)", y = NULL) + thm p1 + p2 ``` ### Grouped Bar Charts A _grouped bar chart_ can be used to show two measurements, e.g. two times. ```{r, class.source = "fold-hide"} ggplot(le2) + geom_col(aes(x = lifeExp, y = reorder(country, lifeExp), fill = Year), position = "dodge") + labs(x = "Life Expectancy (years)", y = NULL) + scale_x_continuous( expand = expansion(mult = c(0, .1))) + thm ``` This visualization would be more effective for * fewer countries and more years * with more variation in ratios within and between categories. In this case the dot plot is a much better choice. For the `barley` data looking at total yields for each of the sites and the two years: ```{r, class.source = "fold-hide"} ggplot(b_tot) + geom_col(aes(x = yield, y = site, fill = year), position = "dodge") + labs(x = "Total Yield (bushels/acre)", y = NULL) + scale_x_continuous( expand = expansion(mult = c(0, .1))) + thm ``` ### Stacked Bar Charts A _stacked bar chart_ is appropriate when adding the values within a category to form a total makes sense. For the `barley` data: ```{r, fig.height = 4, class.source = "fold-hide"} ggplot(b_tot) + geom_col(aes(x = yield, y = site, fill = year), position = "stack") + labs(x = "Total Yield (bushels/acre)", y = NULL) + scale_x_continuous( expand = expansion(mult = c(0, .1))) + thm ``` * The combined bars show the totals. * The bar segments show the contribution of each year within the sites. * Comparing 1931 yields across sites is easy; comparing 1932 values is harder. A stacked bar chart would make no sense for the two-year life expectancy data. ### Category Order Ordering of categories can change the visual effectiveness of bar charts. #### Supermarket Sales A small data set on multi-national supermarket chain sales: ```{r, class.source = "fold-hide"} chains <- read.csv(textConnection("Chain, Total, Foreign Walmart, 22, 5 Costco, 4.5, 2 Tesco, 3, 0.5 Carrefour, 2.5, 2")) chains <- mutate(chains, Home = Total - Foreign) tbl <- select(chains, Chain, Home, Foreign, Total) kbl <- knitr::kable(tbl, format = "html") kableExtra::kable_styling(kbl, full_width = FALSE) ``` The default alphabetical ordering of the chains in a bar chart is not ideal: ```{r, class.source = "fold-hide"} library(tidyr) chains <- mutate(chains, Total = NULL) lchains <- pivot_longer(chains, -Chain, names_to = "Type", values_to = "Sales") p <- ggplot(lchains, aes(x = Sales, y = Chain, fill = Type)) + geom_col() + ylab(NULL) + scale_x_continuous( expand = expansion(mult = c(0, .1))) + scale_fill_brewer( palette = "Paired", guide = guide_legend(title = NULL, reverse = TRUE)) + thm + theme(legend.position = "top") p ``` Reordering by total sales produces a better result: ```{r, class.source = "fold-hide"} lchain_s <- mutate(lchains, Chain = reorder(Chain, Sales, FUN = sum)) p %+% lchain_s ``` With this ordering, comparing the first category of sales, `Home`, among chains is easier than comparing the `Foreign` sales as the `Home` values have a common baseline. If the goal of the visualization is to emphasize the foreign sales then reversing the order would be better. ```{r, class.source = "fold-hide"} lchain_st <- mutate(lchain_s, Type = factor(Type, levels = c("Home", "Foreign"))) p %+% lchain_st + scale_fill_brewer(palette = "Paired", guide = guide_legend(title = NULL, reverse = TRUE), direction = -1) ``` This also makes it easy to see that Walmart's international sales alone exceed the total sales of each of the other chains. A _filled bar chart_ can be used to help compare the proportions of total sales from foreign stores. ```{r, class.source = "fold-hide"} levs <- levels(with(chains, reorder(Chain, Foreign / (Foreign + Home)))) mutate(lchain_st, Chain = factor(Chain, levs)) %>% ggplot(aes(x = Sales, y = Chain, fill = Type)) + geom_col(position = "fill") + labs(x = "Proportion of Sales", y = NULL) + scale_x_continuous( expand = expansion(mult = c(0, .1))) + scale_fill_brewer(palette = "Paired", guide = guide_legend(title = NULL, reverse = TRUE), direction = -1) + thm + theme(legend.position = "top") ``` #### 2016 Election Results Another example for filled bar charts is provided by 2016 presidential election results by state. This is produced by default settings. The values for Clinton are easy to compare, as they are on a common baseline. The values for Other are also on a common baseline, but the numbers for Trump are not. ```{r pres2016-dflt, e = FALSE, fig.height = 6.5, class.source = "fold-hide"} p <- ggplot(geofacet::election, aes(x = votes, y = state, fill = candidate)) + geom_col(position = "fill") + scale_x_continuous(expand = c(0, 0)) + xlab(NULL) + geom_vline(xintercept = 0.5, linetype = 2) p ``` Reordering the categories puts both Clinton and Trump on common baselines and also matches the common color use: ```{r, fig.height = 6.5, class.source = "fold-hide"} ## move 'Other' to middle, align Clinton/Trump with colors elect <- mutate(geofacet::election, candidate = factor(candidate, c("Trump", "Other", "Clinton"))) p %+% elect ``` Different orderings of the states can be used to achieve different goals. Ordering by Clinton's percentage makes it easier to see where she had an outright majority. ```{r, fig.height = 6.5, class.source = "fold-hide"} elect_wide <- pivot_wider(select(elect, -votes), names_from = candidate, values_from = pct) ## ordered by Clinton pct slevs <- arrange(elect_wide, Clinton) %>% pull(state) p %+% mutate(elect, state = factor(state, slevs)) ``` Another possibility is to order by winning margin between the two major candidates. ```{r, fig.height = 6.5, class.source = "fold-hide"} ## ordered by winning margin slevs <- arrange(elect_wide, Clinton - Trump) %>% pull(state) p %+% mutate(elect, state = factor(state, slevs)) ``` ## Faceting Faceting can be used with dot plots and bar charts as well: ```{r, class.source = "fold-hide"} ggplot(barley) + geom_point(aes(x = yield, y = variety, color = year)) + facet_wrap(~ site) + theme(legend.position = "top") ``` These plots show lower yields for 1932 than for 1931 for all sites except Morris. * Cleveland suggest this may indicate a data entry error for Morris. * A [paper from 2014](https://blog.revolutionanalytics.com/2014/07/theres-no-mistake-in-the-barley-data.html) suggests there may be no error. ## Heat Maps Heat maps encode a numeric variable as the color of a tile placed at a particular position in a grid. Heat maps are also called _matrix charts_ or _image plots_. Heat maps can be effective in cases where a line plot contains too much over-potting, such as ```{r, class.source = "fold-hide"} gma <- filter(gapminder, continent == "Americas") ggplot(gma) + geom_line(aes(x = year, y = lifeExp, color = country)) + theme(legend.position = "top", legend.title = element_blank()) ``` A heat map of these data: ```{r, class.source = "fold-hide"} ggplot(gma) + geom_tile(aes(x = year, y = reorder(country, lifeExp), fill = lifeExp)) + scale_x_continuous(expand = c(0, 0)) + scale_fill_viridis_c() + labs(y = NULL, fill = "Life Expectancy (years)") + theme(legend.position = "top") ``` Again it is useful to consider reordering of categories when there is no natural order. This heat map orders the countries by average life expectancy. ## Bubble Charts A form of chart often seen in the popular press is the _bubble chart_. The bubble chart uses ares of circles to represent magnitudes. Position in the plane is usually not fully used or not used at all for mapping attributes. In on-line publications further information on each of the bubbles is often provided through interactions, such as a mouse-over tooltip. Other charts forms are almost always better for encoding the magnitude information. It is also easy to get the encoding wrong ([Corrected version](`r IMG("shrinking-banks-correct.jpg")`)): ```{r, echo = FALSE, out.width = "60%"} knitr::include_graphics(IMG("shrinking-banks-orig.jpg")) ``` `ggplot` bubble charts for the average `yield` values from the `barley` data and for the 2007 population sizes for the gapminder data: ```{r, echo = FALSE, message = FALSE, fig.width = 10} ## derived from ## http://stackoverflow.com/questions/38959093/packed-bubble-pie-charts-in-r library(packcircles) circles <- function(rad2) { stopifnot(all(rad2 > 0)) rad <- sqrt(rad2) n <- length(rad) lim <- n * max(rad) lims <- c(-lim, lim) old.seed <- { runif(1); .Random.seed } # nolint set.seed(12345) v <- circleLayout(cbind(runif(n), runif(n), rad), lims, lims) .Random.seed <- old.seed layoutDF <- as.data.frame(v$layout) names(layoutDF) <- c("x", "y", "radius") list(layout = layoutDF, data = circlePlotData(v$layout)) } bubble_theme <- function() { list(coord_equal(), theme_bw(), theme(axis.text = element_blank(), axis.ticks = element_blank(), axis.title = element_blank())) } ## barley yields absy <- group_by(barley, site, year) %>% summarise(avg_yield = mean(yield)) v <- circles(absy$avg_yield) vv <- v$data vv$year <- absy$year[vv$id] p1 <- ggplot(vv) + geom_polygon(aes(x, y, group = id, fill = year)) + geom_text(aes(x = x, y = y, label = absy$site), data = v$layout, size = 1.5) + bubble_theme() ## gapminder populations for 2007: library(gapminder) library(patchwork) gm2007 <- filter(gapminder, year == 2007) v <- circles(gm2007$pop) vv <- v$data vv$continent <- gm2007$continent[vv$id] p2 <- ggplot(vv) + geom_polygon(aes(x, y, group = id, fill = continent)) + coord_equal() + bubble_theme() ##grid.arrange(p1, p2, nrow = 1) p1 + p2 ``` ## Waterfall Charts Also called _cascade charts_. These usually show the decomposition of a total into positive and negative contributions from various sources. New Zealand net immigration by region: ```{r, echo = FALSE, out.width = "70%"} knitr::include_graphics(IMG("6.7.Waterfall.png")) ``` Internet subscribers by month ([blog post with `ggplot` code](https://web.archive.org/web/20190622211938/http://anhhoangduc.com/blog/create-waterfall-chart-with-ggplot2/)). ```{r, echo = FALSE, out.width = "65%"} knitr::include_graphics(IMG("waterfall.png")) ``` ## Polar Area Charts Florence Nightingale's famous chart showing the effect of the sanitation improvements in March/April 1855: ```{r, echo = FALSE, out.width = "95%"} knitr::include_graphics(IMG("Coxcombs.jpg")) ``` This chart has been called a _polar area diagrams_ or a _Coxcomb Chart_. It can be viewed as * a bar chart with the bars positioned in front of each other, * transformed to polar coordinates, * with magnitude mapped to area (i.e. square root of magnitude mapped to radius). The data are available in the `HistData` package as data frame `Nightingale`. Some rearrangement of the data is needed to get it into a form amenable for plotting. ```{r, class.source = "fold-hide"} library(forcats) data(Nightingale, package = "HistData") Night <- mutate(Nightingale, Period = ifelse(Date < as.Date("1855-04-01"), "Before", "After"), Period = factor(Period, c("Before", "After")), Month = factor(Month, month.abb)) %>% select(Date, Month, Year, Period, Disease, Wounds, Other) %>% pivot_longer(5 : 7, names_to = "Cause", values_to = "Deaths") ## Rearrange the Month levels to start with April. Night3 <- mutate(Night, Month = fct_shift(Month, 3)) ``` A standard stacked bar chart of the data: ```{r, fig.width = 8, class.source = "fold-hide"} ggplot(Night3, aes(x = Month, y = Deaths, fill = Cause)) + geom_col() + facet_wrap(~ Period) + theme(legend.position = "top") ``` A grouped bar chart of the data: ```{r, fig.width = 8, class.source = "fold-hide"} ggplot(Night3, aes(x = Month, y = Deaths, fill = Cause)) + geom_col(position = "dodge") + facet_wrap(~ Period) + theme(legend.position = "top") ``` Using `position = "identity"` the bars will be placed in front of each other rather than side by side or stacked. ```{r, fig.width = 8, class.source = "fold-hide"} p <- ggplot(Night3, aes(x = Month, y = Deaths, fill = Cause)) + geom_col(position = "identity", ## bars in front of each other width = 1, ## remove space between bars color = "black", ## bar border color linewidth = 0.1) + ## bar border thickness facet_wrap(~ Period) + theme(legend.position = "top") p ``` Problem: Larger bars may be placed in front of smaller ones. Reordering the `Deaths` values from largest to smallest ensures that larger bars do not cover smaller ones. ```{r, fig.width = 8, class.source = "fold-hide"} p <- ggplot(arrange(Night3, desc(Deaths)), aes(x = Month, y = Deaths, fill = Cause)) + geom_col(position = "identity", ## bars in front of each other width = 1, ## remove space between bars color = "black", ## bar border color linewidth = 0.1) + ## bar border thickness facet_wrap(~ Period) + theme(legend.position = "top") p ``` Changing to polar coordinates and mapping square roots of `Deaths` to radius produces the basic polar area chart: ```{r, fig.width = 8, class.source = "fold-hide"} p %+% (arrange(Night3, desc(Deaths)) %>% mutate(Deaths = sqrt(Deaths))) + coord_polar() ``` Some adjustments bring the result closer to the original: ```{r, fig.width = 8} p %+% (arrange(Night, desc(Deaths)) %>% mutate(Month = fct_shift(Month, 6), Period = factor(Period, c("After", "Before")), Deaths = sqrt(Deaths))) + coord_polar() + scale_fill_manual( values = c(Wounds = "pink", Other = "darkgray", Disease = "lightblue")) + theme(axis.title = element_blank(), axis.text.y = element_blank(), axis.ticks = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), panel.border = element_blank()) ``` ## Base and Lattice Graphics Base graphics provides the `dotchart()` function for creating dot plots: ```{r, class.source = "fold-hide"} with(le_am_2007, dotchart(sort(lifeExp), labels = country[order(lifeExp)])) ``` The `lattice` package provides `dotplot()`: ```{r, class.source = "fold-hide"} library(lattice) dotplot(reorder(country, lifeExp) ~ lifeExp, data = le_am_2007) ``` Most lattice plots support a `group` argument that is usually mapped to color: ```{r, class.source = "fold-hide"} dotplot(reorder(country, lifeExp) ~ lifeExp, group = year, data = le2, auto.key = TRUE) ``` Creating a basic bar chart in `lattice` uses the `barchart()` function. ```{r, class.source = "fold-hide"} ## By default lattice creates a bad bar chart with a ## non-zero base line for these data. Fix that by ## specifying xlim = c(0, 85). barchart(reorder(country, lifeExp) ~ lifeExp, data = le_am_2007, xlim = c(0, 85)) ``` Base graphics also provides a bar chart with the `barplot()` function. ```{r, class.source = "fold-hide"} par(mar = c(5, 5, 4, 2) + 0.1) with(le_am_2007, barplot(sort(lifeExp), horiz = TRUE, names.arg = country[order(lifeExp)], las = 1, cex.names = 0.7, cex.axis = 0.7)) ``` ## Reading Chapter [_Visualizing amounts_](https://clauswilke.com/dataviz/visualizing-amounts.html) in [_Fundamentals of Data Visualization_](https://clauswilke.com/dataviz/). ## Interactive Tutorial An interactive [`learnr`](https://rstudio.github.io/learnr/) tutorial for these notes is [available](`r WLNK("tutorials/amounts.Rmd")`). You can run the tutorial with ```{r, eval = FALSE} STAT4580::runTutorial("amounts") ``` You can install the current version of the `STAT4580` package with ```{r, eval = FALSE} remotes::install_gitlab("luke-tierney/STAT4580") ``` You may need to install the `remotes` package from CRAN first. ## Exercises 1. A plot similar to this was featured in a CNN news story several years ago: ```{r, echo = FALSE} library(ggplot2) levs <- c("Democrats", "Repubicans", "Independents") d <- data.frame(party = factor(levs, levs), pct = c(62, 54, 54)) ggplot(d, aes(x = party, y = pct - 50)) + geom_col(width = 0.5) + scale_y_continuous(labels = seq(50, 64, by = 2), breaks = seq(0, 14, by = 2), expand = expansion(c(0, 0.18))) + labs(x = "Political Party", y = NULL, title = "Percent Who Agreed With Court") + theme(text = element_text(size = 20, face = "bold"), panel.grid.minor.x = element_blank(), panel.grid.major.x = element_blank(), panel.grid.minor.y = element_blank(), panel.grid.major.y = element_line(color = "black"), panel.background = element_rect(fill = "grey", color = NA), plot.background = element_rect(fill = "white", color = "black", linewidth = 2), plot.margin = margin(20, 20, 20, 20)) + coord_fixed(0.1) ``` Which of the following is approximately correct: - About the same number of democrats as republicans agreed with the court. - About 15% more democrats than republicans agreed with the court. - About tree times as many democrats than republicans agreed with the court. - About two times as many democrats than republicans agreed with the court. 2. Consider the stacked bar chart produced by the following code: ```{r, eval = FALSE, class.source = "fold-show"} library(tidyverse) mpg2 <- mutate(mpg, class = fct_rev(fct_infreq(class)), cyl = factor(cyl)) p <- ggplot(mpg2, aes(y = class, fill = cyl)) + geom_bar() ``` Which of these modifications makes it easiest to compare the count of 4-cylinder models within the different classes? a. `p %+% mutate(mpg2, cyl = factor(cyl, c(4, 5, 6, 8)))` b. `p %+% mutate(mpg2, cyl = factor(cyl, c(8, 6, 5, 4)))` c. `p %+% mutate(mpg2, cyl = factor(cyl, c(5, 6, 4, 8)))` d. `p %+% mutate(mpg2, cyl = factor(cyl, c(4, 6, 8, 5)))` 3. The bar chart produced by the following code has `x` axis labels that could be improved: ```{r, class.source = "fold-show"} library(gapminder) library(dplyr) library(ggplot2) library(scales) p <- filter(gapminder, year == 2007) %>% group_by(continent) %>% summarize(avgGdpPercap = mean(gdpPercap)) %>% ggplot(aes(x = avgGdpPercap, y = continent)) + geom_col() + labs(x = "Average GDP Per Capita", y = NULL) + theme_minimal() + theme(text = element_text(size = 16)) ``` There are a number of different options. Which of the following does **not** provide improved `x` axis labels? a. `p + scale_x_continuous(labels = label_comma())` b. `p + scale_x_continuous(labels = label_dollar())` c. `p + scale_x_continuous(labels = unit_format(scale = 1/1000, unit = "K", prefix = "$"))` d. `p + scale_x_continuous(labels = c("$10,000", "$20,000", "$30,000"))` 4. A stacked bar chart is appropriate if the combined bar heights of the stacked bars have a reasonable interpretation. Consider the following two plots: ```{r, fig.height = 4, fig.width = 8} library(gapminder) library(dplyr) library(ggplot2) library(patchwork) p1 <- filter(gapminder, year >= 2000) %>% group_by(continent, year) %>% summarize(avgLifeExp = mean(lifeExp), .groups = "drop") %>% ggplot(aes(x = avgLifeExp, y = continent, fill = factor(year))) + geom_col() + theme_minimal() + theme(text = element_text(size = 12)) + scale_x_continuous(expand = expansion(mult = c(0, .1))) + labs(x = "Average Life Expectancy", y = NULL, fill = "Year", title = "Average Life Expectancy\nby Continent for Two Years", tag = "P1:") p2 <- count(mpg, class, cyl) %>% ggplot(aes(x = n, y = class, fill = factor(cyl))) + geom_col() + theme_minimal() + theme(text = element_text(size = 12)) + scale_x_continuous(expand = expansion(mult = c(0, .1))) + labs(x = "Number of Models", y = NULL, fill = "Cylinders", title = "Number of Car Models\nby Class and Cylinder Count", tag = "P2:") p1 + p2 ``` Which of the following statements is true: a. P1 is an appropriate use of a stacked bar chart but P2 is not. b. P2 is an appropriate use of a stacked bar chart but P1 is not. c. Neither P1 nor P2 is an appropriate use of a stacked bar chart. d. Both P1 and P2 are appropriate uses of a stacked bar chart.