--- title: "Plots for Multiple Time Series" output: html_document: toc: yes code_folding: show code_download: true --- ```{r setup, include = FALSE} source(here::here("setup.R")) knitr::opts_chunk$set(collapse = TRUE, message = FALSE, fig.height = 5, fig.width = 6, fig.align = "center") options(htmltools.dir.version = FALSE) set.seed(12345) library(ggplot2) library(lattice) library(tidyverse) library(gridExtra) library(forecast) theme_set(theme_minimal() + theme(text = element_text(size = 16), panel.border = element_rect(color = "grey30", fill = NA))) ``` ## Multiple Time Series Some questions for multiple time series: - How do the series behave individually? - How are the series related? - Does one series lead or lag the others? Some options for plotting multiple series: - separate panels in a trellis display; - multiple series in a single plot - this will require standardizing if the scales vary substantially; - a multivatiate plot with time represented by connecting line segments or animation. Another option for two time series: - use a separate `y` axis for each. This is usually a bad idea. Some data structures for multiple time series: * `mts` objects created by `ts()`. - All series have to be numeric. - `autoplot` methods may be sufficient. * Regular data frames. - The series will have to be plotted in separate layers, - or the data frames will need converting into long form. ## Crimean War Casualties The `HistData` package contains the data used by Florence Nightingale to call attention to terrible sanitary conditions for the Britich army in Crimea. ```{r} data(Nightingale, package = "HistData") head(select(Nightingale, -Date)) ``` A plot of the `Disease`, `Wounds`, and `Other`: ```{r night-ts-auto, echo = FALSE, fig.width = 7.5} nDWO <- select(Nightingale, Disease, Wounds, Other) |> ts(start = c(1854, 4), frequency = 12) autoplot(nDWO) ``` ```{r night-ts-auto, eval = FALSE} ``` Using the rate variables produces a very similar pattern: ```{r night-rate-ts-auto, echo = FALSE, fig.width = 7.5} nDWO.rate <- select(Nightingale, ends_with("rate")) |> ts(start = c(1854, 4), frequency = 12) autoplot(nDWO.rate) ``` ```{r night-rate-ts-auto, eval = FALSE} ``` The army size does not vary enough over the time where the casualties are high to affect the pattern much: ```{r night-army-ts-auto, echo = FALSE} nArmy <- select(Nightingale, Army) |> ts(start = c(1854, 4), frequency = 12) autoplot(nArmy) ``` ```{r night-army-ts-auto, eval = FALSE} ``` Instead of showing three series in one plot you can also use faceting. ```{r night-auto-facet, echo = FALSE} autoplot(nDWO, facets = TRUE) ``` ```{r night-auto-facet, eval = FALSE} ``` To create the plot of `Disease`, `Wounds`, and `Other` using basic `ggplot` it is best to first convert the data into long form. You can create a long version of the data, using either `pivot_longer` or `broom::tidy`. Using `broom:tidy` on the `mts` object: ```{r night-ggplot-broom, echo = FALSE, fig.width = 7.5} broom::tidy(nDWO) |> rename(Date = index, Deaths = value, Cause = series) |> ggplot(aes(x = Date, y = Deaths, color = Cause)) + geom_line() ``` ```{r night-ggplot-broom, eval = FALSE} ``` Using `pivot_longer` on the original data frame: ```{r night-ggplot-pivot-longer, echo = FALSE, fig.width = 7.5} select(Nightingale, Date, Disease, Wounds, Other) |> pivot_longer(-Date, names_to = "Cause", values_to = "Deaths") |> ggplot(aes(x = Date, y = Deaths, color = Cause)) + geom_line() ``` ```{r night-ggplot-pivot-longer, eval = FALSE} ``` Using faceting instead of a single plot: ```{r night-ggplot-facet, echo = FALSE} select(Nightingale, Date, Disease, Wounds, Other) |> pivot_longer(-Date, names_to = "Cause", values_to = "Deaths") |> ggplot(aes(x = Date, y = Deaths)) + geom_line() + facet_wrap(~ Cause, ncol = 1) ``` ```{r night-ggplot-facet, eval = FALSE} ``` To focus on the shape of the curves it is often useful to allow the vertical axis ranges to adjust to the data: ```{r night-ggplot-facet-free-y, echo = FALSE} select(Nightingale, Date, Disease, Wounds, Other) |> pivot_longer(-Date, names_to = "Cause", values_to = "Deaths") |> ggplot(aes(x = Date, y = Deaths)) + geom_line() + facet_wrap(~ Cause, ncol = 1, scales = "free_y") ``` ```{r night-ggplot-facet-free-y, eval = FALSE} ``` This is particularly useful when the ranges are very different, for example if the `Army` series is included: ```{r night-ggplot-facet-free-y-army, echo = FALSE} select(Nightingale, Army, Date, Disease, Wounds, Other) |> pivot_longer(-Date, names_to = "series", values_to = "Deaths") |> ggplot(aes(Date, Deaths)) + geom_line() + facet_wrap(~ series, ncol = 1, scales = "free_y") ``` ```{r night-ggplot-facet-free-y-army, eval = FALSE} ``` ## Some Notes Using `autoplot` is often a quick way to get useful plots. Using basic `ggplot` graphics is a bit more work but provides more flexibility. ## Many Time Series Annual flow for dams in Italy: ```{r italy-dams-single, echo = FALSE, fig.width = 8} data(hydroSIMN, package = "nsRFA") annualflows <- mutate(annualflows, cod = factor(cod)) ggplot(annualflows, aes(x = anno, y = dato, group = cod, color = factor(cod))) + geom_line() ``` Clearly showing all series in a single plot is not helpful! ```{r italy-dams-single, eval = FALSE} ``` With small multiples some patterns can be seen: ```{r italy-dams-facet, echo = FALSE, fig.width = 9, fig.height = 9} ggplot(annualflows, aes(x = anno, y = dato, group = cod)) + geom_line() + facet_wrap(~ cod) + theme(axis.text.x = element_text(angle = 45, hjust = 1)) ``` ```{r italy-dams-facet, eval = FALSE} ``` Small multiples can also be arranged in ways that convey spatial information: ```{r unemp-geofacet, echo = FALSE, fig.width = 9, fig.height = 9} library(geofacet) ggplot(state_unemp, aes(year, rate)) + geom_line() + facet_geo(~ state, grid = "us_state_grid2", label = "name") + scale_x_continuous(labels = function(x) paste0("'", substr(x, 3, 4))) + labs(title = "Seasonally Adjusted US Unemployment Rate 2000-2016", caption = "Data Source: bls.gov", x = "Year", y = "Unemployment Rate (%)") + theme(strip.text.x = element_text(size = 6), axis.text.y = element_text(size = 6), axis.text.x = element_blank()) ``` ```{r unemp-geofacet, eval = FALSE} ``` ## The Fall and Rise of Income Inequality A [2015 post](https://www.npr.org/sections/money/2015/02/11/384988128/the-fall-and-rise-of-u-s-inequality-in-2-graphs) on the NPR Planet Money site uses two graphs to show transitions in income inequality. The page includes a link to the [data source](http://topincomes.parisschoolofeconomics.eu/). This has been unavailable for some time. I tried to obtain the same data but may not have done it quite right. The closest source I could find is . The data is available locally: ```{r} income <- read.table("https://www.stat.uiowa.edu/~luke/data/nprincome.dat") ``` Plots over time: ```{r income-single, echo = FALSE, fig.width = 7.5} select(income, -year) |> ts(start = 1913) |> autoplot() ``` ```{r income-single, eval = FALSE} ``` The ranges are too different for this to be effective. One alternative: standardize the series to one, or 100%, at some point. Here the initial value in 1913 is used: ```{r income-single-rel, echo = FALSE, fig.width = 7.5} select(income, -year) |> transmute(U01rel = U01 / U01[1], L90rel = L90 / L90[1]) |> ts(start = 1913) |> autoplot() ``` ```{r income-single-rel, eval = FALSE} ``` Faceted plots, with free $y$ axes, are another option: ```{r income-facet, echo = FALSE} select(income, -year) |> ts(start = 1913) |> autoplot(facets = TRUE) ``` ```{r income-facet, eval = FALSE} ``` With two series another option is a _connected scatter plot_. ```{r income-connected-1, echo = FALSE} library(ggrepel) ggplot(income, aes(U01, L90)) + geom_path(linewidth = 2) + labs(x = "Average income for the top 1%", y = "Average income for the bottom 90%") ``` ```{r income-connected-1, eval = FALSE} ``` This simple form does not show the direction or scale of time. Color or annotation can be used to show direction and scale of time: ```{r income-connected-2, echo = FALSE, fig.width = 7.5, warning = FALSE} library(ggrepel) ggplot(income, aes(U01, L90, color = year)) + geom_path(size = 2) + geom_text_repel( aes(label = ifelse(year %% 5 == 0 | year == max(year), year, "")), color = "grey50") + labs(x = "Average income for the top 1%", y = "Average income for the bottom 90%") ``` ```{r income-connected-2, eval = FALSE} ``` Animated is another option: ```{r income-gganim, echo = FALSE, cache = TRUE, message = FALSE} p <- ggplot(income, aes(x = U01, y = L90, color = year)) + geom_path(size = 2, color = "lightgrey", data = mutate(income, year = NULL)) + geom_path(size = 2) + labs(title = "Year: {frame_along}", x = "Average income for the top 1%", y = "Average income for the bottom 90%") library(gganimate) animate(p + transition_reveal(year), fps = 5, end_pause = 10) ``` ```{r income-gganim, eval = FALSE} ``` ## Stack Loss Data Operational data obtained from 21 days of operation of a plant for the oxidation of ammonia (NH3) to nitric acid (HNO3). The dependent variable, `stack.loss` is 10 times the percentage of the ingoing ammonia to the plant that escapes from the absorption column unabsorbed. ```{r stackloss-splom, echo = FALSE} library(lattice) splom(stackloss) ``` ```{r stackloss-splom, eval = FALSE} ``` The data are collected over time: ```{r stackloss-splom-line, echo = FALSE, fig.height = 7} library(dplyr) sl <- mutate(stackloss, intex = seq_len(nrow(stackloss))) cpal <- colorRampPalette(c("#132B43", "#56B1F7")) cols <- cpal(nrow(stackloss) - 1) pfun <- function(x, y, ...) { n <- length(x) lsegments(x[-n], y[-n], x[-1], y[-1], col = cols, lwd = 3) } splom(stackloss, panel = pfun) ``` ```{r stackloss-splom-line, eval = FALSE} ``` Randomly reordering the indices can help calibrate whether there is a time effect: ```{r stackloss-splom-line-rnd, echo = FALSE, fig.height = 7} splom(sample_n(stackloss, nrow(stackloss)), panel = pfun) ``` ```{r stackloss-splom-line-rnd, eval = FALSE} ``` This is sometimes called _graphical inference_. > Hadley Wickham, Dianne Cook, Heike Hofmann, and Andreas Buja, > "Graphical Inference for Infovis," _IEEE Transactions on Visualization > and Computer Graphics_, Vol. 16, No. 6, November/December 2010. Picking one plot and embedding the true order among 19 random orderings: ```{r stack-loss-embed, echo = FALSE} N <- 20 k <- sample(N, 1) ord <- seq_len(nrow(stackloss)) ss <- function(i) { if (i == k) sl <- stackloss else sl <- sample_n(stackloss, nrow(stackloss)) mutate(sl, order = ord, sample = i) } d <- lapply(1 : N, ss) d <- do.call(rbind, d) library(ggplot2) ggplot(d, aes(x = Acid.Conc., y = stack.loss, color = order)) + geom_path(size = 0.8) + facet_wrap(~ sample) + guides(color = "none") ``` True order: plot `r k`. The code: ```{r stack-loss-embed, eval = FALSE} ``` ## Lynx-Hare Data Trapping data on Canadian lynx and snowshoe hare: ```{r lynxhare-facets, echo = FALSE} lynxhare <- read.table("https://www.stat.uiowa.edu/~luke/data/LynxHare.txt") names(lynxhare) <- c("year", "hare", "lynx") lhts <- ts(select(lynxhare, -year), start = 1845) autoplot(lhts, facets = TRUE) ``` Cycle patterns are similar. ```{r lynxhare-facets, eval = FALSE} ``` Lynx cycles often lag behind: ```{r lynxhare-single, echo = FALSE} autoplot(lhts) ``` ```{r lynxhare-single, eval = FALSE} ``` ## Things to Look For Changes in data definition. * Administrative definitions, such as employment/unemployment can change over time. * Administrative entities, such as countries and states, can change over time. * Instruments can change (satelite imagery replacing surface measurements). Outliers need not be globally outlying: departures from local patterns may also be important. Level shifts and changes in volatility. Shocks affecting level or volatility. ## Australian Airline Data [Weekly passenger loads](https://otexts.com/fpp2/graphics.html), in thousands, on Ansett Airlines between Australia’s two largest cities (available as `melsyd` in package `fpp`): ```{r australian-airlines, echo = FALSE, message = FALSE} data(melsyd, package = "fpp") autoplot(melsyd, facets = TRUE) ``` Some features: * Strike in 1989. * Increase in passenger load starting in second half of 1991. * Dip in economy, rise in business in 1992 corresponding to an experiment with fewer economy/more business seats. ```{r australian-airlines, eval = FALSE} ``` ## Weekly SP 500 Closing Returns Weekly closing returns of the SP 500 from 2003 to September, 2012: ```{r SP500, echo = FALSE, message = FALSE} library(xts) library(astsa) autoplot(sp500w) ``` Some features: * Increased volatility at the start of the financial crisis. * Volatility seems to be decreasing, but not very fast. ```{r SP500, eval = FALSE} ``` ## Forecasting Studying the past may help predict the future. Forecasting a time series usually involves choosing a model and running the model forward. Accuracy of forecasts decreases rapidly the farther ahead the forecast is made. An example from the `forecast` package: `WWWusage` is a time series of the numbers of users connected to the Internet. ```{r WWWusage, eval = FALSE} library(forecast) fit <- ets(WWWusage) fc <- forecast(WWWusage, model = fit) autoplot(fc) ``` ```{r WWWusage, echo = FALSE, message = FALSE} ``` ## Labeling Trend Lines For trend lines using a standard guide can be a little awkward. Average life expectancy against year for the continents in the `gapminder` data: ```{r trend-lines-guide, echo = FALSE, message = FALSE, fig.width = 7.5} library(gapminder) d <- group_by(gapminder, continent, year) |> summarize(avgLifeExp = mean(lifeExp)) |> ungroup() p <- ggplot(d, aes(x = year, y = avgLifeExp, color = continent)) + geom_line(size = 1) p ``` ```{r trend-lines-guide, eval = FALSE} ``` With the legend on the right, a small improvement is to order the levels to match the order of the right hand curve values: ```{r trend-lines-guide-reorder, echo = FALSE} d_last <- filter(d, year == max(year)) |> mutate(continent = reorder(continent, -avgLifeExp)) p %+% mutate(d, continent = factor(continent, levels(d_last$continent))) ``` ```{r trend-lines-guide-reorder, eval = FALSE} ``` _Direct labeling_ is often a better choice. One option is to use `geom_text` at the ends of the lines. This may need some additional adjustments to margins and clipping behavior: ```{r trend-lines-direct, echo = FALSE, fig.width = 7.5} p <- p + guides(color = "none") d_last <- filter(d, year == max(year)) p + geom_text(aes(label = continent), data = d_last, size = 5, hjust = -0.1) + coord_cartesian(clip = "off", expand = FALSE) + theme(plot.margin = margin(30, 60, 10, 10)) ``` ```{r trend-lines-direct, eval = FALSE} ``` Another option is to use a second axis: ```{r trend-lines-sec-axis, echo = FALSE, fig.width = 7.5} p + scale_y_continuous( sec.axis = dup_axis( breaks = d_last$avgLifeExp, labels = d_last$continent, name = NULL)) + scale_x_continuous(expand = c(0, 0)) ``` ```{r trend-lines-sec-axis, eval = FALSE} ``` It is also possible to place the labels along the lines. The [`geomtextpath` package](https://allancameron.github.io/geomtextpath/) provides support for this. ```{r trend-lines-text-path, echo = FALSE} library(geomtextpath) ggplot(d, aes(x = year, y = avgLifeExp, color = continent)) + geom_textline(aes(label = continent), linewidth = 1, vjust = -0.5, hjust = 0.95) + guides(color = "none") ``` ```{r trend-lines-text-path, eval = FALSE} ``` ## Bump Charts Top 10 countries in terms of GDP per capita in 2007: ```{r, class.source = "fold-hide"} top_gdp <- filter(gapminder, continent == "Europe", year == max(year)) |> slice_max(gdpPercap, n = 10) |> pull(country) d <- filter(gapminder, country %in% top_gdp) |> group_by(year) |> mutate( rank = min_rank(desc(gdpPercap))) |> ungroup() d_last <- filter(d, year == max(year)) ggplot(d, aes(year, gdpPercap, color = country)) + geom_line() ``` The dominant feature is the overall trend. One option: show deviations from the overall trend. Another option: show ranks within years. This is called a _bump chart_. A simple bump chart: ```{r, class.source = "fold-hide"} ggplot(d, aes(year, rank, color = country)) + geom_line(size = 1) + scale_y_reverse(breaks = 1:10) + geom_text(aes(label = country), data = d_last, size = 5, hjust = -0.1) + coord_cartesian(clip = "off", expand = FALSE) + theme(plot.margin = margin(30, 90, 10, 10)) + guides(color = "none") ``` The `ggbump` package provides `geom_bump` for drawing smooth bump chart lines: ```{r, class.source = "fold-hide"} library(ggbump) ggplot(d, aes(year, rank, color = country)) + geom_bump(size = 1) + scale_y_reverse(breaks = 1:10) + geom_text(aes(label = country), data = d_last, size = 5, hjust = -0.1) + coord_cartesian(clip = "off", expand = FALSE) + theme(plot.margin = margin(30, 90, 10, 10)) + guides(color = "none") ``` ## Other Approaches A [talk from 2019](https://visualisingdata.com/2019/04/talk-slides-design-of-time/) describes a wide range of options for visualizing temporal data. ## Reading Chapters [_Visualizing time series and other functions of an independent variable_](https://clauswilke.com/dataviz/time-series.html) and [_Visualizing trends_](https://clauswilke.com/dataviz/visualizing-trends.html) in [_Fundamentals of Data Visualization_](https://clauswilke.com/dataviz/). ## Exercises 1. Creating a time series or multiple time series object and using autoplot can be an efficient way to obtain line plots of time series data. The following code uses the `weather` table in the `nycflights13` package to create a data frame containing minimum and maximum temperatures for each month in 2013: ```{r} library(tidyverse) library(nycflights13) library(lubridate) library(forecast) temps <- group_by(weather, year, month) |> summarize(max_temp = max(temp, na.rm = TRUE), min_temp = min(temp, na.rm = TRUE)) |> ungroup() ``` Which of the following expressions produces a line plot showing the minimum and maximum temperature series (and no other series)? a. `autoplot(temps)` b. `autoplot(temps$max_temp, temps$min_temp)` c. `ts(temps) |> autoplot()` d. `select(temps, max_temp, min_temp) |> ts() |> autoplot()`