A lot of data is collected over time:

• economic performace measures;

• weather, climate data;

• medical information.

The objectives can be

• understanding the past – how we got where we are;

• understanding patterns in variation;

• forecasting what will happen next.

Even if time is not of primary interest, data are often collected with a time stamp.

Examining the behavior over time can reveal interesting features that may hint at confounding variables:

• temperature varying with time of day;

• humidity varying over the year.

Time series data can be collected at regularly spaced intervals:

• yearly, e.g. the gapminder data

• monthly, e.g. the river data

• daily, e.g. high temperatures.

Data might be missing

• irregularly, e.g. faulty instrument, holidays;

• regularly, e.g. weekends.

Time series can also be recorder at irregular times:

• blood pressure at doctor visits;

Variables recorded over time can be numerical or categorical.

Basic Plots for a Single Numeric Time Series

Numeric time series are usually plotted as a line chart.

river <- scan("https://www.stat.uiowa.edu/~luke/data/river.dat")
plot(river, type = "l") Using symbols in addition to lines is sometimes useful:

plot(river, type = "b") Using ggplot:

library(ggplot2)
p <- ggplot(NULL, aes(y = river, x = seq_along(river)))
p + geom_line() + geom_point() Using lattice:

library(lattice)
xyplot(river ~ seq_along(river), type = c("l", "p")) Aspect Ratio

Aspect ratio has a significant effect on how well we perceive slope differences.

• Banking to 45 degrees: choose an aspect ratio so the magnitude of the important slopes is close to 45 degrees.

• This usually requires the horizontal axis to be longer than the vertical axis.

Controling the aspect ratio:

• For base graphics use the asp argument to plot.

• lattice uses the aspect argument.

• For ggplot use the ratio argument to coord_fixed.

plot(river, asp = 5, type = "l") xyplot(river ~ seq_along(river), aspect = 0.2, type = "l") p + geom_line() + coord_fixed(ratio = 5) • Lattice supports several aspect ratio strategies:

• “fill” to avoid unused white space;
• “xy” to use a 45 degree banking rule;
• “iso” for isometric scales, e.g. distances.
• Some work may be needed to avoid excess white space around figures.

• Using multiple panels can help.

• In an interactive setting, zoom and pan support can help.

Multiple panels in lattice with overlap:

xyplot(river ~ seq_along(river) | equal.count(seq_along(river), 3, overlap=0.1),
type="l", aspect="xy", strip=FALSE,
scales=list(relation = "sliced"),
xlab="months") Multiple panels in ggplot:

nbr = 3
dr <- data.frame(flow = river,
time = seq_along(river),
period = cut(seq_along(river), nbr))
ggplot(dr, aes(time, flow)) +
geom_line() + ## coord_fixed(ratio = 5) +
facet_wrap(~ period, nrow = nbr, scales = "free_x") Adding zoom and pan support using plotly:

library(plotly)
pp <- p + geom_line() + coord_fixed(ratio = 5)
ggplotly(pp)

There is not always a single best aspect ratio.

The co2 data set in the datasets package contains monthly concentrations of CO2 for the years 1959 to 1997 recorded at the Mauna Loa observatory.

The co2 data is stored as an object of class ts:

str(co2)
##  Time-Series [1:468] from 1959 to 1998: 315 316 316 318 318 ...
attributes(co2)
## $tsp ##  1959.000 1997.917 12.000 ## ##$class
##  "ts"

plot and xyplot have methods for ts objects that simplify time series plotting.

Two aspect ratios:

xyplot(co2) xyplot(co2, aspect = "xy") Is there a slowdown in the increase? A graph with more recent data is available. The full data is available as well.

Updated Mauna Loa CO2 Data

library(lubridate)
url <- "ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_mm_mlo.txt"
if (! file.exists("co2new.dat") ||
now() > file.mtime("co2new.dat") + weeks(4))

An initial look:

##     V1 V2       V3     V4     V5     V6 V7
## 1 1958  3 1958.208 315.71 315.71 314.62 -1
## 2 1958  4 1958.292 317.45 317.45 315.29 -1
## 3 1958  5 1958.375 317.50 317.50 314.71 -1
## 4 1958  6 1958.458 -99.99 317.10 314.85 -1
## 5 1958  7 1958.542 315.86 315.86 314.98 -1
## 6 1958  8 1958.625 314.93 314.93 315.94 -1
names(co2new) <- c("year", "month", "decimal_data", "average",
"interpolated", "trend", "ndays")

plot(co2new$average) head(co2new) ## year month decimal_data average interpolated trend ndays ## 1 1958 3 1958.208 315.71 315.71 314.62 -1 ## 2 1958 4 1958.292 317.45 317.45 315.29 -1 ## 3 1958 5 1958.375 317.50 317.50 314.71 -1 ## 4 1958 6 1958.458 -99.99 317.10 314.85 -1 ## 5 1958 7 1958.542 315.86 315.86 314.98 -1 ## 6 1958 8 1958.625 314.93 314.93 315.94 -1 Missing values are encoded as -99.99. These need to be replaced with R’s missing value representation: filter(co2new, average < -90) ## year month decimal_data average interpolated trend ndays ## 1 1958 6 1958.458 -99.99 317.10 314.85 -1 ## 2 1958 10 1958.792 -99.99 312.66 315.61 -1 ## 3 1964 2 1964.125 -99.99 320.07 319.61 -1 ## 4 1964 3 1964.208 -99.99 320.73 319.55 -1 ## 5 1964 4 1964.292 -99.99 321.77 319.48 -1 ## 6 1975 12 1975.958 -99.99 330.58 331.60 0 ## 7 1984 4 1984.292 -99.99 346.84 344.27 2 co2new <- mutate(co2new, average = ifelse(average < -90, NA, average)) plot(co2new$average, type = "l") The plot shows gaps for the missing values. The interpolated series fills the gaps:

plot(co2new$average, type = "l") lines(co2new$interpolated, col = "red") Converting the new data to a ts object makes adding the original data a little easier:

avts <- ts(co2new$average, start = c(1958, 3), frequency = 12) plot(avts) lines(co2, col = "red") • For time series with a strong seasonal component it can be useful to look at a Seasonal Decomposition of Time Series by Loess, or (STL). • The methodology was suggested by Clevaland and coworkers. • The stl function computes such a decomposition; it requires a series without missing values. avits <- ts(co2new$interpolated, start = c(1958, 3), frequency = 12)
co2stl <- stl(avits, s.window = 21)
head(co2stl$time.series) ## seasonal trend remainder ## [1,] 1.1201682 314.9204 -0.3305750 ## [2,] 2.2928419 314.9962 0.1609439 ## [3,] 2.7604674 315.0720 -0.3324890 ## [4,] 2.2105461 315.1478 -0.2583752 ## [5,] 0.8756507 315.2170 -0.2326649 ## [6,] -1.0881115 315.2862 0.7319123 plot(co2stl) Time Series Objects There are a number of specialized object classes for dealing with time series. Some of the most useful: Class Package Features ts, mts base package regular series zoo zoo package irregular and regular series; autoplot support The zoo package provides a method for the ggplot2 function autoplot that produces an appropriate plot for an object of class zoo: library(zoo) p <- autoplot(as.zoo(avts)) p We can then add a layer for the original co2 data using geom_line. Creating a new data frame for the co2 data makes this easier: zco2 = data.frame(time = time(co2), average = co2) p + geom_line(aes(x = time, y = average), data = zco2, color = "red") ## Don't know how to automatically pick scale for object of type yearmon. Defaulting to continuous. Some Alternative Visualizations GDP Growth Data Bar charts are sometimes used for economic time series, in particular growth rate data. Some GDP growth rate data: Read the data from the web page: library(rvest) gdpurl <- "http://www.multpl.com/us-real-gdp-growth-rate/table/by-quarter" gdppage <- read_html(gdpurl) gdptbl <- html_table(gdppage)[] head(gdptbl) Clean the data: library(dplyr) library(lubridate) names(gdptbl) <- c("Date", "Value") gdptbl <- mutate(gdptbl, Value = as.numeric(sub("%", "", Value)), Date = mdy(Date)) head(gdptbl) Data prior to 1947 is annual, so drop earlier data: gdptbl <- filter(gdptbl, Date >= make_date(1948, 1, 1)) Rearrange to put Date in increasing order: gdptbl <- arrange(gdptbl, Date) Create and plot a time series: gdpts <- ts(gdptbl$Value, start = 1947, frequency = 4)
plot(gdpts)
abline(h = 0, lty = 2)

A bar chart with color used to distinguish positive and negative growth can be created from the original table gdptbl:

library(ggplot2)
ggplot(mutate(gdptbl, growth = ifelse(Value >= 0, "pos", "neg")),
aes(x = Date, y = Value, fill = growth)) +
geom_col() + scale_fill_manual(values = c(pos = "blue", neg = "red"))

The bar chart can also be created from the time series object:

gdpd <- data.frame(rate = as.numeric(gdpts),
time = as.numeric(time(gdpts)))
gdpd <- mutate(gdpd, growth = ifelse(rate >= 0, "pos", "neg"))
ggplot(gdpd, aes(time, rate, fill = growth)) +
geom_col() + scale_fill_manual(values = c(pos = "blue", neg = "red"))

Airline Flights From NYC

The nycflights13 data set contains data on all flight leaving New York City in 2013. Similar data is available for other cities and years.

We can look at the number of departures for each day:

library(nycflights13)
library(ggplot2)
nf <- count(flights, year, month, day)
• It will be useful to have a combined date field in our data frame.

• The lubridate function make_date function can be used for this:

library(lubridate)
nf <- mutate(nf, date = make_date(year, month, day))

Some plots:

library(ggplot2)
p <- ggplot(nf, aes(date, n)) + geom_line()
p p + facet_wrap(~ wday(date, TRUE)) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) A simple calendar plot:

monthweek <- function(d, w) ceiling((d - w) / 7) + 1

nf <- mutate(nf, wd = wday(date, label = TRUE))
nf <- mutate(nf, wd = factor(wd, levels = rev(levels(wd))))
nf <- mutate(nf, mw = monthweek(day, wday(date)))

ggplot(nf, aes(x = as.character(mw), y = wd, fill = n)) +
geom_tile(color = "white") +
facet_wrap(~ month(date, TRUE)) +
ylab("") + xlab("Week of Month") +
theme(panel.grid.major = element_blank()) A slightly better version:

nf2 <- mutate(nf, wd = wday(date, label = TRUE))
nf2 <- mutate(nf2, wd = factor(wd))
nf2 <- mutate(nf2, mw = factor(monthweek(day, wday(date))))
nf2 <- mutate(nf2, mw = factor(mw, rev(levels(mw))))

ggplot(nf2, aes(x = wd, y = mw, fill = n)) +
geom_tile(color = "white") +
facet_wrap(~ month(date, TRUE)) +
ylab("") + xlab("Week of Month") +
theme(panel.grid.major = element_blank(),
axis.text.x = element_text(angle = 45, hjust = 1)) Multiple Time Series

Some options for plotting multiple series:

• separate panels in a trellis display;

• multiple series in a single plot - this will require standardizing if the scales vary substantially;

• a multivatiate plot with time represented by connecting line segments or animation.

Another option for two time series:

• use a separate y axis for each.

This is usually a bad idea.

Crimean War Casualties

The HistData package contains the data used by Florence Nightingale to call attention to terrible sanitary conditions for the Britich army in Crimea.

library(lattice)
data(Nightingale, package = "HistData")
##         Date Month Year  Army Disease Wounds Other Disease.rate
## 1 1854-04-01   Apr 1854  8571       1      0     5          1.4
## 2 1854-05-01   May 1854 23333      12      0     9          6.2
## 3 1854-06-01   Jun 1854 28333      11      0     6          4.7
## 4 1854-07-01   Jul 1854 28722     359      0    23        150.0
## 5 1854-08-01   Aug 1854 30246     828      1    30        328.5
## 6 1854-09-01   Sep 1854 30290     788     81    70        312.2
##   Wounds.rate Other.rate
## 1         0.0        7.0
## 2         0.0        4.6
## 3         0.0        2.5
## 4         0.0        9.6
## 5         0.4       11.9
## 6        32.1       27.7

Lattice allows multiple series to be plotted with

xyplot(Disease + Wounds + Other ~ Date, data = Nightingale,
type = "l", auto.key = TRUE) Using the rate variables produces a very similar pattern:

xyplot(Disease.rate + Wounds.rate + Other.rate ~ Date, data = Nightingale,
type = "l", auto.key = TRUE) The army size does not vary enough over the time where the casualties are high to affect the pattern much:

xyplot(Army ~ Date, data = Nightingale, type = "l") To use ggplot first create a long format data frame:

library(tidyr)
nd <- gather(Nightingale, series, value, 4, 8 : 10)
##         Date Month Year Disease Wounds Other series value
## 1 1854-04-01   Apr 1854       1      0     5   Army  8571
## 2 1854-05-01   May 1854      12      0     9   Army 23333
## 3 1854-06-01   Jun 1854      11      0     6   Army 28333
## 4 1854-07-01   Jul 1854     359      0    23   Army 28722
## 5 1854-08-01   Aug 1854     828      1    30   Army 30246
## 6 1854-09-01   Sep 1854     788     81    70   Army 30290

The series can then be graphed as

library(dplyr)
library(ggplot2)
ggplot(filter(nd, series != "Army"), aes(Date, value)) +
geom_line(aes(color = series)) • The long format also supports creating faceted plots.
• Usually for time series this is done with
• y scales free;
• a single column and x scales aligned.
xyplot(value ~ Date | series, data = nd, type = "l", auto.key = TRUE,
scales = c(y = "free"), layout = c(1, 4)) ggplot(nd, aes(Date, value)) + geom_line() +
facet_wrap(~ series, scales = "free_y", ncol = 1) Converting to a time series or zoo object is another option:

nts <- ts(select(Nightingale, Army, ends_with("rate")),
start = c(1854, 4), frequency = 12)
library(zoo)
ntz <- as.zoo(nts)
plot(nts) plot(nts[, -1], plot.type = "single") xyplot(nts) xyplot(nts[, -1], superpose = TRUE) autoplot(ntz) + facet_free()
## Don't know how to automatically pick scale for object of type yearmon. Defaulting to continuous. autoplot(ntz[, -1], facets = NULL)
## Don't know how to automatically pick scale for object of type yearmon. Defaulting to continuous. In a faceted plot without free scales, variables need to be measured on comparable scales:

autoplot(ntz)
## Don't know how to automatically pick scale for object of type yearmon. Defaulting to continuous. autoplot(ntz[, -1])
## Don't know how to automatically pick scale for object of type yearmon. Defaulting to continuous. A free y scale can be requested using faceting functions, e.g.

autoplot(ntz) + facet_free()
## Don't know how to automatically pick scale for object of type yearmon. Defaulting to continuous. Many Time Series

From Unwin’s book: Annual flow for dams in Italy:

library(nsRFA)
data(hydroSIMN)
annualflows <- mutate(annualflows, cod = factor(cod))

ggplot(annualflows, aes(anno, dato, group = cod, color = factor(cod))) +
geom_line() ggplot(annualflows, aes(anno, dato, group = cod)) +
geom_line() + facet_wrap(~ cod) Unemployment rate by state:

library(ggplot2)
library(geofacet)
ggplot(state_unemp, aes(year, rate)) +
geom_line() +
facet_geo(~ state, grid = "us_state_grid2", label = "name") +
scale_x_continuous(labels = function(x) paste0("'", substr(x, 3, 4))) +
labs(title = "Seasonally Adjusted US Unemployment Rate 2000-2016",
caption = "Data Source: bls.gov",
x = "Year",
y = "Unemployment Rate (%)") +
theme(strip.text.x = element_text(size = 6)) The Fall and Rise of Income Inequality

A 2015 post on the NPR Planet Money site uses two graphs to show transitions in income inequality.

The page includes a link to the data source. This has been unavailable for some time.

I tried to obtain the same data but may not have done it quite right. The closest source I could find is https://wid.world/data/.

library(lattice)
library(ggplot2)
library(tidyr)

Plots over time:

xyplot(U01 + L90 ~ year, data = income, type = "l", auto.key = TRUE) xyplot(U01 / U01 + L90 / L90 ~ year, data = income, type = "l",
auto.key = TRUE) its <- ts(select(income, -year), start = 1913)
xyplot(its) Comparative plot:

xyplot(L90 ~ U01, data = income) xyplot(L90 ~ U01, data = income, type = c("l", "p")) Stack Loss Data

Operational data obtained From 21 days of operation of a plant for the oxidation of ammonia (NH3) to nitric acid (HNO3).

The dependent variable, stack.loss is 10 times the percentage of the ingoing ammonia to the plant that escapes from the absorption column unabsorbed.

Raw data:

library(lattice)
splom(stackloss) The data are collected over time:

library(dplyr)
sl <- mutate(stackloss, intex = 1:nrow(stackloss))
splom(stackloss, type = "l") Randomly reordering the indices can help calibrate whether there is a time effect:

splom(stackloss[sample(nrow(stackloss)),], type = "l") This is sometimes called graphical inference.

Hadley Wickham, Dianne Cook, Heike Hofmann, and Andreas Buja, “Graphical Inference for Infovis,” IEEE Transactions on Visualization and Computer Graphics, Vol. 16, No. 6, November/December 2010.

Picking one plot and embedding the true order among 19 random orderings:

N <- 20
k <- sample(N, 1)

ss <- function(i) {
if (i == k)
sl <- stackloss
else
sl <- stackloss[sample(nrow(stackloss)),]
mutate(sl, sample = i)
}

d <- lapply(1 : N, ss)
d <- do.call(rbind, d)

xyplot(stack.loss ~ Acid.Conc. | sample, type = "l", data = d) library(ggplot2)
ggplot(d, aes(x = Acid.Conc., y = stack.loss)) +
geom_path() + facet_wrap(~sample) k
##  7

Lynx-Hare Data

Trapping data on Canadian lynx and snowshoe hare:

names(lynxhare) <- c("year", "hare", "lynx")
lhts <- ts(lynxhare[, -1], start = 1845)
lhtz <- as.zoo(lhts)

Cycle patterns are similar:

autoplot(lhtz) + facet_free() Lynx cycles often lag behind:

autoplot(lhtz, facet = NULL) Things to Look For

• Changes in data definition.

• Administrative definitions, such as employment/unemployment can change over time.

• Administrative entities, such as countries and states, can change over time.

• Instruments can change (satelite imagery replacing surface measurements).

• Outliers need not be globally outlying: departures from local patterns may also be important.

• Level shifts and changes in volatility.

• Shocks affecting level or volatility.

library(fpp)
xyplot(melsyd) Some features:

• Strike in 1989.
• Increase in passenger load starting in second half of 1991.
• Dip in economy, rise in business in 1992 corresponding to an experiment with fewer economy/more business seats.

Weekly closing returns of the SP 500 from 2003 to September, 2012:

library(xts)
library(astsa)
autoplot(sp500w) Some features:

• Increased volatility at the start of the financial crisis.

• Volatility seems to be decreasing, but not very fast.

Forecasting

• Studying the past may help predict the future.

• Forecasting a time series usually involves choosing a model and running the model forward.

• Accuracy of forecast decreases rapidly the farther ahead the forecast is made.

An example from the forecast package: WWWusage is a time series of the numbers of users connected to the Internet.

library(forecast)
fit <- ets(WWWusage)
fc <- forecast(WWWusage, model=fit)
## Model is being refit with current smoothing parameters but initial states are being re-estimated.
## Set 'use.initial.values=TRUE' if you want to re-use existing initial values.
plot(fc) Other Approaches

A recent talk describes a wide range of options for visualizing temporal data.