---
title: "Data and Data Frames"
output:
html_document:
toc: yes
code_download: true
---
```{r setup, include=FALSE}
source(here::here("setup.R"))
knitr::opts_chunk$set(collapse = TRUE, message = FALSE,
fig.height = 5, fig.width = 6, fig.align = "center")
options(width = 80)
```
## Data Structures and Data Attribute Types
Data comes in many different forms.
Some of the most common data structures are
* rectangular tables
* networks and trees
* geometries, regions
Other forms include
* collections of text
* video and audio recordings
* ...
We will work mostly with tables.
Many other forms can be reduced to tables.
[Stevens](https://en.wikipedia.org/wiki/Stanley_Smith_Stevens) (1945)
classifies [scales of
measurement](https://en.wikipedia.org/wiki/Level_of_measurement) for
attributes, or variables, as
* nominal (e.g. hair color)
* ordinal (e.g. dislike, neutral, like)
* interval (e.g. temperature) --- compare by difference
* ratio (e.g. counts) --- compare by ratio
These are sometimes grouped as
* qualitative: nominal
* quantitative: ordinal, interval, ratio
These can be viewed as _semantic classifications_
_Computational considerations_ often classify variables as
* categorical
* integer, discrete
* real, continuous
Another consideration is that some scales may be cyclic:
* hours of the day
* angles, longitude
These distinctions can be important in choosing visual representations.
Other typologies include one proposed by [Mosteller and Tukey
(1977)](https://en.wikipedia.org/wiki/Level_of_measurement#Mosteller_and_Tukey's_typology_(1977)):
1. Names
2. Grades (ordered labels like beginner, intermediate, advanced)
3. Ranks (orders with 1 being the smallest or largest, 2 the next
smallest or largest, and so on)
4. Counted fractions (bound by 0 and 1)
5. Counts (non-negative integers)
6. Amounts (non-negative real numbers)
7. Balances (any real number)
## Data Types in R
R variables can be of different types.
The most common types are
* `numeric` for real numbers
* `integer`
* `character` for text data or nominal data
* `factor` for nominal or ordinal data
Factors can be
* unordered, for nominal data
* ordered, for ordinal data
`factors` are more efficient and powerful for representing nominal or
ordinal data than `character` data but can take a bit more getting
used to.
Conversion of factors with numeric-looking labels to numeric data
should always go through `as.character` first.
## Data Frames: Organizing Cases and Variables
Tabular data in R is usually stored as a _data frame_.
A data frame is a collection of _variables_, each with a value for
every _case_ or _observacion_.
The `faithful` data set is a data frame:
```{r}
class(faithful)
names(faithful)
```
Most tools we work with in R use data organized in data frames.
Our `plot()` and `lm()` expressions from the
[introductory section](intro.html)
can also we written as
```{r, eval = FALSE}
plot(waiting ~ eruptions, data = faithful,
xlab = "Eruption time (min)",
ylab = "Waiting time to next eruption (min)")
fit <- lm(waiting ~ eruptions, data = faithful)
```
`plot()` only uses the `data` argument when the plot is specified as a
_formula_, like
```r
waiting ~ eruptions
```
## Examining the Data in a Data Frame
`head()` provides an idea of what the raw data looks like:
```{r}
head(faithful)
```
`str()` is also useful for an overview of an object's structure:
```{r}
str(faithful)
```
Another useful function is available from the `dplyr` or `tibble`
packages is `glimpse()`:
```{r}
library(dplyr)
glimpse(faithful)
```
`summary()` shows basic statistical properties of the variables:
```{r}
summary(faithful)
```
`summary()` with a character variable and a factor variable:
```{r}
ffl <- mutate(faithful,
type = ifelse(eruptions < 3, "short", "long"),
ftype = factor(type))
summary(ffl)
```
## Variables in a Data Frame
A Data frame is a list, or vector, of variables:
```{r}
length(faithful)
```
The dollar sign `$` can be used to examine individual variables:
```{r}
class(faithful$eruptions)
class(faithful$waiting)
```
The variables can also be extracted by numerical or character index using the
element extraction operation `[[`:
```{r}
class(faithful[[1]])
class(faithful[["waiting"]])
```
## Dimensions
The numbers of rows and columns can be obtained using `nrow()` and `ncol()`:
```{r}
ncol(faithful)
nrow(faithful)
```
`dim()` returns a vector of the dimensions:
```{r}
dim(faithful)
```
## Simple Visualizations
`plot` has a _method_ for data frames that tries to provide a
reasonable default visualization for numeric data frames:
```{r}
plot(faithful)
```
## Sample Data Sets
The
[`datasets`](http://stat.ethz.ch/R-manual/R-devel/library/datasets/html/00Index.html)
package in the base R distribution contains a number of data sets you
can explore.
Another package with a useful collection of data sets is [`dslabs`](https://cran.r-project.org/package=dslabs).
Many other data sets are contained in contributed packages as examples.
There are also many contributed packages designed specifically for
making particular data sets available.
## Tidy Data
The useful concept of a _tidy_ data frame was introduced fairly
[recently](https://www.jstatsoft.org/article/view/v059i10) and is
described is a [chapter in _R for Data
Science_](http://r4ds.had.co.nz/tidy-data.html).
The idea is that
* every observation should correspond to a single row;
* every variable should correspond to a single column.
Tidy data is computationally convenient, and many of the tools we will
use are designed around tidy data frames.
A large range of these tools can be accessed by loading the
`tidyverse` package:
```{r, eval = FALSE}
library(tidyverse)
```
But for now I will load the needed packages individually.
The term _tidy_ is a little unfortunate.
* Data that is not _tidy_ isn't necessarily _bad_.
* For human reading, and for some computations, data in a wider format can
be better.
* For other computations data in a longer, or narrower, format can be better.
## Tibbles
Many tools in the `tidyverse` produce slightly enhanced data frames
called _tibbles_:
```{r}
library(tibble)
faithful_tbl <- as_tibble(faithful)
class(faithful_tbl)
```
Tibbles print differently from standard data frames:
```{r, highlight.output=c(1, 3, 14)}
faithful_tbl
```
For the most part data frames and tibbles can be used interchangeably.
## Tidying Data
Many data sets are in tidy form already.
If they are not, they can be put into tidy form.
The tools for this are part of _data technologies_.
The tasks involved are part of what is sometimes called _data
wrangling_.
## An Example: Global Average Surface Temperatures
Among many data sets available at
is data on monthly global
average surface temperatures over a number of years.
These data from 2017 were used for the widely cited
[Bloomberg hottest year visualization](https://www.bloomberg.com/graphics/hottest-year-on-record/).
The current data are available in a formatted text file at
or as a
[_CSV_ (comma-separated values)](https://en.wikipedia.org/wiki/Comma-separated_values) file at
The CSV file is a little easier (for a computer program) to read in,
so we will work with that.
The numbers in the CSV file represent deviations in degrees Celcius
from the average temperature for the base period 1951-1980.
The file available on January 14, 2022, is now available
[locally](http://homepage.divms.uiowa.edu/~luke/data/GLB.Ts+dSST.csv).
We can make sure it has been downloaded to our working directory with
```{r}
if (! file.exists("GLB.Ts+dSST.csv"))
download.file("http://homepage.divms.uiowa.edu/~luke/data/GLB.Ts+dSST.csv",
"GLB.Ts+dSST.csv")
```
Assuming this locally available file has been downloaded, we can read
in the data and drop some columns we don't need with
```{r}
library(readr)
gast <- read_csv("GLB.Ts+dSST.csv", skip = 1)[1 : 13]
```
* The function `read_csv` is from the `readr` package, which is part of the
`tidyverse`.
* An alternative is the base R function `read.csv`.
A look a the first few lines:
```{r}
head(gast, 5)
```
And the last few lines:
```{r}
tail(gast, 5)
```
The `print()` method for tibbles abbreviates the output.
It is neater and provides some useful additional information on
variable data types.
But it shows only the first few rows and may not explicitly show some
columns.
If some columns are skipped, you can ask to see more by calling
`print()` explicitly:
```{r}
print(tail(gast), width = 100)
```
The format with one column per month is compact and useful for viewing
and data entry.
But it is not in _tidy format_ since
* the monthly temperature variable is spread over 12 columns;
* the month variable is encoded in the column headings.
For obvious reasons this data format is often referred to as _wide
format_.
The tidy, or _long_, format would have three variables: `Year`,
`Month`, and `Temp`.
One way to put this data frame in tidy, or long, format uses `pivot_longer`
from the `tidyr` package:
```{r}
library(tidyr)
lgast <- pivot_longer(gast,
-Year, ## specifies the columns to use -- all but Year
names_to = "Month",
values_to = "Temp")
```
The first few rows of the result:
```{r}
head(lgast)
```
During plotting it is likely that the `Month` variable will be
converted to a _factor_.
By default, this will order levels alphabetically, which is not what we want:
```{r}
levels(as.factor(lgast$Month))
```
We can guard against this by converting `Month` to a factor with the
right levels now:
```{r}
lgast <- mutate(lgast, Month = factor(Month, levels = month.abb))
levels(lgast$Month)
```
We can now use this tidy version of the data to create a static
version of the [Bloomberg hottest year
visualization](https://www.bloomberg.com/graphics/hottest-year-on-record/).
The basic framework is set up with
```{r}
library(ggplot2)
p <- ggplot(lgast) +
ggtitle("Global Average Surface Temperatures") +
theme(plot.title = element_text(hjust = 0.5))
```
`ggplot` objects only produce output when they are printed.
To see the plot in `p` we need to print it, for example by using a
line line with only `p`.
```{r gast-base, eval = FALSE}
p
```
```{r gast-base, eval = TRUE, echo = FALSE}
```
Then add a layer with lines for each year (specified by the `group`
argument to `geom_line`).
```{r gast-lines, eval = FALSE}
p + geom_line(aes(x = Month,
y = Temp,
group = Year))
```
```{r gast-lines, eval = TRUE, echo = FALSE, warning = FALSE}
```
We can use color to distingish the years.
```{r gast-color1, eval = FALSE}
p1 <- p +
geom_line(aes(x = Month,
y = Temp,
color = Year,
group = Year),
na.rm = TRUE)
p1
```
Saving the plot specification in the variable `p1` will make it easier
to experiment with color variations:
```{r gast-color1, eval = TRUE, echo = FALSE}
```
```{r, include = FALSE}
## Compute the past year and make sure it is in the file
library(lubridate)
past_year <- year(today()) - 1
past_year
stopifnot(past_year %in% lgast$Year)
```
One way to highlight the `past_year` `r past_year`:
```{r gast-color2, eval = FALSE}
lgast_last <- filter(lgast, Year == past_year)
p1 + geom_line(aes(x = Month,
y = Temp,
group = Year),
size = 1,
color = "red",
data = lgast_last,
na.rm = TRUE)
```
```{r gast-color2, echo = FALSE}
```
A useful way to show more recent data in the context of the full data
is to show the full data in grey and the more recent years in black:
```{r gast-full-grey, eval = FALSE}
lgast2k <- filter(lgast, Year >= 2000)
ggplot(lgast, aes(x = Month,
y = Temp,
group = Year)) +
geom_line(color = "grey80") +
theme_minimal() +
geom_line(data = lgast2k)
```
```{r gast-full-grey, echo = FALSE, warning = FALSE}
```
If you want to update your plot later in the year then the current
year's entry may contain missing value indicators that you will have
to deal with..
The New York Times on January 18, 2018, published
[another visualization](https://www.nytimes.com/interactive/2018/01/18/climate/hottest-year-2017.html)
of these data showing average yearly temperatures ([via Google may work better](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwj_4PmhnM31AhXFjYkEHfQxATEQFnoECAYQAQ&url=https%3A%2F%2Fwww.nytimes.com%2Finteractive%2F2018%2F01%2F18%2Fclimate%2Fhottest-year-2017.html&usg=AOvVaw3qouxHQAJCqrt_F3fbLEGE)).
To recreate this plot we first need to compute yearly average temperatures.
This is easy to do with the `summarize` and `group_by` functions from `dpyr`:
```{r}
library(dplyr)
atemp <- lgast %>%
group_by(Year) %>%
summarize(AveTemp = mean(Temp, na.rm = TRUE))
head(atemp)
```
Using `na.rm = TRUE` ensures that the mean is based on the available
months if data for some months is missing.
A simple version of the plot is then produced by
```{r gast-nyt, eval = FALSE}
ggplot(atemp) +
geom_point(aes(x = Year, y = AveTemp))
```
```{r gast-nyt, echo = FALSE}
```
Another variation on the Bloomberg plot showing just a few years 20
years apart:
```{r gast-skip, eval = FALSE}
lg_by_20 <-
filter(lgast,
Year %in% seq(2020, by = -20, len = 5)) %>%
mutate(Year = factor(Year))
ggplot(lg_by_20, aes(x = Month,
y = Temp,
group = Year,
color = Year)) +
geom_line()
```
Converting `Year` to a factor results in a discrete color scale and legend.
```{r gast-skip, echo = FALSE}
```
## Handling Missing Values
The data for 2019 available in early 2020 is also available
[locally](http://homepage.divms.uiowa.edu/~luke/data/GLB.Ts+dSST-2019.csv).
```{r, echo = FALSE}
if (! file.exists("GLB.Ts+dSST-2019.csv"))
download.file("http://homepage.divms.uiowa.edu/~luke/data/GLB.Ts+dSST-2019.csv",
"GLB.Ts+dSST-2019.csv")
```
Assuming this locally available file has been downloaded, we can read
in the data and drop some columns we don't need with
```{r}
gast2019 <- read_csv("GLB.Ts+dSST-2019.csv", skip = 1)[1 : 13]
```
The last three columns are read as character variables:
```{r}
head(gast2019, 5)
```
The reason is that data for October through December were not
available:
```{r}
tail(gast2019, 2)
```
We want to convert these columns to numeric, with missing values
represented as `NA`.
One option is to handle them individually:
```{r}
gast2019$Oct <- as.numeric(gast2019$Oct)
tail(gast2019, 2)
```
Another option is to convert all character columns to numeric with
```{r}
gast2019 <- mutate(gast2019, across(where(is.character), as.numeric))
tail(gast2019, 2)
```
The warnings are benign and can be suppressed with the `warning =
FALSE` chunk option.
Since we know the missing value pattern `***` we can also avoid the
need to fix the data after the fact by specifying this at read time:
```{r}
gast2019 <- read_csv("GLB.Ts+dSST-2019.csv", na = "***", skip = 1)[1 : 13]
tail(gast2019, 2)
```
A plot highlighting the year 2019 shows only the months with available
data:
```{r gast-2019, eval = FALSE}
lgast2019 <- gast2019 %>%
pivot_longer(-Year,
names_to = "Month",
values_to = "Temp") %>%
mutate(Month = factor(Month, levels = month.abb))
ggplot(lgast2019, aes(x = Month,
y = Temp,
group = Year)) +
geom_line(color = "grey80",
na.rm = TRUE) +
geom_line(data = filter(lgast2019, Year == 2019),
na.rm = TRUE)
```
Adding `na.rm = TRUE` in the `geom_line` calls suppresses warnings;
the plot would be the same without these.
```{r gast-2019, echo = FALSE}
```
Outline of the tools used:
* Data processing:
* Reading: `read_csv`, `read.csv`;
* Reshaping: `pivot_longer`;
* Cleaning: `is.character`, `as.numeric`, `mutate`, `across`, `factor`;
* Summarizing: `group_by`, `summarize`.
* Visualization geometries:
* `geom_line` for a line plot;
* `geom_point` for a scatter plot.
## Reading
Stevens' classification of scales of measurement is described in a
[Wikipedia
article](https://en.wikipedia.org/wiki/Level_of_measurement).
A good introduction to the concept of _tidy data_ is provided in a
[chapter in _R for Data
Science_](http://r4ds.had.co.nz/tidy-data.html).
## Interactive Tutorial
An interactive [`learnr`](https://rstudio.github.io/learnr/) tutorial
for these notes is [available](`r WLNK("tutorials/datafrm.Rmd")`).
You can run the tutorial with
```{r, eval = FALSE}
STAT4580::runTutorial("datafrm")
```
## Exercises
1. Which of the Stevens classifications (nominal, ordinal, interval, ratio)
best characterizes these variables:
a. Daily maximal temperatures in Iowa City.
b. Population counts for Iowa counties.
c. Education level of job applicants using the [Bureau of Labor
Statistics
classification](https://www.bls.gov/careeroutlook/2014/article/education-level-and-jobs.htm).
d. Major of UI students.
2. Which of these data sets are in tidy form?
a. The builtin data set `co2`
b. The builtin data set `BOD`
c. The `who` data set in package `tidyr` (`tidyr::who`)
d. The `mpg` data set in package `ggplot2` (`ggplot2::mpg`)
The next exercises use the data in the variable `gapminder` in the package
`gapminder`. You can make it available with
```{r, eval = FALSE}
data(gapminder, package = "gapminder")
```
3. Use the function `str` to examine the value of the gapminder
variable. How many cases are there in the data set? How many of
the variables are factors?
4. Use the functions `class` and `names` to find the class and
variable names in the `gapminder` data.
5. Use `summary` to compute summary information for the variables.
6. Fill in the values for `---` needed to produce plots of life
expectancy against year for the countries in continent Oceania.
```{r, eval = FALSE}
library(dplyr)
library(ggplot2)
data(gapminder, package = "gapminder")
ggplot(filter(gapminder, continent == "Oceania"),
aes(x = ---, y = ---, color = country)) +
geom_line()
```