## Data Structures and Data Attribute Types

Data comes in many different forms.

Some of the most common data structures are

• rectangular tables
• networks and trees
• geometries, regions

Other forms include

• collections of text
• video and audio recordings

We will work mostly with tables.

Many other forms can be reduced to tables.

Stevens (1945) classifies scales of measurement for attributes, or variables, as

• nominal (e.g. hair color)
• ordinal (e.g. dislike, neutral, like)
• interval (e.g. temperature) — compare by difference
• ratio (e.g. counts) — compare by ratio

These are sometimes grouped as

• qualitative: nominal
• quantitative: ordinal, interval, ratio

These can be viewed as semantic classifications

Computational considerations often classify variables as

• categorical
• integer, discrete
• real, continuous

Another consideration is that some scales may be cyclic:

• hours of the day
• angles, longitude

These distinctions can be important in choosing visual representations.

Other typologies include one proposed by Mosteller and Tukey (1977):

1. Names
3. Ranks (orders with 1 being the smallest or largest, 2 the next smallest or largest, and so on)
4. Counted fractions (bound by 0 and 1)
5. Counts (non-negative integers)
6. Amounts (non-negative real numbers)
7. Balances (any real number)

## Data Types in R

R variables can be of different types.

The most common types are

• numeric for real numbers
• integer
• character for text data or nominal data
• factor for nominal or ordinal data

Factors can be

• unordered, for nominal data
• ordered, for ordinal data

factors are more efficient and powerful for representing nominal or ordinal data than character data but can take a bit more getting used to.

Membership predicates and coercion functions are

Predicate Coersion
is.numeric as.numeric
is.integer as.integer
is.character as.character
is.factor as.factor
is.ordered as.ordered

Conversion of factors with numeric-looking labels to numeric data should always go through as.character first.

## Data Frames: Organizing Cases and Variables

Tabular data in R is usually stored as a data frame.

A data frame is a collection of variables, each with a value for every case or observacion.

The faithful data set is a data frame:

class(faithful)
## [1] "data.frame"
names(faithful)
## [1] "eruptions" "waiting"

Most tools we work with in R use data organized in data frames.

Our plot() and lm() expressions from the introductory section can also we written as

plot(waiting ~ eruptions, data = faithful,
xlab = "Eruption time (min)",
ylab = "Waiting time to next eruption (min)")
fit <- lm(waiting ~ eruptions, data = faithful)

plot() only uses the data argument when the plot is specified as a formula, like

waiting ~ eruptions

## Examining the Data in a Data Frame

head() provides an idea of what the raw data looks like:

##   eruptions waiting
## 1     3.600      79
## 2     1.800      54
## 3     3.333      74
## 4     2.283      62
## 5     4.533      85
## 6     2.883      55

str() is also useful for an overview of an object’s structure:

str(faithful)
## 'data.frame':    272 obs. of  2 variables:
##  $eruptions: num 3.6 1.8 3.33 2.28 4.53 ... ##$ waiting  : num  79 54 74 62 85 55 88 85 51 85 ...

Another useful function is available from the dplyr or tibble packages is glimpse():

library(dplyr)
glimpse(faithful)
## Rows: 272
## Columns: 2
## $eruptions <dbl> 3.600, 1.800, 3.333, 2.283, 4.533, 2.883, 4.700, 3.600, 1.95… ##$ waiting   <dbl> 79, 54, 74, 62, 85, 55, 88, 85, 51, 85, 54, 84, 78, 47, 83, …

summary() shows basic statistical properties of the variables:

summary(faithful)
##    eruptions        waiting
##  Min.   :1.600   Min.   :43.0
##  1st Qu.:2.163   1st Qu.:58.0
##  Median :4.000   Median :76.0
##  Mean   :3.488   Mean   :70.9
##  3rd Qu.:4.454   3rd Qu.:82.0
##  Max.   :5.100   Max.   :96.0

summary() with a character variable and a factor variable:

ffl <- mutate(faithful,
type = ifelse(eruptions < 3, "short", "long"),
ftype = factor(type))
summary(ffl)
##    eruptions        waiting         type             ftype
##  Min.   :1.600   Min.   :43.0   Length:272         long :175
##  1st Qu.:2.163   1st Qu.:58.0   Class :character   short: 97
##  Median :4.000   Median :76.0   Mode  :character
##  Mean   :3.488   Mean   :70.9
##  3rd Qu.:4.454   3rd Qu.:82.0
##  Max.   :5.100   Max.   :96.0

## Variables in a Data Frame

A Data frame is a list, or vector, of variables:

length(faithful)
## [1] 2

The dollar sign $can be used to examine individual variables: class(faithful$eruptions)
## [1] "numeric"
class(faithful$waiting) ## [1] "numeric" The variables can also be extracted by numerical or character index using the element extraction operation [[: class(faithful[[1]]) ## [1] "numeric" class(faithful[["waiting"]]) ## [1] "numeric" ## Dimensions The numbers of rows and columns can be obtained using nrow() and ncol(): ncol(faithful) ## [1] 2 nrow(faithful) ## [1] 272 dim() returns a vector of the dimensions: dim(faithful) ## [1] 272 2 ## Simple Visualizations plot has a method for data frames that tries to provide a reasonable default visualization for numeric data frames: plot(faithful) ## Sample Data Sets The datasets package in the base R distribution contains a number of data sets you can explore. Another package with a useful collection of data sets is dslabs. Many other data sets are contained in contributed packages as examples. There are also many contributed packages designed specifically for making particular data sets available. ## Tidy Data The useful concept of a tidy data frame was introduced fairly recently and is described is a chapter in R for Data Science. The idea is that • every observation should correspond to a single row; • every variable should correspond to a single column. Tidy data is computationally convenient, and many of the tools we will use are designed around tidy data frames. A large range of these tools can be accessed by loading the tidyverse package: library(tidyverse) But for now I will load the needed packages individually. The term tidy is a little unfortunate. • Data that is not tidy isn’t necessarily bad. • For human reading, and for some computations, data in a wider format can be better. • For other computations data in a longer, or narrower, format can be better. ## Tibbles Many tools in the tidyverse produce slightly enhanced data frames called tibbles: library(tibble) faithful_tbl <- as_tibble(faithful) class(faithful_tbl) ## [1] "tbl_df" "tbl" "data.frame" Tibbles print differently from standard data frames: faithful_tbl ## # A tibble: 272 × 2 ## eruptions waiting ## <dbl> <dbl> ## 1 3.6 79 ## 2 1.8 54 ## 3 3.33 74 ## 4 2.28 62 ## 5 4.53 85 ## 6 2.88 55 ## 7 4.7 88 ## 8 3.6 85 ## 9 1.95 51 ## 10 4.35 85 ## # … with 262 more rows For the most part data frames and tibbles can be used interchangeably. ## Tidying Data Many data sets are in tidy form already. If they are not, they can be put into tidy form. The tools for this are part of data technologies. The tasks involved are part of what is sometimes called data wrangling. ## An Example: Global Average Surface Temperatures Among many data sets available at https://data.giss.nasa.gov/gistemp/ is data on monthly global average surface temperatures over a number of years. These data from 2017 were used for the widely cited Bloomberg hottest year visualization. The current data are available in a formatted text file at https://data.giss.nasa.gov/gistemp/tabledata_v4/GLB.Ts+dSST.txt or as a CSV (comma-separated values) file at https://data.giss.nasa.gov/gistemp/tabledata_v4/GLB.Ts+dSST.csv The CSV file is a little easier (for a computer program) to read in, so we will work with that. The numbers in the CSV file represent deviations in degrees Celcius from the average temperature for the base period 1951-1980. The file available on January 14, 2022, is now available locally. We can make sure it has been downloaded to our working directory with if (! file.exists("GLB.Ts+dSST.csv")) download.file("http://homepage.divms.uiowa.edu/~luke/data/GLB.Ts+dSST.csv", "GLB.Ts+dSST.csv") Assuming this locally available file has been downloaded, we can read in the data and drop some columns we don’t need with library(readr) gast <- read_csv("GLB.Ts+dSST.csv", skip = 1)[1 : 13] • The function read_csv is from the readr package, which is part of the tidyverse. • An alternative is the base R function read.csv. A look a the first few lines: head(gast, 5) ## # A tibble: 5 × 13 ## Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1880 -0.17 -0.24 -0.08 -0.15 -0.09 -0.2 -0.17 -0.09 -0.14 -0.23 -0.21 -0.17 ## 2 1881 -0.19 -0.13 0.04 0.06 0.07 -0.18 0.01 -0.03 -0.15 -0.21 -0.18 -0.06 ## 3 1882 0.17 0.14 0.05 -0.15 -0.13 -0.22 -0.16 -0.07 -0.15 -0.23 -0.16 -0.36 ## 4 1883 -0.29 -0.36 -0.12 -0.18 -0.17 -0.06 -0.07 -0.13 -0.22 -0.11 -0.24 -0.11 ## 5 1884 -0.13 -0.08 -0.36 -0.4 -0.33 -0.34 -0.32 -0.27 -0.27 -0.25 -0.33 -0.3 And the last few lines: tail(gast, 5) ## # A tibble: 5 × 13 ## Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 2017 1.02 1.14 1.16 0.95 0.91 0.72 0.82 0.87 0.77 0.9 0.88 0.94 ## 2 2018 0.81 0.85 0.89 0.89 0.83 0.78 0.83 0.77 0.8 1.02 0.82 0.92 ## 3 2019 0.93 0.96 1.18 1.01 0.85 0.91 0.95 0.95 0.93 1.02 1 1.1 ## 4 2020 1.17 1.25 1.17 1.13 1.02 0.93 0.91 0.88 0.99 0.89 1.11 0.82 ## 5 2021 0.82 0.64 0.89 0.76 0.79 0.85 0.93 0.82 0.92 1 0.93 0.86 The print() method for tibbles abbreviates the output. It is neater and provides some useful additional information on variable data types. But it shows only the first few rows and may not explicitly show some columns. If some columns are skipped, you can ask to see more by calling print() explicitly: print(tail(gast), width = 100) ## # A tibble: 6 × 13 ## Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 2016 1.17 1.37 1.36 1.11 0.95 0.8 0.85 1.02 0.9 0.89 0.92 0.87 ## 2 2017 1.02 1.14 1.16 0.95 0.91 0.72 0.82 0.87 0.77 0.9 0.88 0.94 ## 3 2018 0.81 0.85 0.89 0.89 0.83 0.78 0.83 0.77 0.8 1.02 0.82 0.92 ## 4 2019 0.93 0.96 1.18 1.01 0.85 0.91 0.95 0.95 0.93 1.02 1 1.1 ## 5 2020 1.17 1.25 1.17 1.13 1.02 0.93 0.91 0.88 0.99 0.89 1.11 0.82 ## 6 2021 0.82 0.64 0.89 0.76 0.79 0.85 0.93 0.82 0.92 1 0.93 0.86 The format with one column per month is compact and useful for viewing and data entry. But it is not in tidy format since • the monthly temperature variable is spread over 12 columns; • the month variable is encoded in the column headings. For obvious reasons this data format is often referred to as wide format. The tidy, or long, format would have three variables: Year, Month, and Temp. One way to put this data frame in tidy, or long, format uses pivot_longer from the tidyr package: library(tidyr) lgast <- pivot_longer(gast, -Year, ## specifies the columns to use -- all but Year names_to = "Month", values_to = "Temp") The first few rows of the result: head(lgast) ## # A tibble: 6 × 3 ## Year Month Temp ## <dbl> <chr> <dbl> ## 1 1880 Jan -0.17 ## 2 1880 Feb -0.24 ## 3 1880 Mar -0.08 ## 4 1880 Apr -0.15 ## 5 1880 May -0.09 ## 6 1880 Jun -0.2 During plotting it is likely that the Month variable will be converted to a factor. By default, this will order levels alphabetically, which is not what we want: levels(as.factor(lgast$Month))
##  [1] "Apr" "Aug" "Dec" "Feb" "Jan" "Jul" "Jun" "Mar" "May" "Nov" "Oct" "Sep"

We can guard against this by converting Month to a factor with the right levels now:

lgast <- mutate(lgast, Month = factor(Month, levels = month.abb))
levels(lgast$Month) ## [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec" We can now use this tidy version of the data to create a static version of the Bloomberg hottest year visualization. The basic framework is set up with library(ggplot2) p <- ggplot(lgast) + ggtitle("Global Average Surface Temperatures") + theme(plot.title = element_text(hjust = 0.5)) ggplot objects only produce output when they are printed. To see the plot in p we need to print it, for example by using a line line with only p. p Then add a layer with lines for each year (specified by the group argument to geom_line). p + geom_line(aes(x = Month, y = Temp, group = Year)) We can use color to distingish the years. p1 <- p + geom_line(aes(x = Month, y = Temp, color = Year, group = Year), na.rm = TRUE) p1 Saving the plot specification in the variable p1 will make it easier to experiment with color variations: One way to highlight the past_year 2021: lgast_last <- filter(lgast, Year == past_year) p1 + geom_line(aes(x = Month, y = Temp, group = Year), size = 1, color = "red", data = lgast_last, na.rm = TRUE) A useful way to show more recent data in the context of the full data is to show the full data in grey and the more recent years in black: lgast2k <- filter(lgast, Year >= 2000) ggplot(lgast, aes(x = Month, y = Temp, group = Year)) + geom_line(color = "grey80") + theme_minimal() + geom_line(data = lgast2k) If you want to update your plot later in the year then the current year’s entry may contain missing value indicators that you will have to deal with.. The New York Times on January 18, 2018, published another visualization of these data showing average yearly temperatures (via Google may work better). To recreate this plot we first need to compute yearly average temperatures. This is easy to do with the summarize and group_by functions from dpyr: library(dplyr) atemp <- lgast %>% group_by(Year) %>% summarize(AveTemp = mean(Temp, na.rm = TRUE)) head(atemp) ## # A tibble: 6 × 2 ## Year AveTemp ## <dbl> <dbl> ## 1 1880 -0.162 ## 2 1881 -0.0792 ## 3 1882 -0.106 ## 4 1883 -0.172 ## 5 1884 -0.282 ## 6 1885 -0.328 Using na.rm = TRUE ensures that the mean is based on the available months if data for some months is missing. A simple version of the plot is then produced by ggplot(atemp) + geom_point(aes(x = Year, y = AveTemp)) Another variation on the Bloomberg plot showing just a few years 20 years apart: lg_by_20 <- filter(lgast, Year %in% seq(2020, by = -20, len = 5)) %>% mutate(Year = factor(Year)) ggplot(lg_by_20, aes(x = Month, y = Temp, group = Year, color = Year)) + geom_line() Converting Year to a factor results in a discrete color scale and legend. ## Handling Missing Values The data for 2019 available in early 2020 is also available locally. Assuming this locally available file has been downloaded, we can read in the data and drop some columns we don’t need with gast2019 <- read_csv("GLB.Ts+dSST-2019.csv", skip = 1)[1 : 13] The last three columns are read as character variables: head(gast2019, 5) ## # A tibble: 5 × 13 ## Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> ## 1 1880 -0.17 -0.23 -0.08 -0.15 -0.08 -0.2 -0.17 -0.09 -0.13 -.22 -.20 -.16 ## 2 1881 -0.18 -0.13 0.04 0.06 0.08 -0.17 0.02 -0.02 -0.14 -.20 -.17 -.05 ## 3 1882 0.18 0.15 0.06 -0.15 -0.13 -0.21 -0.15 -0.06 -0.13 -.23 -.15 -.34 ## 4 1883 -0.28 -0.36 -0.11 -0.17 -0.16 -0.07 -0.05 -0.12 -0.2 -.10 -.22 -.10 ## 5 1884 -0.12 -0.07 -0.36 -0.39 -0.33 -0.34 -0.32 -0.27 -0.26 -.24 -.32 -.30 The reason is that data for October through December were not available: tail(gast2019, 2) ## # A tibble: 2 × 13 ## Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> ## 1 2018 0.82 0.85 0.9 0.89 0.83 0.78 0.83 0.76 0.81 1.02 .83 .92 ## 2 2019 0.94 0.96 1.18 1.02 0.86 0.93 0.95 0.94 0.93 *** *** *** We want to convert these columns to numeric, with missing values represented as NA. One option is to handle them individually: gast2019$Oct <- as.numeric(gast2019$Oct) ## Warning: NAs introduced by coercion tail(gast2019, 2) ## # A tibble: 2 × 13 ## Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> ## 1 2018 0.82 0.85 0.9 0.89 0.83 0.78 0.83 0.76 0.81 1.02 .83 .92 ## 2 2019 0.94 0.96 1.18 1.02 0.86 0.93 0.95 0.94 0.93 NA *** *** Another option is to convert all character columns to numeric with gast2019 <- mutate(gast2019, across(where(is.character), as.numeric)) ## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion

## Warning in mask\$eval_all_mutate(quo): NAs introduced by coercion
tail(gast2019, 2)
## # A tibble: 2 × 13
##    Year   Jan   Feb   Mar   Apr   May   Jun   Jul   Aug   Sep   Oct   Nov   Dec
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  2018  0.82  0.85  0.9   0.89  0.83  0.78  0.83  0.76  0.81  1.02  0.83  0.92
## 2  2019  0.94  0.96  1.18  1.02  0.86  0.93  0.95  0.94  0.93 NA    NA    NA

The warnings are benign and can be suppressed with the warning = FALSE chunk option.

Since we know the missing value pattern *** we can also avoid the need to fix the data after the fact by specifying this at read time:

gast2019 <- read_csv("GLB.Ts+dSST-2019.csv", na = "***", skip = 1)[1 : 13]
tail(gast2019, 2)
## # A tibble: 2 × 13
##    Year   Jan   Feb   Mar   Apr   May   Jun   Jul   Aug   Sep   Oct   Nov   Dec
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  2018  0.82  0.85  0.9   0.89  0.83  0.78  0.83  0.76  0.81  1.02  0.83  0.92
## 2  2019  0.94  0.96  1.18  1.02  0.86  0.93  0.95  0.94  0.93 NA    NA    NA

A plot highlighting the year 2019 shows only the months with available data:

lgast2019 <- gast2019 %>%
pivot_longer(-Year,
names_to = "Month",
values_to = "Temp") %>%
mutate(Month = factor(Month, levels = month.abb))
ggplot(lgast2019, aes(x = Month,
y = Temp,
group = Year)) +
geom_line(color = "grey80",
na.rm = TRUE) +
geom_line(data = filter(lgast2019, Year == 2019),
na.rm = TRUE)

Adding na.rm = TRUE in the geom_line calls suppresses warnings; the plot would be the same without these.

Outline of the tools used:

• Data processing:
• Reshaping: pivot_longer;
• Cleaning: is.character, as.numeric, mutate, across, factor;
• Summarizing: group_by, summarize.
• Visualization geometries:
• geom_line for a line plot;
• geom_point for a scatter plot.

Stevens’ classification of scales of measurement is described in a Wikipedia article.

A good introduction to the concept of tidy data is provided in a chapter in R for Data Science.

## Interactive Tutorial

An interactive learnr tutorial for these notes is available.

You can run the tutorial with

STAT4580::runTutorial("datafrm")

## Exercises

1. Which of the Stevens classifications (nominal, ordinal, interval, ratio) best characterizes these variables:

1. Daily maximal temperatures in Iowa City.
2. Population counts for Iowa counties.
3. Education level of job applicants using the Bureau of Labor Statistics classification.
4. Major of UI students.
1. Which of these data sets are in tidy form?

1. The builtin data set co2
2. The builtin data set BOD
3. The who data set in package tidyr (tidyr::who)
4. The mpg data set in package ggplot2 (ggplot2::mpg)

The next exercises use the data in the variable gapminder in the package gapminder. You can make it available with

data(gapminder, package = "gapminder")
1. Use the function str to examine the value of the gapminder variable. How many cases are there in the data set? How many of the variables are factors?

2. Use the functions class and names to find the class and variable names in the gapminder data.

3. Use summary to compute summary information for the variables.

4. Fill in the values for --- needed to produce plots of life expectancy against year for the countries in continent Oceania.

library(dplyr)
library(ggplot2)
data(gapminder, package = "gapminder")
ggplot(filter(gapminder, continent == "Oceania"),
aes(x = ---, y = ---, color = country)) +
geom_line()