Data Structures and Data Attribute Types

Data comes in many different forms.

Some of the most common data structures are

Other forms include

We will work mostly with tables. Many other forms can be reduced to tables.

Stevens (1945) classifies scales of measurement for attributes, or variables, as

These are sometimes grouped as

These can be viewed as semantic classifications

Computational considerations often classify variables as

Another consideration is that some scales may be cyclic:

These distinctions can be important in choosing visual representations.

Data Types in R

R variables can be of different types. The most common types are

Factors can be

factors are more efficient and powerful for representing nominal or ordinal data than character but can take a bit more getting used to.

Data Frames: Organizing Cases and Variables

Tabular data in R is usually stored as a data frame.

A data frame is a collection of variables, each with a value for every case or observacion.

The faithful data set is a data frame:

class(faithful)
## [1] "data.frame"
names(faithful)
## [1] "eruptions" "waiting"

Most tools we work with in R use data organized in data frames.

Our plot and lm expressions from the introductory section can also we written as

plot(waiting ~ eruptions, data = faithful,
     xlab = "Eruption time (min)",
     ylab = "Waiting time to next eruption (min)")
fit <- lm(waiting ~ eruptions, data = faithful)

plot only uses the data argument when the plot is specified as a formula, like waiting ~ eruptions

Examining the Data in a Data Frame

head provides a in idea of what the raw data looks like:

head(faithful)
##   eruptions waiting
## 1     3.600      79
## 2     1.800      54
## 3     3.333      74
## 4     2.283      62
## 5     4.533      85
## 6     2.883      55

str is also useful for an overview of an object’s structure:

str(faithful)
## 'data.frame':    272 obs. of  2 variables:
##  $ eruptions: num  3.6 1.8 3.33 2.28 4.53 ...
##  $ waiting  : num  79 54 74 62 85 55 88 85 51 85 ...

summary shows basic statistical properties of the variables:

summary(faithful)
##    eruptions        waiting    
##  Min.   :1.600   Min.   :43.0  
##  1st Qu.:2.163   1st Qu.:58.0  
##  Median :4.000   Median :76.0  
##  Mean   :3.488   Mean   :70.9  
##  3rd Qu.:4.454   3rd Qu.:82.0  
##  Max.   :5.100   Max.   :96.0

Variables in a Data Frame

A Data frame is a list, or vector, of variables:

length(faithful)
## [1] 2

The dollar sign $ can be used to examine individual variables:

class(faithful$eruptions)
## [1] "numeric"
class(faithful$waiting)
## [1] "numeric"

The variables can also be extracted by numerical index using the element extraction operation [[:

class(faithful[[1]])
## [1] "numeric"
class(faithful[[2]])
## [1] "numeric"

Dimensions

The numbers of rows and columns can be obtained using nrow and ncol; dim returns a vector of the dimensions:

ncol(faithful)
## [1] 2
nrow(faithful)
## [1] 272
dim(faithful)
## [1] 272   2

Simple Visualizations

plot has a method for data frames that tries to provide a reasonable default visualization for numeric data frames:

plot(faithful)

Sample Data Sets

The datasets package in the base R distribution contains a number of data sets you can explore.

Another package with a useful collection of data sets is dslabs.

Many other data sets are contained in contributed packages as examples.

There are also many contributed packages designed specifically for making particular data sets available.

Tidy Data

The useful concept of a tidy data frame was introduced fairly recently and is described is a chapter in R for Data Science.

The idea is that

Tidy data is computationally convenient, and many of the tools we will use are designed around tidy data frames.

A large range of these tools can be accessed by loading the tidyverse package:

library(tidyverse)

But for now I will load the needed packages individually.

Many tools in the tidyverse produce slightly enhanced data frames called tibbles:

library(tibble)
faithful_tbl <- as_tibble(faithful)
class(faithful_tbl)
## [1] "tbl_df"     "tbl"        "data.frame"
faithful_tbl
## # A tibble: 272 x 2
##    eruptions waiting
##  *     <dbl>   <dbl>
##  1      3.6       79
##  2      1.8       54
##  3      3.33      74
##  4      2.28      62
##  5      4.53      85
##  6      2.88      55
##  7      4.7       88
##  8      3.6       85
##  9      1.95      51
## 10      4.35      85
## # ... with 262 more rows

For the most part data frames and tibbles can be used interchangeably.

Many data sets are in tidy form already.

If they are not, they can be put into tidy form.

The tools for this are part of data technologies.

The tasks involved are part of what is sometimes called data wrangling.

An Example: Global Average Surface Temperatures

Among many data sets available at https://data.giss.nasa.gov/gistemp/ is data on monthly global average surface temperatures over a number of years.

These are the data used for the Bloomberg hottest year visualization.

The data are available in a formatted text file at

https://data.giss.nasa.gov/gistemp/tabledata_v3/GLB.Ts+dSST.txt

or as a CSV (comma-separated values) file at

https://data.giss.nasa.gov/gistemp/tabledata_v3/GLB.Ts+dSST.csv

The CSV file is a little easier to read in so we will work with that.

The file available on January 16, 2019, is now available locally. We can make sure it has been downloaded to our working directory with

if (! file.exists("GLB.Ts+dSST.csv"))
    download.file("http://homepage.divms.uiowa.edu/~luke/data/GLB.Ts+dSST.csv",
                  "GLB.Ts+dSST.csv")

Assuming this locally available file has been downloaded, we can read in the data and drop some columns we don’t need with

library(readr)
gast <- read_csv("GLB.Ts+dSST.csv", skip = 1)[1 : 13]
head(gast)
## # A tibble: 6 x 13
##    Year    Jan    Feb   Mar   Apr   May   Jun   Jul   Aug   Sep   Oct    Nov Dec  
##   <int>  <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <chr>
## 1  1880 -0.28  -0.17  -0.1  -0.19 -0.11 -0.22 -0.2  -0.08 -0.15 -0.22 -0.19  -.22 
## 2  1881 -0.14  -0.16   0.05  0.05  0.03 -0.19 -0.06 -0.02 -0.13 -0.2  -0.21  -.10 
## 3  1882  0.15   0.16   0.04 -0.18 -0.15 -0.25 -0.2  -0.05 -0.09 -0.24 -0.15  -.24 
## 4  1883 -0.31  -0.38  -0.12 -0.17 -0.19 -0.12 -0.08 -0.15 -0.2  -0.13 -0.22  -.15 
## 5  1884 -0.15  -0.08  -0.36 -0.42 -0.36 -0.4  -0.34 -0.26 -0.27 -0.23 -0.290 -.28 
## 6  1885 -0.580 -0.290 -0.24 -0.42 -0.41 -0.43 -0.35 -0.31 -0.23 -0.18 -0.19  -.04
tail(gast)
## # A tibble: 6 x 13
##    Year   Jan   Feb   Mar   Apr   May   Jun   Jul   Aug   Sep   Oct   Nov Dec  
##   <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1  2013  0.66  0.55  0.65  0.53 0.570  0.65 0.570  0.66  0.77  0.67  0.78 .65  
## 2  2014  0.73  0.51  0.76  0.76 0.85   0.66 0.55   0.81  0.88  0.81  0.66 .78  
## 3  2015  0.81  0.86  0.9   0.74 0.75   0.79 0.71   0.78  0.81  1.07  1.02 1.10 
## 4  2016  1.15  1.34  1.3   1.07 0.91   0.77 0.82   1     0.87  0.89  0.9  .83  
## 5  2017  0.97  1.12  1.12  0.92 0.88   0.69 0.82   0.86  0.75  0.87  0.85 .88  
## 6  2018  0.76  0.84  0.91  0.87 0.81   0.74 0.78   0.72  0.75  0.98  0.77 ***

The function read_csv is from the readr package, which is part of the tidyverse. An alternative is the base function read.csv.

The print method for tibbles abbreviates the output. Here that might skip showing the Dec column. One way to see all columns is to explicitly call print:

print(tail(gast), width = 100)
## # A tibble: 6 x 13
##    Year   Jan   Feb   Mar   Apr   May   Jun   Jul   Aug   Sep   Oct   Nov Dec  
##   <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1  2013  0.66  0.55  0.65  0.53 0.570  0.65 0.570  0.66  0.77  0.67  0.78 .65  
## 2  2014  0.73  0.51  0.76  0.76 0.85   0.66 0.55   0.81  0.88  0.81  0.66 .78  
## 3  2015  0.81  0.86  0.9   0.74 0.75   0.79 0.71   0.78  0.81  1.07  1.02 1.10 
## 4  2016  1.15  1.34  1.3   1.07 0.91   0.77 0.82   1     0.87  0.89  0.9  .83  
## 5  2017  0.97  1.12  1.12  0.92 0.88   0.69 0.82   0.86  0.75  0.87  0.85 .88  
## 6  2018  0.76  0.84  0.91  0.87 0.81   0.74 0.78   0.72  0.75  0.98  0.77 ***

Each observation consists of a year, a month, and a temperature. The format with one column per month is compact and useful for viewing and data entry, but it is not in tidy format since

For obvious reasons this data format is often referred to as wide format. The tidy, or long, format would have three variables: Year, Month, and Temp.

One way to put this data frame in tidy, or long, format uses gather from the tidyr package:

library(tidyr)
lgast <- gather(gast, Month, Temp, -Year, factor_key = TRUE)
head(lgast)
## # A tibble: 6 x 3
##    Year Month Temp 
##   <int> <fct> <chr>
## 1  1880 Jan   -0.28
## 2  1881 Jan   -0.14
## 3  1882 Jan   0.15 
## 4  1883 Jan   -0.31
## 5  1884 Jan   -0.15
## 6  1885 Jan   -0.58

The factor_key = TRUE argument ensures that the key variable Month is created as a factor rather than a character vector:

head(lgast$Month)
## [1] Jan Jan Jan Jan Jan Jan
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
str(lgast$Month)
##  Factor w/ 12 levels "Jan","Feb","Mar",..: 1 1 1 1 1 1 1 1 1 1 ...

This preserves the order of the months and prevents them from being reordered alphabetically.

The Temp variable may have been read as class character because of a missing value code. If so, we need to convert it to numeric:

library(dplyr)
if (class(lgast$Temp) == "character") {
    lgast <- mutate(lgast, Temp = as.numeric(Temp))
    filter(lgast, is.na(Temp))
}
## Warning in evalq(as.numeric(Temp), <environment>): NAs introduced by coercion
## # A tibble: 1 x 3
##    Year Month  Temp
##   <int> <fct> <dbl>
## 1  2018 Dec      NA

We can now use this tidy version to create a static version of the Bloomberg hottest year visualization. The basic framework is set up with

library(ggplot2)
p <- ggplot(lgast) +
     ggtitle("Global Average Surface Temperatures") +
     theme(plot.title = element_text(hjust = 0.5))

Then add a layer with lines for each year (specified by the group argument to geom_line).

p + geom_line(aes(x = Month, y = Temp, group = Year))
## Warning: Removed 1 rows containing missing values (geom_path).

We can use color to distingish the years.

Saving the plot specification in the variable p1 will make it easier to experiment with color variations:

p1 <- p + geom_line(aes(x = Month, y = Temp, group = Year, color = Year))
p1
## Warning: Removed 1 rows containing missing values (geom_path).

There are several ways to change the default color scheme.

For a color gradient from black to red we can use

p1 + scale_color_gradient(low="black", high = "red")
## Warning: Removed 1 rows containing missing values (geom_path).

Another possibility is to use a continuous version of the "Reds" palette from http://colorbrewer2.org:

p2 <- p1 + scale_color_distiller(palette = "Reds", direction = 1)
p2
## Warning: Removed 1 rows containing missing values (geom_path).

Compute the past year and make sure it is in the file (this would usually be hidden in a report by using the chunk option include = FALSE):

library(lubridate)
past_year <- year(today()) - 1
past_year
## [1] 2018
stopifnot(past_year %in% lgast$Year)

One way to highlight the past_year 2018:

lgast_last <- filter(lgast, Year == past_year)
p2 + geom_line(aes(x = Month, y = Temp, group = Year),
               size = 1, color = "blue", data = lgast_last)
## Warning: Removed 1 rows containing missing values (geom_path).

## Warning: Removed 1 rows containing missing values (geom_path).

The version of the data on the web may have been updated to include the missing December 2018 value. If you want to update your plot later in the year you will see similar missing value markers for the remaining months of 2019.

The New York Times on January 18, 2018, published another visualization of these data showing average yearly temperatures.

To recreate this plot we first need to compute yearly average temperatures. This is easy to do with the summarize and group_by functions from dpyr:

library(dplyr)
lgast_by_year <- group_by(lgast, Year)
atemp <- summarize(lgast_by_year,
                   AveTemp = mean(Temp, na.rm = TRUE))
head(atemp)
## # A tibble: 6 x 2
##    Year AveTemp
##   <int>   <dbl>
## 1  1880  -0.178
## 2  1881  -0.09 
## 3  1882  -0.10 
## 4  1883  -0.185
## 5  1884  -0.287
## 6  1885  -0.306

Using na.rm = TRUE ensures that the mean for 2018 is based on the available first 11 months.

A simple version of the plot is then produced by

ggplot(atemp) + geom_point(aes(x = Year, y = AveTemp))