--- title: "Data and Data Frames" output: html_document: toc: yes code_download: true --- ```{r setup, include=FALSE} source(here::here("setup.R")) knitr::opts_chunk$set(collapse = TRUE, message = FALSE, fig.height = 5, fig.width = 6, fig.align = "center") options(width = 80) ``` ## Data Structures and Data Attribute Types Data comes in many different forms. Some of the most common data structures are * rectangular tables * networks and trees * geometries, regions Other forms include * collections of text * video and audio recordings * ... We will work mostly with tables. Many other forms can be reduced to tables. [Stevens](https://en.wikipedia.org/wiki/Stanley_Smith_Stevens) (1945) classifies [scales of measurement](https://en.wikipedia.org/wiki/Level_of_measurement) for attributes, or variables, as * nominal (e.g. hair color) * ordinal (e.g. dislike, neutral, like) * interval (e.g. temperature) --- compare by difference * ratio (e.g. counts) --- compare by ratio These are sometimes grouped as * qualitative: nominal * quantitative: ordinal, interval, ratio These can be viewed as _semantic classifications_ _Computational considerations_ often classify variables as * categorical * integer, discrete * real, continuous Another consideration is that some scales may be cyclic: * hours of the day * angles, longitude These distinctions can be important in choosing visual representations. Other typologies include one proposed by [Mosteller and Tukey (1977)](https://en.wikipedia.org/wiki/Level_of_measurement#Mosteller_and_Tukey's_typology_(1977)): 1. Names 2. Grades (ordered labels like beginner, intermediate, advanced) 3. Ranks (orders with 1 being the smallest or largest, 2 the next smallest or largest, and so on) 4. Counted fractions (bound by 0 and 1) 5. Counts (non-negative integers) 6. Amounts (non-negative real numbers) 7. Balances (any real number) ## Data Types in R R variables can be of different types. The most common types are * `numeric` for real numbers * `integer` * `character` for text data or nominal data * `factor` for nominal or ordinal data Factors can be * unordered, for nominal data * ordered, for ordinal data
`factors` are more efficient and powerful for representing nominal or ordinal data than `character` data but can take a bit more getting used to.
Membership predicates and coercion functions are | Predicate | Coersion| |:----------------|:-----------------| | `is.numeric` | `as.numeric` | | `is.integer` | `as.integer` | | `is.character` | `as.character` | | `is.factor` | `as.factor` | | `is.ordered` | `as.ordered` |
Conversion of factors with numeric-looking labels to numeric data should always go through `as.character` first.
## Data Frames: Organizing Cases and Variables Tabular data in R is usually stored as a _data frame_. A data frame is a collection of _variables_, each with a value for every _case_ or _observacion_. The `faithful` data set is a data frame: ```{r} class(faithful) names(faithful) ``` Most tools we work with in R use data organized in data frames. Our `plot()` and `lm()` expressions from the [introductory section](intro.html) can also we written as ```{r, eval = FALSE} plot(waiting ~ eruptions, data = faithful, xlab = "Eruption time (min)", ylab = "Waiting time to next eruption (min)") fit <- lm(waiting ~ eruptions, data = faithful) ```
`plot()` only uses the `data` argument when the plot is specified as a _formula_, like ```r waiting ~ eruptions ```
## Examining the Data in a Data Frame `head()` provides an idea of what the raw data looks like: ```{r} head(faithful) ``` `str()` is also useful for an overview of an object's structure: ```{r} str(faithful) ``` Another useful function available from the `dplyr` or `tibble` packages is `glimpse()`: ```{r} library(dplyr) glimpse(faithful) ``` `summary()` shows basic statistical properties of the variables: ```{r} summary(faithful) ``` `summary()` with a character variable and a factor variable: ```{r} ffl <- mutate(faithful, type = ifelse(eruptions < 3, "short", "long"), ftype = factor(type)) summary(ffl) ``` ## Variables in a Data Frame A Data frame is a list, or vector, of variables: ```{r} length(faithful) ``` The dollar sign `$` can be used to examine individual variables: ```{r} class(faithful$eruptions) class(faithful$waiting) ``` The variables can also be extracted by numerical or character index using the element extraction operation `[[`: ```{r} class(faithful[[1]]) class(faithful[["waiting"]]) ``` ## Dimensions The numbers of rows and columns can be obtained using `nrow()` and `ncol()`: ```{r} ncol(faithful) nrow(faithful) ``` `dim()` returns a vector of the dimensions: ```{r} dim(faithful) ``` ## Simple Visualizations `plot` has a _method_ for data frames that tries to provide a reasonable default visualization for numeric data frames: ```{r} plot(faithful) ``` ## Sample Data Sets The [`datasets`](http://stat.ethz.ch/R-manual/R-devel/library/datasets/html/00Index.html) package in the base R distribution contains a number of data sets you can explore. Another package with a useful collection of data sets is [`dslabs`](https://cran.r-project.org/package=dslabs). Many other data sets are contained in contributed packages as examples. There are also many contributed packages designed specifically for making particular data sets available. ## Tidy Data The useful concept of a _tidy_ data frame was introduced fairly [recently](https://www.jstatsoft.org/article/view/v059i10) and is described is a [chapter in _R for Data Science_](https://r4ds.had.co.nz/tidy-data.html). The idea is that * every observation should correspond to a single row; * every variable should correspond to a single column. Tidy data is computationally convenient, and many of the tools we will use are designed around tidy data frames. A large range of these tools can be accessed by loading the `tidyverse` package: ```{r, eval = FALSE} library(tidyverse) ``` But for now I will load the needed packages individually.
The term _tidy_ is a little unfortunate. * Data that is not _tidy_ isn't necessarily _bad_. * For human reading, and for some computations, data in a wider format can be better. * For other computations data in a longer, or narrower, format can be better.
## Tibbles Many tools in the `tidyverse` produce slightly enhanced data frames called _tibbles_: ```{r} library(tibble) faithful_tbl <- as_tibble(faithful) class(faithful_tbl) ``` Tibbles print differently from standard data frames: ```{r, highlight.output=c(1, 3, 14)} faithful_tbl ``` For the most part data frames and tibbles can be used interchangeably. ## Tidying Data Many data sets are in tidy form already. If they are not, they can be put into tidy form. The tools for this are part of _data technologies_. The tasks involved are part of what is sometimes called _data wrangling_. ## An Example: Global Average Surface Temperatures Among many data sets available at is data on monthly global average surface temperatures over a number of years. These data from 2017 were used for the widely cited [Bloomberg hottest year visualization](https://www.bloomberg.com/graphics/hottest-year-on-record/). The current data are available in a formatted text file at or as a [_CSV_ (comma-separated values)](https://en.wikipedia.org/wiki/Comma-separated_values) file at ```{r download-GLB, echo = FALSE} ``` The first few lines of the CSV file: ```{r} #| echo: false #| comment: " " readLines("GLB.Ts+dSST.csv", 6) |> writeLines() ``` The CSV file is a little easier (for a computer program) to read in, so we will work with that. The numbers in the CSV file represent deviations in degrees Celcius from the average temperature for the base period 1951-1980. The file available on January 19, 2023, is now available [locally](https://stat.uiowa.edu/~luke/data/GLB.Ts+dSST.csv). We can make sure it has been downloaded to our working directory with ```{r download-GLB} if (! file.exists("GLB.Ts+dSST.csv")) download.file("https://stat.uiowa.edu/~luke/data/GLB.Ts+dSST.csv", "GLB.Ts+dSST.csv") ``` Assuming this locally available file has been downloaded, we can read in the data and drop some columns we don't need with ```{r} library(readr) gast <- read_csv("GLB.Ts+dSST.csv", skip = 1)[1 : 13] ```
* The function `read_csv` is from the `readr` package, which is part of the `tidyverse`. * An alternative is the base R function `read.csv`.
A look a the first few lines: ```{r} head(gast, 5) ``` And the last few lines: ```{r} tail(gast, 5) ``` The `print()` method for tibbles abbreviates the output. It is neater and provides some useful additional information on variable data types. But it shows only the first few rows and may not explicitly show some columns. If some columns are skipped, you can ask to see more by calling `print()` explicitly: ```{r} print(tail(gast), width = 100) ``` The format with one column per month is compact and useful for viewing and data entry. But it is not in _tidy format_ since * the monthly temperature variable is spread over 12 columns; * the month variable is encoded in the column headings. For obvious reasons this data format is often referred to as _wide format_. The tidy, or _long_, format would have three variables: `Year`, `Month`, and `Temp`. One way to put this data frame in tidy, or long, format uses `pivot_longer` from the `tidyr` package: ```{r} library(tidyr) lgast <- pivot_longer(gast, -Year, ## specifies the columns to use -- all but Year names_to = "Month", values_to = "Temp") ``` The first few rows of the result: ```{r} head(lgast) ``` During plotting it is likely that the `Month` variable will be converted to a _factor_. By default, this will order levels alphabetically, which is not what we want: ```{r} levels(as.factor(lgast$Month)) ``` We can guard against this by converting `Month` to a factor with the right levels now: ```{r} lgast <- mutate(lgast, Month = factor(Month, levels = month.abb)) levels(lgast$Month) ``` We can now use this tidy version of the data to create a static version of the [Bloomberg hottest year visualization](https://www.bloomberg.com/graphics/hottest-year-on-record/). The basic framework is set up with ```{r} library(ggplot2) p <- ggplot(lgast) + ggtitle("Global Average Surface Temperatures") + theme(plot.title = element_text(hjust = 0.5)) ```
`ggplot` objects only produce output when they are printed. To see the plot in `p` we need to print it, for example by using a line line with only `p`.
```{r gast-base, eval = FALSE} p ``` ```{r gast-base, eval = TRUE, echo = FALSE} ``` Then add a layer with lines for each year (specified by the `group` argument to `geom_line`). ```{r gast-lines, eval = FALSE} p + geom_line(aes(x = Month, y = Temp, group = Year)) ``` ```{r gast-lines, eval = TRUE, echo = FALSE, warning = FALSE} ``` We can use color to distingish the years. ```{r gast-color1, eval = FALSE} p1 <- p + geom_line(aes(x = Month, y = Temp, color = Year, group = Year), na.rm = TRUE) p1 ``` Saving the plot specification in the variable `p1` will make it easier to experiment with color variations: ```{r gast-color1, eval = TRUE, echo = FALSE} ``` ```{r, include = FALSE} ## Compute the past year and make sure it is in the file library(lubridate) past_year <- year(today()) - 1 past_year stopifnot(past_year %in% lgast$Year) ``` One way to highlight the `past_year` `r past_year`: ```{r gast-color2, eval = FALSE} lgast_last <- filter(lgast, Year == past_year) p1 + geom_line(aes(x = Month, y = Temp, group = Year), linewidth = 1, color = "red", data = lgast_last, na.rm = TRUE) ``` ```{r gast-color2, echo = FALSE} ``` A useful way to show more recent data in the context of the full data is to show the full data in grey and the more recent years in black: ```{r gast-full-grey, eval = FALSE} lgast2k <- filter(lgast, Year >= 2000) ggplot(lgast, aes(x = Month, y = Temp, group = Year)) + geom_line(color = "grey80") + theme_minimal() + geom_line(data = lgast2k) ``` ```{r gast-full-grey, echo = FALSE, warning = FALSE} ``` If you want to update your plot later in the year then the current year's entry may contain missing value indicators that you will have to deal with. The New York Times on January 18, 2018, published [another visualization](https://www.nytimes.com/interactive/2018/01/18/climate/hottest-year-2017.html) of these data showing average yearly temperatures ([via Google may work better](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwj_4PmhnM31AhXFjYkEHfQxATEQFnoECAYQAQ&url=https%3A%2F%2Fwww.nytimes.com%2Finteractive%2F2018%2F01%2F18%2Fclimate%2Fhottest-year-2017.html&usg=AOvVaw3qouxHQAJCqrt_F3fbLEGE)). To recreate this plot we first need to compute yearly average temperatures. This is easy to do with the `summarize` and `group_by` functions from `dpyr`: ```{r} library(dplyr) atemp <- lgast %>% group_by(Year) %>% summarize(AveTemp = mean(Temp, na.rm = TRUE)) head(atemp) ``` Using `na.rm = TRUE` ensures that the mean is based on the available months if data for some months is missing. A simple version of the plot is then produced by ```{r gast-nyt, eval = FALSE} ggplot(atemp) + geom_point(aes(x = Year, y = AveTemp)) ``` ```{r gast-nyt, echo = FALSE} ``` Another variation on the Bloomberg plot showing just a few years 20 years apart: ```{r gast-skip, eval = FALSE} lg_by_20 <- filter(lgast, Year %in% seq(2020, by = -20, len = 5)) %>% mutate(Year = factor(Year)) ggplot(lg_by_20, aes(x = Month, y = Temp, group = Year, color = Year)) + geom_line() ``` Converting `Year` to a factor results in a discrete color scale and legend. ```{r gast-skip, echo = FALSE} ``` ## Handling Missing Values The data for 2019 available in early 2020 is also available [locally](https://stat.uiowa.edu/~luke/data/GLB.Ts+dSST-2019.csv). ```{r, echo = FALSE} if (! file.exists("GLB.Ts+dSST-2019.csv")) download.file("https://stat.uiowa.edu/~luke/data/GLB.Ts+dSST-2019.csv", "GLB.Ts+dSST-2019.csv") ``` Assuming this locally available file has been downloaded, we can read in the data and drop some columns we don't need with ```{r} gast2019 <- read_csv("GLB.Ts+dSST-2019.csv", skip = 1)[1 : 13] ``` The last three columns are read as character variables: ```{r} head(gast2019, 5) ``` The reason is that data for October through December were not available: ```{r} tail(gast2019, 2) ``` We want to convert these columns to numeric, with missing values represented as `NA`. One option is to handle them individually: ```{r} gast2019$Oct <- as.numeric(gast2019$Oct) tail(gast2019, 2) ``` Another option is to convert all character columns to numeric with ```{r} gast2019 <- mutate(gast2019, across(where(is.character), as.numeric)) tail(gast2019, 2) ``` The warnings are benign and can be suppressed with the `warning = FALSE` chunk option. Since we know the missing value pattern `***` we can also avoid the need to fix the data after the fact by specifying this at read time: ```{r} gast2019 <- read_csv("GLB.Ts+dSST-2019.csv", na = "***", skip = 1)[1 : 13] tail(gast2019, 2) ``` A plot highlighting the year 2019 shows only the months with available data: ```{r gast-2019, eval = FALSE} lgast2019 <- gast2019 %>% pivot_longer(-Year, names_to = "Month", values_to = "Temp") %>% mutate(Month = factor(Month, levels = month.abb)) ggplot(lgast2019, aes(x = Month, y = Temp, group = Year)) + geom_line(color = "grey80", na.rm = TRUE) + geom_line(data = filter(lgast2019, Year == 2019), na.rm = TRUE) ``` Adding `na.rm = TRUE` in the `geom_line` calls suppresses warnings; the plot would be the same without these. ```{r gast-2019, echo = FALSE} ```
Outline of the tools used: * Data processing: * Reading: `read_csv`, `read.csv`; * Reshaping: `pivot_longer`; * Cleaning: `is.character`, `as.numeric`, `mutate`, `across`, `factor`; * Summarizing: `group_by`, `summarize`. * Visualization geometries: * `geom_line` for a line plot; * `geom_point` for a scatter plot.
## Reading Stevens' classification of scales of measurement is described in a [Wikipedia article](https://en.wikipedia.org/wiki/Level_of_measurement). A good introduction to the concept of _tidy data_ is provided in a [chapter in _R for Data Science_](https://r4ds.had.co.nz/tidy-data.html). ## Interactive Tutorial An interactive [`learnr`](https://rstudio.github.io/learnr/) tutorial for these notes is [available](`r WLNK("tutorials/datafrm.Rmd")`). You can run the tutorial with ```{r, eval = FALSE} STAT4580::runTutorial("datafrm") ``` ## Exercises 1. Which of the Stevens classifications (nominal, ordinal, interval, ratio) best characterizes these variables: a. Daily maximal temperatures in Iowa City. b. Population counts for Iowa counties. c. Education level of job applicants using the [Bureau of Labor Statistics classification](https://www.bls.gov/careeroutlook/2014/article/education-level-and-jobs.htm). d. Major of UI students. 2. Which of these data sets are in tidy form? a. The builtin data set `co2` b. The builtin data set `BOD` c. The `who` data set in package `tidyr` (`tidyr::who`) d. The `mpg` data set in package `ggplot2` (`ggplot2::mpg`) The next exercises use the data in the variable `gapminder` in the package `gapminder`. You can make it available with ```{r, eval = FALSE} data(gapminder, package = "gapminder") ``` 3. Use the function `str` to examine the value of the gapminder variable. How many cases are there in the data set? How many of the variables are factors? 4. Use the functions `class` and `names` to find the class and variable names in the `gapminder` data. 5. Use `summary` to compute summary information for the variables. 6. Fill in the values for `---` needed to produce plots of life expectancy against year for the countries in continent Oceania. ```{r, eval = FALSE} library(dplyr) library(ggplot2) data(gapminder, package = "gapminder") ggplot(filter(gapminder, continent == "Oceania"), aes(x = ---, y = ---, color = country)) + geom_line() ```