Data and Data Frames

.title[
# Data and Data Frames
]
.author[
### Luke Tierney
]
.institute[
### University of Iowa
]
.date[
### 2024-01-26
]

---

## Data Structures and Data Attribute Types
---

Data comes in many different forms.

Some of the most common data structures are

* rectangular tables
* networks and trees
* geometries, regions

Other forms include

* collections of text
* video and audio recordings
* ...

We will work mostly with tables.

Many other forms can be reduced to tables.

---

[Stevens](https://en.wikipedia.org/wiki/Stanley_Smith_Stevens) (1945)
 classifies [scales of
 measurement](https://en.wikipedia.org/wiki/Level_of_measurement) for
 attributes, or variables, as

* nominal (e.g. hair color)
* ordinal (e.g. dislike, neutral, like)
* interval (e.g. temperature) --- compare by difference
* ratio (e.g. counts) --- compare by ratio

These are sometimes grouped as

* qualitative: nominal
* quantitative: ordinal, interval, ratio

These can be viewed as _semantic classifications_

---

_Computational considerations_ often classify variables as

* categorical
* integer, discrete
* real, continuous

Another consideration is that some scales may be cyclic:

* hours of the day
* angles, longitude

These distinctions can be important in choosing visual representations.

---

Other typologies include one proposed by [Mosteller and Tukey
(1977)](https://en.wikipedia.org/wiki/Level_of_measurement#Mosteller_and_Tukey's_typology_(1977)):

1. Names
2. Grades (ordered labels like beginner, intermediate, advanced)
3. Ranks (orders with 1 being the smallest or largest, 2 the next
   smallest or largest, and so on)
4. Counted fractions (bound by 0 and 1)
5. Counts (non-negative integers)
6. Amounts (non-negative real numbers)
7. Balances (any real number)

---

R variables can be of different types.

The most common types are

* `numeric` for real numbers
* `integer`
* `character` for text data or nominal data
* `factor` for nominal or ordinal data

Factors can be

* unordered, for nominal data
* ordered, for ordinal data

.alert[
`factors` are more efficient and powerful for representing nominal or
ordinal data than `character` data but can take a bit more getting
used to.
]

---

Membership predicates and coercion functions are

| Predicate       | Coersion|
|:----------------|:-----------------|
| `is.numeric`   | `as.numeric`    |
| `is.integer`   | `as.integer`    |
| `is.character` | `as.character`  |
| `is.factor`    | `as.factor`     |
| `is.ordered`   | `as.ordered`    |

.alert[
Conversion of factors with numeric-looking labels to numeric data
should always go through `as.character` first.
]

---
layout: true
## Data Frames: Organizing Cases and Variables
---

Tabular data in R is usually stored as a _data frame_.

A data frame is a collection of _variables_, each with a value for
every _case_ or _observacion_.

The `faithful` data set is a data frame:

```r
class(faithful)
## [1] "data.frame"
names(faithful)
## [1] "eruptions" "waiting"
```

Most tools we work with in R use data organized in data frames.

---

Our `plot()` and `lm()` expressions from the
[introductory section](intro.html)
can also we written as

```r
plot(waiting ~ eruptions, data = faithful,
     xlab = "Eruption time (min)",
     ylab = "Waiting time to next eruption (min)")
fit <- lm(waiting ~ eruptions, data = faithful)
```
--

```r
waiting ~ eruptions
```
]

---
layout: true
## Examining the Data in a Data Frame
---

`head()` provides an idea of what the raw data looks like:

```r
head(faithful)
##   eruptions waiting
## 1     3.600      79
## 2     1.800      54
## 3     3.333      74
## 4     2.283      62
## 5     4.533      85
## 6     2.883      55
```

`str()` is also useful for an overview of an object's structure:

```r
str(faithful)
## 'data.frame':	272 obs. of  2 variables:
##  $ eruptions: num  3.6 1.8 3.33 2.28 4.53 ...
##  $ waiting  : num  79 54 74 62 85 55 88 85 51 85 ...
```

---

Another useful function available from the `dplyr` or `tibble`
packages is `glimpse()`:

```r
library(dplyr)
glimpse(faithful)
## Rows: 272
## Columns: 2
## $ eruptions <dbl> 3.600, 1.800, 3.333, 2.283, 4.533, 2.883, 4.700, 3.600, 1.95…
## $ waiting   <dbl> 79, 54, 74, 62, 85, 55, 88, 85, 51, 85, 54, 84, 78, 47, 83, …
```

`summary()` shows basic statistical properties of the variables:

```r
summary(faithful)
##    eruptions        waiting    
##  Min.   :1.600   Min.   :43.0  
##  1st Qu.:2.163   1st Qu.:58.0  
##  Median :4.000   Median :76.0  
##  Mean   :3.488   Mean   :70.9  
##  3rd Qu.:4.454   3rd Qu.:82.0  
##  Max.   :5.100   Max.   :96.0
```

---

`summary()` with a character variable and a factor variable:

```r
ffl <- mutate(faithful,
              type = ifelse(eruptions < 3, "short", "long"),
              ftype = factor(type))
summary(ffl)
##    eruptions        waiting         type             ftype    
##  Min.   :1.600   Min.   :43.0   Length:272         long :175  
##  1st Qu.:2.163   1st Qu.:58.0   Class :character   short: 97  
##  Median :4.000   Median :76.0   Mode  :character              
##  Mean   :3.488   Mean   :70.9                                 
##  3rd Qu.:4.454   3rd Qu.:82.0                                 
##  Max.   :5.100   Max.   :96.0
```

---
layout: false
## Variables in a Data Frame

A Data frame is a list, or vector, of variables:

```r
length(faithful)
## [1] 2
```

The dollar sign `$` can be used to examine individual variables:

```r
class(faithful$eruptions)
## [1] "numeric"
class(faithful$waiting)
## [1] "numeric"
```

The variables can also be extracted by numerical or character index using the
  element extraction operation `[[`:

```r
class(faithful[[1]])
## [1] "numeric"
class(faithful[["waiting"]])
## [1] "numeric"
```

---
## Dimensions

The numbers of rows and columns can be obtained using `nrow()` and `ncol()`:

```r
ncol(faithful)
## [1] 2
nrow(faithful)
## [1] 272
```

`dim()` returns a vector of the dimensions:

```r
dim(faithful)
## [1] 272   2
```

---
## Simple Visualizations

`plot` has a _method_ for data frames that tries to provide a
reasonable default visualization for numeric data frames:

```r
plot(faithful)
```

---
## Sample Data Sets

The
[`datasets`](http://stat.ethz.ch/R-manual/R-devel/library/datasets/html/00Index.html)
package in the base R distribution contains a number of data sets you
can explore.

Another package with a useful collection of data sets is [`dslabs`](https://cran.r-project.org/package=dslabs).

Many other data sets are contained in contributed packages as examples.

There are also many contributed packages designed specifically for
making particular data sets available.

---
layout: true
## Tidy Data
---

The useful concept of a _tidy_ data frame was introduced fairly
[recently](https://www.jstatsoft.org/article/view/v059i10) and is
described in a [chapter in _R for Data
Science_](https://r4ds.hadley.nz/data-tidy.html).

The idea is that

* every observation should correspond to a single row;
* every variable should correspond to a single column.

Tidy data is computationally convenient, and many of the tools we will
use are designed around tidy data frames.

A large range of these tools can be accessed by loading the
`tidyverse` package:

```r
library(tidyverse)
```

But for now I will load the needed packages individually.

---

* Data that is not _tidy_ isn't necessarily _bad_.
* For human reading, and  for some computations, data in a wider format can
  be better.
* For other computations data in a longer, or narrower, format can be better.
]

---
layout: true
## Tibbles
---

Many tools in the `tidyverse` produce slightly enhanced data frames
called _tibbles_:

.pull-left.width-50[

```r
library(tibble)
faithful_tbl <- as_tibble(faithful)
class(faithful_tbl)
## [1] "tbl_df"     "tbl"        "data.frame"
```
]
--
.pull-right.width-40.small-code[
Tibbles print differently from standard data frames:

```r
faithful_tbl
*## # A tibble: 272 × 2
##    eruptions waiting
*##        <dbl>   <dbl>
##  1      3.6       79
##  2      1.8       54
##  3      3.33      74
##  4      2.28      62
##  5      4.53      85
##  6      2.88      55
##  7      4.7       88
##  8      3.6       85
##  9      1.95      51
## 10      4.35      85
*## # ℹ 262 more rows
```
]

For the most part data frames and tibbles can be used interchangeably.

---
layout: false
## Tidying Data

Many data sets are in tidy form already.

If they are not, they can be put into tidy form.

The tools for this are part of _data technologies_.

The tasks involved are part of what is sometimes called _data
wrangling_.

---
layout: true
## An Example: Global Average Surface Temperatures
---

Among many data sets available at
<https://data.giss.nasa.gov/gistemp/> is data on monthly global
average surface temperatures over a number of years.

These data from 2017 were used for the widely cited
[Bloomberg hottest year visualization](https://web.archive.org/web/20190202194432/https://www.bloomberg.com/graphics/hottest-year-on-record/).

The current data are available in a formatted text file at

<https://data.giss.nasa.gov/gistemp/tabledata_v4/GLB.Ts+dSST.txt>

or as a
[_CSV_ (comma-separated values)](https://en.wikipedia.org/wiki/Comma-separated_values) file at

<https://data.giss.nasa.gov/gistemp/tabledata_v4/GLB.Ts+dSST.csv>

The first few lines of the CSV file:

```
    Land-Ocean: Global Means
    Year,Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec,J-D,D-N,DJF,MAM,JJA,SON
    1880,-.19,-.24,-.09,-.16,-.10,-.21,-.18,-.10,-.14,-.23,-.21,-.18,-.17,***,***,-.12,-.16,-.19
    1881,-.20,-.14,.03,.05,.06,-.19,.00,-.04,-.15,-.22,-.19,-.07,-.09,-.10,-.17,.05,-.08,-.19
    1882,.16,.14,.05,-.17,-.14,-.23,-.16,-.07,-.14,-.23,-.16,-.35,-.11,-.09,.08,-.09,-.15,-.18
    1883,-.29,-.37,-.12,-.18,-.17,-.08,-.06,-.14,-.21,-.11,-.23,-.11,-.17,-.19,-.34,-.16,-.09,-.18
```
]

---

The CSV file is a little easier (for a computer program) to read in,
so we will work with that.

The numbers in the CSV file represent deviations in degrees Celcius
from the average temperature for the base period 1951-1980.

The file available on January 15, 2024, is now available
[locally](https://stat.uiowa.edu/~luke/data/GLB.Ts+dSST.csv).

We can make sure it has been downloaded to our working directory with

```r
if (! file.exists("GLB.Ts+dSST.csv"))
    download.file("https://stat.uiowa.edu/~luke/data/GLB.Ts+dSST.csv",
                  "GLB.Ts+dSST.csv")
```
---

Assuming this locally available file has been downloaded, we can read
in the data and drop some columns we don't need with

```r
library(readr)
gast <- read_csv("GLB.Ts+dSST.csv", skip = 1)[1 : 13]
```
--

.alert[
* The function `read_csv` is from the `readr` package, which is part of the
  `tidyverse`.
* An alternative is the base R function `read.csv`.
]

---

A look at the first few lines:

```r
head(gast, 5)
## # A tibble: 5 × 13
##    Year   Jan   Feb   Mar   Apr   May   Jun   Jul   Aug   Sep   Oct   Nov   Dec
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  1880 -0.19 -0.24 -0.09 -0.16 -0.1  -0.21 -0.18 -0.1  -0.14 -0.23 -0.21 -0.18
## 2  1881 -0.2  -0.14  0.03  0.05  0.06 -0.19  0    -0.04 -0.15 -0.22 -0.19 -0.07
## 3  1882  0.16  0.14  0.05 -0.17 -0.14 -0.23 -0.16 -0.07 -0.14 -0.23 -0.16 -0.35
## 4  1883 -0.29 -0.37 -0.12 -0.18 -0.17 -0.08 -0.06 -0.14 -0.21 -0.11 -0.23 -0.11
## 5  1884 -0.13 -0.07 -0.36 -0.4  -0.34 -0.36 -0.3  -0.27 -0.27 -0.25 -0.33 -0.3
```
]

And the last few lines:

```r
tail(gast, 5)
## # A tibble: 5 × 13
##    Year   Jan   Feb   Mar   Apr   May   Jun   Jul   Aug   Sep   Oct   Nov   Dec
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  2019  0.93  0.95  1.17  1.01  0.85  0.9   0.94  0.95  0.92  1     0.99  1.09
## 2  2020  1.17  1.24  1.17  1.13  1.01  0.92  0.9   0.87  0.98  0.88  1.1   0.8 
## 3  2021  0.81  0.64  0.89  0.75  0.78  0.84  0.92  0.82  0.92  1     0.94  0.86
## 4  2022  0.91  0.89  1.05  0.84  0.84  0.92  0.94  0.95  0.89  0.96  0.72  0.8 
## 5  2023  0.87  0.98  1.2   1     0.94  1.08  1.19  1.19  1.48  1.34  1.43  1.37
```
]

---

The `print()` method for tibbles abbreviates the output.

It is neater and provides some useful additional information on
variable data types.

But it shows only the first few rows and may not explicitly show some
columns.

If some columns are skipped, you can ask to see more by calling
`print()` explicitly:

```r
print(tail(gast), width = 100)
## # A tibble: 6 × 13
##    Year   Jan   Feb   Mar   Apr   May   Jun   Jul   Aug   Sep   Oct   Nov   Dec
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  2018  0.82  0.85  0.88  0.89  0.82  0.77  0.82  0.77  0.8   1.01  0.82  0.91
## 2  2019  0.93  0.95  1.17  1.01  0.85  0.9   0.94  0.95  0.92  1     0.99  1.09
## 3  2020  1.17  1.24  1.17  1.13  1.01  0.92  0.9   0.87  0.98  0.88  1.1   0.8 
## 4  2021  0.81  0.64  0.89  0.75  0.78  0.84  0.92  0.82  0.92  1     0.94  0.86
## 5  2022  0.91  0.89  1.05  0.84  0.84  0.92  0.94  0.95  0.89  0.96  0.72  0.8 
## 6  2023  0.87  0.98  1.2   1     0.94  1.08  1.19  1.19  1.48  1.34  1.43  1.37
```
]

<!--
* The last values in the `Aug` - `Dec` columns are missing and coded as `***`.
* This causes these columns to be read as character vectors indicated
  by `<chr>`.
* The others have been read as numeric, coded `<dbl>` (for 
  _double precision_).
* We can fix the these columns now or deal with them later.

One way to fix them now is to use `mutate_if`:

```r
gast <- mutate(gast, across(where(is.character), as.numeric))
head(gast)
## # A tibble: 6 × 13
##    Year   Jan   Feb   Mar   Apr   May   Jun   Jul   Aug   Sep   Oct   Nov   Dec
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  1880 -0.19 -0.24 -0.09 -0.16 -0.1  -0.21 -0.18 -0.1  -0.14 -0.23 -0.21 -0.18
## 2  1881 -0.2  -0.14  0.03  0.05  0.06 -0.19  0    -0.04 -0.15 -0.22 -0.19 -0.07
## 3  1882  0.16  0.14  0.05 -0.17 -0.14 -0.23 -0.16 -0.07 -0.14 -0.23 -0.16 -0.35
## 4  1883 -0.29 -0.37 -0.12 -0.18 -0.17 -0.08 -0.06 -0.14 -0.21 -0.11 -0.23 -0.11
## 5  1884 -0.13 -0.07 -0.36 -0.4  -0.34 -0.36 -0.3  -0.27 -0.27 -0.25 -0.33 -0.3 
## 6  1885 -0.58 -0.34 -0.27 -0.42 -0.45 -0.44 -0.34 -0.31 -0.28 -0.23 -0.24 -0.11
```

Each observation consists of a year, a month, and a temperature.
-->

---

The format with one column per month is compact and useful for viewing
and data entry.

But it is not in _tidy format_ since

* the monthly temperature  variable is spread over 12 columns;
* the month variable is encoded in the column headings.

For obvious reasons this data format is often referred to as _wide
format_.

The tidy, or _long_, format would have three variables: `Year`,
`Month`, and `Temp`.

---

One way to put this data frame in tidy, or long, format uses `pivot_longer`
from the `tidyr` package:

```r
library(tidyr)
lgast <- pivot_longer(gast,
                      -Year,  ## specifies the columns to use -- all but Year
                      names_to = "Month",
                      values_to = "Temp")
```
--

The first few rows of the result:

```r
head(lgast)
## # A tibble: 6 × 3
##    Year Month  Temp
##   <dbl> <chr> <dbl>
## 1  1880 Jan   -0.19
## 2  1880 Feb   -0.24
## 3  1880 Mar   -0.09
## 4  1880 Apr   -0.16
## 5  1880 May   -0.1 
## 6  1880 Jun   -0.21
```

---

During plotting it is likely that the `Month` variable will be
converted to a _factor_.

By default, this will order levels alphabetically, which is not what we want:

```r
levels(as.factor(lgast$Month))
##  [1] "Apr" "Aug" "Dec" "Feb" "Jan" "Jul" "Jun" "Mar" "May" "Nov" "Oct" "Sep"
```

We can guard against this by converting `Month` to a factor with the
right levels now:

```r
lgast <- mutate(lgast, Month = factor(Month, levels = month.abb))
levels(lgast$Month)
##  [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
```

We can now use this tidy version of the data to create a static
version of the [Bloomberg hottest year
visualization](https://web.archive.org/web/20190202194432/https://www.bloomberg.com/graphics/hottest-year-on-record/).

---

The basic framework is set up with

.pull-left.small-code[

```r
library(ggplot2)
p <- ggplot(lgast) +
    ggtitle("Global Average Surface Temperatures") +
    theme(plot.title = element_text(hjust = 0.5))
```

To see the plot in `p` we need to print it, for example by using a
line with only `p`.
]
]

```r
 p
```
<img src="datafrm_files/figure-html/gast-base-1.png" style="display: block; margin: auto;" />
]

---

Then add a layer with lines for each year (specified by the `group`
argument to `geom_line`).

.pull-left.small-code[

```r
p + geom_line(aes(x = Month,
                  y = Temp,
*                 group = Year))
```
]
.pull-right[
<img src="datafrm_files/figure-html/gast-lines-1.png" style="display: block; margin: auto;" />
]

---

We can use color to distingish the years.

.pull-left.small-code[

```r
p1 <- p +
    geom_line(aes(x = Month,
                  y = Temp,
*                 color = Year,
                  group = Year),
              na.rm = TRUE)
p1
```
Saving the plot specification in the variable `p1` will make it easier
to experiment with color variations:
]

.pull-right[
<img src="datafrm_files/figure-html/gast-color1-1.png" style="display: block; margin: auto;" />
]

---

One way to highlight the `past_year` 2023:

.pull-left.small-code[

```r
lgast_last <- filter(lgast, Year == past_year)
p1 + geom_line(aes(x = Month,
                   y = Temp,
                   group = Year),
               linewidth = 1,
               color = "red",
               data = lgast_last,
               na.rm = TRUE)
```
]
.pull-right[
<img src="datafrm_files/figure-html/gast-color2-1.png" style="display: block; margin: auto;" />
]

---

A useful way to show more recent data in the context of the full data
is to show the full data in grey and the more recent years in black:

.pull-left.small-code[

```r
lgast2k <- filter(lgast, Year >= 2000)
ggplot(lgast, aes(x = Month,
                  y = Temp,
                  group = Year)) +
    geom_line(color = "grey80") +
    theme_minimal() +
    geom_line(data = lgast2k)
```
]
.pull-right[
<img src="datafrm_files/figure-html/gast-full-grey-1.png" style="display: block; margin: auto;" />
]

If you want to update your plot later in the year then the current
year's entry may contain missing value indicators that you will have
to deal with.

---

The New York Times on January 18, 2018, published
[another visualization](https://www.nytimes.com/interactive/2018/01/18/climate/hottest-year-2017.html)
of these data showing average yearly temperatures ([via Google may work better](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwj_4PmhnM31AhXFjYkEHfQxATEQFnoECAYQAQ&url=https%3A%2F%2Fwww.nytimes.com%2Finteractive%2F2018%2F01%2F18%2Fclimate%2Fhottest-year-2017.html&usg=AOvVaw3qouxHQAJCqrt_F3fbLEGE)).

To recreate this plot we first need to compute yearly average temperatures.

This is easy to do with the `summarize` and `group_by` functions from `dpyr`:

.pull-left.small-code[

```r
library(dplyr)
atemp <- lgast |>
    group_by(Year) |>
    summarize(AveTemp = mean(Temp, na.rm = TRUE))
head(atemp)
## # A tibble: 6 × 2
##    Year AveTemp
##   <dbl>   <dbl>
## 1  1880 -0.169 
## 2  1881 -0.0883
## 3  1882 -0.108 
## 4  1883 -0.172 
## 5  1884 -0.282 
## 6  1885 -0.334
```
]

.pull-right[
Using `na.rm = TRUE` ensures that the mean is based on the available
months if data for some months is missing.
]

---

A simple version of the plot is then produced by

.pull-left.small-code[

```r
ggplot(atemp) +
    geom_point(aes(x = Year, y = AveTemp))
```
]
.pull-right[
<img src="datafrm_files/figure-html/gast-nyt-1.png" style="display: block; margin: auto;" />
]

---

A variation showing record years:

.pull-left.small-code[

```r
library(ggrepel)
atemp_rec <- filter(atemp, cummax(AveTemp) == AveTemp)
ggplot(atemp, aes(x = Year, y = AveTemp)) +
    geom_point() +
    geom_point(data = atemp_rec, color = "red") +
    geom_text_repel(aes(label = Year),
                    data = atemp_rec,
                    color = "blue")
```
]
.pull-right[
<img src="datafrm_files/figure-html/gast-nyt-rec-1.png" style="display: block; margin: auto;" />
]

---

Another variation on the Bloomberg plot showing just a few years 20
years apart:

.pull-left.small-code[

```r
lg_by_20 <-
    filter(lgast,
           Year %in% seq(2020, by = -20, len = 5)) |>
    mutate(Year = factor(Year))
ggplot(lg_by_20, aes(x = Month,
                     y = Temp,
                     group = Year,
                     color = Year)) +
    geom_line()
```
Converting `Year` to a factor results in a discrete color scale and legend.
]
.pull-right[
<img src="datafrm_files/figure-html/gast-skip-1.png" style="display: block; margin: auto;" />
]

---
layout: true
## Handling Missing Values
---

The data for 2019 available in early 2020 is also available
[locally](https://stat.uiowa.edu/~luke/data/GLB.Ts+dSST-2019.csv).

Assuming this locally available file has been downloaded, we can read
in the data and drop some columns we don't need with

```r
gast2019 <- read_csv("GLB.Ts+dSST-2019.csv", skip = 1)[1 : 13]
```

The last three columns are read as character variables:

```r
head(gast2019, 5)
## # A tibble: 5 × 13
##    Year   Jan   Feb   Mar   Apr   May   Jun   Jul   Aug   Sep Oct   Nov   Dec  
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr>
## 1  1880 -0.17 -0.23 -0.08 -0.15 -0.08 -0.2  -0.17 -0.09 -0.13 -.22  -.20  -.16 
## 2  1881 -0.18 -0.13  0.04  0.06  0.08 -0.17  0.02 -0.02 -0.14 -.20  -.17  -.05 
## 3  1882  0.18  0.15  0.06 -0.15 -0.13 -0.21 -0.15 -0.06 -0.13 -.23  -.15  -.34 
## 4  1883 -0.28 -0.36 -0.11 -0.17 -0.16 -0.07 -0.05 -0.12 -0.2  -.10  -.22  -.10 
## 5  1884 -0.12 -0.07 -0.36 -0.39 -0.33 -0.34 -0.32 -0.27 -0.26 -.24  -.32  -.30
```
]

---

The reason is that data for October through December were not
available:

```r
tail(gast2019, 2)
## # A tibble: 2 × 13
##    Year   Jan   Feb   Mar   Apr   May   Jun   Jul   Aug   Sep Oct   Nov   Dec  
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr>
## 1  2018  0.82  0.85  0.9   0.89  0.83  0.78  0.83  0.76  0.81 1.02  .83   .92  
## 2  2019  0.94  0.96  1.18  1.02  0.86  0.93  0.95  0.94  0.93 ***   ***   ***
```
]

We want to convert these columns to numeric, with missing values
represented as `NA`.

One option is to handle them individually:

```r
gast2019$Oct <- as.numeric(gast2019$Oct)
## Warning: NAs introduced by coercion
tail(gast2019, 2)
## # A tibble: 2 × 13
##    Year   Jan   Feb   Mar   Apr   May   Jun   Jul   Aug   Sep   Oct Nov   Dec  
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1  2018  0.82  0.85  0.9   0.89  0.83  0.78  0.83  0.76  0.81  1.02 .83   .92  
## 2  2019  0.94  0.96  1.18  1.02  0.86  0.93  0.95  0.94  0.93 NA    ***   ***
```
]

---

Another option is to convert all character columns to numeric with

```r
gast2019 <- mutate(gast2019, across(where(is.character), as.numeric))
## Warning: There were 2 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `across(where(is.character), as.numeric)`.
## Caused by warning:
## ! NAs introduced by coercion
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.
tail(gast2019, 2)
## # A tibble: 2 × 13
##    Year   Jan   Feb   Mar   Apr   May   Jun   Jul   Aug   Sep   Oct   Nov   Dec
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  2018  0.82  0.85  0.9   0.89  0.83  0.78  0.83  0.76  0.81  1.02  0.83  0.92
## 2  2019  0.94  0.96  1.18  1.02  0.86  0.93  0.95  0.94  0.93 NA    NA    NA
```
]

The warnings are benign and can be suppressed with the `warning =
FALSE` chunk option.

---

Since we know the missing value pattern `***` we can also avoid the
need to fix the data after the fact by specifying this at read time:

```r
gast2019 <- read_csv("GLB.Ts+dSST-2019.csv", na = "***", skip = 1)[1 : 13]
tail(gast2019, 2)
## # A tibble: 2 × 13
##    Year   Jan   Feb   Mar   Apr   May   Jun   Jul   Aug   Sep   Oct   Nov   Dec
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  2018  0.82  0.85  0.9   0.89  0.83  0.78  0.83  0.76  0.81  1.02  0.83  0.92
## 2  2019  0.94  0.96  1.18  1.02  0.86  0.93  0.95  0.94  0.93 NA    NA    NA
```

---

A plot highlighting the year 2019 shows only the months with available
data:

.pull-left.small-code[

```r
lgast2019 <- gast2019 |>
    pivot_longer(-Year,
                 names_to = "Month",
                 values_to = "Temp") |>
    mutate(Month = factor(Month, levels = month.abb))
ggplot(lgast2019, aes(x = Month,
                      y = Temp,
                      group = Year)) +
    geom_line(color = "grey80",
              na.rm = TRUE) +
    geom_line(data = filter(lgast2019, Year == 2019),
              na.rm = TRUE)
```

Adding `na.rm = TRUE` in the `geom_line` calls suppresses warnings;
the plot would be the same without these.
]

.pull-right[
<img src="datafrm_files/figure-html/gast-2019-1.png" style="display: block; margin: auto;" />
]

---
layout: false

* Data processing:
    * Reading: `read_csv`, `read.csv`;
    * Reshaping: `pivot_longer`;
    * Cleaning: `is.character`, `as.numeric`, `mutate`, `across`, `factor`;
    * Summarizing: `group_by`, `summarize`.
* Visualization geometries:
    * `geom_line` for a line plot;
    * `geom_point` for a scatter plot.
]

---
layout: false
## Reading

Stevens' classification of scales of measurement is described in a
[Wikipedia
article](https://en.wikipedia.org/wiki/Level_of_measurement).

A good introduction to the concept of _tidy data_ is provided in a
[chapter in _R for Data
Science_](https://r4ds.hadley.nz/data-tidy.html).

## Interactive Tutorial

An interactive [`learnr`](https://rstudio.github.io/learnr/) tutorial
for these notes is [available](../tutorials/datafrm.Rmd).

You can run the tutorial with

```r
STAT4580::runTutorial("datafrm")
```

---
layout: true
## Exercises
---

1) Which of the Stevens classifications (nominal, ordinal, interval, ratio)
   best characterizes these variables:

*   a. Daily maximal temperatures in Iowa City.
*   b. Population counts for Iowa counties.
*   c. Education level of job applicants using the [Bureau of Labor
       Statistics
       classification](https://www.bls.gov/careeroutlook/2014/article/education-level-and-jobs.htm).
*   d. Major of UI students.

2) Which of these data sets are in tidy form?

*   a. The builtin data set `co2`
*   b. The builtin data set `BOD`
*   c. The `who` data set in package `tidyr` (`tidyr::who`)
*   d. The `mpg` data set in package `ggplot2` (`ggplot2::mpg`)

---

The next exercises use the data in the variable `gapminder` in the package
`gapminder`. You can make it available with

```r
data(gapminder, package = "gapminder")
```
3) Use the function `str` to examine the value of the gapminder
   variable.  How many cases are there in the data set? How many of
   the variables are factors?

4) Use the functions `class` and `names` to find the class and
   variable names in the `gapminder` data.

5) Use `summary` to compute summary information for the variables.

6) Fill in the values for `---` needed to produce plots of life
   expectancy against year for the countries in continent Oceania.

```r
library(dplyr)
library(ggplot2)
data(gapminder, package = "gapminder")
ggplot(filter(gapminder, continent == "Oceania"),
       aes(x = ---, y = ---, color = country)) +
    geom_line()
```

//adapted from Emi Tanaka's gist at //https://gist.github.com/emitanaka/eaa258bb8471c041797ff377704c8505