A Brief Overview of R

.title[
# A Brief Overview of R
]
.author[
### Luke Tierney
]
.institute[
### University of Iowa
]
.date[
### 2024-01-19
]

---

## Background

R is a language, or an environment, for data analysis and data visualization.

R is derived form the [_S_
language](https://en.wikipedia.org/wiki/S_(programming_language)
developed at ATT Bell Laboratories.

R was originally developed for teaching at the University of Auckland,
New Zealand, by Ross Ihaka and Robert Gentleman.

R is now maintained by an international group of about 20
statisticians and computer scientists.

A great strength of R is the large number of extension packages that
have been developed.

The number available on [CRAN](https://cran.r-project.org) is now over
19,000.

---
layout: true
## Basic Usage
---
--

Interactive R uses a _command line interface_ (CLI).

The interface runs a _read-evaluate-print loop_ (REPL).

A simple interaction with the R interpreter:

```r
> 1 + 2
[1] 3
```

Values can be assigned to variables using a left arrow `<-` combination:

```r
> x <- c(1, 3, 5)
> x
[1] 1 3 5
```
--

---

Basic arithmetic operations work element-wise on vectors:

```r
> x + x
[1]  2  6 10
```

Scalars are _recycled_ to the length of the longer operand:

```r
> x + 1
[1] 2 4 6
```
--

```r
> 2 * x
[1]  2  6 10
```

---

Some ways to create new vectors:

```r
> c(1, 2, 3)
[1] 1 2 3
```

```r
> c("a", "b", "c")
[1] "a" "b" "c"
```

```r
> 1 : 3
[1] 1 2 3
```

These examples show a _prompt_ as you would see in the interpreter.

Usually Rmarkdown documents show code and results like this:

```r
2 * x
## [1]  2  6 10
```
--

This makes it easier to copy code for pasting it into another document
or the R console.

---
layout: false
## Data Frames

Data sets in R are often organized in _named lists_ of variables
called _data frames_.

The value of the variable `faithful` is a data frame with two
variables recorded for eruptions of the _Old Faithful_ geyser in Yellowstone
National Park:

* `eruptions`:  Eruption duration (minutes)
* `waiting`:  Waiting time to next eruption (minutes)

`head()` shows the first 6 rows:

```r
head(faithful)
##   eruptions waiting
## 1     3.600      79
## 2     1.800      54
## 3     3.333      74
## 4     2.283      62
## 5     4.533      85
## 6     2.883      55
```

---

## A Simple Scatter Plot

.pull-left.small-code.width-50[

```r
with(faithful,
     plot(eruptions, waiting,
          xlab = "Eruption time (min)",
          ylab = "Waiting time to next eruption (min)"))
```
]

--
.pull-right.width-40[

<img src="data:image/png;base64,#Rintro_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" />

]

--
.pull-left[
Several graphics systems are available for R.
]
.pull-right[
]

---

## Fitting a Linear Regression
.pull-left.width-45.small-code[

```r
fit <- with(faithful, lm(waiting ~ eruptions))
fit
## 
## Call:
## lm(formula = waiting ~ eruptions)
## 
## Coefficients:
## (Intercept)    eruptions  
##       33.47        10.73
```
{{content}}
]
--

You can also use the `data` argument to `lm()`:

```r
fit <- lm(waiting ~ eruptions, data = faithful)
```
{{content}}
--

`coef()` extracts the coefficients:

```r
coef(fit)
## (Intercept)   eruptions 
##    33.47440    10.72964
```
--

.pull-right.width-55.small-code[
`summary(fit)` provides more details:

```r
summary(fit)
## 
## Call:
## lm(formula = waiting ~ eruptions, data = faithful)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.0796  -4.4831   0.2122   3.9246  15.9719 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  33.4744     1.1549   28.98   <2e-16 ***
## eruptions    10.7296     0.3148   34.09   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.914 on 270 degrees of freedom
## Multiple R-squared:  0.8115,	Adjusted R-squared:  0.8108 
## F-statistic:  1162 on 1 and 270 DF,  p-value: < 2.2e-16
```
]

---
layout: true
## Adding the Regression Line to the Plot
---

Original plot:

.pull-left.small-code.width-50[

```r
with(faithful,
     plot(eruptions, waiting,
          xlab = "Eruption time (min)",
          ylab = "Waiting time to next eruption (min)"))
```

]
.pull-right.width-40[

<img src="data:image/png;base64,#Rintro_files/figure-html/unnamed-chunk-17-1.png" style="display: block; margin: auto;" />

]

---

With a regression line:

.pull-left.small-code.width-50[

```r
with(faithful,
     plot(eruptions, waiting,
          xlab = "Eruption time (min)",
          ylab = "Waiting time to next eruption (min)"))
*abline(coef(fit), col = "red", lwd = 3)
```

]

.pull-right.width-40[

<img src="data:image/png;base64,#Rintro_files/figure-html/geyser-with-line-1.png" style="display: block; margin: auto;" />
]

---
layout: false
## Packages and Package Libraries

Extension code and data sets are often made available in _packages_.

Packages are stored in folders or directories as collections called _libraries_.

`.libPaths()` will show you the libraries your R process will search.

`search()` shows what packages are attached to the global search path.

The `library()` function is used to find packages in the libraries and
attach them to the search path.

The expression `pkg::var` gets the value of variable `var` from
package `pkg` without attaching `pkg`.

You can install packages using the `install.packages` function or
the **Install Packages** item in the RStudio **Tools** menu.

By default packages are installed from [CRAN](cran.r-project.org).

It is also possible to use functions in the `remotes` package to
install packages hosted on [GitHub](https://github.com/) or
[GitLab](https://about.gitlab.com/).

---
layout: true
## A Useful Package: `ggplot2`

---

The `ggplot2` package provides a powerful alternative to the base
graphics system.

The geyser example can be done in `ggplot2` like this:

.pull-left.small-code.width-55[

```r
library(ggplot2)
ggplot(data = faithful) +
    geom_point(mapping = aes(x = eruptions, y = waiting)) +
    geom_smooth(mapping = aes(x = eruptions, y = waiting),
                method = "lm", se = FALSE)
```
]
--
.pull-right.width-40[
<img src="data:image/png;base64,#Rintro_files/figure-html/geyser-ggplot-1.png" style="display: block; margin: auto;" />
]

---

`ggplot2` is part of a useful collection of packages called the
[_tidyverse_](https://www.tidyverse.org/).

`ggplot` is based on the _Grammar of Graphics_.

* Plots are composed of _geometric objects_ (`geoms`).

* Variables are _mapped_ to _aesthetic features_ of geometric objects.

A basic template for creating a plot with `ggplot`:

```r
ggplot(data = <DATA>) + <GEOM>(mapping = aes(<MAPPINGS>))
```

---
layout: true
## Subsetting and Extracting Components

---

The _subset operator_ `[` can be used to extract element by index:

```r
month.abb
##  [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
month.abb[1 : 3]
## [1] "Jan" "Feb" "Mar"
```

Subsetting can also be based on a logical expression that returns
`TRUE` or `FALSE` for each element:

```r
(starts_with_J <- substr(month.abb, 1, 1) == "J")
##  [1]  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE
month.abb[starts_with_J]
## [1] "Jan" "Jun" "Jul"
```

* Ordinarily this value is not printed.
* Placing the assignment expression in parentheses causes it to be
  printed.
]

---

Individual elements can be extracted using the _element operator_ `[[`:

```r
month.abb[[3]]
## [1] "Mar"
```
]

Components of named lists, like _data frames_, can be extracted with
the `$` operator:

```r
names(faithful)
## [1] "eruptions" "waiting"
head(faithful, 4)
##   eruptions waiting
## 1     3.600      79
## 2     1.800      54
## 3     3.333      74
## 4     2.283      62
head(faithful$eruptions, 4)
## [1] 3.600 1.800 3.333 2.283
```
]

```r
head(faithful[["eruptions"]], 4)
## [1] 3.600 1.800 3.333 2.283
```
]

---
layout: false
## Functions

### Simple Functions

All computations in R are carried out by functions.

Defining a function allows you to avoid cutting and pasting code.

A simple function:

```r
ms <- function(x) list(mean = mean(x), sd = sd(x))
ms(faithful$eruptions)
## $mean
## [1] 3.487783
## 
## $sd
## [1] 1.141371
```

---

R supports several mechanisms for object-oriented programming based on
_generic functions_.

The most commonly used mechanism, called S3, allows a function to
_dispatch_ to a _method_ based on the _class_ of its first argument.

`plot` is a very simple generic function.

```r
plot
## function (x, y, ...) 
## UseMethod("plot")
## <bytecode: 0x55a82643f7b8>
## <environment: namespace:base>
```

---

For example, the `plot` method for linear model fit objects produces a
set of 4 plots commonly used to assess regression fits.

.pull-left.width-30[

```r
plot(fit)
```
]
.pull-right.width-60[
<img src="data:image/png;base64,#Rintro_files/figure-html/unnamed-chunk-26-1.png" style="display: block; margin: auto;" />
]

---
layout: true
### Lazy and Non-Standard Evaluation
---

An unusual but useful feature of R is that function arguments are not
evaluated until their value is needed, so they may not be evaluated at
all.

This is called _lazy evaluation_.

```r
log("A")
## Error in log("A"): non-numeric argument to mathematical function
f <- function(x) NULL
f(log("A"))
## NULL
```

Functions can also capture the expression of the arguments they were
called with:

```r
f <- function(x) deparse(substitute(x))
f(a + b)
## [1] "a + b"
```

Together these features allow functions to evaluate their arguments in
_non-standard_ ways.

---

This is most commonly used to allow values for variables in arguments
to be found in a provided data frame.

The `with()` function is a simple example:

```r
mean(eruptions)
## Error in eval(expr, envir, enclos): object 'eruptions' not found
```

```r
with(faithful, mean(eruptions))
## [1] 3.487783
```

Non-standard evaluation of this type is used extensively in the _tidyverse_.

---
layout: true
## The Tidyverse
---

[_Tidyverse_](https://www.tidyverse.org/) functions are designed to
perform operations on data frames.

The `dplyr` package provides a _grammar for data manipulation_.

.pull-left.small-code[
A simple example: computing means and standard deviations for the
waiting times after the short (less than 3 minutes) and the long (3
minutes or more) eruptions:

```r
library(dplyr)
tmp <- mutate(faithful,
              type = ifelse(eruptions < 3,
                            "short",
                            "long"))
head(tmp)
##   eruptions waiting  type
## 1     3.600      79  long
## 2     1.800      54 short
## 3     3.333      74  long
## 4     2.283      62 short
## 5     4.533      85  long
## 6     2.883      55 short
```
]
--
.pull-right[

```r
summarize(group_by(tmp, type),
          mean = mean(waiting),
          sd = sd(waiting))
## # A tibble: 2 × 3
##   type   mean    sd
##   <chr> <dbl> <dbl>
## 1 long   80.0  5.99
## 2 short  54.5  5.84
```
.alert[
Tidyverse functions like to work with an enhanced form of data frame
called a _tibble_.
]
]

---

A computation like this can be viewed as a _transformation pipeline_
consisting of three stages:

* mutation (adding a new variable)
* grouping (splitting by `type`)
* summarizing within groups.
  
--

Tidyverse code often uses the _forward pipe operator_ `%>%` provided by
the `magrittr` package to express such a pipeline.

R 4.1.0 and later also provides a _native pipe operator_ `|>`.

The pipe operator allows a call `f(x)` to be written as

```r
x |> f()
```

The left hand value is passed implicitly as the first argument to the
function called on the right.

---

Using the pipe operator, the code for computing means and standard
deviations can be written as

```r
faithful |>
    mutate(type = ifelse(eruptions < 3, "short", "long")) |>
    group_by(type) |>
    summarize(mean = mean(waiting),
              sd = sd(waiting))
## # A tibble: 2 × 3
##   type   mean    sd
##   <chr> <dbl> <dbl>
## 1 long   80.0  5.99
## 2 short  54.5  5.84
```
--

There are trade-offs:

* Manipulation pipelines expressed this way are often more compact
  than ones using intermediate variables and/or nested calls.

* With pipe notation there is no need to come up with intermediate
  variable names.

* Pipe notation obscures the function calls that are actually
  happening and this can make debugging harder.

---
layout: false
## Contrast to Point-and-Click Interfaces

* Even simple tasks require learning some of the R language.

* Once you can do simple tasks, you have learned some of the R language.

* More complicated tasks become easier.

* Even very complicated tasks become possible.

---

## R and Reproducibility

Analyses in R are carried out by running code describing the tasks to
perform.

This code can be

* audited to make sure the analysis is right;

* replayed to make sure the results are repoducible;

* reused after changes in the data or on new data.

_Literate data analysis_ tools like Rmarkdown provide support for
this.

---
layout: true
## Finding Out More
---

### Getting Help on Functions

* `help(mean)` will show help for the function `mean`.
* This can be abbreviated as `?mean`

### Some R Introductions and Tutorials

* [An Introduction to R](https://cran.r-project.org/doc/manuals/R-intro.html)
  introduces the language and shows how to use R for
  statistical analysis and graphics.
* Another
  [introduction to R](http://zoonek2.free.fr/UNIX/48_R/all.html) by
  Vincent Zoonekynd.
* [Quick-R](https://www.statmethods.net/) web site related to *R
  in Action* book.
* [R For
  Beginners](https://cran.r-project.org/doc/contrib/Paradis-rdebuts_en.pdf).
* [TryR](https://www.pluralsight.com/search?q=R) at Codeschool.
* [swirl: Learn R in R](https://swirlstats.com/).
* [_Hands-On Programming with R_](https://rstudio-education.github.io/hopr/).
* [_R for Data Science_](https://r4ds.hadley.nz/).
* [Data Science Dojo YouTube
  tutorials](https://www.youtube.com/c/Datasciencedojo/playlists?view=50&sort=dd&shelf_id=2).
* [Tutorials ad RStudio](https://education.rstudio.com/learn/).
* [R for the Rest of Us](https://rfortherestofus.com/).
* There are _many_ others.

---

### Introductions to the Tidiverse

* Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund
    (2023), [_R for Data Science (2nd
    Edition)_](https://r4ds.hadley.nz/), O'Reilly. ([Book source on
    GitHub](https://github.com/hadley/r4ds))

* [R Basics
    chapter](https://rafalab.dfci.harvard.edu/dsbook-part-1/R/R-basics.html)
    in Rafael A. Irizarry (2019), [Introduction to Data Science: _Data
    Analysis and Prediction Algorithms with
    R_](https://rafalab.dfci.harvard.edu/dsbook-part-1/), Chapman &
    Hall/CRC. ([Book source on
    GitHub](https://github.com/rafalab/dsbook-part-1))

### R Markdown Tutorials

* [R Markdown: The Definitive Guide](https://bookdown.org/yihui/rmarkdown/)
  by Yihui Xie is a book-length presentation.
* The [R Markdown Home Page](https://rmarkdown.rstudio.com) has a link
  to a [tutorial](https://rmarkdown.rstudio.com/lesson-1.html).

---
layout: false

## Interactive Tutorial

An interactive [`learnr`](https://rstudio.github.io/learnr/) tutorial
for these notes is [available](../tutorials/Rintro.Rmd).

You can run the tutorial with

```r
STAT4580::runTutorial("Rintro")
```

You can install the current version of the `STAT4580` package with

```r
remotes::install_gitlab("luke-tierney/STAT4580")
```

You may need to install the `remotes` package from CRAN first.

---

## Exercises

1) Compute the mean of the numbers 1, 3, 5, 8.

2) What is the mean of the `eruptions` variable in the `faithful` data
   frame?

3) Find the average of the first 50 eruption durations in the `faithful`
   data frame.

4) Use the `median` function to modify the pipe example in the
   [tidyverse section](#the-tidyverse) to include medians.