---
title: "A Brief Overview of R"
output:
html_document:
toc: yes
code_download: true
---
```{r setup, include = FALSE}
source(here::here("setup.R"))
knitr::opts_chunk$set(collapse = TRUE,
fig.height = 5, fig.width = 6, fig.align = "center")
```
## Background
R is a language, or an environment, for data analysis and visualization.
R is derived form the [_S_
language](https://en.wikipedia.org/wiki/S_(programming_language)
developed at ATT Bell Laboratories.
R was originally developed for teaching at the University of Auckland,
New Zealand, by Ross Ihaka and Robert Gentleman.
R is now maintained by an international group of about 20
statisticians and computer scientists.
A great strength of R is the large number of extension packages that
have been developed.
The number available on [CRAN](https://cran.r-project.org) is now over
19,000.
## Basic Usage
Interactive R uses a _command line interface_ (CLI).
The interface runs a _read-evaluate-print loop_ (REPL).
A simple interaction with the R interpreter:
```{r, prompt = TRUE, comment = ""}
1 + 2
```
Values can be assigned to variables using a left arrow `<-` combination:
```{r, prompt = TRUE, comment = ""}
x <- c(1, 3, 5)
x
```
The `=` sign can also be used for assignment, but `<-` is recommended.
Basic arithmetic operations work element-wise on vectors:
```{r, prompt = TRUE, comment = ""}
x + x
```
Scalars are _recycled_ to the length of the longer operand:
```{r, prompt = TRUE, comment = ""}
x + 1
```
```{r, prompt = TRUE, comment = ""}
2 * x
```
Some ways to create new vectors:
```{r, prompt = TRUE, comment = ""}
c(1, 2, 3)
```
```{r, prompt = TRUE, comment = ""}
c("a", "b", "c")
```
```{r, prompt = TRUE, comment = ""}
1 : 3
```
These examples show a _prompt_ as you would see in the interpreter.
Usually Rmarkdown documents show code and results like this:
```{r}
2 * x
```
This makes it easier to copy code for pasting it into another document
or the R console.
## Data Frames
Data sets in R are often organized in _named lists_ of variables
called _data frames_.
The value of the variable `faithful` is a data frame with two
variables recorded for eruptions of the _Old Faithful_ geyser in Yellowstone
National Park:
* `eruptions`: Eruption duration (minutes)
* `waiting`: Waiting time to next eruption (minutes)
`head()` shows the first 6 rows:
```{r}
head(faithful)
```
## A Simple Scatter Plot
```{r geyser, eval = FALSE}
with(faithful,
plot(eruptions, waiting,
xlab = "Eruption time (min)",
ylab = "Waiting time to next eruption (min)"))
```
```{r eval = TRUE, echo = FALSE, fig.height = 4}
op <- par(mar = c(4, 4, 0.1, 0.1))
<>
par(op)
```
Several graphics systems are available for R.
`plot()` is part of _base graphics_.
## Fitting a Linear Regression
```{r}
fit <- with(faithful, lm(waiting ~ eruptions))
fit
```
You can also use the `data` argument to `lm()`:
```{r}
fit <- lm(waiting ~ eruptions, data = faithful)
```
`coef()` extracts the coefficients:
```{r}
coef(fit)
```
`summary(fit)` provides more details:
```{r}
summary(fit)
```
## Adding the Regression Line to the Plot
Original plot:
```{r, eval = FALSE}
<>
```
```{r, eval = TRUE, echo = FALSE}
<>
```
With a regression line:
```{r geyser-with-line, eval = FALSE}
<>
abline(coef(fit), col = "red", lwd = 3)
```
```{r geyser-with-line, eval = TRUE, echo = FALSE}
```
## Packages and Package Libraries
Extension code and data sets are often made available in _packages_.
Packages are stored in folders or directories as collections called _libraries_.
`.libPaths()` will show you the libraries your R process will search.
`search()` shows what packages are attached to the global search path.
The `library()` function is used to find packages in the libraries and
attach them to the search path.
The expression `pkg::var` gets the value of variable `var` from
package `pkg` without attaching `pkg`.
You can install packages using the `install.packages` function or
the **Install Packages** item in the RStudio **Tools** menu.
By default packages are installed from [CRAN](cran.r-project.org).
It is also possible to use functions in the `remotes` package to
install packages hosted on [GitHub](https://github.com/) or
[GitLab](https://about.gitlab.com/).
## A Useful Package: `ggplot2`
The `ggplot2` package provides a powerful alternative to the base
graphics system.
The geyser example can be done in `ggplot2` like this:
```{r geyser-ggplot, eval = FALSE, echo = TRUE}
library(ggplot2)
ggplot(data = faithful) +
geom_point(mapping = aes(x = eruptions, y = waiting)) +
geom_smooth(mapping = aes(x = eruptions, y = waiting),
method = "lm", se = FALSE)
```
```{r geyser-ggplot, eval = TRUE, echo = FALSE, message = FALSE}
```
`ggplot2` is part of a useful collection of packages called the
[_tidyverse_](https://www.tidyverse.org/).
`ggplot` is based on the _Grammar of Graphics_.
* Plots are composed of _geometric objects_ (`geoms`).
* Variables are _mapped_ to _aesthetic features_ of geometric objects.
A basic template for creating a plot with `ggplot`:
```{r, eval = FALSE}
ggplot(data = ) + (mapping = aes())
```
## Subsetting and Extracting Components
The _subset operator_ `[` can be used to extract element by index:
```{r}
month.abb
month.abb[1 : 3]
```
Subsetting can also be based on a logical expression that returns
`TRUE` or `FALSE` for each element:
```{r}
(starts_with_J <- substr(month.abb, 1, 1) == "J")
month.abb[starts_with_J]
```
The value of an assignment operation is the right hand side value.
* Ordinarily this value is not printed.
* Placing the assignment expression in parentheses causes it to be
printed.
Individual elements can be extracted using the _element operator_ `[[`:
```{r}
month.abb[[3]]
```
Components of named list, like _data frames_, can be extracted with
the `$` operator:
```{r}
names(faithful)
head(faithful, 4)
head(faithful$eruptions, 4)
```
The element operator can be used as well:
```{r}
head(faithful[["eruptions"]], 4)
```
## Functions
### Simple Functions
All computations in R are carried out by functions.
Defining a function allows you to avoid cutting and pasting code.
A simple function:
```{r}
ms <- function(x) list(mean = mean(x), sd = sd(x))
ms(faithful$eruptions)
```
### Generic Functions and Object-Oriented Programming
R supports several mechanisms for object-oriented programming based on
_generic functions_.
The most commonly used mechanism, called S3, allows a function to
_dispatch_ to a _method_ based on the _class_ of its first argument.
`plot` is a very simple generic function.
```{r}
plot
```
For example, the `plot` method for linear model fit objects produces a
set of 4 plots commonly used to assess regression fits.
```{r geyser_lm_fit, eval = FALSE}
plot(fit)
```
```{r, echo = FALSE, fig.height = 6.5, fig.width = 7}
op <- par(mfrow = c(2, 2))
<>
par(op)
```
### Lazy and Non-Standard Evaluation
An unusual but useful feature of R is that function arguments are not
evaluated until their value is needed, so they may not be evaluated at
all.
This is called _lazy evaluation_.
```{r, error = TRUE}
log("A")
f <- function(x) NULL
f(log("A"))
```
Functions can also capture the expression of the arguments they were
called with:
```{r}
f <- function(x) deparse(substitute(x))
f(a + b)
```
Together these features allow functions to evaluate their arguments in
_non-standard_ ways.
This is most commonly used to allow values for variables in arguments
to be found in a provided data frame.
The `with()` function is a simple example:
```{r, error = TRUE}
mean(eruptions)
```
```{r}
with(faithful, mean(eruptions))
```
Non-standard evaluation of this type is used extensively in the _tidyverse_.
## The Tidyverse
[_Tidyverse_](https://www.tidyverse.org/) functions are designed to
perform operations on data frames.
The `dplyr` package provides a _grammar for data manipulation_.
A simple example: computing means and standard deviations for the
waiting times after the short (less than 3 minutes) and the long (3
minutes or more) eruptions:
```{r, message = FALSE}
library(dplyr)
tmp <- mutate(faithful,
type = ifelse(eruptions < 3,
"short",
"long"))
head(tmp)
```
```{r}
summarize(group_by(tmp, type),
mean = mean(waiting),
sd = sd(waiting))
```
Tidyverse functions like to work with an enhanced form of data frame
called a _tibble_.
A computation like this can be viewed as a _transformation pipeline_
consisting of three stages:
* mutation (adding a new variable)
* grouping (splitting by `type`)
* summarizing within groups.
Tidyverse code often uses the _forward pipe operator_ `%>%` provided by
the `magrittr` package to express such a pipeline.
R 4.1.0 and later also provides a _native pipe operator_ `|>`.
The pipe operator allows a call `f(x)` to be written as
```{r, eval = FALSE}
x %>% f()
```
The left hand value is passed implicitly as the first argument to the
function called on the right.
Using the pipe operator, the code for computing means and standard
deviations can be written as
```{r}
faithful %>%
mutate(type = ifelse(eruptions < 3, "short", "long")) %>%
group_by(type) %>%
summarize(mean = mean(waiting),
sd = sd(waiting))
```
There are trade-offs:
* Manipulation pipelines expressed this way are often more compact
than ones using intermediate variables and/or nested calls.
* With pipe notation there is no need to come up with intermediate
variable names.
* Pipe notation obscures the function calls that are actually
happening and this can make debugging harder.
## Contrast to Point-and-Click Interfaces
* Even simple tasks require learning some of the R language.
* Once you can do simple tasks, you have learned some of the R language.
* More complicated tasks become easier.
* Even very complicated tasks become possible.
## R and Reproducibility
Analyses in R are carried out by running code describing the tasks to
perform.
This code can be
* audited to make sure the analysis is right;
* replayed to make sure the results are repoducible;
* reused after changes in the data or on new data.
_Literate data analysis_ tools like Rmarkdown provide support for
this.
## Finding Out More
### Getting Help on Functions
* `help(mean)` will show help for the function `mean`.
* This can be abbreviated as `?mean`
### Some R Introductions and Tutorials
* [An Introduction to R](https://cran.r-project.org/doc/manuals/R-intro.html)
introduces the language and shows how to use R for
statistical analysis and graphics.
* Another
[introduction to R](http://zoonek2.free.fr/UNIX/48_R/all.html) by
Vincent Zoonekynd.
* [Quick-R](https://www.statmethods.net/) web site related to *R
in Action* book.
* [R For
Beginners](https://cran.r-project.org/doc/contrib/Paradis-rdebuts_en.pdf).
* [TryR](https://www.pluralsight.com/search?q=R) at Codeschool.
* [swirl: Learn R in R](https://swirlstats.com/).
* [_Hands-On Programming with R_](https://rstudio-education.github.io/hopr/).
* [_R for Data Science_](https://r4ds.had.co.nz/).
* [Data Science Dojo YouTube
tutorials](https://www.youtube.com/c/Datasciencedojo/playlists?view=50&sort=dd&shelf_id=2).
* [Tutorials ad RStudio](https://education.rstudio.com/learn/).
* [R for the Rest of Us](https://rfortherestofus.com/).
* There are _many_ others.
### Introductions to the Tidiverse
* Hadley Wickham and Garrett Grolemund (2016), [_R for Data
Science_](https://r4ds.had.co.nz/), O'Reilly. (Book source on
[GitHub](https://github.com/hadley/r4ds))
* [R Basics chapter](https://rafalab.dfci.harvard.edu/dsbook/r-basics.html)
in Rafael A. Irizarry (2019), [Introduction to Data Science: _Data
Analysis and Prediction Algorithms with
R_](https://rafalab.dfci.harvard.edu/dsbook/), Chapman & Hall/CRC. ([Book
source on GitHub](https://github.com/rafalab/dsbook))
### R Markdown Tutorials
* [R Markdown: The Definitive Guide](https://bookdown.org/yihui/rmarkdown/)
by Yihui Xie is a book-length presentation.
* The [R Markdown Home Page](https://rmarkdown.rstudio.com) has a link
to a [tutorial](https://rmarkdown.rstudio.com/lesson-1.html).
## Interactive Tutorial
An interactive [`learnr`](https://rstudio.github.io/learnr/) tutorial
for these notes is [available](`r WLNK("tutorials/Rintro.Rmd")`).
You can run the tutorial with
```{r, eval = FALSE}
STAT4580::runTutorial("Rintro")
```
You can install the current version of the `STAT4580` package with
```{r, eval = FALSE}
remotes::install_gitlab("luke-tierney/STAT4580")
```
You may need to install the `remotes` package from CRAN first.
## Exercises
1. Compute the mean of the numbers 1, 3, 5, 8.
2. What is the mean of the `eruptions` variable in the `faithful` data
frame?
3. Find the average of the first 50 eruption durations in the `faithful`
data frame.
4. Use the `median` function to modify the pipe example in the
[tidyverse section](#the-tidyverse) to include medians.