## Background

R is a language, or an environment, for data analysis and visualization.

R is derived form the [S language](https://en.wikipedia.org/wiki/S_(programming_language) developed at ATT Bell Laboratories.

R was originally developed for teaching at the University of Auckland, New Zealand, by Ross Ihaka and Robert Gentleman.

R is now maintained by an international group of about 20 statisticians and computer scientists.

A great strength of R is the large number of extension packages that have been developed.

The number available on CRAN is now over 17,000.

## Basic Usage

Interactive R uses a command line interface (CLI).

The interface runs a read-evaluate-print loop (REPL).

A simple interaction with the R interpreter:

> 1 + 2
[1] 3

Values can be assigned to variables using a left arrow <- combination:

> x <- c(1, 3, 5)
> x
[1] 1 3 5

The = sign can also be used for assignment, but <- is recommended.

Basic arithmetic operations work element-wise on vectors:

> x + x
[1]  2  6 10

Scalars are recycled to the length of the longer operand:

> x + 1
[1] 2 4 6
> 2 * x
[1]  2  6 10

Some ways to create new vectors:

> c(1, 2, 3)
[1] 1 2 3
> c("a", "b", "c")
[1] "a" "b" "c"
> 1 : 3
[1] 1 2 3

These examples show a prompt as you would see in the interpreter.

Usually Rmarkdown documents show code and results like this:

2 * x
## [1]  2  6 10

This makes it easier to copy code for pasting it into another document.

## Data Frames

Data sets in R are often organized in named lists of variables called data frames.

The value of the variable faithful is a data frame with two variables recorded for eruptions of the Old Faithful geyser in Yellowstone National Park:

• eruptions: Eruption duration (minutes)
• waiting: Waiting time to next eruption (minutes)

head() shows the first 6 rows:

head(faithful)
##   eruptions waiting
## 1     3.600      79
## 2     1.800      54
## 3     3.333      74
## 4     2.283      62
## 5     4.533      85
## 6     2.883      55

## A Simple Scatter Plot

with(faithful,
plot(eruptions, waiting,
xlab = "Eruption time (min)",
ylab = "Waiting time to next eruption (min)"))

Several graphics systems are available for R.

plot() is part of base graphics.

## Fitting a Linear Regression

fit <- with(faithful, lm(waiting ~ eruptions))
fit
##
## Call:
## lm(formula = waiting ~ eruptions)
##
## Coefficients:
## (Intercept)    eruptions
##       33.47        10.73

You can also use the data argument to lm():

fit <- lm(waiting ~ eruptions, data = faithful)

coef() extracts the coefficients:

coef(fit)
## (Intercept)   eruptions
##    33.47440    10.72964

summary(fit) provides more details:

summary(fit)
##
## Call:
## lm(formula = waiting ~ eruptions, data = faithful)
##
## Residuals:
##      Min       1Q   Median       3Q      Max
## -12.0796  -4.4831   0.2122   3.9246  15.9719
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  33.4744     1.1549   28.98   <2e-16 ***
## eruptions    10.7296     0.3148   34.09   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.914 on 270 degrees of freedom
## Multiple R-squared:  0.8115, Adjusted R-squared:  0.8108
## F-statistic:  1162 on 1 and 270 DF,  p-value: < 2.2e-16

## Adding the Regression Line to the Plot

Original plot:

with(faithful,
plot(eruptions, waiting,
xlab = "Eruption time (min)",
ylab = "Waiting time to next eruption (min)"))

With a regression line:

with(faithful,
plot(eruptions, waiting,
xlab = "Eruption time (min)",
ylab = "Waiting time to next eruption (min)"))
abline(coef(fit), col = "red", lwd = 3)

## Packages and Package Libraries

Extension code and data sets are often made available in packages.

Packages are stored in folders or directories as collections called libraries.

.libPaths() will show you the libraries your R process will search.

search() shows what packages are attached to the global search path.

The library() function is used to find packages in the libraries and attach them to the search path.

The expression pkg::var gets the value of variable var from package pkg without attaching pkg.

You can install packages using the install.packages function or the Install Packages item in the RStudio Tools menu.

By default packages are installed from CRAN.

It is also possible to use functions in the remotes package to install packages hosted on GitHub or GitLab.

## A Useful Package: ggplot2

The ggplot2 package provides a powerful alternative to the base graphics system.

The geyser example can be done in ggplot2 like this:

library(ggplot2)
ggplot(data = faithful) +
geom_point(mapping = aes(x = eruptions, y = waiting)) +
geom_smooth(mapping = aes(x = eruptions, y = waiting),
method = "lm", se = FALSE)

ggplot2 is part of a useful collection of packages called the tidyverse.

ggplot is based on the Grammar of Graphics.

• Plots are composed of geometric objects (geoms).

• Variables are mapped to aesthetic features of geometric objects.

A basic template for creating a plot with ggplot:

ggplot(data = <DATA>) + <GEOM>(mapping = aes(<MAPPINGS>))

## Subsetting and Extracting Components

The subset operator [ can be used to extract element by index:

month.abb
##  [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
month.abb[1 : 3]
## [1] "Jan" "Feb" "Mar"

Subsetting can also be based on a logical expression that returns TRUE or FALSE for each element:

(starts_with_J <- substr(month.abb, 1, 1) == "J")
##  [1]  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE
month.abb[starts_with_J]
## [1] "Jan" "Jun" "Jul"

The value of an assignment operation is the right hand side value.

• Ordinarily this value is not printed.
• Placing the assignment expression in parentheses causes it to be printed.

Individual elements can be extracted using the element operator [[:

month.abb[[3]]
## [1] "Mar"

Components of named list, like data frames, can be extracted with the $ operator: names(faithful) ## [1] "eruptions" "waiting" head(faithful, 4) ## eruptions waiting ## 1 3.600 79 ## 2 1.800 54 ## 3 3.333 74 ## 4 2.283 62 head(faithful$eruptions, 4)
## [1] 3.600 1.800 3.333 2.283

The element operator can be used as well:

head(faithful[["eruptions"]], 4)
## [1] 3.600 1.800 3.333 2.283

## Functions

### Simple Functions

All computations in R are carried out by functions.

Defining a function allows you to avoid cutting and pasting code.

A simple function:

ms <- function(x) list(mean = mean(x), sd = sd(x))
ms(faithful$eruptions) ##$mean
## [1] 3.487783
##
## \$sd
## [1] 1.141371

### Generic Functions and Object-Oriented Programming

R supports several mechanisms for object-oriented programming based on generic functions.

The most commonly used mechanism, called S3, allows a function to dispatch to a method based on the class of its first argument.

plot is a very simple generic function.

plot
## function (x, y, ...)
## UseMethod("plot")
## <bytecode: 0x557e824c6c80>
## <environment: namespace:base>

For example, the plot method for linear model fit objects produces a set of 4 plots commonly used to assess regression fits.

plot(fit)

### Lazy and Non-Standard Evaluation

An unusual but useful feature of R is that function arguments are not evaluated until their value is needed, so they may not be evaluated at all.

This is called lazy evaluation.

log("A")
## Error in log("A"): non-numeric argument to mathematical function
f <- function(x) NULL
f(log("A"))
## NULL

Functions can also capture the expression of the arguments they were called with:

f <- function(x) deparse(substitute(x))
f(a + b)
## [1] "a + b"

Together these features allow functions to evaluate their arguments in non-standard ways.

This is most commonly used to allow values for variables in arguments to be found in a provided data frame.

The with() function is a simple example:

mean(eruptions)
## Error in mean(eruptions): object 'eruptions' not found
with(faithful, mean(eruptions))
## [1] 3.487783

Non-standard evaluation of this type is used extensively in the tidyverse.

## The Tidyverse

Tidyverse functions are designed to perform operations on data frames.

The dplyr package provides a grammar for data manipulation.

A simple example: computing means and standard deviations for the waiting times after the short (less than 3 minutes) and the long (3 minutes or more) eruptions:

library(dplyr)
tmp <- mutate(faithful,
type = ifelse(eruptions < 3,
"short",
"long"))
##   eruptions waiting  type
## 1     3.600      79  long
## 2     1.800      54 short
## 3     3.333      74  long
## 4     2.283      62 short
## 5     4.533      85  long
## 6     2.883      55 short
summarize(group_by(tmp, type),
mean = mean(waiting),
sd = sd(waiting))
## # A tibble: 2 × 3
##   type   mean    sd
##   <chr> <dbl> <dbl>
## 1 long   80.0  5.99
## 2 short  54.5  5.84

Tidyverse functions like to work with an enhanced form of data frame called a tibble.

A computation like this can be viewed as a transformation pipeline consisting of three stages:

• mutation (adding a new variable)
• grouping (splitting by type)
• summarizing within groups.

Tidyverse code often uses the forward pipe operator %>% provided by the magrittr package to express such a pipeline.

R 4.1.0 and later also provides a native pipe operator |>.

The pipe operator allows a call f(x) to be written as

x %>% f()

The left hand value is passed implicitly as the first argument to the function called on the right.

Using the pipe operator, the code for computing means and standard deviations can be written as

faithful %>%
mutate(type = ifelse(eruptions < 3, "short", "long")) %>%
group_by(type) %>%
summarize(mean = mean(waiting),
sd = sd(waiting))
## # A tibble: 2 × 3
##   type   mean    sd
##   <chr> <dbl> <dbl>
## 1 long   80.0  5.99
## 2 short  54.5  5.84

• Manipulation pipelines expressed this way are often more compact than ones using intermediate variables and/or nested calls.

• With pipe notation there is no need to come up with intermediate variable names.

• Pipe notation obscures the function calls that are actually happening and this can make debugging harder.

## Contrast to Point-and-Click Interfaces

• Even simple tasks require learning some of the R language.

• Once you can do simple tasks, you have learned some of the R language.

• More complicated tasks become easier.

• Even very complicated tasks become possible.

## R and Reproducibility

Analyses in R are carried out by running code describing the tasks to perform.

This code can be

• audited to make sure the analysis is right;

• replayed to make sure the results are repoducible;

• reused after changes in the data or on new data.

Literate data analysis tools like Rmarkdown provide support for this.

## Finding Out More

### Getting Help on Functions

• help(mean) will show help for the function mean.
• This can be abbreviated as ?mean

## Interactive Tutorial

An interactive learnr tutorial for these notes is available.

You can run the tutorial with

STAT4580::runTutorial("Rintro")

You can install the current version of the STAT4580 package with

remotes::install_gitlab("luke-tierney/STAT4580")

You may need to install the remotes package from CRAN first.

## Exercises

1. Compute the mean of the numbers 1, 3, 5, 8.
1. What is the mean of the eruptions variable in the faithful data frame?
1. Find the average of the first 50 eruption durations in the faithful data frame.
1. Use the median function to modify the pipe example in the tidyverse section to include medians.