--- title: "A Brief Overview of R" output: html_document: toc: yes code_download: true --- ```{r setup, include = FALSE} source(here::here("setup.R")) knitr::opts_chunk$set(collapse = TRUE, fig.height = 5, fig.width = 6, fig.align = "center") ``` ## Background R is a language, or an environment, for data analysis and visualization. R is derived form the [_S_ language](https://en.wikipedia.org/wiki/S_(programming_language) developed at ATT Bell Laboratories. R was originally developed for teaching at the University of Auckland, New Zealand, by Ross Ihaka and Robert Gentleman. R is now maintained by an international group of about 20 statisticians and computer scientists. A great strength of R is the large number of extension packages that have been developed. The number available on [CRAN](https://cran.r-project.org) is now over 19,000. ## Basic Usage Interactive R uses a _command line interface_ (CLI). The interface runs a _read-evaluate-print loop_ (REPL). A simple interaction with the R interpreter: ```{r, prompt = TRUE, comment = ""} 1 + 2 ``` Values can be assigned to variables using a left arrow `<-` combination: ```{r, prompt = TRUE, comment = ""} x <- c(1, 3, 5) x ```
The `=` sign can also be used for assignment, but `<-` is recommended.
Basic arithmetic operations work element-wise on vectors: ```{r, prompt = TRUE, comment = ""} x + x ``` Scalars are _recycled_ to the length of the longer operand: ```{r, prompt = TRUE, comment = ""} x + 1 ``` ```{r, prompt = TRUE, comment = ""} 2 * x ``` Some ways to create new vectors: ```{r, prompt = TRUE, comment = ""} c(1, 2, 3) ``` ```{r, prompt = TRUE, comment = ""} c("a", "b", "c") ``` ```{r, prompt = TRUE, comment = ""} 1 : 3 ``` These examples show a _prompt_ as you would see in the interpreter. Usually Rmarkdown documents show code and results like this: ```{r} 2 * x ``` This makes it easier to copy code for pasting it into another document or the R console. ## Data Frames Data sets in R are often organized in _named lists_ of variables called _data frames_. The value of the variable `faithful` is a data frame with two variables recorded for eruptions of the _Old Faithful_ geyser in Yellowstone National Park: * `eruptions`: Eruption duration (minutes) * `waiting`: Waiting time to next eruption (minutes) `head()` shows the first 6 rows: ```{r} head(faithful) ``` ## A Simple Scatter Plot ```{r geyser, eval = FALSE} with(faithful, plot(eruptions, waiting, xlab = "Eruption time (min)", ylab = "Waiting time to next eruption (min)")) ``` ```{r eval = TRUE, echo = FALSE, fig.height = 4} op <- par(mar = c(4, 4, 0.1, 0.1)) <> par(op) ``` Several graphics systems are available for R. `plot()` is part of _base graphics_. ## Fitting a Linear Regression ```{r} fit <- with(faithful, lm(waiting ~ eruptions)) fit ``` You can also use the `data` argument to `lm()`: ```{r} fit <- lm(waiting ~ eruptions, data = faithful) ``` `coef()` extracts the coefficients: ```{r} coef(fit) ``` `summary(fit)` provides more details: ```{r} summary(fit) ``` ## Adding the Regression Line to the Plot Original plot: ```{r, eval = FALSE} <> ``` ```{r, eval = TRUE, echo = FALSE} <> ``` With a regression line: ```{r geyser-with-line, eval = FALSE} <> abline(coef(fit), col = "red", lwd = 3) ``` ```{r geyser-with-line, eval = TRUE, echo = FALSE} ``` ## Packages and Package Libraries Extension code and data sets are often made available in _packages_. Packages are stored in folders or directories as collections called _libraries_. `.libPaths()` will show you the libraries your R process will search. `search()` shows what packages are attached to the global search path. The `library()` function is used to find packages in the libraries and attach them to the search path. The expression `pkg::var` gets the value of variable `var` from package `pkg` without attaching `pkg`. You can install packages using the `install.packages` function or the **Install Packages** item in the RStudio **Tools** menu. By default packages are installed from [CRAN](cran.r-project.org). It is also possible to use functions in the `remotes` package to install packages hosted on [GitHub](https://github.com/) or [GitLab](https://about.gitlab.com/). ## A Useful Package: `ggplot2` The `ggplot2` package provides a powerful alternative to the base graphics system. The geyser example can be done in `ggplot2` like this: ```{r geyser-ggplot, eval = FALSE, echo = TRUE} library(ggplot2) ggplot(data = faithful) + geom_point(mapping = aes(x = eruptions, y = waiting)) + geom_smooth(mapping = aes(x = eruptions, y = waiting), method = "lm", se = FALSE) ``` ```{r geyser-ggplot, eval = TRUE, echo = FALSE, message = FALSE} ``` `ggplot2` is part of a useful collection of packages called the [_tidyverse_](https://www.tidyverse.org/).
`ggplot` is based on the _Grammar of Graphics_. * Plots are composed of _geometric objects_ (`geoms`). * Variables are _mapped_ to _aesthetic features_ of geometric objects. A basic template for creating a plot with `ggplot`: ```{r, eval = FALSE} ggplot(data = ) + (mapping = aes()) ```
## Subsetting and Extracting Components The _subset operator_ `[` can be used to extract element by index: ```{r} month.abb month.abb[1 : 3] ``` Subsetting can also be based on a logical expression that returns `TRUE` or `FALSE` for each element: ```{r} (starts_with_J <- substr(month.abb, 1, 1) == "J") month.abb[starts_with_J] ```
The value of an assignment operation is the right hand side value. * Ordinarily this value is not printed. * Placing the assignment expression in parentheses causes it to be printed.
Individual elements can be extracted using the _element operator_ `[[`: ```{r} month.abb[[3]] ``` Components of named list, like _data frames_, can be extracted with the `$` operator: ```{r} names(faithful) head(faithful, 4) head(faithful$eruptions, 4) ``` The element operator can be used as well: ```{r} head(faithful[["eruptions"]], 4) ``` ## Functions ### Simple Functions All computations in R are carried out by functions. Defining a function allows you to avoid cutting and pasting code. A simple function: ```{r} ms <- function(x) list(mean = mean(x), sd = sd(x)) ms(faithful$eruptions) ``` ### Generic Functions and Object-Oriented Programming R supports several mechanisms for object-oriented programming based on _generic functions_. The most commonly used mechanism, called S3, allows a function to _dispatch_ to a _method_ based on the _class_ of its first argument. `plot` is a very simple generic function. ```{r} plot ``` For example, the `plot` method for linear model fit objects produces a set of 4 plots commonly used to assess regression fits. ```{r geyser_lm_fit, eval = FALSE} plot(fit) ``` ```{r, echo = FALSE, fig.height = 6.5, fig.width = 7} op <- par(mfrow = c(2, 2)) <> par(op) ``` ### Lazy and Non-Standard Evaluation An unusual but useful feature of R is that function arguments are not evaluated until their value is needed, so they may not be evaluated at all. This is called _lazy evaluation_. ```{r, error = TRUE} log("A") f <- function(x) NULL f(log("A")) ``` Functions can also capture the expression of the arguments they were called with: ```{r} f <- function(x) deparse(substitute(x)) f(a + b) ``` Together these features allow functions to evaluate their arguments in _non-standard_ ways. This is most commonly used to allow values for variables in arguments to be found in a provided data frame. The `with()` function is a simple example: ```{r, error = TRUE} mean(eruptions) ``` ```{r} with(faithful, mean(eruptions)) ``` Non-standard evaluation of this type is used extensively in the _tidyverse_. ## The Tidyverse [_Tidyverse_](https://www.tidyverse.org/) functions are designed to perform operations on data frames. The `dplyr` package provides a _grammar for data manipulation_. A simple example: computing means and standard deviations for the waiting times after the short (less than 3 minutes) and the long (3 minutes or more) eruptions: ```{r, message = FALSE} library(dplyr) tmp <- mutate(faithful, type = ifelse(eruptions < 3, "short", "long")) head(tmp) ``` ```{r} summarize(group_by(tmp, type), mean = mean(waiting), sd = sd(waiting)) ```
Tidyverse functions like to work with an enhanced form of data frame called a _tibble_.
A computation like this can be viewed as a _transformation pipeline_ consisting of three stages: * mutation (adding a new variable) * grouping (splitting by `type`) * summarizing within groups. Tidyverse code often uses the _forward pipe operator_ `%>%` provided by the `magrittr` package to express such a pipeline. R 4.1.0 and later also provides a _native pipe operator_ `|>`. The pipe operator allows a call `f(x)` to be written as ```{r, eval = FALSE} x %>% f() ``` The left hand value is passed implicitly as the first argument to the function called on the right. Using the pipe operator, the code for computing means and standard deviations can be written as ```{r} faithful %>% mutate(type = ifelse(eruptions < 3, "short", "long")) %>% group_by(type) %>% summarize(mean = mean(waiting), sd = sd(waiting)) ``` There are trade-offs: * Manipulation pipelines expressed this way are often more compact than ones using intermediate variables and/or nested calls. * With pipe notation there is no need to come up with intermediate variable names. * Pipe notation obscures the function calls that are actually happening and this can make debugging harder. ## Contrast to Point-and-Click Interfaces * Even simple tasks require learning some of the R language. * Once you can do simple tasks, you have learned some of the R language. * More complicated tasks become easier. * Even very complicated tasks become possible. ## R and Reproducibility Analyses in R are carried out by running code describing the tasks to perform. This code can be * audited to make sure the analysis is right; * replayed to make sure the results are repoducible; * reused after changes in the data or on new data. _Literate data analysis_ tools like Rmarkdown provide support for this. ## Finding Out More ### Getting Help on Functions * `help(mean)` will show help for the function `mean`. * This can be abbreviated as `?mean` ### Some R Introductions and Tutorials * [An Introduction to R](https://cran.r-project.org/doc/manuals/R-intro.html) introduces the language and shows how to use R for statistical analysis and graphics. * Another [introduction to R](http://zoonek2.free.fr/UNIX/48_R/all.html) by Vincent Zoonekynd. * [Quick-R](https://www.statmethods.net/) web site related to *R in Action* book. * [R For Beginners](https://cran.r-project.org/doc/contrib/Paradis-rdebuts_en.pdf). * [TryR](https://www.pluralsight.com/search?q=R) at Codeschool. * [swirl: Learn R in R](https://swirlstats.com/). * [_Hands-On Programming with R_](https://rstudio-education.github.io/hopr/). * [_R for Data Science_](https://r4ds.had.co.nz/). * [Data Science Dojo YouTube tutorials](https://www.youtube.com/c/Datasciencedojo/playlists?view=50&sort=dd&shelf_id=2). * [Tutorials ad RStudio](https://education.rstudio.com/learn/). * [R for the Rest of Us](https://rfortherestofus.com/). * There are _many_ others. ### Introductions to the Tidiverse * Hadley Wickham and Garrett Grolemund (2016), [_R for Data Science_](https://r4ds.had.co.nz/), O'Reilly. (Book source on [GitHub](https://github.com/hadley/r4ds)) * [R Basics chapter](https://rafalab.dfci.harvard.edu/dsbook/r-basics.html) in Rafael A. Irizarry (2019), [Introduction to Data Science: _Data Analysis and Prediction Algorithms with R_](https://rafalab.dfci.harvard.edu/dsbook/), Chapman & Hall/CRC. ([Book source on GitHub](https://github.com/rafalab/dsbook)) ### R Markdown Tutorials * [R Markdown: The Definitive Guide](https://bookdown.org/yihui/rmarkdown/) by Yihui Xie is a book-length presentation. * The [R Markdown Home Page](https://rmarkdown.rstudio.com) has a link to a [tutorial](https://rmarkdown.rstudio.com/lesson-1.html). ## Interactive Tutorial An interactive [`learnr`](https://rstudio.github.io/learnr/) tutorial for these notes is [available](`r WLNK("tutorials/Rintro.Rmd")`). You can run the tutorial with ```{r, eval = FALSE} STAT4580::runTutorial("Rintro") ``` You can install the current version of the `STAT4580` package with ```{r, eval = FALSE} remotes::install_gitlab("luke-tierney/STAT4580") ``` You may need to install the `remotes` package from CRAN first. ## Exercises 1. Compute the mean of the numbers 1, 3, 5, 8. 2. What is the mean of the `eruptions` variable in the `faithful` data frame? 3. Find the average of the first 50 eruption durations in the `faithful` data frame. 4. Use the `median` function to modify the pipe example in the [tidyverse section](#the-tidyverse) to include medians.