class: center, middle, title-slide .title[ # A Brief Overview of R ] .author[ ### Luke Tierney ] .institute[ ### University of Iowa ] .date[ ### 2023-05-06 ] --- <link rel="stylesheet" href="stat4580.css" type="text/css" /> ## Background -- R is a language, or an environment, for data analysis and visualization. -- R is derived form the [_S_ language](https://en.wikipedia.org/wiki/S_(programming_language) developed at ATT Bell Laboratories. -- R was originally developed for teaching at the University of Auckland, New Zealand, by Ross Ihaka and Robert Gentleman. -- R is now maintained by an international group of about 20 statisticians and computer scientists. -- A great strength of R is the large number of extension packages that have been developed. -- The number available on [CRAN](https://cran.r-project.org) is now over 19,000. --- layout: true ## Basic Usage --- -- Interactive R uses a _command line interface_ (CLI). -- The interface runs a _read-evaluate-print loop_ (REPL). -- A simple interaction with the R interpreter: ```r > 1 + 2 [1] 3 ``` -- Values can be assigned to variables using a left arrow `<-` combination: ```r > x <- c(1, 3, 5) > x [1] 1 3 5 ``` -- .alert[ The `=` sign can also be used for assignment, but `<-` is recommended. ] --- -- Basic arithmetic operations work element-wise on vectors: ```r > x + x [1] 2 6 10 ``` -- Scalars are _recycled_ to the length of the longer operand: ```r > x + 1 [1] 2 4 6 ``` -- ```r > 2 * x [1] 2 6 10 ``` --- -- Some ways to create new vectors: ```r > c(1, 2, 3) [1] 1 2 3 ``` -- ```r > c("a", "b", "c") [1] "a" "b" "c" ``` -- ```r > 1 : 3 [1] 1 2 3 ``` -- These examples show a _prompt_ as you would see in the interpreter. -- Usually Rmarkdown documents show code and results like this: ```r 2 * x ## [1] 2 6 10 ``` -- This makes it easier to copy code for pasting it into another document or the R console. --- layout: false ## Data Frames -- Data sets in R are often organized in _named lists_ of variables called _data frames_. -- The value of the variable `faithful` is a data frame with two variables recorded for eruptions of the _Old Faithful_ geyser in Yellowstone National Park: -- * `eruptions`: Eruption duration (minutes) * `waiting`: Waiting time to next eruption (minutes) -- `head()` shows the first 6 rows: ```r head(faithful) ## eruptions waiting ## 1 3.600 79 ## 2 1.800 54 ## 3 3.333 74 ## 4 2.283 62 ## 5 4.533 85 ## 6 2.883 55 ``` --- ## A Simple Scatter Plot -- .pull-left.small-code.width-50[ ```r with(faithful, plot(eruptions, waiting, xlab = "Eruption time (min)", ylab = "Waiting time to next eruption (min)")) ``` ] -- .pull-right.width-40[ <!-- ## nolint start --> <img src="data:image/png;base64,#Rintro_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" /> <!-- ## nolint end --> ] -- .pull-left[ Several graphics systems are available for R. ] .pull-right[ ] -- .pull-left[ `plot()` is part of _base graphics_. ] --- ## Fitting a Linear Regression .pull-left.width-45.small-code[ ```r fit <- with(faithful, lm(waiting ~ eruptions)) fit ## ## Call: ## lm(formula = waiting ~ eruptions) ## ## Coefficients: ## (Intercept) eruptions ## 33.47 10.73 ``` {{content}} ] -- You can also use the `data` argument to `lm()`: ```r fit <- lm(waiting ~ eruptions, data = faithful) ``` {{content}} -- `coef()` extracts the coefficients: ```r coef(fit) ## (Intercept) eruptions ## 33.47440 10.72964 ``` -- .pull-right.width-55.small-code[ `summary(fit)` provides more details: ```r summary(fit) ## ## Call: ## lm(formula = waiting ~ eruptions, data = faithful) ## ## Residuals: ## Min 1Q Median 3Q Max ## -12.0796 -4.4831 0.2122 3.9246 15.9719 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 33.4744 1.1549 28.98 <2e-16 *** ## eruptions 10.7296 0.3148 34.09 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 5.914 on 270 degrees of freedom ## Multiple R-squared: 0.8115, Adjusted R-squared: 0.8108 ## F-statistic: 1162 on 1 and 270 DF, p-value: < 2.2e-16 ``` ] --- layout: true ## Adding the Regression Line to the Plot --- Original plot: .pull-left.small-code.width-50[ <!-- ## nolint start --> ```r with(faithful, plot(eruptions, waiting, xlab = "Eruption time (min)", ylab = "Waiting time to next eruption (min)")) ``` <!-- ## nolint end --> ] .pull-right.width-40[ <!-- ## nolint start --> <img src="data:image/png;base64,#Rintro_files/figure-html/unnamed-chunk-17-1.png" style="display: block; margin: auto;" /> <!-- ## nolint end --> ] --- With a regression line: .pull-left.small-code.width-50[ <!-- ## nolint start --> ```r with(faithful, plot(eruptions, waiting, xlab = "Eruption time (min)", ylab = "Waiting time to next eruption (min)")) *abline(coef(fit), col = "red", lwd = 3) ``` <!-- ## nolint end --> ] .pull-right.width-40[ <img src="data:image/png;base64,#Rintro_files/figure-html/geyser-with-line-1.png" style="display: block; margin: auto;" /> ] --- layout: false ## Packages and Package Libraries -- Extension code and data sets are often made available in _packages_. -- Packages are stored in folders or directories as collections called _libraries_. -- `.libPaths()` will show you the libraries your R process will search. -- `search()` shows what packages are attached to the global search path. -- The `library()` function is used to find packages in the libraries and attach them to the search path. -- The expression `pkg::var` gets the value of variable `var` from package `pkg` without attaching `pkg`. -- You can install packages using the `install.packages` function or the **Install Packages** item in the RStudio **Tools** menu. -- By default packages are installed from [CRAN](cran.r-project.org). -- It is also possible to use functions in the `remotes` package to install packages hosted on [GitHub](https://github.com/) or [GitLab](https://about.gitlab.com/). --- layout: true ## A Useful Package: `ggplot2` --- -- The `ggplot2` package provides a powerful alternative to the base graphics system. -- The geyser example can be done in `ggplot2` like this: .pull-left.small-code.width-55[ ```r library(ggplot2) ggplot(data = faithful) + geom_point(mapping = aes(x = eruptions, y = waiting)) + geom_smooth(mapping = aes(x = eruptions, y = waiting), method = "lm", se = FALSE) ``` ] -- .pull-right.width-40[ <img src="data:image/png;base64,#Rintro_files/figure-html/geyser-ggplot-1.png" style="display: block; margin: auto;" /> ] --- -- `ggplot2` is part of a useful collection of packages called the [_tidyverse_](https://www.tidyverse.org/). -- `ggplot` is based on the _Grammar of Graphics_. -- * Plots are composed of _geometric objects_ (`geoms`). -- * Variables are _mapped_ to _aesthetic features_ of geometric objects. -- A basic template for creating a plot with `ggplot`: <!-- # nolint start --> ```r ggplot(data = <DATA>) + <GEOM>(mapping = aes(<MAPPINGS>)) ``` <!-- # nolint end --> --- layout: true ## Subsetting and Extracting Components --- -- The _subset operator_ `[` can be used to extract element by index: ```r month.abb ## [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec" month.abb[1 : 3] ## [1] "Jan" "Feb" "Mar" ``` -- Subsetting can also be based on a logical expression that returns `TRUE` or `FALSE` for each element: ```r (starts_with_J <- substr(month.abb, 1, 1) == "J") ## [1] TRUE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE month.abb[starts_with_J] ## [1] "Jan" "Jun" "Jul" ``` -- .alert[ The value of an assignment operation is the right hand side value. * Ordinarily this value is not printed. * Placing the assignment expression in parentheses causes it to be printed. ] --- Individual elements can be extracted using the _element operator_ `[[`: .small-code[ ```r month.abb[[3]] ## [1] "Mar" ``` ] -- Components of named list, like _data frames_, can be extracted with the `$` operator: .small-code[ ```r names(faithful) ## [1] "eruptions" "waiting" head(faithful, 4) ## eruptions waiting ## 1 3.600 79 ## 2 1.800 54 ## 3 3.333 74 ## 4 2.283 62 head(faithful$eruptions, 4) ## [1] 3.600 1.800 3.333 2.283 ``` ] -- .small-code[ The element operator can be used as well: ```r head(faithful[["eruptions"]], 4) ## [1] 3.600 1.800 3.333 2.283 ``` ] --- layout: false ## Functions -- ### Simple Functions -- All computations in R are carried out by functions. -- Defining a function allows you to avoid cutting and pasting code. -- A simple function: ```r ms <- function(x) list(mean = mean(x), sd = sd(x)) ms(faithful$eruptions) ## $mean ## [1] 3.487783 ## ## $sd ## [1] 1.141371 ``` --- layout: true ### Generic Functions and Object-Oriented Programming --- R supports several mechanisms for object-oriented programming based on _generic functions_. -- The most commonly used mechanism, called S3, allows a function to _dispatch_ to a _method_ based on the _class_ of its first argument. -- `plot` is a very simple generic function. ```r plot ## function (x, y, ...) ## UseMethod("plot") ## <bytecode: 0x55ab734fc1b0> ## <environment: namespace:base> ``` --- For example, the `plot` method for linear model fit objects produces a set of 4 plots commonly used to assess regression fits. -- .pull-left.width-30[ ```r plot(fit) ``` ] .pull-right.width-60[ <img src="data:image/png;base64,#Rintro_files/figure-html/unnamed-chunk-26-1.png" style="display: block; margin: auto;" /> ] --- layout: true ### Lazy and Non-Standard Evaluation --- An unusual but useful feature of R is that function arguments are not evaluated until their value is needed, so they may not be evaluated at all. -- This is called _lazy evaluation_. ```r log("A") ## Error in log("A"): non-numeric argument to mathematical function f <- function(x) NULL f(log("A")) ## NULL ``` -- Functions can also capture the expression of the arguments they were called with: ```r f <- function(x) deparse(substitute(x)) f(a + b) ## [1] "a + b" ``` -- Together these features allow functions to evaluate their arguments in _non-standard_ ways. --- This is most commonly used to allow values for variables in arguments to be found in a provided data frame. -- The `with()` function is a simple example: ```r mean(eruptions) ## Error in eval(expr, envir, enclos): object 'eruptions' not found ``` -- ```r with(faithful, mean(eruptions)) ## [1] 3.487783 ``` -- Non-standard evaluation of this type is used extensively in the _tidyverse_. --- layout: true ## The Tidyverse --- name: the-tidyverse [_Tidyverse_](https://www.tidyverse.org/) functions are designed to perform operations on data frames. -- The `dplyr` package provides a _grammar for data manipulation_. .pull-left.small-code[ A simple example: computing means and standard deviations for the waiting times after the short (less than 3 minutes) and the long (3 minutes or more) eruptions: ```r library(dplyr) tmp <- mutate(faithful, type = ifelse(eruptions < 3, "short", "long")) head(tmp) ## eruptions waiting type ## 1 3.600 79 long ## 2 1.800 54 short ## 3 3.333 74 long ## 4 2.283 62 short ## 5 4.533 85 long ## 6 2.883 55 short ``` ] -- .pull-right[ ```r summarize(group_by(tmp, type), mean = mean(waiting), sd = sd(waiting)) ## # A tibble: 2 × 3 ## type mean sd ## <chr> <dbl> <dbl> ## 1 long 80.0 5.99 ## 2 short 54.5 5.84 ``` .alert[ Tidyverse functions like to work with an enhanced form of data frame called a _tibble_. ] ] --- A computation like this can be viewed as a _transformation pipeline_ consisting of three stages: -- * mutation (adding a new variable) * grouping (splitting by `type`) * summarizing within groups. -- Tidyverse code often uses the _forward pipe operator_ `%>%` provided by the `magrittr` package to express such a pipeline. -- R 4.1.0 and later also provides a _native pipe operator_ `|>`. -- The pipe operator allows a call `f(x)` to be written as ```r x %>% f() ``` -- The left hand value is passed implicitly as the first argument to the function called on the right. --- Using the pipe operator, the code for computing means and standard deviations can be written as ```r faithful %>% mutate(type = ifelse(eruptions < 3, "short", "long")) %>% group_by(type) %>% summarize(mean = mean(waiting), sd = sd(waiting)) ## # A tibble: 2 × 3 ## type mean sd ## <chr> <dbl> <dbl> ## 1 long 80.0 5.99 ## 2 short 54.5 5.84 ``` -- There are trade-offs: -- * Manipulation pipelines expressed this way are often more compact than ones using intermediate variables and/or nested calls. -- * With pipe notation there is no need to come up with intermediate variable names. -- * Pipe notation obscures the function calls that are actually happening and this can make debugging harder. --- layout: false ## Contrast to Point-and-Click Interfaces -- * Even simple tasks require learning some of the R language. -- * Once you can do simple tasks, you have learned some of the R language. -- * More complicated tasks become easier. -- * Even very complicated tasks become possible. --- ## R and Reproducibility Analyses in R are carried out by running code describing the tasks to perform. -- This code can be -- * audited to make sure the analysis is right; -- * replayed to make sure the results are repoducible; -- * reused after changes in the data or on new data. -- _Literate data analysis_ tools like Rmarkdown provide support for this. --- layout: true ## Finding Out More --- ### Getting Help on Functions * `help(mean)` will show help for the function `mean`. * This can be abbreviated as `?mean` -- ### Some R Introductions and Tutorials * [An Introduction to R](https://cran.r-project.org/doc/manuals/R-intro.html) introduces the language and shows how to use R for statistical analysis and graphics. * Another [introduction to R](http://zoonek2.free.fr/UNIX/48_R/all.html) by Vincent Zoonekynd. * [Quick-R](https://www.statmethods.net/) web site related to *R in Action* book. * [R For Beginners](https://cran.r-project.org/doc/contrib/Paradis-rdebuts_en.pdf). * [TryR](https://www.pluralsight.com/search?q=R) at Codeschool. * [swirl: Learn R in R](https://swirlstats.com/). * [_Hands-On Programming with R_](https://rstudio-education.github.io/hopr/). * [_R for Data Science_](https://r4ds.had.co.nz/). * [Data Science Dojo YouTube tutorials](https://www.youtube.com/c/Datasciencedojo/playlists?view=50&sort=dd&shelf_id=2). * [Tutorials ad RStudio](https://education.rstudio.com/learn/). * [R for the Rest of Us](https://rfortherestofus.com/). * There are _many_ others. --- ### Introductions to the Tidiverse * Hadley Wickham and Garrett Grolemund (2016), [_R for Data Science_](https://r4ds.had.co.nz/), O'Reilly. (Book source on [GitHub](https://github.com/hadley/r4ds)) * [R Basics chapter](https://rafalab.dfci.harvard.edu/dsbook/r-basics.html) in Rafael A. Irizarry (2019), [Introduction to Data Science: _Data Analysis and Prediction Algorithms with R_](https://rafalab.dfci.harvard.edu/dsbook/), Chapman & Hall/CRC. ([Book source on GitHub](https://github.com/rafalab/dsbook)) ### R Markdown Tutorials * [R Markdown: The Definitive Guide](https://bookdown.org/yihui/rmarkdown/) by Yihui Xie is a book-length presentation. * The [R Markdown Home Page](https://rmarkdown.rstudio.com) has a link to a [tutorial](https://rmarkdown.rstudio.com/lesson-1.html). --- layout: false ## Interactive Tutorial An interactive [`learnr`](https://rstudio.github.io/learnr/) tutorial for these notes is [available](../tutorials/Rintro.Rmd). You can run the tutorial with ```r STAT4580::runTutorial("Rintro") ``` You can install the current version of the `STAT4580` package with ```r remotes::install_gitlab("luke-tierney/STAT4580") ``` You may need to install the `remotes` package from CRAN first. --- ## Exercises 1) Compute the mean of the numbers 1, 3, 5, 8. <!-- The answer to Exercise 1 is closest to * 4.25 5.75 3.75 5.25 --> 2) What is the mean of the `eruptions` variable in the `faithful` data frame? <!-- The answer to Exercise 2 is closest to * 3.49 3.35 3.87 3.16 --> 3) Find the average of the first 50 eruption durations in the `faithful` data frame. <!-- The answer to Exercise 3 is closest to * 3.30 2.50 3.13 4.33 --> 4) Use the `median` function to modify the pipe example in the [tidyverse section](#the-tidyverse) to include medians.